# Why Not Store Everything in Main Memory? Why use disks?

0 1 3 3 0 0 1 0 ENTROPY: E(Aj)=j=1..v[(s1j+..+smj)*I(s1j..smj)/s] I(s1j..smj)=-i=1..m[pij*log2(pij)] pij=sij/|Aj| sij=s1,j+..+s5,j= 2 4 PB2=2 PB2=3 PB2=7 PB2=10 PB2=11 PB3=4 PB3=5 PB3=8 PB4=11 PB4=15 |Aj| where Aj's are the rootcounts of Pk(aj)'s 2 2 4 4 2 4 4 4 010 0 010 0 0010 0 0001 1 010 0 010 0 0010 0 0001 1 001 0 001 0 100 0 0001 0 001 0 10 0 001 0 001 0 010 0 010 1 01 1 01 0 010 0 010 1 01 1 01 0 001 0 001 0 100 0 001 0 001 0 100 0 001 0 001 0 010 1 010 0 001 0 001 0 010 1 010 0 001 0 001 0 001 0 001 0 100 0 010 0 001 0 100 0 010 0 010 0 0010 0 0010 0 010 0 010 0 0010 0 0010 0 010 0 010 0 001 0 001 0 100 1 0010 1 001 0 100 0 0010 1 0010 1 0010 0 0010 0 100 0 100 0 0010 0 0010 0 100 0 100 0 0001 1 0001 1 0100 0 0010 0 0001 1 0100 1 0010 0 0010 0 010 0 010 0 100 1 100 0 010 0 010 0 100 1 100 0 0010 0010 010 1 001 1 0010 010 0 001 1 001 1 001 0 001 0 010 0 010 1 001 0 001 0 010 0 010 1 100 0 100 0 010 0 010 0 100 0 010 0 001 0 010 0 0100 1 0100 1 001 0 0010 0 0100 1 0100 1 001 0 0010 0 100 1 100 1 0100 0 001 0 100 1 0100 1 001 0 001 0 010 1 010 0 01 1 01 1 010 1 010 1 01 1 01 1 001 0 001 0 10 1 0011 001 0 10 0 0011 0011 S1,j= S2,j= S3,j= S4,j= S5,j= 0 0 2 1 3 6 0 0 2 0 0 2 6 8 2 3 4 0 1 0 11 8 0 3 4 1 3 3 1 0 1 0 5 11 5 001 0 001 1 100 0 100 0 001 0 001 0 100 0 100 0 0010 1 0001 1 0100 0 0001 0 0010 1 0100 1 0001 0 0001 0 P1j = 0 0 0 0 .75 0 0 .375 0 .6 P2j = 0 .5 .5 0 0 0 0 .5 .273 .2 P3j = 1 .5 0 0 0 .33 1 0 .363 0 P4j = 0 0 0 .25 .25 .17 0 .125 .091 .2 P5j = 0 0 0 .75 0 .5 0 0 .273 0 0 0 0 0 .75 0 0 .53 0 .44 -p1j*log2(p1j) 0 .5 .5 0 0 0 0 .5 .51 .46 -p2j*log2(p2j) 0 .5 0 0 0 .52 0 0 .53 0 -p3j*log2(p3j) 0 0 0 .5 .5 .43 0 .37 .31 .46 -p4j*log2(p4j) 0 0 0 .31 0 .5 0 0 .51 0 -p5j*log2(p5j) (s1j+..+s5j)*I(s1j..s5j)/16 0 .25 .13 .2 .31 .54 0 .7 .127 .43 GAIN(B2)=2.81-.89 =1.92 PPc=10 P P PPc=10 P PPc=10 PPc=10 P c=10

Pc=15 Pc=2 c=10 c=10 Pc=10 PP Pc=7 Pc=10 P PP c=10 PP Pc=7 c=15 c=15 GAIN(B3)=2.81-1.24 =1.57 c=3 c=3 c=3 c=3 c=7 c=15 c=15 c=15 c=7 c=3 c=15 c=7 c=7 c=3 c=7 c=3 c=15 c=7 c=3 c=15 c=3 c=7 c=7 c=2 c=2 c=2 c=2 c=2 c=2 c=2 c=2 c=3 c=15 c=2 GAIN(B4)=2.81-.557 =2.53 So its all just arithmetic, except for the #Classes * #FeatureValues ANDs and RootCounts, rc(PC=ci^PBk=aj). Should these be pre-computed at capture? Are they part of the correlation calculation? Other often-used calculations? Other speedups include: 1.Use approx. Value pTrees. Intervalize feature values and use the IntervalBitMaps (which can be calculated either from BitSlice or ValueMap pTrees. 2.Using bitslice intervals, wed pre-calculate all Pi,j,k = PC=ci^Pj,k then use HiOrderBit IntervalBitMaps, P i,j,Kj where Kj is the HiOrderBit of attrib, j. For 2ndHiOrderBit Intervals, just take Pi,j,Kj & P i,j,Kj-1 etc. Aside note: There must be mistakes in the arithmetic above since I get different GAIN values than on the previous slide. Who can correct? 3 0.1875 -2.4150 -0.4528 An RSI Dataset 16 pixels (4 rows 4 cols): 4 bitslices Band B1: 3 3 7 7 3 3 7 7 2 2 10 15 2 10 15 15 Band B2: Band B3: 7 3 3 2 8845 7 3 3 2 8845 11 11 10 10 8844 11 11 10 10 8844 4 0.25 -2 -0.5 Band B4: 11 15 11 11 11 11 11 11 15 15 11 11 15 15 11 11 4 2 3 0.25 0.125 0.187 -2 -3 -2.41 -0.5 -0.37 -0.45 2.280 rcPci pi=rcPci/16 log2(pi) pilog2(pi) -SUM(pilog2(pi)=I(c1..cm) Value BitMap pTrees BitSlice pTrees P1,3 P1,2 P1,1 P1,0 rc=5 S: X-Y B1 B2 B3 B4 0,0 0011 0111 1000 1011 0 0,1 0011 0011 1000 1111 0 0,2 0111 0011 0100 1011 0 0,3 0111 0010 0101 1011 0 1,0 0011 0111 1000 1011 0 1,1 0011 0011 1000 1011 0 1,2 0111 0011 0100 1011 0 1,3 0111 0010 0101 1011 0 2,0 0010 1011 1000 1111 0 2,1 0010 1011 1000 1111 0 2,2 1010 1010 0100 1011 1 2,3 1111 1010 0100 1011 1 3,0 0010 1011 1000 1111 0 3,1 1010 1011 1000 1111 1 3,2 1111 1010 0100 1011 1 3,3 1111 1010 0100 1011 . 1 Pc=2 Pc=3 Pc=7 Pc=10 Pc=15 PB2=2 PB2=3 PB2=7 PB2=10 PB2=11 P2,3 P2,2 P2,1 P2,0 P3,3 P3,2 P3,1 P3,0 P4,3 P4,2 P4,1 P4,0 rc=7 rc=16 rc=11 rc=8 rc=2 rc=16 rc=10rc=8 rc=8 rc=0 rc=2 rc=16 rc=5 rc=16 rc=16 rc=3 rc=4 rc=4 rc=2 rc=3 rc=2 rc=4 rc=2 rc=4 rc=4 PB3=4 PB3=5 PB3=8 rc=6 rc=2 rc=8 PB4=11 PB4=15 rc=11 rc=5 0 0 1 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 0 0 0 1 1 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0

0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1

elements of class P and n elements of class N. Amount of info, needed to decide if arbitrary example in S belongs to P or N is defined: I ( p, n) p p n n log 2 log 2 pn pn pn pn Assume using attribute A, set S will be partitioned into sets {CS1, S2 , , Sv}. If Si contains pi examples of P and ni examples of N, the entropy, or the expected p n info needed to classify objects in all subtrees Si is E ( A) i i I ( pi , ni ) info gained by branching on A Gain( A) I ( p, n) E ( A) i 1 p n 5 4 E ( age) I ( 2,3) I ( 4,0) Hence Class P: buys_computer = yes 14 14 Class N: buys_computer = no 5 I (3,2) 0.69 Similarly 14 I(p, n) = I(9, 5) =0.940 Compute the entropy for age: Gain(age) I ( p, n) E (age) age <=30 3040 >40 pi 2 4 3 ni 3 0 2 I(pi, ni) 0.971 0 0.971 Gain(income) 0.029 Gain( student ) 0.151 Gain(credit _ rating ) 0.048 Bayesian: thm: X inclassified sample. H be the hypothesis that X belongs to class, H. P(H|X)=cond probof H given X. P(H) is prob of H, P(H|X) = P(X|H)P(H)/P(X) Nave Bayesian: Given training set, R(A1..An, C) where C={CC1..Cm} is the class label attribute. The naive Bayesian Classifier will predict the class of unknown data sample, X, to be the class, Cj having the highest conditional probability, conditioned on X P(Cj|X) P(Ci|X), i j. From the Bayes theorem: P(Cj|X) = P(X|Cj)P(Cj)/P(X) P(X) is constant for all classes so we maximize P(X|Cj)P(Cj).. Max P(X|Cj)P(Cj). To reduce comp complexity of calculating all P(X|Cj)'s the naive assumption: class conditional indep EXPECTED INFO to classify: I= -i=1..m((rcPci)/|X|) log2((rcPci)/|X|), m=5 I = -(3/16*log3/16+1/4*log1/4+1/4*log1/4+1/8*log1/8+3/16*log3/16)= 2.8 (If basic pTree rcs are pre-computed (actually just value pTrees), this is arithmetic!) rcPci = 3 4 4 2 3 pj=rcPci/16 = 3/16 1/4 1/4 1/8 3/16 rc(Pc=2^PBk=aj) rc(Pc=3^PBk=aj) rc(Pc=7^PBk=aj) rc(Pc=10^PBk=aj) rc(Pc=15^PBk=aj) 0 0 2 0 0 0 2 2 0 0 0 2 0 0 0 0 0 0 1 3 3 0 0 1 0 ENTROPY: E(Aj)=j=1..v[(s1j+..+smj)*I(s1j..smj)/s] I(s1j..smj)=-i=1..m[pij*log2(pij)] pij=sij/|Aj| sij=s1,j+..+s5,j= 2 4 PB2=2 PB2=3 PB2=7 PB2=10 PB2=11 PB3=4 PB3=5 PB3=8 PB4=11 PB4=15 |Aj| where Aj's are the rootcounts of Pk(aj)'s 2 2 4 4 2 4 4 4 010 0 010 0 0010 0 0001 1 010 0 010 0 0010 0 0001 1 001 0 001 0 100 0 0001 0 001 0 10 0 001 0 001 0 010 0 010 1 01 1 01 0 010 0 010 1 01 1 01 0 001 0 001 0 100 0 001 0 001 0 100 0 001 0 001 0 010 1 010 0 001 0 001 0 010 1 010 0 001 0 001 0 001 0 001 0 100 0 010 0 001 0 100 0 010 0 010 0 0010 0 0010 0 010 0 010 0 0010 0 0010 0 010 0 010 0 001 0 001 0 100 1 0010 1 001 0 100 0 0010 1 0010 1 0010 0 0010 0 100 0 100 0 0010 0 0010 0 100 0 100 0 0001 1 0001 1 0100 0 0010 0 0001 1 0100 1 0010 0 0010 0 010 0 010 0 100 1 100 0 010 0 010 0 100 1 100 0 0010 0010 010 1 001 1 0010 010 0 001 1 001 1 001 0 001 0 010 0 010 1 001 0 001 0 010 0 010 1 100 0 100 0 010 0 010 0 100 0 010 0 001 0 010 0 0100 1 0100 1 001 0 0010 0 0100 1 0100 1 001 0 0010 0 100 1 100 1 0100 0 001 0 100 1 0100 1 001 0 001 0 010 1 010 0 01 1 01 1 010 1 010 1 01 1 01 1 001 0 001 0 10 1 0011 001 0 10 0 0011 0011 S1,j= S2,j= S3,j= S4,j= S5,j= 0 0 2 1 3 6 0 0 2 0 0 2 6 8 2 3 4 0 1 0 11 8 0

3 4 1 3 3 1 0 1 0 5 11 5 001 0 001 1 100 0 100 0 001 0 001 0 100 0 100 0 0010 1 0001 1 0100 0 0001 0 0010 1 0100 1 0001 0 0001 0 P1j = 0 0 0 0 .75 0 0 .375 0 .6 P2j = 0 .5 .5 0 0 0 0 .5 .273 .2 P3j = 1 .5 0 0 0 .33 1 0 .363 0 P4j = 0 0 0 .25 .25 .17 0 .125 .091 .2 P5j = 0 0 0 .75 0 .5 0 0 .273 0 0 0 0 0 .75 0 0 .53 0 .44 -p1j*log2(p1j) 0 .5 .5 0 0 0 0 .5 .51 .46 -p2j*log2(p2j) 0 .5 0 0 0 .52 0 0 .53 0 -p3j*log2(p3j) 0 0 0 .5 .5 .43 0 .37 .31 .46 -p4j*log2(p4j) 0 0 0 .31 0 .5 0 0 .51 0 -p5j*log2(p5j) (s1j+..+s5j)*I(s1j..s5j)/16 0 .25 .13 .2 .31 .54 0 .7 .127 .43 GAIN(B2)=2.81-.89 =1.92 PPc=10 P P PPc=10 P PPc=10 PPc=10 P c=10 Pc=15 Pc=2 c=10 c=10 Pc=10 PP Pc=7 Pc=10 P PP c=10 PP Pc=7 c=15 c=15 GAIN(B3)=2.81-1.24 =1.57 c=3 c=3 c=3 c=3 c=7 c=15 c=15 c=15 c=7 c=3 c=15 c=7 c=7 c=3 c=7 c=3 c=15 c=7 c=3 c=15 c=3 c=7 c=7 c=2 c=2 c=2 c=2 c=2 c=2 c=2 c=2 c=3 c=15 c=2 GAIN(B4)=2.81-.557 =2.53 So its all just arithmetic, except for the #Classes * #FeatureValues ANDs and RootCounts, rc(PC=ci^PBk=aj). Should these be pre-computed at capture? Are they part of the correlation calculation? Other often-used calculations? Other speedups include: 1.Use approx. Value pTrees. Intervalize feature values and use the IntervalBitMaps (which can be calculated either from BitSlice or ValueMap pTrees. 2.Using bitslice intervals, wed pre-calculate all Pi,j,k = PC=ci^Pj,k then use HiOrderBit IntervalBitMaps, P i,j,Kj where Kj is the HiOrderBit of attrib, j. For 2ndHiOrderBit Intervals, just take Pi,j,Kj & P i,j,Kj-1 etc. Aside note: There must be mistakes in the arithmetic above since I get different GAIN values than on the previous slide. Who can correct? 3 0.1875 -2.4150 -0.4528 An RSI Dataset 16 pixels (4 rows 4 cols): 4 bitslices Band B1: 3 3 7 7 3 3 7 7 2 2 10 15 2 10 15 15 Band B2: Band B3: 7 3 3 2 8845 7 3 3 2 8845 11 11 10 10 8844 11 11 10 10 8844 4 0.25 -2 -0.5 Band B4: 11 15 11 11 11 11 11 11 15 15 11 11 15 15 11 11 4 2 3 0.25 0.125 0.187 -2 -3 -2.41 -0.5 -0.37 -0.45 2.280 rcPci pi=rcPci/16 log2(pi) pilog2(pi) -SUM(pilog2(pi)=I(c1..cm) Value BitMap pTrees BitSlice pTrees P1,3 P1,2 P1,1 P1,0 rc=5 S: X-Y B1 B2 B3 B4 0,0 0011 0111 1000 1011 0 0,1 0011 0011 1000 1111 0 0,2 0111 0011 0100 1011 0 0,3 0111 0010 0101 1011 0 1,0 0011 0111 1000 1011 0 1,1 0011 0011 1000 1011 0 1,2 0111 0011 0100 1011 0 1,3 0111 0010 0101 1011 0 2,0 0010 1011 1000 1111 0 2,1 0010 1011 1000 1111 0 2,2 1010 1010 0100 1011 1 2,3 1111 1010 0100 1011 1 3,0 0010 1011 1000 1111 0 3,1 1010 1011 1000 1111 1 3,2 1111 1010 0100 1011 1 3,3 1111 1010 0100 1011 . 1 Pc=2 Pc=3 Pc=7 Pc=10 Pc=15 PB2=2 PB2=3 PB2=7 PB2=10 PB2=11 P2,3 P2,2 P2,1 P2,0 P3,3 P3,2 P3,1 P3,0 P4,3 P4,2 P4,1 P4,0 rc=7 rc=16 rc=11 rc=8 rc=2 rc=16 rc=10rc=8 rc=8 rc=0 rc=2 rc=16 rc=5 rc=16 rc=16 rc=3 rc=4 rc=4 rc=2 rc=3 rc=2 rc=4 rc=2 rc=4 rc=4 PB3=4 PB3=5 PB3=8 rc=6 rc=2 rc=8 PB4=11 PB4=15 rc=11 rc=5 0 0 1 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 0 0 0 1 1 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 1 0

0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 APPENDIX: level-1 TermFreqPTrees (E.g., the predicate of tfP0: mod(sum(mdl-stride),2)=1) 0 1 0 0 0 0 1 ... ... 3 doc=1 d=1 d=1 term=a t=again t=all d=2 t=a 2 0 0 1 ... 0. 0 0. 0 0 0. 1 .. 2 .. d=2 d=2 t=again t=all 0 .. 1 0. 0 1 0. 0. .. .. .. 8 1 ... tf0 ... tf1 3 1 8 8 0 1 1 1 0 3 3 <--dfP0 3 <--dfP2 df (cnt) 0 Length of this level-1 TermExistencePTree =VocabLen*DocCount ... tf d=3 d=3 d=3 t=a t=again t=all ... 2 0 1 0 doc=1 doc=1 doc=1 term=a trm=again term=all 0 0 1 1 pred is NOTpure0 0 doc=3 doc=3 doc=3 term=a trm=again term=all ... doc=2 doc=2 doc=2 term=a trm=again term=all Length of this level-0 pTree= mdl*VocabLen*DocCount 0 . 0 0 . ...d oc mdl 1 2 3 4 5 6 7 mdl 1 2 3 4 5 6 7 mdl 1 2 3 4 5 6 7 reading-positions for doc=1, term=a reading-positions: doc=1, term=again reading-positions for doc=1, term=all HHS LMM tf1 tf0 ... ... ...Term Freq 0

1 0 2 1 0 0 0 0 1 0 . . . a 0 0 0 0 0 0 0 0 0 0 0 0 . . . again 3 0 0 0 0 0 0 0 0 0 0 0 0 . . . all 1 0 0 0 0 30 0 0 0 0 0 0 0 always. 2 0 0 0 0 0 0 0 0 0 0 0 0 an 1 1 0 1 1 3 0 1 0 1 0 0 1 and 1 0 0 0 0 0 0 0 0 0 0 0 0 . . . apple 3 0 0 0 0 0 0 0 0 0 0 0 0 . . . April 1 0 0 0 0 0 0 0 0 0 0 0 0 . . . are 1 2 3 4 7 ... Position ... Term (Vocab) tf2 1 ..doc freq ... Next slides shows how to do it differently so that even the dfk's come out as level-2 pTrees. pTree Text Mining (from 2012_08_04 ... Term Ex (mdl = max doc length) dfk isn't a level-2 pTree since it's not a predicate on level-1 te strides. . DocTrmPos pTreeSet 1 1 Data Cube Text Mining 5 6 JSE level-2 PTree, hdfP?? (Hi Doc Feq): pred=NOTpure0 applied to tfP1 1 1 doc1 0 doc2

These level-2 pTrees, dfPk have len= VocabLength hdfP ... doc3 0 0 1 0 0 0 0 ... 0 ... ... 0 ... ... ... ... tfP0 1 8 8 0 1 doc1 ... tfP1 8 1 1 0 ... 2 doc2 1 3 3 <--dfP0 3 . . . <--dfP3 df count doc3 level-1 PTrees, tfPk e.g., pred of tfP0: mod(sum(mdl-stride),2)=1 ... d=3 t=a 0 ... ... d=1 d=2 d=1 d=2 d=3 t=again t=again t=again t=all t=all ... tf d=3 t=all ... This one, overall, level-1 pTree, teP, has length = DocCount*VocabLength 1 0 trm=a doc1 tePt=all tePt=again 0 tr=all doc1 t=again t=again t=again doc1 doc2 doc3 term=a doc3 t=all doc2 t=all doc3 ... tf2 tf1 tf0 ... ... ...Term Freq 1 0 1 0 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 30 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 3 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0

0 . . . April 3 0 0 0 0 0 0 0 0 0 0 0 0 . . . are 1 0 0 0 0 0 0 0 0 0 0 0 0 . . . 1 2 3 4 7 ... Pos ..doc freq ... term=again doc1 ... ... Term Ex pTree Text Mining trm=a term=a doc2 doc3 tePt=a term=a doc2 term=a doc1 This one, overall, level-0 pTree, corpusP, has length = MaxDocLen*DocCount*VocabLen 0 Corpus pTreeSet a ... again all always. an and apple oc doc=1 d=2 term=a t=a 0 ...d 0 Vocab Terms 2 1 1 data Cube layout: 5 6 HHS LMM JSE level-2 PTree, hdfP?? (Hi Doc Feq): pred=NOTpure0 applied to tfP1 1 doc1 doc2 These level-2 pTrees, dfPk have len= VocabLength hdfP ... doc3 2 0 0 1 0 0 0 0 ... 0 ... ... 0 ... ... ... ... tfP0 1 doc1 ... tfP1 level-1 PTrees, tfPk e.g., pred of tfP0: mod(sum(mdl-stride),2)=1 2 0 0 doc=1 d=2 term=a t=a d=3 t=a ... 0 ... ... d=1 d=2 d=1 d=2 d=3 t=again t=again t=again t=all t=all d=3 t=all ... ... tf 8 8 8 0 1 1 1 2 doc2 1 0 ... doc3 3 3 <--dfP0 3 . . . <--dfP3 df count This overall, level-1 pTree, teP, has length = DocCount*VocabLength 1 0 trm=a doc1 0 trm=a term=a doc2 doc3 tePt=a tePt=again 0 tePt=all t=all tr=all t=all doc1 doc2 doc3 ... t=again t=again t=again doc1 doc2 doc3 This overall level-0 pTree corpusP length MaxDocLen*DocCount*VocabLen 0 0 term=a doc1 0 0 0 0 Pt=a,d=1. . . Pt=a,d=2 term=a doc2 0 1

0 1 0 0 0 0 0 0 1 Pt=again,d=1 0 0 0 0 0 1 0 1 1 1 0 0 0 0 tf 0 ... 1 1 tf0 0 0 ... 1 0 tf1 0 0 ... 0 0 tf2 1 0 ... 0 0 te Verb pTree EndofSentence Refrncs pTree LastChpt pTree Preface pTree 0 ... pTree Text Mining term=a doc3 0 term=again doc1 ... ..doc freq Any of these masks can be ANDed into the Pt= , d= pTrees before they are concatenated as above (or repetitions of the mask can be ANDED after they are concatenated). Pt=a,d=3 Vocab Terms 0 1 0 1 0 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 30 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 3 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 . . . April 3 0 0 0 0 0 0 0 0 0 0 0 0 . . . are 1 0 0 0 0 0 0 0

0 0 0 0 0 . . . 1 2 3 4 7 ... Pos a again all ... always. an and apple ...d oc 1 1 1 data Cube layout: 5 6 HHS LMM JSE I have put together a pBase of 75 Mother Goose Rhymes or Stories. Created a pBase of the 15 documents with 30 words (Universal Document Length, UDL) using as vocabulary, all white-space separated strings. APPENDIX Little Miss Muffet Lev1 (term freq/exist) pos te tf tf1 tf0 VOCAB 1 1 2 1 0 a 2 0 0 0 0 again. 3 0 0 0 0 all 4 0 0 0 0 always 5 0 0 0 0 an 6 1 3 1 1 and 7 0 0 0 0 apple 8 0 0 0 0 April 9 0 0 0 0 are 10 0 0 0 0 around 11 0 0 0 0 ashes, 12 0 0 0 0 away 13 0 0 0 0 away 14 1 1 0 1 away. 15 0 0 0 0 baby 16 0 0 0 0 baby. 17 0 0 0 0 bark! 18 0 0 0 0 beans 19 0 0 0 0 beat 20 0 0 0 0 bed, 21 0 0 0 0 Beggars 22 0 0 0 0 begins. 23 1 1 0 1 beside 24 0 0 0 0 between . . . . . . 182 0 0 0 0 your Humpty Dumpty Lev1 (term freq/exist) pos te tf tf1 tf0 1 1 2 1 0 2 1 1 0 1 3 1 2 1 0 4 0 0 0 0 5 0 0 0 0 6 1 1 0 1 7 0 0 0 0 8 0 0 0 0 9 0 0 0 0 10 0 0 0 0 11 0 0 0 0 12 0 0 0 0 13 0 0 0 0 14 0 0 0 0 15 0 0 0 0 16 0 0 0 0 17 0 0 0 0 18 0 0 0 0 19 0 0 0 0 20 0 0 0 0 21 0 0 0 0 22 0 0 0 0 23 0 0 0 0 24 0 0 0 0 . . . . . . 182 0 0 0 0 Lev-0 1 2 3 Little Miss Muffet 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Lev-0 5 on 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 3 4 5 05HDS Humpty Dumpty a 0 0 again. 0 0 all 0 0 always 0 0 an 0 0 and 0 0 apple 0 0 April 0 0 are 0 0 around 0 0 ashes, 0 0 away 0 0 away 0 0 away. 0 0 baby 0 0 baby. 0 0 bark! 0 0 beans 0 0 beat 0 0 bed, 0 0 Beggars 0 0 begins. 0 0 beside 0 0 between 0 0 sat 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 on 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 a 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 your 1 0 4 sat 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 7 8 9 a tuffet eating 1 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 7 0 0 11 0 0 12 13 14 15 16 17 and whey. There 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 came 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 a 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 big spider 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 19 20... and 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 sat down... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8... wall. Humpt yDumpty 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 of curds 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Level-2 pTrees (document frequency) df3 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 df2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 df1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 df0 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 df VOCAB 8 a 1 again. 3 all 1 always 1 an 13 and 1 apple 1 April 1 are 1 around 1 ashes, 2 away 2 away 1 away. 1 baby 1 baby. 1 bark! 1 beans 1 beat 1 bed, 1 Beggars 1 begins. 1 beside 1 between te04 te05 te08 te09 te27 te29 te34 1 1 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0

0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 FAUST2^? Clustering1 1 0 -1 -2 D=d35 L-Gap Clusterer Cut, C, mid-gap (of F&C) using next (d,p) from dpSet, where F=L|S|R s 2-1 separates 7,50 D=.27 D=.64 0 d26 0 0 0 0 0 0 d9 0 d26 Going back to D=d35, how close does HOB come? 21, 20 separate 35 0 d1 2-2 separates.27s 0 0 0 0 0 0 d49 0 d33 0 d27 0 0 0 0 0 0 d45 0 d3 0 d3 0 0 0 0 0 0.09 d6 0 d44 0 0 0 0 0 0.09 d3 0 d27 0 d16 0 0 0 0 0 D=sum of 0 d45 C1 (.17 xod .25)={C2,3,6,16,18,22,42,43,49} 0.09 d33 0 d6 0 0 0 0 0 0 d2 all C31docs 0 d17 0 0 0 0 0 0.09 d18 0 d44 0 d47 0 0 0 0 0 0.09 d44 C2 (.34 xod .56)={C1,4,5,8,9,12,14,15,23,25,27,32,33,36,37,38,44,45,47,48} 0.63 d17 0 d23 0 d18 0 0 0 0 0 0.18 d43 0 d10 0 0 0 0 0 0.18 d25 0 d9 0.63 d29 0 d43 0 0 0 0 0 0.18 d22 C3 (.64xod.86)={C10,11,13,17,21,26,28,29,30,39,41,50} 0 d15 0.63 d11 0 d12 0 0 0 0 0 0.18 d12 0 d49 0 d33 0 0 0 0 0 0.18 d16 0.84 d50 0 d16 0 d14 Single: 46 (x od=.99); 7 (=1.16); 35 (=1.47) D=sum of 0 0 0 0 0 0.18 d2 0 d38 0.84 d13 0 d23 allC2docs 0 0 0 0 0 0.27 d27 0 d49 0 d6 0 0 0 0 0 0.27 d23 0.27 d23 0.84 d30 Next, on each Ck try D=Ck, Thres=.2 0 d25 0 0 0 0 0 0 d18 0.36 d25 0 d45 0 0 0 0 0 0.27 d42 0.95 d26 0 d22 0.36 d4 0 d2 0 0 0 0 0 0.27 d15 0.25 d1 0.95 d28 D=sum of D=sum of D=sum of 0 d29 0 0 0 0 0 0.27 d13 0.25 d37 0.36 d38 0 d13 0 0 0 0 0 0.27 d47 0.95 d10 0.45 d15 all C1docs all C11docs all C3docs 0 d9 0 0 0 0 0 0.36 d26 0.25 d43 0.45 d33 0.42 d16 0.57 d2 0.56 d11 0.95 d41 0 d32 0 0 0 0 0 d29 0.25 d8 0.45 d12 0.27 d28 0.27 0 0 0 1 0.36 0.42 d2 0.57 d3 0.66 d17 d36 0.25 d29 1.16 d21 0.27 d41 0.27 0 0 0 1 0.36 0.45 d36 0.42 d3 0.57 d16 0.66 d29 0.25 d25 0.46 d38 0.27 d42 0.27 0 0 0 1 0.46 d14 0.25 d42 C311(..63) ={11,17,29} C312(.84) ={13,30,50} 0.54 d8 0.42 d42 0.57 d22 0.75 d13 0.27 d30 0.27 0 0 0 1 0.46 d48 0.54 d44 0.25 d12 0.27 d21 0.27 0 0 0 1 0.42 d43 0.57 d42 0.85 d30 C313(.95) ={10,26,28,41} 21 outlier 0.27 d22 0.27 0 0 0 1 0.46 d8 0.25 d47 0.54 d47 0.42 d22 0.57 d43 0.85 d10 0.27 d15 0.27 0 0 0 1 0.46 d10 0.25 d48 0.63 d1 0.63 d18 0.94 d28

tree me n me n maid 37 A bake-bread sub-corpus would have been strong. (docs{C7 21 35 42) bread bake 42 bake 36 house 33 m buy tow buy n tow dog 35 old 22 son ki ng 3 8 eat clean old 25 woman 12 men child 26 baby bed always r 29 the o m away 4 away eat green pie eat boy 50 50 lamb full men 32 old 41 fiddle 11 lamb cry baby mother 45 bright lady round 44 2 pie 17 cock run away 39 46 cry boy 28 cry baby mother 27 30 away high pig 41 41 son 9 money day way fid thre dle e merry men mo th er round 32 18 old Using AVG+1 2 9 10 25 45 47 d21 0 0 1 0 0 1 d35 0 0 1 1 1 0 d39 1 1 0 0 1 0 d46 1 0 0 1 0 0 d50 0 1 0 1 1 1 hill day day coc k bed run 15 43 three plum 49 bad day n b 47 wife 6 23 sing dog dish thum old There are many others. back 13 nose three girl bread bake cloth cloth 11 m wa orn y run cut men 14 two buy back 7 eat A bake-bread sub-corpus would have been strong. (docs{C7 21 35 42) plum 21 pl u brown crown fid dle We have captured only a few of the salient subgraphs. 17 38 bag 16 HOB2 Alt (use other HOBs) fall 10 37 bread bake 42 7 47 3 8 eat clean 26 baby bed always r 29 the o m

away full day 18 4 away eat 30 green 50 50 away e pi oy b boy baby mother lady round 44 bright 2 pie 17 cock run away 39 away eat eat eat 46 46 cry boy 28 cry lamb cry high pig 41 baby mother 45 32 way lamb 27 hill day ki ng son run 15 43 three plum plum 49 bad day n dog dish 9 money 6 23 sing coc k bed child 22 son wife 12 old nose old old 25 woman dog 35 35 eat men tow bread bake cloth cloth 11 m wa orn y back 13 bake 36 house three girl b round me n me n maid old 33 um buy tow buy n buy back 21 21 pl le men 14 king old tree brown crown fid d 5 two old fid thre dle e merry 48 fall thum 12. There came an old woman from France who taught grown-up children to dance. But they were so stiff she sent them home in a sniff. This sprightly old woman from France. c o w h l o i d m l a d n 15 44 59 d12 1 1 1 1 cut And if we want to pull out a particular word cluster, just turn the word-pTree into a list.: w=baby w=boy a b a b w a w o b a y a y 2 9 2 3 d28 0 1 d9 0 1 d39 1 1 d26 1 1 d50 0 1 d27 0 1 d45 1 For a particular doc cluster, just turn the docpTree into a list: run three r recurse: wAv+2,dAvg-1 e p a i t e 2 9 10 25 45 d35 0 0 1 1 1 d39 1 1 0 0 1 d46 1 0 0 1 0 d50 0 1 0 1 1 30 mo th e FAUST Cluster 1.2.4 wAvg+1, dAvg+1 a b b e p p w o r a i l a y e t e u y a m d 2 9 10 25 45 47 d21 0 0 1 0 0 1 d35 0 0 1 1 1 0 d39 1 1 0 0 1 0 d46 1 0 0 1 0 0 d50 0 1 0 1 1 1 38 bag 16

FAUST HULL Classification 1 Using the clustering of FAUST Clustering1 as classes, we extract 80% from each class as TrainingSet (w class=cluster#). How accurate is FAUST Hull Classification on the remaining 20% plus the outliers (which should be "Other"). Use Lpd, Sp, Rpd with p=ClassAvg and d=unitized ClassSum. C11={C2,3,16,22,42,43} C2 ={C1,4,5,8,9,12,14,15,23,25,27,32,33,36,37,38,44,45,47,48} C311= {C11,17,29} Full classes from slide: FAUST Clustering1 C312={C13,30,50} C313={C10,26,28,41} OUTLIERS {C18,49} {C6} {C39} {C21} {C46} {C7} {C35} C11={C2,16,22,42,43} C2 ={C1,5,8,9,12,15,25,27,32,33,36,37,38,44,47,48} C311= {C11,17} 80% Training Set C312={C30,50} C313={C10,28,41} D11=C11 p=avC11 L 0 D1=TS p=avTS Lpd -.09 .106 .305 .439 .572 C312 MIN MAX CLASS C311 C11 MIN MAX CLASS C311 .63 .63 C11 -0.09 .106 C11 0 0 .63 C2 0.106 .439 C2 .106 .439 .505 .771 C313 0 0 C311 0.572 .572 C311 C2 C313 0 .31 .31 C312 0.305 .439 C312 C2 0 .31 C313 0.505 .771 C313 0 D312=C312 p=avC312 .31 C11 D311=C311 p=avC311 C11 1.3 1.6 L MN MX CLAS .31 C2 L MN MX CLAS 0 .33 C311 C312 0 .31 C11 .31 C311 0 0 C11 0 .31 C2 .31 C313 0 0.66 C2 0 .33 C313 0 .31 C311 1.33 1.66 C311 1.58 C312 1.58 1.58 C312 0 0.33 C312 0 .66 0 .31 C313 0 0.33 C313 C2 .31 C312 .31 C11={C3} C2 ={C4,14,23,45} 20% Test Set C311= {C29} C312={C13} O={C18 49 6 39 21 46 7 35} C313={C26} D2=C2 p=avC2 L 0 .63 C11 .63 MIN 0 .44 .66 .11 .44 MAX CLS .22 C11 .77 C2 .66 C311 .22 C312 .66 C313 .22 .44 .66 C11 C313 .11 .22 .44 .77 C312 C2 .66 C311 D313=C313 p=avC313 0 .22 L MN MX 0 .22 0 .44 0 .44 0.22 .22 1.34 1.56 CLAS C11 C2 C311 C312 C313 0 0 C11 .44 C2 .44 C311 1.34 1.56 .22 C313 C312 All 6 class hulls separated using Lpd, p=CLavg, D=CLsum. D311 separates C311, D312 separates C312 and D313 separates C313 from all others. D2 separates C11 and C2. Now, remove some false positives with S and R using the same p's and d's: D1=TS p=avTS Sp 4.2 C313 5.4 1.9 C11 2.1 2.4 C311 3.4 4.6 C312 4.7 1.8 C2 3.8 D11=C11 p=avC11 Sp [1.6]C11 [3.4 4 4]C311 [5.4 6]C313 [2.4 4.4]C2 [5]C312 D2=C2 p=avC2 Sp [1.8 D311=C311 p=avC311 Sp D312=C312 p=avC312 Sp [1.2]C311 [4.2]C11 [6.2 7.2]C312 [3.5 4.5]C11 [6.5 7.5]C313 [2.2 6.2]C2 [4.5 6.5]C2 [6.2 8.2]C313 [2.5]C312 [5.5]C311 [2 2.3]C11 [4.5 5.8]C313 3.5]C2 [5 5.1]C312 [2.5 3.5]C311 D313=C313 p=avC313 Sp [3.5 4.2]C11 [6.5]C312 [2.8 6.2]C2 [3.8 6.2]C311 [2.5 3.5]C313 Sp removes a lot of the potential for false positives. (Many of the classes lie a single distance from p.) D1=TS p=avTS Rpd [1.3 1.4]C11 [1.3 1.9]C2 [1.5 1.8]C311 [2.1]C312 [2.0 D311=C311 p=avC311 Rpd [1.4]C11 [1.2 2]C2 [1.1]C311 [2.2]C312 [2.2 2.4]C313 2.2]C313 D11=C11 p=avC11 Rpd [1.2]C11 [1.4 2]C2 [1.7 2]C311 [2.2 2.]]C312 [2.2 2.4]C313 D312=C312 p=avC312 Rpd [1.3 1.4]C11 [1.4 2]C2 [1.7 1.9]C311 [1.5]C312 [2.2 2.4]C313 Rpd removes even more of the potential for false positives. D2=C2 p=avC2 Rpd [1.3 1.4]C11 [1.3 1.8]C2 [1.6 1.8]C311 [2.2]C312 [2.1 2.4]C313 D313=C313 p=avC313 Rpd [1.3 1.4]C11 [1.3 2]C2 [1.6 2]C311 [2.2]C312 [1.5 1.8]C313 FAUST Hull Classification 2 (TESTING) D1=TS p=avTS Lpd [.57]C311 [-.09 .11]C11 [.31 .44]C312 [.11 .44]C2 [.51 .77]C313 D11=C11 p=avC11 Lpd [0]C311 [.31]C312 [0 .31]C313 [0 D1=TS p=avTS Rpd D1=TS p=avTS Sp [1.3 1.4]C11 [1.9 2.1]C11 [2.4 3.4]C311 [4.2 5.4]C313 [1.3 1.9]C2 [1.5 1.8]C311 [1.8 3.8]C2 [4.6 4.7]C312 Test Set C11={C3} C2 ={C4,14,23, [2.1]C312 [2.0 2.2]C313 45} C311= {C29} D11=C11 p=avC11 Rpd [1.2]C11 C312={C13} [1.4 2]C2 C313={C26} [1.7 2]C311 [2.2 2.]]C312 O={C18 49 6 [2.2 2.4]C313 39 21 46 7 35} D2=C2 p=avC2 Rpd D11=C11 p=avC11 Sp [1.6]C11 [3.4 4 4]C311 [5.4 6]C313 [2.4

4.4]C2 [5]C312 [.63]C11 .63]C2 D2=C2 p=avC2 Lpd D2=C2 p=avC2 Sp .[44 .66]C313 [2 2.3]C11 [4.5 [0 .22]C11 [.44 .77]C2 [1.8 3.5]C2 [2.5 3.5]C311 [.66] C311 [.11 .22]C312 D311=C311 p=avC311 Lpd [0]C11 [0 .33]C312 [0 .33]C313 [0 .66]C2 [1.3 D312=C312 p=avC312 Lpd .31 .31 .31 .31 C11 C2 C311 C313 1.58 C312 1.6]C311 D=TS Rpd 1.41 1.40 1.92 1.38 1.97 2.22 2.60 1.40 2.50 1.40 2.42 3.46 2.60 2.35 1.41 Sp 2.19 2.06 3.71 1.99 3.92 4.99 6.78 2.13 6.37 2.06 5.92 12.2 6.78 5.57 2.19 [1.3 1.6]C313 Lpd trueCL -0.4 -0.3 0.01 -0.2 -0.1 -0.2 -0.0 -0.3 0.34 -0.3 -0.1 0.47 -0.0 0.14 -0.4 11 2 2 2 311 312 313 d3 d4 d14 d23 d29 d13 d26 d6 d7 d18 d21 d35 d39 d46 d49 5.1]C312 5.8]C313 D311=C311 p=avC311 Sp [1.2]C311 [4.2]C11 [6.2 7.2]C312 [2.2 6.2]C2 [6.2 8.2]C313 D312=C312 p=avC312 Sp [3.5 4.5]C11 [4.5 [2.5]C312 [5.5]C311 D313=C313 p=avC313 Lpd [0 .22]C11 [0 .44]C2 [0 .44]C311 [.22]C312 [5 [6.5 7.5]C313 6.5]C2 =.8 Final predicted predicted Class Other 11 Other 2 Other 2 Other 2 Other 311(all 311|2 all) Other 312(all 312|313 a Other Other Other . Other . Other . Other . Other . Other . Other . Other Other 8/15 = 53% correct just with D=TS p=AvgTS Note: It's likely to get worse as we consider more D's. D311=C311 p=avC311 Rpd [1.4]C11 [1.2 2]C2 [1.1]C311 D312=C312 p=avC312 Rpd [1.3 1.4]C11 [1.4 [1.5]C312 D313=C313 p=avC313 Sp [3.5 4.2]C11 [6.5]C312 [2.8 6.2]C2 [3.8 6.2]C311 [2.5 3.5]C313 Predicted____CLASS R S L 2 Oth 11 2 2 11 Oth 2 11 2|11 2|11 Oth Oth 312|313 11 313 313 11 Oth Oth 11 2|11 2 11 313 Oth 2 2|11 2|11 Oth 313 Oth Oth Oth Oth Oth Oth Oth 11 Oth Oth 2 2 2 Oth [1.3 1.4]C11 [2.1 2.4]C313 [1.3 1.8]C2 [1.6 1.8]C311 [2.2]C312 [2.2]C312 [2.2 2.4]C313 2]C2 [1.7 1.9]C311 D313=C313 p=avC313 Rpd [1.3 1.4]C11 [1.3 [1.6 [1.5 [2.2 2.4]C313 2]C2 2]C311 1.8]C313 [2.2]C312 Let's think about TrainingSet quality resulting from clustering. This a poor quality TrainingSet (from clustering Mother Goose Rythmes. MGR is a difficult corpus to cluster since: 1., in MGR, almost every document is isolated (an outlier), so the clustering is vague (no 2 MGRs deal with the same topic so their word use is quite different.). Instead of tightening the class hulls by replacing CLASSmin and CLASSmax by CLASSfpci (fpci=first percipitous count increase) and CLASSlpcd, we might loosen class hulls (since we know the classes somewhat arbitrary) by expanding the [CLASSmin, CLASSmax] interval as follows: Let A = Avg{CClASSmin, CLASSmax} and R (for radius) = ACLASSmin (=CLASSmax-A also). Use [A-R-, A+R+]. Let =.8 increases accuracy to 100% (assuming all Other stay Other.). Finally, it occurs to me that Clustering to produce a TrainingSet, then setting aside a TestSet gives a good way to measure the quality of the clustering. If the TestSet part classifies well under the TrainingSet part, the clustering must have been high quality (produced a good TrainingSet for classification). This clustering quality test method is probably not new (check the literature?). If it is new, we might have a paper here? (discuss this quality measure and assess using different 's?) WP Wed 11/26 Yes, we have discovered also that one has to think about the quality of the training set. If it is very high quality (expected to fully encompass all borderline cases of all classes) then using exact gap endpoints is probably wise, but if there is reason to worry about the comprehensiveness of the training set (e.g., when there are very few training samples - which is often the case in medical expert systems where getting a sufficient number of training samples is difficult and expensive), then it is probably better to move the cutpoints toward the midpoint (reflecting the vagueness of training set class boundaries). What does one use to decide how much to move away from the endpoints? That's not an easy question. Cluster deviation seems like a useful measure to employ. One last though on how to decide whether to cut at gap midpoint, endpoints, or to move the cut-points away from the endpoints toward the midpoint, If one has a time-stamp on training samples, one might assess the "class endpoint" change rate over time. As the training set gets larger and larger, if an endpoint stops moving much and isn't an outlier, then cutting at the endpoint seems wise. If an endpt is still changing a lot, then moving away from that endpoint seems wise (maybe based on the rate of change of that endpoint as well as other measures?). A complete subgraph is a clique. A maximal clique is not a proper subset of any other clique. In G=(X,Y,E), a bipartite graph, a clique (Sx, Sy) is a complete bipartite subgraph induced by bipartite vertex set (Sx, Sy). The Consensus Set or clique of Sx, CLQ(Sx) = xSxNy(x), i.e., the set of all y's that are adjacent (edge connected) to every x in Sx. Clearly, (Sx, CLQ(Sx)) is a clique. Thm2: SyY s.t. CLQ(Sy) ( CLQ(Sy), CLQ(CLQ(Sy)) ) is maximal. Thm1: (Sx, Sy) is a maximal clique iff Sy=CLQ(Sx) and Sx=CLQ(Sy) Find all cliques starting with Sy=singletons. Then tripletons etc. Then examine Sy1y2-doubletons s.t. Px(Sy1y2) Examining MGRs, (x=docs, y=words) all singleton wordsets, Sy, form a nonempty clique. AND pairwise to find all nonempty doubleton wordset cliques, Sy1y2. AND those nonempty doubleton wordset with each other singleton wordset to find all nonempty tripleton wordset cliques, Sy 1y2y3... Start w singleton docs, incl another... until . The last nonempty set is a max-clique and all subsets are cliques. Remove them. Iterate.

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 8 w58 14 w21 17 w49 23 w52 28 w52 30 w49 41 w52 46 w49 48 w52 8 none 14 none 17 30 46 23 28 41 28 23 41 30 17 46 41 23 28 46 17 30 48 23 28 13 13 13 13 13 13 21 50 23 33 45 21 43 46 21 48 48 48 41 2 2 2 2 37 w57 46 w45 47 w57 37 47 w57 3 3 3 3 3 3 13 w51 29 w8 46 w51 47 w8 13 46 w51 29 47 w8 w49 w52 w47 w4 w13 w54 w51 w47 w54 29 45 w28 29 47 w8 29 30 39 46 w2 30 30 30 30 32 41 35 43 46 32 36 32 41 w22 w27 4 4 4 4 4 4 4 4 4 5 5 5 5 w27 w23 w24 w2 w49 8 29 30 35 39 46 50 8 29 10 11 36 36 w25 w2 w2 w25 w2 w2 w25 w25 35 46 50 30 39 46 14 17 32 36 41 6 15 18 32 6 49 14 17 32 36 14 39 15 18 32 15 35 15 44 33 33 33 35 35 35 35 35 35 36 37 45 49 36 37 38 39 41 42 w47 w55 16 16 17 17 17 17 w22 w50 w31 w48 w13 w29 w33 w17 w40 w45 w34 w7 37 38 39 39 39 39 47 46 41 46 47 50 41 48 42 w25 w2 w26 w38 w34 w34 w38 w22 w5 33 28 30 32 39 48 37 46 36 47 w48 w6 w42 w38 w18 w56 w41 w57 w57 w39 w2 w18 w9 w45 w52 43 44 7 7 7 7 7 7 7 7 7 7 8 8 9 9 9 9 13 35 35 13 30 9 10 13 35 45 26 35 26 27 44 45 18 18 21 21 21 21 22 22 42 33 45 43 23 29 11 12 46 50 27 45 29 45

w4 w7 w10 w13 w24 45 w42 25 41 w44 w4 w13 w7 w10 w13 w42 w16 w25 w3 w42 w35 w3 w42 32 44 35 42 43 50 35 50 w22 w10 w10 w4 w54 w47 w43 w53 45 46 47 50 w25 48 49 50 23 28 41 48 25 26 26 26 26 26 10 10 10 10 11 12 25 41 14 10 44 21 w44 w42 w32 w12 w19 11 12 25 41 11 14 17 32 36 11 35 37 12 12 12 12 w44 w38 w17 25 41 w44 25 w59 26 w15 25 w44 w59 #CLQs 13 1 7 9 23 48 #docs 2 6 5 4 3 2 #words 2 1 1 1 1 1 There is something wrong here. This does not find all maximal cliques. w52 41 w44 27 45 w3 28 w60 29 w1 38 w36 39 w30 27 35 27 50 28 28 28 28 w43 w53 35 38 46 39 50 41 48 w28 w20 w9 w52 Next I try the following logic: 1.Find all 1WdC (1 Word Cliques). 2.A kWdC contains each of k (k-1)WdCs, so of a (k-1) wordset is not the wordset of a clique than none of its supersets are either (downward closure property). 3.Thus, the wordset of any 2WdCs can be composed by unioning the wordsets of two 1WdCs and any k WdC wordset is the union o f a (k-1)WdC wordset with a 1WdCwordset. 4 3 2 2 3 3 3 3 2 2 4 2 2 2 3 3 2 4 2 4 2 3 53 32 32 22 26 22 22 22 25 2 6 3 2 3 3 4 2 3 5 2 3 2 2 3 2 2 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 2 3 4 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 10 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0AND 0 0 with w1 1 1 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 00 00 01 00 00 10 00 00 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 10 00 00 01 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 00 00 00 00 00 00 00 00 01 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 10 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1For 0 wh=w2 0 0 d4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 01 00 00 00 00 00 00 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 00 10 00 00 00 00 00 10 00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0w60 0 0,I show 0 d6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 00 00 00 00 01 00 00 00 00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0only 1 0the0 1-d8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 d9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 00 10 00 00 01 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0counts 0 0 of 0 d10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 00 00 00 00 00 00 10 00 00 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 d11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0wh&wk 0 0 0k>h d12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 00 00 00 00 01 00 00 00 00 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 d13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 d15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 d16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 d17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 d18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 00 00 01 00 00 10 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 d21 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 00 00 00 00 00 00 00 00 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 00 00 00 10 00 00 00 00 00 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 d23 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 00 00 10 00 00 00 00 00 00 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 6 d26 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 00 00 01 00 00 10 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 00 00 10 00 01 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 10 00 00 00 00 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 d28 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 11 00 00 11 00 00 01 00 10 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 3 d29 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 01 00 00 00 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 00 00 00 00 00 00 00 01 00 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 d32 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 00 00 01 00 00 00 01 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d33 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 00 00 00 00 00 11 00 00 00 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 d35 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 00 10 00 00 01 00 00 00 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 d36 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d37 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 d38 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 00 01 00 00 00 00 10 10 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d39 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d41 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 10 00 00 00 00 00 00 00 00 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 d42 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 00 00 00 00 00 00 00 01 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 d43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 01 00 00 00 00 00 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 d44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 10 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d45 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 00 00 00 00 00 00 00 00 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 d46 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 0 d47 06 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 60 7d48 8 9 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d49 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 00 0 0d50 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 2 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 1 0 0 2 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0A1first 0 0 0 0 0of0 our team might be to implement (in optimized Treeminer parallel code ), that is, a 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 goal 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0maximal 0 0 0 0 complete 0 0 1 subgraph finder (maximal clique finder). The benefits would be substantial! 0 0 2 0 0 1 1 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 1 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 01.This 0 0 0 would 1 0 0 0be an exercise in parallel programming (e.g., in a Treeminer Hadoop environment). 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 is 0a 0typical exponential growth case. If you can find an engineering breakthru here, it will be a 0 1 1 1 0 0 1 0 1 0 0 0 1 1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 1 0 1 1 1 1 0 1 0 0 1 0 0 0 2.This 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0breakthru 1 0 0 0 0for 0 0a massive collection of similar existing big data parallel programming problems . 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0What 1 0 0 0 0 0next 0 step? We would have to have all the wh&wk, k>h results (not just the 1 counts) but 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0is0the 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 that 0 0 would 0 0 1 0have 1 taken about 60 more slides to display ;-( 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1 0 1 1 0Just 0 0 looking 0 0 0 0 at 1 the pTee results of w1&wk k>1 above, we see that even though w1&w2 and w1&w3 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 their AND (which is w1&w2&w3) has 0 count and therefore we need not consider 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0both 0 0 have 0 0 0 counts=1, 0 0 1 1 0 1 1 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0any 0 0combinations 0 0 0 0 0 of the type w1&w2&w3& by the downward closure). 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 2 1 1 0 1 1 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 In 0 fact, 0 0 0the 0 0only 0 wh for which we need to look further is h=42. Note that ct(w1&w2&w42)=1 but all 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 The only maximal clique involving w1 and w2 is {CDocSet={C29}, 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 other 0 0 0ct(w1&w2&wh)=0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 right? 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 WordsSet={C1,2,42}, 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 0 1 0 Next 0 0 0we 0 0

0 0 look at the pTrees of w2&wk, k>2. Clearly we only need to consider the 16 WDpTrees would 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 {C8,9,18,20,23,24,25,27,30,39,42,45,46,49,51,55}, 0 0 0 0 0 0 1 not all 58 of them. And going down to w30 the 0 0 1 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 WDpTreeSet 0 1 1 0 0 0 0is (30,48} only 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 2 0 0 0 1 0 1 0 0 0 0 1 0 0 0 To 0 0appreciate 0 0 0 0 0 that we need engineering breakthroughs here, recall that a typical vocabulary might be 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 100,000 0 0 0 0 0 0 0 1 0 0 0 0 0 words, 0 2 0 not just 60. 0 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 3 2 2 3 3 3 3 2 2 4 2 2 2 3 3 2 4 2 4 2 3 53 32 32 22 26 22 22 22 25 2 6 3 2 3 3 4 2 3 5 2 3 2 2 3 2 2 201 1 1 1 1 1 1 1,301,30 1 1 1 1,30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 2 3 8 2 15 3 3 3 3,36 42 2 60 3,36 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 8 15 15 15 15 8 15,60 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 16 16 16 42 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 10 00 00 01 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 00 00 00 00 00 00 00 00 01 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 10 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 01 00 00 00 00 00 00 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 00 10 00 00 00 00 00 10 00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 40 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 00 00 00 00 01 00 00 00 00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 61 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 00 10 00 00 01 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 70 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 00 00 00 00 00 00 10 00 00 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 80 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 00 00 00 00 01 00 00 00 00 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 10 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 11 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 12 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 13 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14 1 00 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 00 00 01 00 00 10 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 00 00 00 00 00 00 00 00 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0 00 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 00 00 00 10 00 00 00 00 00 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 16 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 00 00 10 00 00 00 00 00 00 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 18 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 00 00 10 00 01 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 10 00 00 00 00 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 21 0 00 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 11 00 00 11 00 00 01 00 10 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 22 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 01 00 00 00 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 23 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 00 00 00 00 00 00 00 01 00 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 25 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 00 00 01 00 00 00 01 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 00 00 00 00 00 11 00 00 00 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 26 0 00 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 00 10 00 00 01 00 00 00 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 27 0 00 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 28 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 29 0 01 0 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 00 01 00 00 00 00 10 10 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 30 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 10 00 00 00 00 00 00 00 00 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 32 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 00 00 00 00 00 00 00 01 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 33 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 01 00 00 00 00 00 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 35 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 10 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 00 00 00 00 00 00 00 00 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 36 0 00 0 1 2 3 4 5 37 0 0 06 0 0 0 0 0 0 0 0 0 0 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0 0 0 38 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 39 0 0 00 00 0 00 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 2 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 1 0 0 2 0 1 0 0 41 0 1 00 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 42 0 0 00 00 0 00 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 43 1 0 00 00 0 0 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 44 0 0 45 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 46 0 0 00 00 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 1 1 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 1 0 1 1 1 1 0 0 0 0 1 0 0 0 47 0 0 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 48 0 0 00 10 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 49 0 0 00 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 50 0 1 1 1 0 0 1 0 1 0 0 0 1 1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 1 0 1 1 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 29 0 0 26 0 0290 29 26 26 26 26 26 29 29 26 26 0 1 0 0 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0discard 0 We can those subWordSets that have the same DocSet 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 W1&wW 0 1 0 20 30 40 50 60 7 8 910 1 2 3 4 5 6 7 8 920 1 2 3 4 5 6 7 8 930 1 2 3 4 5 6 7 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 0 11 1 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 00 00 01 00 00 10 00 00 01 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 d11 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 d2 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 10 0 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 d3 1 1 0 1 1 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 0 02 0 30 d4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 400 0d5 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 1 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 2 1 1 0 1 1 1 0 1 0 0 500 0d6 01 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 1 0 1 0 0 0 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 600 0d7 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 700 0d8 00 0 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 d9 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 1 0 0 08 900 0d10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 10 00 0d11 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 11 00 0d12 00 0 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 12 00 0d13 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 13 00 0d14 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 1 0 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 14 00 0d15 15 0 0 d16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 0 1 0 0 0 16 00 0d17 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 17 00 0d18 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 00 0 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 00 0d21 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 1 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 21 10 0d22 22 0 d23 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 23 10 0 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 d25 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 1 25 00

0 00 0 00 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 00 00 01 00 00 10 00 00 00 6 d26 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 26 00 0d27 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 1 0 0 0 0 0 0 0 27 00 2d28 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 00 0 10 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 01 1 0 1 0 0 0 0 1 0 0 0 0 0 28 03 0d29 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 1 0 0 0 0 1 0 0 1 0 0 29 00 0d30 30 One possible way to process is to start with 0 d32 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 1 0 0 0 32 0 00 00 2 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 d33 0 1 0 0 1 0 0 1 0 1 33 00 0 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 d35 w1CountVector: 0 0 1 0 1 0 0 0 0 35 00 1d36 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 1 0 1 2 0 36 00 0d37 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 37 00 1d38 38 01 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 1 1 0 0 0 10 0d39 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0d41 0 0 0 0 0 39 0 41 0 0 d42 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 1 0 42 00 0d43 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 43 10 0d44 01 0 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 44 00 0d45 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 45 00 0d46 46 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 00 0d47 47 0 0 0d48 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 48 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 d49 49 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 d50 50 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 940 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 950 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 FAUST Hull Classification, F(x)=D1ox: Scalar pTreeSet (column of reals) , SPF(X) pTree calculated: mod( int(SP F(X)/(2exp) , 2 ) SPF(X)-min = SPD1oX = SPD1,iXi pD1,0 pD1,-1 pD1,-2 pD1,-3 pe1,0 pe1,-1 pe1,-2 pe1,-3 pe2,0 pe2,-1 pe2,-2 pe2,-3 F(a)= F(b)= F(c)= F(d)= F(e)= F(f)= F(g)= F(h)= 1*1.0 1*1.5 1*1.2 1*0.6 1*2.2 1*2.3 1*2.0 1*2.5 -*3.0 -*3.0 -*2.4 -*2.4 -*2.1 -*3.0 -*2.4 -*2.4 SPe1oX SPe2oX h1 [.6,1.5] h2 [2,2. 5] h1 [2.4,3] h2 [2.1,3] (1,3) (1.5,3) b a c d (.6,2.4) (1.2,2.4) F(d) = = = = = = = = F(a) F(b=F(c)) F(f)=F(g) F(e) D1=(1 , -) F(h) 0.1 0.6 0.6 0 1.75 1.4 1.4 1.9 -.5 0 0 -.6 1.15 .8 .8 1.3 SPD1oX -mn h1 [0,.6] h2 [1.4,1.9] 2.3,3) f g h (2,2.4) (2.5,2.4) e (2.2,2.1) 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 Idea: Incrementally build clusters one at a time using all F values. E.g., start with one pt, x. Recall F dis dominated, which means actual distance F difference. If the hull is close to convex hull, max Fdiff approximates distance? Then 1st gap in maxFdiss isolates x-cluster? FAUST Hull Classification, F(x)=D1ox: Scalar pTreeSet (column of reals) , SPF(X) SPF(X)-min = SPD oX = SPD X 1 F(a)= F(b)= F(c)= F(d)= F(e)= F(f)= F(g)= F(h)= 1,i 1*1.0 1*1.5 1*1.2 1*0.6 1*2.2 1*2.3 1*2.0 1*2.5 -*3.0 -*3.0 -*2.4 -*2.4 -*2.1 -*3.0 -*2.4 -*2.4 = = = = = = = = SPe1oX SPe2oX h1 [.6,1.5] h2 [2,2. 5] h1 [2.4,3] h2 [2.1,3] i -.5

0 0 -.6 1.15 .8 .8 1.3 0.1 0.6 0.6 0 1.75 1.4 1.4 1.9 SPD1oX -mn h1 [0,.6] h2 [1.4,1.9] Incrementally build clusters 1 at a time with F values. E.g., start with 1 pt, x. Recall F dis dominated, which means actual separation F separation. If the hull is well developed (close to convex hull) max Fdiff approximates distance? Then 1 st gap in maxFdis isolates x-cluster? (1,3) (1.5,3) b a c d (.6,2.4) (1.2,2.4) F(d) F(a) F(b=F(c)) F(f)=F(g) F(e) D1=(1 , -) F(h) pTree calculated: mod( int(SP F(X)/(2exp) , 2 ) 2.3,3) f g h (2,2.4) (2.5,2.4) e (2.2,2.1) mxFdf(a) mxFdf(b) mxFdf(c) mxFdf(d) mxFdf(e) mxFdf(g) mxFdf(h) mxFdf(f) 0 .5 .6 .6 1.65 1.3 1.3 1.8 .5 0 .6 .9 1.15 .8 .8 1.3 .6 .6 0 .6 1.15 1.1 .8 1.3 .6 .9 .6 0 1.75 1.7 1.4 1.9 1.65 1.15 1.15 1.75 0 .9 .35 .3 1.3 .8 .8 1.4 .35 .6 0 .5 1.8 1.3 1.3 1.9 .3 .6 .5 0 1.3 .8 1.1 1.7 .9 0 .6 .6 Gap=.5 Density= 4/.9^2= 4.9 Gap=.55 Density= 3/.35^2= 24.5 Gap=.5 Density= 6/.8^2= 9.4 Gap=.7 Density= 4/.6^2= 11.1 {a,b,c,d} a-core. Gap=.7 Density= 4/.6^2= 11.1 No b-core No c-core {a,b,c,d} d-core {e,g,h} e-core {b,c,e,f,g,h} g-core {e,f,g,h} h-core No f-core FAUST Oblique, F(x)=D1ox: Scalar pTreeSet (column of reals) , SPF(X) SPF(X)-min = SPD oX = SPD X 1 F(a)= F(b)= F(c)= F(d)= F(e)= F(f)= F(g)= F(h)= 1*3.0 1*1.5 1*1.2 1*0.6 1*2.2 1*2.3 1*2.0 1*2.5 1,i -*1.5 -*3.0 -*2.4 -*2.4 -*2.1 -*3.0 -*2.4 -*2.4 = = = = = = = = (1.5,3) b c d (.6,2.4) (1.2,2.4) i 2.25 0 0 -.6 1.15 .8 .8 1.3 2.85 0.6 0.6 0 1.75 1.4 1.4 1.9 2.3,3) f g h (2,2.4) (2.5,2.4) e (2.2,2.1) (3,1.5) a F(d) F(b=F(c)) F(f)=F(g) F(e) D1=(1 , -) F(h) F(a) pTree calculated: mod( int(SP F(X)/(2exp) , 2 ) Clustering: Partition; TODO: http://www.cs.ndsu.nodak.edu/~perrizo/saturday/teach/879s15/dmcluster.htm http://www.cs.ndsu.nodak.edu/~perrizo/saturday/teach/879s15/dmlearn.htm 1.Attribute selection prior to FAUST (for speed/accuracy, clustering/classification; Treeminer methods, hi pTree correlation with class?, hi info_gai/gini_index/other?, In information theory and machine learning, information gain is a synonym for KullbackLeibler divergence. The expected value of the information gain is the mutual information I(X; A) of X and A i.e. the reduction in the entropy of X achieved by learning the state of the random variable A. In machine learning, this concept can be used to define a preferred sequence of attributes to investigate to most rapidly narrow down the state of X. Such a sequence (which depends on the outcome of the investigation of previous attributes at each stage) is called a decision tree. Usually an attribute with hi mutual info should be preferred to others. In general terms, the expected information gain is the change in information entropy from a prior state to a state that takes some information as given: Formal definition: Let denote a set of training examples, each of the form where is the value of the th attribute of example and is the corresponding class label. The information gain for an attribute is defined in terms of entropy as follows: The mutual information is equal to the total entropy for an attribute if for each of the attribute values a unique classification can be made for the result attribute. In this case, the relative entropies subtracted from the total entropy are 0. Drawbacks: Although information gain is usually a good measure for deciding the relevance of an attribute, it is not perfect. A notable problem occurs when information gain is applied to attributes that can take on a large number of distinct values. For example, suppose that one is building a decision tree for some data describing the customers of a business. Information gain is often used to decide which of the attributes are the most relevant, so they can be tested near the root of the tree. One of the input attributes might be the customer's credit card number. This attribute has a high mutual information, because it uniquely identifies each customer, but we do not want to include it in the decision tree: deciding how to treat a customer based on their credit card number is unlikely to generalize to customers we haven't seen before (overfitting). Information gain ratio is sometimes used instead. This biases the decision tree against considering attributes with a large number of distinct values. However, attributes with very low information values then appeared to receive an unfair advantage. In statistics, dependence is any statistical relationship between two random variables or two sets of data. Correlation refers to any of a class of statistical relationships involving dependence. Familiar examples of dependent phenomena include the correlation between the physical statures of parents and their offspring, and the correlation between the demand for a product and its price. Correlations are useful because they can indicate a predictive relationship that can be exploited in practice. For example, an electrical utility may produce less power on a mild day based on the correlation between electricity demand and weather. In this example there is a causal relationship, because extreme weather causes people to use more electricity for heating or cooling; however, statistical dependence is not sufficient to demonstrate the presence of such a causal relationship (i.e., correlation does not imply causation). Formally, dependence refers to any situation in which random variables do not satisfy a mathematical condition of probabilistic independence. In loose usage, correlation can refer to any departure of two or more random variables from independence, but technically it refers to any of several more specialized types of relationship between mean values. There are several correlation coefficients, often or r, measuring the degree of correlation. Most common of these is Pearson correlation coefficient, sensitive only to a linear relationship between 2 variables. Other correlation coeffs have been developed to be more robust than Pearson correlation. Several sets of (x, y) points, with the Pearson correlation coefficient of x and y for each set. Note that the correlation reflects the noisiness and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in that case correlation coefficient is undefined

because the variance of Y=0 Contents: Pearson's product-moment coefficient: Main article: Pearson product-moment correlation coefficient :The most familiar measure of dependence between two quantities is the Pearson product-moment correlation coefficient, or "Pearson's correlation coefficient", commonly called simply "the correlation coefficient". It is obtained by dividing the covariance of the two variables by the product of their standard deviations. Karl Pearson developed the coefficient from a similar but slightly different idea by Francis Galton.[4] The population correlation coefficient X,Y between two random variables X and Y with expected values X and Y and standard deviations X and Y is defined as: where E is expected value operator, cov means covariance, and, corr an alternative notation for the correlation coefficient. Pearson correlation is defined only if both standard deviations are finite and nonzero. It is a corollary of the CauchySchwarz inequality that the correlation cannot exceed 1 in absolute value. The correlation coefficient is symmetric: corr(X,Y) = corr(Y,X). The Pearson correlation is +1 in the case of a perfect direct (increasing) linear relationship (correlation), 1 in the case of a perfect decreasing (inverse) linear relationship (anticorrelation),[5] and some value between 1 and 1 in all other cases, indicating the degree of linear dependence between variables. As it approaches 0 there is less of a relationship (closer to uncorrelated). The closer coefficient is to either 1 or 1, stronger the correlation between variables. If variables are indep Pearson's correlation coeff is 0, but the converse is not true because the correlation coefficient detects only linear dependencies between two variables. E.g., random variable X is symmetrically distributed about 0, Y = X2. Then Y is completely determined by X, so that X and Y are perfectly dependent, but their correlation is zero; uncorrelated. Special case X and Y are jointly normal, uncorrelatedness = independence. If a series of n measmnts of X ,Y written xi and yi i = 1...n, sample correlation coefficient can be used to estimate pop Pearson correlation r between X and Y. Sample correlation coeff is written where x and y are the sample means of X and Y, and sx and sy are the sample standard deviations of X and Y. Info Gain as an Attribute Selection Measure Minimizes expected number of tests needed to classify an object and guarantees simple tree (not the simplest) At any stage, let S = {Cs1,...,sm} be a TRAINING SUBSET. S[C] = {CC1,...,Cc} be the distinct classes in S. EXPECTED INFORMATION needed to classify a sample given S as TRAINING SET is: I{Cs 1,...,sm} = -i=1..mpi*log2(pi) pi= |SCi|/|S| Choosing A as decision attribute, Expected Info gained is E(A) = j=1..v; i=1..m ( si,j/|S| * I{Csij..smj} ) where Skh = SA=akCh Gain(A) = I(s1..sm) - E(A) - expected reduction of info required to classify after splitting via A-values. The algorithm above computes the information gain of each attribute and selects the one with the highest information gain as the test attribute. The example: Training Data: Band B1: Band B2: 3 3 7 7 7 3 3 2 3 3 7 7 7 3 3 2 2 2 10 15 11 11 10 10 2 10 15 15 11 11 10 10 S: X-Y B1 B2 B3 B4 0,0 0011 0111 1000 1011 0,1 0011 0011 1000 1111 0,2 0111 0011 0100 1011 0,3 0111 0010 0101 1011 1,0 0011 0111 1000 1011 1,1 0011 0011 1000 1011 1,2 0111 0011 0100 1011 1,3 0111 0010 0101 1011 2,0 0010 1011 1000 1111 2,1 0010 1011 1000 1111 2,2 1010 1010 0100 1011 2,3 1111 1010 0100 1011 3,0 0010 1011 1000 1111 3,1 1010 1011 1000 1111 3,2 1111 1010 0100 1011 3,3 1111 1010 0100 1011 Band B3: 8845 8845 8844 8844 Band B4: 11 15 11 11 11 11 11 11 15 15 11 11 15 15 11 11 ENTROPY based on partition into subsets by B2 is E(B2)=j=1..v[ (s1j+..+smj)*I(s1j..smj)/s ] where Ij=I(s1j..smj)=- i=1..m [pij*log2(pij)], pij=sij/|Aj| Since m=5, the sij's are: j=1 --0 0 2 0 0 --2 j=2 --0 2 2 0 0 --4 j=3 --0 2 0 0 0 --2 j=4 --0 0 0 1 3 --4 j=5 --3 <-- s1j 0 <-- s2j 0 <-- s3j 1 <-- s4j 0 <-- s5j --4 <- s1j+..+s5j j=1 j=2 j=3 j=4 j=5 --- --- --- --- --2 4 2 4 4 <- |Aj| where Aj's are the rootcounts of P2(aj)'s. B1 is the class label (2, 3, 7, 10, 15 are C1,..,C5). We need to know the count of the number of pixels (rows in the table above) that contain each value in each attribute. We also need to know the count of pixels that contain pairs of values, one from a descriptive attribute and the other from the class label attribute. B2=(a1,a2,a3,a4,a5}={C 2, 3, 7,10,11 } as 1st candidate attribute. Aj={Ct: :t(B2)=aj}, a1=0010, a2=0011, a3=0111, a4=1010, a5=1011. sij is number of samples of class, Ci, in a subset, Aj. so sij=rc( P1(ci)^P2(aj) ), where ci{C2,3,7,10,15}, aj{C2,3,7,10,11}. EXPECTED INFO needed to classify the sample is: I = I(s1..sm) = -SUM(i=1..m)[ pi * log2(pi) ], m=5 s=16 si=3,4,4,2,3 (rootcounts for class labels, rc(P1(ci))'s) pi = s1/s = 3/16, 1/4, 1/4, 1/8, 3/16 I = -(3/16*log(3/16)+4/16*log(4/16)+4/16*log(4/16) +2/16*log(2/16)+3/16*log(3/16)) = -( -.453 -.5 -.5 -.375 -.453 ) = -( -2.281) = 2.281 Therefore, j=1 j=2 j=3 ------0 0 0 0 .5 .5 1 .5 0 0 0 0 0 0 0 and 0* 0 0 0 -.5 -.5 0 -.5 0 0 0 0 0 0 0 -----0 1 .5 2 4 4 0 .25 .125 j=4 --0 0 0 .25 .75 j=5 --.75 0 0 .25 0 <<<<<- p1j p2j p3j p4j p5j p1j*log2(p1j) p2j*log2(p2j) p3j*log2(p3j) p4j*log2(p4j) p5j*log2(p5j) 0 0 0 -.5 -.31 ---.81 4 -.31 0 0 -.5 0 ---.81 4 <<<<<- .203 .203 (s1j+..+s5j)*I(s1j..s5j)/16 <- I(s1j..s5j) <- s1j+..+s5j so that, E(B2)=.781 I(s1..sm)=2.28 GAIN(B2) -> GAIN(B3) -> GAIN(B4) -> 1.750 1.331 .568 I(s1..sm) - E(B2) I(s1..sm) - E(B3) I(s1..sm) - E(B4) For Info Gain, IG, we can (for that matter, we can do this for correlation too) 1.eliminate attributes based on an IG threshold or 2. use 1/IG as a weighting in the classification vote or 3.both! From: Arjun Roy January 14, 2015 2:03 AM For starting 'd', we have been using the vector from mean to median which has worked quite well. But I am curious to know if there is any other starting 'd' which would give even better separation between the classes. I could randomly choose some d's but also wanted to avoid too many computations. WP: For clustering, we use the vector from mean to median for recursive clustering (to break apart a cluster into subclusters), but those vectors may not be very helpful for classification (of course, they are not hurtful either - the more d's the merrier for classification, right? Computation cost is the only limiting issue.) For classification, the vectors connecting pairs of class means (or medians or...) would be good vectors to pick (together with the freebees, e1=(1.0...0), ... , en=(0...0,1) ). In the picture below, d seems to connect the two class means? We might also take the "single link" class pair vectors (connecting the pair of points that give the single link distance between the two classes (=the min pairwise distance)). Do we know how to find single link distance using pTrees? (seems like we may have done that????). I apologize if I am forgetting an idea already mentioned in MY lecture slides. ;-) Then of course we can look at the "complete link" vectors (max instead of min). The "average link" vectors are those connecting the means. Of course the single link and complete link pairs can be approximate. For convex hull algorithm, I was looking for the minimum number of d required to build the class model (kind of an end condition). For e.g. in last Saturday slide, one d was sufficient enough (the red line) and black line computation was redundant. WP: Remember the black ( using d=e2=(0,1) )and gold lines ( using d=e1=(1,0) ) are very helpful in eliminating false positives in the case where "none of the classes" is a possibility, right? (e.g., one-class classification).. So I recommend (as a minimal set of d's): 1. Use all ei (the freebees). 2. Use all pairwise class mean connectors. Then if more are desired, add the following in the order given: 3. Main diagonals (connecting each corner of the (0,x2,...,xn) face with the opposite corner on the (1,x2,...,xn) face). The alg for main diagonal endpts: Let x2,...,xn be any sequence of 0's and 1's. D is a main diagonal iff D connects (0,x2,x3,...,xn) to (1,x2',x3',...,xn') where ' =bit complement (flip each bit). So exactly 2^(n-1) main diagonals?? 4. All non-main diagonals (basically fix a set of coordinates at some fixed sequence of 0's and 1's then from the other coordinates, pick one, say k. Run a vector from xk=0 with the fixed sequence and all others 0 or 1, to xk=1 with the fixed sequence and all others flipped. 5. Do 3 using xh=1/2 6. Do 4 using xh=1/2 So the classifier could build initial hulls using 1 first (Use only the freebees , for a quick and dirty FAUST HULL classifier.). Then as the user is using those results, reclassify (add to the class hulls) in the background using 1 and 2. Then as the user is using those results, reclassify in the background using 1 and 2 and 3. Then as the user is using those results, reclassify in the background using 1 and 2 and 3 and 4. Then as the user is using those results, reclassify in the background using 1 and 2 and 3 and 4 and 5. Then as the user is using those results, reclassify in the background using 1 and 2 and 3 and 4 and 5 and 6 Is computing variance the only way to choose the next d? I need to check with Mohammad how expensive it is. Graphically, its intuitive to choose d just by looking at the points but algorithmically its quite a computational pain. WP: Maybe you can re-phrase this question? I'm not catching it. WP: Alternatively, we could use the freebees, then use the PTM partitioning of the surface of the unit sphere (vectors from the origin to a vertex of a great circle triangle under PTM or equivalently, HTM), starting with the partition into 8 triangles (which gives just the feebees, e1,e2,e3), then descend one layer (subdivide all triangles by connecting all side midpoints, etc. Basically, again, at each level in PTM, we would run a vector from the northern hemisphere to the point in the southern hemisphere directly opposite it (or more simply, from the origin to each northern hemisphere vertex. That way, the vectors, D, are coincident with the vertexes of the triangles at that level (there are formulas in the literatire see HTM but it seems like we could get the same thing by: 1.use the freebees {Ce1..en}, which is the top level PTM triangulation. 2.use all V1={C ei ej | i=1,2,3; ij=2,3}, which I believe is the 2nd level PTM triangulation. 3.use all V2={C v + w | v ,w in V1}, 4.use all V3={C v + w | v ,w in V2},

## Recently Viewed Presentations

• Some types of indexes work only in conjunction with a certain file organization Index Structures for Files Single level ordered indexes allow us to search for a record by searching the index file using binary search The index is typically...
• Passes through the digestive tract unchanged - serves as a laxative. Fiber. Best obtained through dietary sources. Fruits, vegetables, dried beans, peas, legumes, cereals, grains, nuts, and seeds ... Incomplete proteins - have small amounts of some of the essential...
• AELookupSvc. Processes application compatibility cache requests for applications as they are launched. Custom ETW. BDESVC. Provides BitLocker client services for user interface and auto-unlocking of data volumes. Custom ETW. BTHSERV. The Bluetooth service supports discovery and association of remote Bluetooth...
• 2010 Dual Enrollment Liaison Workshop * If quick-admitted to obtain Student ID for placement testing, obtain MEID by following "Forgot Your MEID' instructions. * Resource center for information pertaining to: classes enrolled, tuition balance, grades, placement test scores, transcript requests,...
• Senior CHAT: Consumer Health Awareness Training. Funded by a 2010/2011 Express Consumer Health Outreach Award from NN/LM SCR. Our project is designed to improve health information literacy among the senior citizens of Tangipahoa Parish and ultimately improve health outcomes.
• The Legal Profession in Britain ... they do the preparatory work and approach the barristers Barristers represent the client in all courts Legal Services Act 2007 The Act provides for the creation of the Legal Services Board (LSB) that consists...
• Dept. of Chem and Materials Science & Engineering, Univ. of Arizona, Privy Press: Safety in Lab, Sept. 22, 2005. Proper Storage Procedures: Store chemicals in approved chemical storage cabinets / refrigerators. Dispose of chemicals that have not been used in...
• Paracrines: locally acting chemicals that affect cells other than those that secrete them. Autocrines and paracrines are local chemical messengers and will not be considered part of the endocrine system. Figure 11.5. Local signaling. Target cells. Secreting cell.