Why Not Store Everything in Main Memory? Why use disks?

Why Not Store Everything in Main Memory? Why use disks?

Todays main takeaway: We can attack both the curse of cardinality (CC) and the curse of dimensionality (CD) with pTree structuring. We datamine Big Cardinality DataSets (many rows) by Horizontal Processing of Vertical Data (HPVD). We datamine Big Dimensionality DataSets (many columns) with HPVD because it reduces the cost of attribute relevance analysis (eliminating attributes irrelevant to the classification or clustering or ARM or???.) by orders of magnitude. Wrt to Attribute Relevance Analysis, we may be wise to pre-compute, not only the bitslice pTrees of all numeric columns and bitmap pTrees of all categorical columns (the basic pTrees), but to also pre-compute (one time) all attribute-value bitmap pTrees (a bitmap for each value in the extent domain (the values that actually occur there). This would be a matter of treating numeric columns, not only as binary numbers to be bitsliced but also as categorical columns (treating each actual numerical value as a category and bit mapping it). We may, in fact, find it worth the trouble to pre-compute the value bit map of all composite attributes as well. (See the next few slides) We need a Genetic Algorithm implementation!!!! (See next slide.) TODO: 1.Attribute selection prior to FAUST Oblique (for speed/accuracy; use Treeminers, use hi pTree correlation with class?, info_gain, gini_index, other?, 2.Replicate the graph below (and expand that accuracy performance study to a full experimental design). 3.Do a comprehensive speed performance study. 4.Implement FAUST Hull Classifier. Enhance/compare it using a good exp. design and real datasets (Treeminer hasnt gotten to that yet. So we can help a lot there!). 5.Research pTree density based clustering (https://bb.ndsu.nodak.edu/bbcswebdav/pid-2819939-dt-content-rid-13579458_2/courses/153-NDSU-20309/dmcluster.htm + 1st ppt_set) Related papers: DAYVD: Iterative Density-Based Approach for Clusters with VarYing Density, International Journal of Computers and Their Applications, V17:1, ISSN 1076-5204, B. Wang and W. Perrizo, March, 2010.52. . A Hierarchical Approach for Clusters in Different Densities, Proceedings of the International Conference on Software Engineering and Data Engineering, Los Angeles, B. Wang, W. Perrizo, July, 2006. A Comprehensive Hierarchical Clustering Method for Gene Expression Data Association of Computing Machinery, Symposium on Applied Computing, ACM SAC 2005, Mar., Santa Fe, NM, B. Wang, W. Perrizo. A P-tree-based Outlier Detection Alg, International Society of Computer Applications Conference on Applications. in Industry and Engineering., ISCA CAINE 2004, Orlando, FL, Nov., 2004 (with B. Wang, D. Ren) A Cluster-based Outlier Detection Method with Efficient Pruning, International Society of Computer Applications Conf. on Applics. in Industry and Eng., ISCA CAINE, Nov., 2004 (with B. Wang, D. Ren) A Density-based Outlier Detection Alg using Pruning Techniques, Intl Society of Computer Applications Conf. on Applics. in Industry and Eng., ISCA CAINE 2004, Nov., 2004 (with B. Wang, K. Scott, D. Ren) Parameter Reduction for Density-based Clustering on large Data Sets, Intl Society of Computer Applications Conference on Applications in Industry and Engineering, ISCA CAINE 2004, Nov., 2004 (with B. Wang) Outlier Detection with Local Pruning, Association of Computing Machinery Conference on Information and Knowledge Management, ACM CIKM 2004, Nov., 2004, Washington, D.C., (with D. Ren). RDF: A Density-based Outlier Detection Method using Vertical Data Representation, IEEE International Conference On Data Mining, IEEE ICDM 2004, Nov., 2004, Brighton, U.K., (with D. Ren, B. Wang). A Vertical Outlier Detection Method with Clusters as a By-Product, IEEE International Conf. On Tools in Artificial Intelligence, IEEE ICTAI 2004, Nov., 2004, Boca Raton, FL, (with D. Ren). 5. AGNES Clusterer: Start w singles clusters. C, find closest nbr, C. If (C,C) Qualifies, CCC. Qualifies::always; d(C,C) thres), recursively cut C at F-gaps (mdpt or adjust w variance). Use PCC gaps instead for suspected aberrations. Mark11/25 FAUST text classification capable of accuracy as high as anything out there. Stanford_newsgroup dataset (7.5Kdocs) FAUST got 80% boost by eliminating terms in <= 2 docs. Chi-squared to reduce attrs, 20% (pick 80% best attr.). Vertical allows toss atts easily before we expend CPU. Tossing intelligently improvse accuracy and reduces time! We eliminated ~70% attribs from TestSet and achieved better accuracy than the classifiers referenced on Stanford NLP site! About to turn this loose on datasets approaching 1TB in size. Mark 11/26 Adjusting midpt as well based on cluster deviation. Gives extra 4% accuracy. The hull is interesting case, as were looking at situations. We are already able to predict which members are poor matches to a class. TEST MINING COMMMENTS: For text corpus (d docs (rows) and t terms (cols), so far recorded only a very tiny part of the total info. Wealth of other info, not captured . 2010-2015 notes tried to capture more information than just term existence or wtf info (~2012_07-09). SPRING PLAN: Develop Treeminer platform and datasets (+ Matt Piehls datasets) on pTree1 incl Hadoop (+Ingest procs for new datasets and convert them to pTreeSets. Each pick a VDMp TopicSetPPT or book / produce enhanced/up-to-date version / embed audio / Blackboard based AssignmentSets (+solutions) + (TestSets +answers. Md: Ph.D. CS, VDMp operations. Proposal defended (X) FAUST Analytics X=(X1..Xn)Rn, |X|=N. XC=(X1..Xn,C}. x.C{CC1..Cc}= Rajat: MS CS, finished coursework. SE Quality Metrics (Dr. Magel) imp api Maninder: Ph.D. CS, (book needs structure) Plan: develop in Treeminer env., TrainSetCl. d=(d1..dn), p=(p1..pn)Rn. F:RnR, F=L, S, R : PTSSPTS. Spoorthy: MS SE Ld,p (X-p)od=Xod-pod LdXod Sp (X-p)o(X-p)=XoX+Xo(-2p)+pop=L-2p+XoX+pop Damian: Ph.D. in SE Arjun: Ph.D. CS. (VDMp algs, 2 tasks s15, proposal, journal paper, Rd,p Sp-L2d,p =XoX+L-2p+pop-(Ld)2-2pod*Xod+(pod)d2=L-2p-(2pod)d -(Ld)2 +XoX+pop+(pod)2 Arijit: Ph.D. in CS, VDMp DM of fanacials, proposal defense this term Bryan: Multilevel pTrees FAUST Top K Outlier Detector : rankn-1Sx From Mark 1/23/2015: So how would you use GA in conjunction with faust? Use the weightings to adjust the splits? WP: See below, but we used GAs for attribute selection (weighting). For attribute selection (or weighting-selection can be viewed as setting most weights =0) we think three tools (at least) will help: correlation with the class column; Information Gain wrt to the class column (see later slides); A GA tool (yet to be developed, but it would be for attribute selection from very wide datasets). MS: Also, check this out... we are doing clustering of data in hadoop, building a classification model directly from the clustering results, and applying it real time to streaming data (real time, multi-class, classification of text documents) It's all about attribute relevance (well, maybe not "all", but it's huge.... especially with 100,000 attributes), I've wanted to play with GA for a while!!! https://www.youtube.com/watch?v=5X65WV0n4rU (everyone should watch this impressive video!!!) First we describe and reference the uses of Genetic Algorithms in the two ACM KDD Cup wins: ACM KDD Cub 2002: Full paper at in the ACM SIGMOD Explorations Journal: http:///home/perrizo/public_html/saturday/papers/paper Due to the diversity of the experimental class label and the nature of the attribute data, we need a classifier that would not require a fixed importance measure on its features, i.e., we needed an adaptable mechanism to arrive at the best possible classifier. In our approach we optimize (column based scaling) the weight space, W, with a standard Genetic Algorithm (GA). The set of weights on the features represented the solution space that was encoded for the GA. The AROC value of the classified list was used as the GA fitness evaluator in the search for an optimal solution. RESULTS AND LESSONS: This work resulted in both biological and data mining related insights. The systematic analysis of the relevance of different attributes is essential for a successful classification. We found that the function of a protein did not help to classify the hidden system. Sub-cellular localization, which is a sub-hierarchy of the function hierarchy, on the other hand, contributed significantly. Furthermore, it was interesting to note that quantitative information, such as the number of interactions, played a significant role. The fact that a protein has many interactions may suggest that the deletion of the corresponding gene would cause changes to many biological processes. Alternatively it could be that a high number of listed interactions is an indication of the fact that previous researchers have considered the gene important and that it therefore is more likely to be involved in further experiments. For the purpose of classification we didnt have to distinguish between these alternatives. ACM KDD Cub 2006: Full paper at in the ACM SIGMOD Explorations Journal: http:///home/perrizo/public_html/saturday/papers/paper The multiple parameters involved in the two classification algorithms were optimized via an evolutionary algorithm. The most important aspect of an evolutionary algorithm is the evaluation or fitness function, to guide the evolutionary algorithm towards a optimized solution. Negative Predictive Value NPV and True Negatives (TN) are calculated based on the task specification for KDDCup06. NPV is calculated by TN/(TN+FN). The above fitness function encourages solutions with high TN, provided that NPV was within a threshold value. Although the task specified threshold for NPV was 1.0, with the very low number of negative cases it was hard to expect multiple solutions that meet the actual NPV threshold and also maintained a high TN level. In a GA, collections of quality solutions in each generation potentially influence the set of solutions in the next generation. Since the training data set was small, patient level bootstrapping was used to validate solutions. Bootstrap implies taking out one sample from the training set for testing and repeating until all samples were used for testing. In this specific task which involves multiple candidates for the same patient we removed all candidates from the particular training set for a given patient when used as a test sample. The overall approach of the solution is summarized in the following diagram. As described in the above figure, attribute relevance is carried out on the training data to identify the top M attributes. Those attributes are subsequently used to optimize the parameter space, with a standard Genetic Algorithm (GA) [8]. The set of weights on the parameters represented the solution space that was encoded for the GA. Multiple iterations with the feedback from the solution evaluation is used to arrive at the best solution. Finally the optimized parameters are used on the combined classification algorithm to classify the unknown samples. Training Data Attribute Relevance Genetic Algorithm P a r a m . Top M Attributes Training Data Test Data F it n e s s Nearest Neighbor Vote + Boundary Based Classification Algorithm Optimized Classification Algorithm Classified Results Value pTrees in a Data Cube A data warehouse is usually based on a multidimensional data model which views data in the form of a data cube describing the subject of interest A data cube allows data to be modeled and viewed in multiple dimensions This is a RSI dataset with 2 feature bands (Red {Clo,avg,hi} Green {Clo,avg,hi} Class (bad,avg,good,great} crop yield In each cell we store the AND(R-value,G-value,C-value) The rollup for all others? lo hi avg bad avg Class good great hi avg lo Green Re d Auxiliary dimension tables are added to the central cube for additional information Fact cube contains measurement(s) and keys (references) to each of the related dimension tables. That is, do we want to pre-compute this rollup and store it using datacube software? Or do we want to store all the cuboids ANDs separately (that is, store all the rollups as pre-computed pTrees?)? ANDs of Green, Class value pairs (Roll Up Red by pTree ORing avg good great hi avg lo Green Re d lo hi avg lo Yield Rollup yield by Oring R and G 2Qtr 3Qtr 4Qtr U.S.A Canada Mexico Rollup Green by Oring Red,Class Country Pr od uc t TV PC VCR 1Qtr Date avg good great hi Rollup Class = Red& Green Rollup Red = Class&Green RollupG=C RollupG=C RolluR Rollup C=G =G avg lo Green Re d lo hi avg bad Class RollupC =R ??? ??? ??? Its probably better to drill down!!! Start with the R={Clo,avg,hi} and G={Clo,avg,hi} feature value pTrees and the C={Cbad,avg,good.great} class value pTrees Then drill down by ANDing all value pairs (producing the 2D sides first, then the 3D cube. Or its proabably better still to do all that as pre-computation and store everything ahead of time (or have a background process drilling down to create all of these combo value pTrees in the background while users are using the bitslice pTrees??? EXPECTED INFO to classify: I= -i=1..m((rcPci)/|X|) log2((rcPci)/|X|), m=5 I = -(3/16*log3/16+1/4*log1/4+1/4*log1/4+1/8*log1/8+3/16*log3/16)= 2.8 (If basic pTree rcs are pre-computed (actually just value pTrees), this is arithmetic!) rcPci = 3 4 4 2 3 pj=rcPci/16 = 3/16 1/4 1/4 1/8 3/16 rc(Pc=2^PBk=aj) rc(Pc=3^PBk=aj) rc(Pc=7^PBk=aj) rc(Pc=10^PBk=aj) rc(Pc=15^PBk=aj) 0 0 2 0 0 0 2 2 0 0 0 2 0 0 0 0 0

0 1 3 3 0 0 1 0 ENTROPY: E(Aj)=j=1..v[(s1j+..+smj)*I(s1j..smj)/s] I(s1j..smj)=-i=1..m[pij*log2(pij)] pij=sij/|Aj| sij=s1,j+..+s5,j= 2 4 PB2=2 PB2=3 PB2=7 PB2=10 PB2=11 PB3=4 PB3=5 PB3=8 PB4=11 PB4=15 |Aj| where Aj's are the rootcounts of Pk(aj)'s 2 2 4 4 2 4 4 4 010 0 010 0 0010 0 0001 1 010 0 010 0 0010 0 0001 1 001 0 001 0 100 0 0001 0 001 0 10 0 001 0 001 0 010 0 010 1 01 1 01 0 010 0 010 1 01 1 01 0 001 0 001 0 100 0 001 0 001 0 100 0 001 0 001 0 010 1 010 0 001 0 001 0 010 1 010 0 001 0 001 0 001 0 001 0 100 0 010 0 001 0 100 0 010 0 010 0 0010 0 0010 0 010 0 010 0 0010 0 0010 0 010 0 010 0 001 0 001 0 100 1 0010 1 001 0 100 0 0010 1 0010 1 0010 0 0010 0 100 0 100 0 0010 0 0010 0 100 0 100 0 0001 1 0001 1 0100 0 0010 0 0001 1 0100 1 0010 0 0010 0 010 0 010 0 100 1 100 0 010 0 010 0 100 1 100 0 0010 0010 010 1 001 1 0010 010 0 001 1 001 1 001 0 001 0 010 0 010 1 001 0 001 0 010 0 010 1 100 0 100 0 010 0 010 0 100 0 010 0 001 0 010 0 0100 1 0100 1 001 0 0010 0 0100 1 0100 1 001 0 0010 0 100 1 100 1 0100 0 001 0 100 1 0100 1 001 0 001 0 010 1 010 0 01 1 01 1 010 1 010 1 01 1 01 1 001 0 001 0 10 1 0011 001 0 10 0 0011 0011 S1,j= S2,j= S3,j= S4,j= S5,j= 0 0 2 1 3 6 0 0 2 0 0 2 6 8 2 3 4 0 1 0 11 8 0 3 4 1 3 3 1 0 1 0 5 11 5 001 0 001 1 100 0 100 0 001 0 001 0 100 0 100 0 0010 1 0001 1 0100 0 0001 0 0010 1 0100 1 0001 0 0001 0 P1j = 0 0 0 0 .75 0 0 .375 0 .6 P2j = 0 .5 .5 0 0 0 0 .5 .273 .2 P3j = 1 .5 0 0 0 .33 1 0 .363 0 P4j = 0 0 0 .25 .25 .17 0 .125 .091 .2 P5j = 0 0 0 .75 0 .5 0 0 .273 0 0 0 0 0 .75 0 0 .53 0 .44 -p1j*log2(p1j) 0 .5 .5 0 0 0 0 .5 .51 .46 -p2j*log2(p2j) 0 .5 0 0 0 .52 0 0 .53 0 -p3j*log2(p3j) 0 0 0 .5 .5 .43 0 .37 .31 .46 -p4j*log2(p4j) 0 0 0 .31 0 .5 0 0 .51 0 -p5j*log2(p5j) (s1j+..+s5j)*I(s1j..s5j)/16 0 .25 .13 .2 .31 .54 0 .7 .127 .43 GAIN(B2)=2.81-.89 =1.92 PPc=10 P P PPc=10 P PPc=10 PPc=10 P c=10

Pc=15 Pc=2 c=10 c=10 Pc=10 PP Pc=7 Pc=10 P PP c=10 PP Pc=7 c=15 c=15 GAIN(B3)=2.81-1.24 =1.57 c=3 c=3 c=3 c=3 c=7 c=15 c=15 c=15 c=7 c=3 c=15 c=7 c=7 c=3 c=7 c=3 c=15 c=7 c=3 c=15 c=3 c=7 c=7 c=2 c=2 c=2 c=2 c=2 c=2 c=2 c=2 c=3 c=15 c=2 GAIN(B4)=2.81-.557 =2.53 So its all just arithmetic, except for the #Classes * #FeatureValues ANDs and RootCounts, rc(PC=ci^PBk=aj). Should these be pre-computed at capture? Are they part of the correlation calculation? Other often-used calculations? Other speedups include: 1.Use approx. Value pTrees. Intervalize feature values and use the IntervalBitMaps (which can be calculated either from BitSlice or ValueMap pTrees. 2.Using bitslice intervals, wed pre-calculate all Pi,j,k = PC=ci^Pj,k then use HiOrderBit IntervalBitMaps, P i,j,Kj where Kj is the HiOrderBit of attrib, j. For 2ndHiOrderBit Intervals, just take Pi,j,Kj & P i,j,Kj-1 etc. Aside note: There must be mistakes in the arithmetic above since I get different GAIN values than on the previous slide. Who can correct? 3 0.1875 -2.4150 -0.4528 An RSI Dataset 16 pixels (4 rows 4 cols): 4 bitslices Band B1: 3 3 7 7 3 3 7 7 2 2 10 15 2 10 15 15 Band B2: Band B3: 7 3 3 2 8845 7 3 3 2 8845 11 11 10 10 8844 11 11 10 10 8844 4 0.25 -2 -0.5 Band B4: 11 15 11 11 11 11 11 11 15 15 11 11 15 15 11 11 4 2 3 0.25 0.125 0.187 -2 -3 -2.41 -0.5 -0.37 -0.45 2.280 rcPci pi=rcPci/16 log2(pi) pilog2(pi) -SUM(pilog2(pi)=I(c1..cm) Value BitMap pTrees BitSlice pTrees P1,3 P1,2 P1,1 P1,0 rc=5 S: X-Y B1 B2 B3 B4 0,0 0011 0111 1000 1011 0 0,1 0011 0011 1000 1111 0 0,2 0111 0011 0100 1011 0 0,3 0111 0010 0101 1011 0 1,0 0011 0111 1000 1011 0 1,1 0011 0011 1000 1011 0 1,2 0111 0011 0100 1011 0 1,3 0111 0010 0101 1011 0 2,0 0010 1011 1000 1111 0 2,1 0010 1011 1000 1111 0 2,2 1010 1010 0100 1011 1 2,3 1111 1010 0100 1011 1 3,0 0010 1011 1000 1111 0 3,1 1010 1011 1000 1111 1 3,2 1111 1010 0100 1011 1 3,3 1111 1010 0100 1011 . 1 Pc=2 Pc=3 Pc=7 Pc=10 Pc=15 PB2=2 PB2=3 PB2=7 PB2=10 PB2=11 P2,3 P2,2 P2,1 P2,0 P3,3 P3,2 P3,1 P3,0 P4,3 P4,2 P4,1 P4,0 rc=7 rc=16 rc=11 rc=8 rc=2 rc=16 rc=10rc=8 rc=8 rc=0 rc=2 rc=16 rc=5 rc=16 rc=16 rc=3 rc=4 rc=4 rc=2 rc=3 rc=2 rc=4 rc=2 rc=4 rc=4 PB3=4 PB3=5 PB3=8 rc=6 rc=2 rc=8 PB4=11 PB4=15 rc=11 rc=5 0 0 1 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 0 0 0 1 1 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0

0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1

0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 Access to Ptree1 and Ptree2 William Perrizo To all my students (plus 3 former students who may be interested in an opportunity to use a pTree environment replete with massive, real-life big datasets that was developed by Treeminer Inc (the company that licensed pTree patents and is becoming a strong player in the vertical data mining area). The only requirement from this side is your willingness to share and willingness to sign a Treeminer NDA. I particularly am interested in having available for everyone to use a Genetic Algorithm tool. As we all know, Dr. Amal Shehan Perera was the lead scientist implementing a GA tool in pTree data mining software and it helped to win two ACM KDD Cups!!! Since that time my suggestion to anyone has been "When your done getting all you can out of your algorithm, GA it!!!" Baoying (Elizabeth)Wang did some amazing work on pTree clustering (and classification and ARM) and Dr. Imad Rahal did amazing pTree work on classification (and ARM and Clustering). That's why I invite Amal, Elizabeth and Imad to use our system once it gets up and going. My hope is that they will be willing to let us use any tools they develop in that system? ;-) (synergism). There may be other former DataSURG members who would be interested as well (please advise). Just to update Amal, Elizabeth and Imad, Treeminer is a startup that has implemented a pTree data mining system in Java (about 70.000 lines of code). It is wonderful and is commercially successful. Treeminer has used several BigData real-life dataset (captured in pTrees). They have demonstrated both orders of magnitude improvement in speed (which is the traditional focus of pTree technology - speedup through Horizontal Processing of Vertical Data or HPVD) but also improvements in accuracy at the same time! This is unheard of and wonderful. Maybe the main reason we are able to get accuracy improvements while getting great speedup is that vertical structuring facilitates fast and effective attribute selection (which might take forever with horizontal data and therefore is not attempted in the horizontal data world). E.g., when datamining a text corpus of 100,000 vocab words (the columns) and 1oo,000,000 documents (the rows). it is important to find out the important words (attribute selection) before running an algorithm (solve the curse of dimensionality first). Amal, Elizabeth and Imad, if you would like to get involved, let us know. We still have Saturday meetings at 10 CST and we have two people who skype in every week (the more the merrier!). Arjun Roy or Arijit Chatterjee can tell how to skype in. Nathan Olson is our department technician who has helped us a lot. At this point we are nearly done getting all of Treeminer's stuff to the two pTree1 and pTree2 servers two servers, so that you can remote login to add anything to the GIT repository there (and run and test it against a very impressive suite of datasets.) Please note I have posted all recent Saturday notes and other good materials on my web site: http://www.cs.ndsu.nodak.edu/~perrizo You can get these materials by clicking on dot or period to the right of the bullet "media coverage" William Perrizo: I have put all the Saturday notes for the past few years and many other useful files (e.g., topic synopses) on my web site so you all can access anything you want. I'm not worried at all about security since, if someone outside our group were to think they had stolen something important from there, then they probably would study it and become a pTree data mining expert. How could that be bad? You can get these materials by clicking on dot or period to the right of the bullet "media coverage" My homepage is: http://www.cs.ndsu.nodak.edu/~perrizo There was a time in the past when I contemplated a "pTree Data Mining" monograph. I will put the very rough preliminary version of that book project on the website also. Anyone at all that every wants to co-author such a thing with me is welcome to work on it using the material at the site. I just wanted to ask if the 4 experience students (Damian, Mohammad, Arjun and Arijit) would be willing to help the students who will/may join our group (manider, Rajat, Spoorthy) and who will then be tasked to do coding, implementation and testing using Treeminer Software Environment and using Treeminer big vertical data sets? That would be great and would also help each of you, I believe. I'm sure Mark, Puneet will help also but they are, no doubt, buried under the weight of other needs right now. From: Bryan Mesich [email protected] Sent: Friday, January 23, 2015 12:41 PM > I had the same question. We can make changes to code on ptree1/2 directly but it will be difficult via VI editor. 2 relevant options. Take a copy in your local system, make changes and move the java file to ptree1/2. Compile and run. > 2. Take a copy in your local system, make changes and compile. Move class file to ptree1/2 and run. I do all of my development (Perl, C/C++. Assembly, Java, Bash) using VIM. This is my preference. I'm not advocating to use one IDE/Editor over another as its a personal choice. The best way to accomplish this would be to use a version system. That way you can do development on a remote machine and commit your changes. You could then log into ptree[12].ndsu, checkout the latest version and run the code. Version control works well in collaborations efforts when merging code together. Also, I went ahead and installed both Java7 and Java8 JDKs on both servers. They are located in /usr/java. By default, Java8 will be used when running/compiling. Bryan So we have any other alternatives I would suggest running GIT or SVN.Rajat > From: Arjun Roy > My IP address is 192.168.0.4 > If for some reason I am not able to copy, Damian can give a try on campus? > I think the way we are going to proceed is that we will have Eclipse software on our client machine (pointing to code on ptree1) but its going to be compiled on ptree1. > I don't know if we can directly make changes to code stored on ptree1 or if we would have to make changes locally and push it to ptree1. > I dont have much knowledge on Java environment/Eclipse but does this sound logical (Rajat/Maninder)? > Thanks, Arjun > > From: Bryan Mesich > > > Please give me a remote access. Once I have access, will put everything on Share. > > I have used it with Jdk 7 without any compiler complaints. > I need your IP address in order to make the exception in the firewall. > Also, did you use OpenJDK, or the official JDK from Oracle? > > > > Subject: Re: Access to Ptree1 and Ptree2 > > Logins are ready for use. I will be installing a firewall shortly > > that will only allow on-campus access. If you need remote access, > > please let me know and I'll make an exception for you in the firewall. > > Otherwise I'll assume everyone has access unless contacted. >>> > > > Bryan, I have a few months old version of TreeMiner. Its about 2.5 GB. Is > > > it ok if I put the entire thing in my space on ptree1 (might take a while to transfer)? > > Lets have you put the TreeMiner code under the Shared directory > > (/stggroups/perrizo/Shared). That way everyone will have access to the code. > > > For latest version, we will have to contact Mark and go through some layers of security. > > We'll want to investigate a way to pull the code in an automated > > fashion in the future. The current version(s) you have should work for now. > > Does anyone know what JDK version TreeMiner is using? Damian Lampl Fri 1/23/2015 2:12 PM I'll need remote access, too. My IP is: 208.107.126.159 One reason we wanted to get things centralized on the pTree servers was so we wouldn't have to configure everyone's development environment separately (Java, Eclipse dependencies, etc) since that's the primary hang-up to getting started with the Treeminer code. I was thinking we could just remote desktop into the ptree servers using xrdp or something similar if that works? I really don't know the best way of setting it up so we all have access to the same preconfigured Java/Eclipse environment (.NET and Visual Studio would be a different story: install Visual Studio, done). You mentioned maven, Bryan, but I'm not familiar with how that works? Or if there's a configuration file we can get set up that makes sure our local Java, Eclipse, and any other dependencies are configured properly? I agree on a git repository so we'll have one main trunk and then separate branches for everyone, and when we need to merge with Mark's code, we can use the main trunk for that since he's also using git. Mark also has things working with Hadoop and some of his datasets will be stored in that. Bryan Mesich Fri 1/23/2015 12:41 PM Rajat Singh wrote: had the same question. > We can make changes to code on ptree1/2 directly but it will be difficult via VI editor. So we have two relevant options > 1. Take a copy in your local system, make changes and move the java file to ptree1/2. Compile and run. > 2. Take a copy in your local system, make changes and compile. Move class file to ptree1/2 and run. I do all of my development (Perl, C/C++. Assembly, Java, Bash) using VIM. This is my preference. I'm not advocating to use one IDE/Editor over another as its a personal choice. The best way to accomplish this would be to use a version control system. That way you can do development on a remote machine and commit your changes. You could then log into ptree[12].ndsu, checkout thel atest version and run code. Version control works well in collaborative efforts when merging code together. Also, I went ahead and installed both Java7 and Java8 JDKs on both servers. They are located in /usr/java. By default, Java8 will be used when running/compiling. Bryan Maninder Singh Fri 1/23/2015 12:21 PM Just a thought, can we think of some way to write a batch file in VI editor to do some of our work directly on ptree? Rajat Singh Fri 1/23/2015 12:15 PM I had the same question. We can make changes to code on ptree1/2 directly but it will be difficult via VI editor. So we have two relevant options 1. Take a copy in your local system, make changes and move the java file to ptree1/2. Compile and run. 2. Take Arjun Roy Fri 1/23/2015 12:08 PM My IP address is 192.168.0.4 If for some reason I am not able to copy, Damian can give a try on campus? I think the way we are going to proceed is that we will have Eclipse software on our client machine (pointing to code on ptree1) but its going to be compiled on ptree1. I don't know if we can directly make changes to code stored on ptree1 or if we would have to make changes locally and push it to ptree1. I dont have much knowledge on Java environment/Eclipse but does this sound logical (Rajat/Maninder)? Arjun In my opinion, you can stay back at home and relax this Saturday :) We students will collectively make it operational now that Bryan has created login for us. Logins are ready for use. I will be installing a firewall shortly that will only allow on-campus access. If you need remote access, please let me know and I'll make an exception for you in the firewall. Otherwise I'll assume everyone has access unless contacted. > > > > Bryan, I have a few months old version of TreeMiner. Its about 2.5 GB. Is it ok if I put the entire thing in my space on ptree1 (might take a while to transfer)? Lets have you put the TreeMiner code under the Shared directory (/stggroups/perrizo/Shared). That way everyone will have access to the code. > > For latest version, we will have to contact Mark and go through some layers of security. We'll want to investigate a way to pull the code in an automated fashion in the future. The current version(s) you have should work for now. Does anyone know what JDK version TreeMiner is using? Current Treeminer methods, hi pTree correlation with class?; hi information_gain; Hi gini_index? In information theory and machine learning, information gain is a synonym for KullbackLeibler divergence. The expected value of information gain is the mutual information I(X; A) of X and A i.e. reduction in the entropy of X achieved by learning the state of the random variable A. In machine learning, this concept can be used to define a preferred sequence of attributes to investigate to most rapidly narrow down the state of X. Such a sequence (depends on outcome of investigation of previous attribs at each stage) is a decision tree. Usually an attribute with hi mutual info is preferred to others. In general terms, the expected information gain is the change in information entropy from a prior state to a state that takes some information as given: Drawbacks: Altho info gain is a good measure of attrib relevance, it is not perfect. E.g., when applied to attributes that can take on many distinct values. E.g., building a decision tree for some data describing the customers of a business. Info gain is often used to decide which attribs are most relevant, so they can be tested near the tree root. Credit card number uniquely identifies customer, but deciding how to treat a customer based on credit card # will not generalize to customers we haven't seen yet (overfitting). Information gain ratio instead? This biases the decision treevariables against attributes many distinct values. low info values then receivedependence. an unfair advantage. In statistics, dependence is any relationship between two random or two setswith of data. Correlation refers toBut, any attributes of a class ofwith statistical relationships involving Attribute Selection; http://www.cs.ndsu.nodak.edu/~perrizo/saturday/teach/879s15/dmlearn.ht m E.g., correlation between physical statures of parents and offspring,; between demand for a product and its price. Correlations can indicate a predictive relationship to exploited in practice. E.g., an electrical utility may produce less power on a mild day based on the correlation between electricity demand and weather. In this example there is a causal relationship, because extreme weather causes the use of more electricity; however, statistical dependence is not sufficient to demonstrate the presence of such a causal relationship (i.e., correlation does not imply causation). Formally, dependence refers to random variables not satisfy inga math condition of probabilistic independence. Loosely correlation is any departure of two or more random variables from independence, but technically it refers to any of several more specialized types of relationship between mean values. There are several correlation coefficients, or r, measuring the degree of correlation., e.g., Pearson correlation coefficient, sensitive only to a linear relationship between 2 variables. Other correlation coeffs have been developed to be more robust than Pearson correlation. Here are several sets of (x, y) points, with the Pearson correlation coefficient of x and y for each set. Note , correlation reflects noisiness /direction of a linear relationship (top row), but not slope (middle), nor many aspects of nonlinear relationships (bottom). N.B.: figure in center has a slope of 0 but correlation coefficient is undefined because the variance of Y=0 Contents: Pearson's product-moment coefficient: Main article: Pearson product-moment correlation coefficient :The most familiar measure of dependence between two quantities is the Pearson product-moment correlation coefficient, or "Pearson's correlation coefficient", commonly called simply "the correlation coefficient". It is obtained by dividing the covariance of the two variables by the product of their standard deviations. Karl Pearson developed the coefficient from a similar but slightly different idea by Francis Galton.[4] The population correlation coefficient X,Y between two random variables X and Y with expected values X and Y and standard deviations X and Y is defined as: where E is expected value operator, cov means covariance, and, corr an alternative notation for the correlation coefficient. Pearson correlation is defined only if both standard deviations are finite and nonzero. It is a corollary of the CauchySchwarz inequality that the correlation cannot exceed 1 in absolute value. The correlation coefficient is symmetric: corr(X,Y) = corr(Y,X). The Pearson correlation is +1 in the case of a perfect direct (increasing) linear relationship (correlation), 1 in the case of a perfect decreasing (inverse) linear relationship (anticorrelation),[5] and some value between 1 and 1 in all other cases, indicating the degree of linear dependence between variables. As it approaches 0 there is less of a relationship (closer to uncorrelated). The closer coefficient is to either 1 or 1, stronger the correlation between variables. If variables are indep Pearson's correlation coeff is 0, but the converse is not true because the correlation coefficient detects only linear dependencies between two variables. E.g., random variable X is symmetrically distributed about 0, Y = X2. Then Y is completely determined by X, so that X and Y are perfectly dependent, but their correlation is zero; uncorrelated. Special case X and Y are jointly normal, uncorrelatedness = independence. If a series of n measmnts of X ,Y written xi and yi i = 1...n, sample correlation coefficient can be used to estimate pop Pearson correlation r between X and Y. Sample correlation coeff is written where x and y are the sample means of X and Y, and sx and sy are the sample standard deviations of X and Y. This can also be written as: If x,y are results of meamnts containing error, realistic limits on correlation coef are not 1 to +1 but a smaller range..Rank correlation coefficients Main articles: Spearman's rank correlation coefficient and Kendall tau rank correlation coefficient Rank correlation coefficients, such as Spearman's rank correlation coefficient and Kendall's rank correlation coefficient ()) measure one variable increases to the other variable increase, w/o requiring that increase to be represented by a linear relationship (alternatives to Pearson's coefficient). To illustrate rank correlation, and its difference from linear correlation, consider 4 pairs (x, y): (0, 1), (10, 100), (101, 500), (102, 2000). As we go from each pair to the next pair x increases, and so does y. This relationship is perfect, in the sense that an increase in x is always accompanied by an increase in y. This means that we have a perfect rank correlation, and both Spearman's and Kendall's correlation coefficients are 1, whereas in this example Pearson product-moment correlation coefficient is 0.7544, indicating that the points are far from lying on a straight line. In the same way if y always decreases when x increases, the rank correlation coefficients will be 1, while the Pearson product-moment correlation coefficient may or may not be close to 1, depending on how close the points are to a straight line. Although in the extreme cases of perfect rank correlation the two coefficients are both equal (being both +1 or both 1) this is not in general so, and values of the two coefficients cannot meaningfully be compared. [7] For example, for the three pairs (1, 1) (2, 3) (3, 2) Spearman's coefficient is 1/2, while Kendall's coefficient is 1/3. Other measures of dependence: The info given by a correlation coef is not enough to define the dependence structure between random variables. [9] Correlation coef completely defines dependence structure only in particular cases, for ex when distrib is a multivariate normal distribution. In the case of elliptical distributions it characterizes (hyper-) ellipses of equal density, but, it does not completely characterizeCanonical dependence structure (for ex, a multivariate t-distribution's degrees of freedom determine level of tail dep). Distance correlation and Brownian covariance / SEE: Association (statistics) Autocorrelation correlation Coefficient of determination Cointegration Concordance correlation coefficient Cophenetic correlation Copula Correlation function Covariance and correlation Cross-correlation Ecological correlation [10][11] Brownian correlation introduced toKruskal's address deficiency of Pearson's corr that it cancorrelation be zeroModifiable for dependent random variables; 0 distancecorrelation correlation and 0 Brownian correlation imply indep Fraction of variance unexplained Geneticwere correlation Goodman and lambda Illusory correlation Interclass correlation Intraclass areal unit problem Multiple correlation Point-biserial coefficient Quadrant count ratio Statistical arbitrage Clustering: Partition; Hierarchical; Density; Grid; Model-based http://www.cs.ndsu.nodak.edu/~perrizo/saturday/teach/879s15/dmcluster.htm Agnes(Agglomerative Nesting) Kaufmann, Rousseeuw (90). Use Single-Link (distance between two sets is the minimum pairwise dist) meth. Merge nodes most similarity. Eventually all nodes belong to the same cluster Diana (Divisive Analysis) Inverse order of AGNES (start: all objects in 1 cluster; split by some criteria (e.g., max some aggregate or pairwise dissimilarity. Agglomerative clustering doesnt scale well (ime complexityO(n2), n=#objs. Can never undo what was done previously (greedy alg). Integration w distance-based: BIRCH 96: uses Cluster Feature tree (CF-tree).Incr adjusts quality of sub-clusters CURE98: selects well-scattered pts from cluster, shrinks to cluster ctr by fraction CHAMELEON 99: hierarchical clustering using dynamic modeling Density Clustering, Discover clusters of arbitrary shape, Handle noise; One scan; Need density parameters as stop condition. Several interesting studies: DBSCAN: Ester, et al. (KDD96); OPTICS: Ankerst, et al (SIGMOD99).; DENCLUE: Hinneburg & D. Keim (KDD98); CLIQUE: Agrawal, et al. (SIGMOD98) Decision Tree Classification (A flow-chart-like tree structure) Internal node denotes test on an attrib. Branch represents an outcome of test. Leaf nodes represent class labels or class distribution. Tree pruning (Identify and remove branches that reflect noise or outliers). Information Gain as an Attribute Selection Measure Minimizes expected number of tests needed to classify an object and guarantees simple tree (not necessarily the simplest) S = {Cs1,...,sm} be a TRAINING SUBSET. S[C] = {CC1,...,Cc} be the distinct classes in S. EXPECTED INFORMATION needed to classify a sample given S is: I{Cs1,...,sm} = -i=1..mpi*log2(pi) pi= |SCi|/|S|. Choosing A as decision attribute, the Expected Info gained is E(A) = j=1..v; i=1..m ( si,j/|S| * I{Csij..smj} ) where Skh = SA=akCh. Gain(A) = I(s1..sm) - E(A) - expected reduction of info required to classify after splitting via A-values.. Alg computes the information gain of each attribute and selects the one with the highest information gain as the test attribute. Branches are created for each value of that attribute and samples are partitioned accordingly. When a decision tree is built, many branches will reflect anomalies in the training data due to noise or outliers. Tree pruning addresess "overfitting" data (classifying situations that are erroneous or accidental). http://www.cs.ndsu.nodak.edu/~perrizo/saturday/teach/879s15/dmlearn.htm CLASSIFICATION TRAINING SET T(A1..An,C), CLASS C, FEATURES (A1..An) unclassified sample, (a1;;an) SELECT Max (Count (T.Ci)); FROM T; WHERE T.A1=a1 AND T.A2=a2 ... T.An=an GRP BY T.C; i.e., just a SELECTION C-Classificatn is assigning to (a1..an) most frequent C-val in RA=(a1..an). Nearest Neighbor Classification (NNC) selecting a set of R-tuples with similar features (to the unclassified sample)and then letting corresponding class values vote. NNC won't work well if vote inconclusive or if similar (near) is not well defined, then we build MODEL of TRAINSET(at, possibly, great 1-time expense?) Eager Classifiers models decision trees, Bayesian, Neural Nets,SVM... Preparing Data: Data Cleaning Remove Noise (or reduce noise) by "smoothing", Fill in missing values (with most common or statistical val). Noise/Missing Val mgnt done by a NN Vote! (interpolation) Feature Extraction eliminate irrelevant attrs Compare Different Methods Predictive Accuracy (predicting the class label of new data) Speed (computation costs for generating and using the model) Robustness (~same predictions when Training Set are almost the same?) Scalability (Model construction efficiency - massive datasets) Decision Trees: each inode is a test on a feature attrib. Each test outcome is assigned a link to next level (outcome=a val / range of vals or?). Leaf = distribution of classes) Some branches may represent noise or outliers (and should be pruned?) ID3 algorithm for inducing a decision tree from training tuples is: 1. The tree starts as a single node containing the entire TRAINING SET. 2. If all TRAINING TUPLES have the same class, this node is a leaf. DONE. 3. else, use info gain, for selecting the best decision attribute for that node 4. Branch created for each val [interval of vals] of test attr and TrainSet partitioned. Info Gain (ID3/C4.5) Select attrib with highest info gain. Assume two classes, P and N (positive/negative). Let the set of examples S contain p

elements of class P and n elements of class N. Amount of info, needed to decide if arbitrary example in S belongs to P or N is defined: I ( p, n) p p n n log 2 log 2 pn pn pn pn Assume using attribute A, set S will be partitioned into sets {CS1, S2 , , Sv}. If Si contains pi examples of P and ni examples of N, the entropy, or the expected p n info needed to classify objects in all subtrees Si is E ( A) i i I ( pi , ni ) info gained by branching on A Gain( A) I ( p, n) E ( A) i 1 p n 5 4 E ( age) I ( 2,3) I ( 4,0) Hence Class P: buys_computer = yes 14 14 Class N: buys_computer = no 5 I (3,2) 0.69 Similarly 14 I(p, n) = I(9, 5) =0.940 Compute the entropy for age: Gain(age) I ( p, n) E (age) age <=30 3040 >40 pi 2 4 3 ni 3 0 2 I(pi, ni) 0.971 0 0.971 Gain(income) 0.029 Gain( student ) 0.151 Gain(credit _ rating ) 0.048 Bayesian: thm: X inclassified sample. H be the hypothesis that X belongs to class, H. P(H|X)=cond probof H given X. P(H) is prob of H, P(H|X) = P(X|H)P(H)/P(X) Nave Bayesian: Given training set, R(A1..An, C) where C={CC1..Cm} is the class label attribute. The naive Bayesian Classifier will predict the class of unknown data sample, X, to be the class, Cj having the highest conditional probability, conditioned on X P(Cj|X) P(Ci|X), i j. From the Bayes theorem: P(Cj|X) = P(X|Cj)P(Cj)/P(X) P(X) is constant for all classes so we maximize P(X|Cj)P(Cj).. Max P(X|Cj)P(Cj). To reduce comp complexity of calculating all P(X|Cj)'s the naive assumption: class conditional indep EXPECTED INFO to classify: I= -i=1..m((rcPci)/|X|) log2((rcPci)/|X|), m=5 I = -(3/16*log3/16+1/4*log1/4+1/4*log1/4+1/8*log1/8+3/16*log3/16)= 2.8 (If basic pTree rcs are pre-computed (actually just value pTrees), this is arithmetic!) rcPci = 3 4 4 2 3 pj=rcPci/16 = 3/16 1/4 1/4 1/8 3/16 rc(Pc=2^PBk=aj) rc(Pc=3^PBk=aj) rc(Pc=7^PBk=aj) rc(Pc=10^PBk=aj) rc(Pc=15^PBk=aj) 0 0 2 0 0 0 2 2 0 0 0 2 0 0 0 0 0 0 1 3 3 0 0 1 0 ENTROPY: E(Aj)=j=1..v[(s1j+..+smj)*I(s1j..smj)/s] I(s1j..smj)=-i=1..m[pij*log2(pij)] pij=sij/|Aj| sij=s1,j+..+s5,j= 2 4 PB2=2 PB2=3 PB2=7 PB2=10 PB2=11 PB3=4 PB3=5 PB3=8 PB4=11 PB4=15 |Aj| where Aj's are the rootcounts of Pk(aj)'s 2 2 4 4 2 4 4 4 010 0 010 0 0010 0 0001 1 010 0 010 0 0010 0 0001 1 001 0 001 0 100 0 0001 0 001 0 10 0 001 0 001 0 010 0 010 1 01 1 01 0 010 0 010 1 01 1 01 0 001 0 001 0 100 0 001 0 001 0 100 0 001 0 001 0 010 1 010 0 001 0 001 0 010 1 010 0 001 0 001 0 001 0 001 0 100 0 010 0 001 0 100 0 010 0 010 0 0010 0 0010 0 010 0 010 0 0010 0 0010 0 010 0 010 0 001 0 001 0 100 1 0010 1 001 0 100 0 0010 1 0010 1 0010 0 0010 0 100 0 100 0 0010 0 0010 0 100 0 100 0 0001 1 0001 1 0100 0 0010 0 0001 1 0100 1 0010 0 0010 0 010 0 010 0 100 1 100 0 010 0 010 0 100 1 100 0 0010 0010 010 1 001 1 0010 010 0 001 1 001 1 001 0 001 0 010 0 010 1 001 0 001 0 010 0 010 1 100 0 100 0 010 0 010 0 100 0 010 0 001 0 010 0 0100 1 0100 1 001 0 0010 0 0100 1 0100 1 001 0 0010 0 100 1 100 1 0100 0 001 0 100 1 0100 1 001 0 001 0 010 1 010 0 01 1 01 1 010 1 010 1 01 1 01 1 001 0 001 0 10 1 0011 001 0 10 0 0011 0011 S1,j= S2,j= S3,j= S4,j= S5,j= 0 0 2 1 3 6 0 0 2 0 0 2 6 8 2 3 4 0 1 0 11 8 0

3 4 1 3 3 1 0 1 0 5 11 5 001 0 001 1 100 0 100 0 001 0 001 0 100 0 100 0 0010 1 0001 1 0100 0 0001 0 0010 1 0100 1 0001 0 0001 0 P1j = 0 0 0 0 .75 0 0 .375 0 .6 P2j = 0 .5 .5 0 0 0 0 .5 .273 .2 P3j = 1 .5 0 0 0 .33 1 0 .363 0 P4j = 0 0 0 .25 .25 .17 0 .125 .091 .2 P5j = 0 0 0 .75 0 .5 0 0 .273 0 0 0 0 0 .75 0 0 .53 0 .44 -p1j*log2(p1j) 0 .5 .5 0 0 0 0 .5 .51 .46 -p2j*log2(p2j) 0 .5 0 0 0 .52 0 0 .53 0 -p3j*log2(p3j) 0 0 0 .5 .5 .43 0 .37 .31 .46 -p4j*log2(p4j) 0 0 0 .31 0 .5 0 0 .51 0 -p5j*log2(p5j) (s1j+..+s5j)*I(s1j..s5j)/16 0 .25 .13 .2 .31 .54 0 .7 .127 .43 GAIN(B2)=2.81-.89 =1.92 PPc=10 P P PPc=10 P PPc=10 PPc=10 P c=10 Pc=15 Pc=2 c=10 c=10 Pc=10 PP Pc=7 Pc=10 P PP c=10 PP Pc=7 c=15 c=15 GAIN(B3)=2.81-1.24 =1.57 c=3 c=3 c=3 c=3 c=7 c=15 c=15 c=15 c=7 c=3 c=15 c=7 c=7 c=3 c=7 c=3 c=15 c=7 c=3 c=15 c=3 c=7 c=7 c=2 c=2 c=2 c=2 c=2 c=2 c=2 c=2 c=3 c=15 c=2 GAIN(B4)=2.81-.557 =2.53 So its all just arithmetic, except for the #Classes * #FeatureValues ANDs and RootCounts, rc(PC=ci^PBk=aj). Should these be pre-computed at capture? Are they part of the correlation calculation? Other often-used calculations? Other speedups include: 1.Use approx. Value pTrees. Intervalize feature values and use the IntervalBitMaps (which can be calculated either from BitSlice or ValueMap pTrees. 2.Using bitslice intervals, wed pre-calculate all Pi,j,k = PC=ci^Pj,k then use HiOrderBit IntervalBitMaps, P i,j,Kj where Kj is the HiOrderBit of attrib, j. For 2ndHiOrderBit Intervals, just take Pi,j,Kj & P i,j,Kj-1 etc. Aside note: There must be mistakes in the arithmetic above since I get different GAIN values than on the previous slide. Who can correct? 3 0.1875 -2.4150 -0.4528 An RSI Dataset 16 pixels (4 rows 4 cols): 4 bitslices Band B1: 3 3 7 7 3 3 7 7 2 2 10 15 2 10 15 15 Band B2: Band B3: 7 3 3 2 8845 7 3 3 2 8845 11 11 10 10 8844 11 11 10 10 8844 4 0.25 -2 -0.5 Band B4: 11 15 11 11 11 11 11 11 15 15 11 11 15 15 11 11 4 2 3 0.25 0.125 0.187 -2 -3 -2.41 -0.5 -0.37 -0.45 2.280 rcPci pi=rcPci/16 log2(pi) pilog2(pi) -SUM(pilog2(pi)=I(c1..cm) Value BitMap pTrees BitSlice pTrees P1,3 P1,2 P1,1 P1,0 rc=5 S: X-Y B1 B2 B3 B4 0,0 0011 0111 1000 1011 0 0,1 0011 0011 1000 1111 0 0,2 0111 0011 0100 1011 0 0,3 0111 0010 0101 1011 0 1,0 0011 0111 1000 1011 0 1,1 0011 0011 1000 1011 0 1,2 0111 0011 0100 1011 0 1,3 0111 0010 0101 1011 0 2,0 0010 1011 1000 1111 0 2,1 0010 1011 1000 1111 0 2,2 1010 1010 0100 1011 1 2,3 1111 1010 0100 1011 1 3,0 0010 1011 1000 1111 0 3,1 1010 1011 1000 1111 1 3,2 1111 1010 0100 1011 1 3,3 1111 1010 0100 1011 . 1 Pc=2 Pc=3 Pc=7 Pc=10 Pc=15 PB2=2 PB2=3 PB2=7 PB2=10 PB2=11 P2,3 P2,2 P2,1 P2,0 P3,3 P3,2 P3,1 P3,0 P4,3 P4,2 P4,1 P4,0 rc=7 rc=16 rc=11 rc=8 rc=2 rc=16 rc=10rc=8 rc=8 rc=0 rc=2 rc=16 rc=5 rc=16 rc=16 rc=3 rc=4 rc=4 rc=2 rc=3 rc=2 rc=4 rc=2 rc=4 rc=4 PB3=4 PB3=5 PB3=8 rc=6 rc=2 rc=8 PB4=11 PB4=15 rc=11 rc=5 0 0 1 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 0 0 0 1 1 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 1 0

0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 APPENDIX: level-1 TermFreqPTrees (E.g., the predicate of tfP0: mod(sum(mdl-stride),2)=1) 0 1 0 0 0 0 1 ... ... 3 doc=1 d=1 d=1 term=a t=again t=all d=2 t=a 2 0 0 1 ... 0. 0 0. 0 0 0. 1 .. 2 .. d=2 d=2 t=again t=all 0 .. 1 0. 0 1 0. 0. .. .. .. 8 1 ... tf0 ... tf1 3 1 8 8 0 1 1 1 0 3 3 <--dfP0 3 <--dfP2 df (cnt) 0 Length of this level-1 TermExistencePTree =VocabLen*DocCount ... tf d=3 d=3 d=3 t=a t=again t=all ... 2 0 1 0 doc=1 doc=1 doc=1 term=a trm=again term=all 0 0 1 1 pred is NOTpure0 0 doc=3 doc=3 doc=3 term=a trm=again term=all ... doc=2 doc=2 doc=2 term=a trm=again term=all Length of this level-0 pTree= mdl*VocabLen*DocCount 0 . 0 0 . ...d oc mdl 1 2 3 4 5 6 7 mdl 1 2 3 4 5 6 7 mdl 1 2 3 4 5 6 7 reading-positions for doc=1, term=a reading-positions: doc=1, term=again reading-positions for doc=1, term=all HHS LMM tf1 tf0 ... ... ...Term Freq 0

1 0 2 1 0 0 0 0 1 0 . . . a 0 0 0 0 0 0 0 0 0 0 0 0 . . . again 3 0 0 0 0 0 0 0 0 0 0 0 0 . . . all 1 0 0 0 0 30 0 0 0 0 0 0 0 always. 2 0 0 0 0 0 0 0 0 0 0 0 0 an 1 1 0 1 1 3 0 1 0 1 0 0 1 and 1 0 0 0 0 0 0 0 0 0 0 0 0 . . . apple 3 0 0 0 0 0 0 0 0 0 0 0 0 . . . April 1 0 0 0 0 0 0 0 0 0 0 0 0 . . . are 1 2 3 4 7 ... Position ... Term (Vocab) tf2 1 ..doc freq ... Next slides shows how to do it differently so that even the dfk's come out as level-2 pTrees. pTree Text Mining (from 2012_08_04 ... Term Ex (mdl = max doc length) dfk isn't a level-2 pTree since it's not a predicate on level-1 te strides. . DocTrmPos pTreeSet 1 1 Data Cube Text Mining 5 6 JSE level-2 PTree, hdfP?? (Hi Doc Feq): pred=NOTpure0 applied to tfP1 1 1 doc1 0 doc2

These level-2 pTrees, dfPk have len= VocabLength hdfP ... doc3 0 0 1 0 0 0 0 ... 0 ... ... 0 ... ... ... ... tfP0 1 8 8 0 1 doc1 ... tfP1 8 1 1 0 ... 2 doc2 1 3 3 <--dfP0 3 . . . <--dfP3 df count doc3 level-1 PTrees, tfPk e.g., pred of tfP0: mod(sum(mdl-stride),2)=1 ... d=3 t=a 0 ... ... d=1 d=2 d=1 d=2 d=3 t=again t=again t=again t=all t=all ... tf d=3 t=all ... This one, overall, level-1 pTree, teP, has length = DocCount*VocabLength 1 0 trm=a doc1 tePt=all tePt=again 0 tr=all doc1 t=again t=again t=again doc1 doc2 doc3 term=a doc3 t=all doc2 t=all doc3 ... tf2 tf1 tf0 ... ... ...Term Freq 1 0 1 0 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 30 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 3 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0

0 . . . April 3 0 0 0 0 0 0 0 0 0 0 0 0 . . . are 1 0 0 0 0 0 0 0 0 0 0 0 0 . . . 1 2 3 4 7 ... Pos ..doc freq ... term=again doc1 ... ... Term Ex pTree Text Mining trm=a term=a doc2 doc3 tePt=a term=a doc2 term=a doc1 This one, overall, level-0 pTree, corpusP, has length = MaxDocLen*DocCount*VocabLen 0 Corpus pTreeSet a ... again all always. an and apple oc doc=1 d=2 term=a t=a 0 ...d 0 Vocab Terms 2 1 1 data Cube layout: 5 6 HHS LMM JSE level-2 PTree, hdfP?? (Hi Doc Feq): pred=NOTpure0 applied to tfP1 1 doc1 doc2 These level-2 pTrees, dfPk have len= VocabLength hdfP ... doc3 2 0 0 1 0 0 0 0 ... 0 ... ... 0 ... ... ... ... tfP0 1 doc1 ... tfP1 level-1 PTrees, tfPk e.g., pred of tfP0: mod(sum(mdl-stride),2)=1 2 0 0 doc=1 d=2 term=a t=a d=3 t=a ... 0 ... ... d=1 d=2 d=1 d=2 d=3 t=again t=again t=again t=all t=all d=3 t=all ... ... tf 8 8 8 0 1 1 1 2 doc2 1 0 ... doc3 3 3 <--dfP0 3 . . . <--dfP3 df count This overall, level-1 pTree, teP, has length = DocCount*VocabLength 1 0 trm=a doc1 0 trm=a term=a doc2 doc3 tePt=a tePt=again 0 tePt=all t=all tr=all t=all doc1 doc2 doc3 ... t=again t=again t=again doc1 doc2 doc3 This overall level-0 pTree corpusP length MaxDocLen*DocCount*VocabLen 0 0 term=a doc1 0 0 0 0 Pt=a,d=1. . . Pt=a,d=2 term=a doc2 0 1

0 1 0 0 0 0 0 0 1 Pt=again,d=1 0 0 0 0 0 1 0 1 1 1 0 0 0 0 tf 0 ... 1 1 tf0 0 0 ... 1 0 tf1 0 0 ... 0 0 tf2 1 0 ... 0 0 te Verb pTree EndofSentence Refrncs pTree LastChpt pTree Preface pTree 0 ... pTree Text Mining term=a doc3 0 term=again doc1 ... ..doc freq Any of these masks can be ANDed into the Pt= , d= pTrees before they are concatenated as above (or repetitions of the mask can be ANDED after they are concatenated). Pt=a,d=3 Vocab Terms 0 1 0 1 0 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 30 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 3 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 . . . April 3 0 0 0 0 0 0 0 0 0 0 0 0 . . . are 1 0 0 0 0 0 0 0

0 0 0 0 0 . . . 1 2 3 4 7 ... Pos a again all ... always. an and apple ...d oc 1 1 1 data Cube layout: 5 6 HHS LMM JSE I have put together a pBase of 75 Mother Goose Rhymes or Stories. Created a pBase of the 15 documents with 30 words (Universal Document Length, UDL) using as vocabulary, all white-space separated strings. APPENDIX Little Miss Muffet Lev1 (term freq/exist) pos te tf tf1 tf0 VOCAB 1 1 2 1 0 a 2 0 0 0 0 again. 3 0 0 0 0 all 4 0 0 0 0 always 5 0 0 0 0 an 6 1 3 1 1 and 7 0 0 0 0 apple 8 0 0 0 0 April 9 0 0 0 0 are 10 0 0 0 0 around 11 0 0 0 0 ashes, 12 0 0 0 0 away 13 0 0 0 0 away 14 1 1 0 1 away. 15 0 0 0 0 baby 16 0 0 0 0 baby. 17 0 0 0 0 bark! 18 0 0 0 0 beans 19 0 0 0 0 beat 20 0 0 0 0 bed, 21 0 0 0 0 Beggars 22 0 0 0 0 begins. 23 1 1 0 1 beside 24 0 0 0 0 between . . . . . . 182 0 0 0 0 your Humpty Dumpty Lev1 (term freq/exist) pos te tf tf1 tf0 1 1 2 1 0 2 1 1 0 1 3 1 2 1 0 4 0 0 0 0 5 0 0 0 0 6 1 1 0 1 7 0 0 0 0 8 0 0 0 0 9 0 0 0 0 10 0 0 0 0 11 0 0 0 0 12 0 0 0 0 13 0 0 0 0 14 0 0 0 0 15 0 0 0 0 16 0 0 0 0 17 0 0 0 0 18 0 0 0 0 19 0 0 0 0 20 0 0 0 0 21 0 0 0 0 22 0 0 0 0 23 0 0 0 0 24 0 0 0 0 . . . . . . 182 0 0 0 0 Lev-0 1 2 3 Little Miss Muffet 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Lev-0 5 on 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 3 4 5 05HDS Humpty Dumpty a 0 0 again. 0 0 all 0 0 always 0 0 an 0 0 and 0 0 apple 0 0 April 0 0 are 0 0 around 0 0 ashes, 0 0 away 0 0 away 0 0 away. 0 0 baby 0 0 baby. 0 0 bark! 0 0 beans 0 0 beat 0 0 bed, 0 0 Beggars 0 0 begins. 0 0 beside 0 0 between 0 0 sat 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 on 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 a 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 your 1 0 4 sat 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 7 8 9 a tuffet eating 1 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 7 0 0 11 0 0 12 13 14 15 16 17 and whey. There 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 came 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 a 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 big spider 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 19 20... and 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 sat down... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8... wall. Humpt yDumpty 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 of curds 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Level-2 pTrees (document frequency) df3 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 df2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 df1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 df0 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 df VOCAB 8 a 1 again. 3 all 1 always 1 an 13 and 1 apple 1 April 1 are 1 around 1 ashes, 2 away 2 away 1 away. 1 baby 1 baby. 1 bark! 1 beans 1 beat 1 bed, 1 Beggars 1 begins. 1 beside 1 between te04 te05 te08 te09 te27 te29 te34 1 1 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0

0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 FAUST2^? Clustering1 1 0 -1 -2 D=d35 L-Gap Clusterer Cut, C, mid-gap (of F&C) using next (d,p) from dpSet, where F=L|S|R s 2-1 separates 7,50 D=.27 D=.64 0 d26 0 0 0 0 0 0 d9 0 d26 Going back to D=d35, how close does HOB come? 21, 20 separate 35 0 d1 2-2 separates.27s 0 0 0 0 0 0 d49 0 d33 0 d27 0 0 0 0 0 0 d45 0 d3 0 d3 0 0 0 0 0 0.09 d6 0 d44 0 0 0 0 0 0.09 d3 0 d27 0 d16 0 0 0 0 0 D=sum of 0 d45 C1 (.17 xod .25)={C2,3,6,16,18,22,42,43,49} 0.09 d33 0 d6 0 0 0 0 0 0 d2 all C31docs 0 d17 0 0 0 0 0 0.09 d18 0 d44 0 d47 0 0 0 0 0 0.09 d44 C2 (.34 xod .56)={C1,4,5,8,9,12,14,15,23,25,27,32,33,36,37,38,44,45,47,48} 0.63 d17 0 d23 0 d18 0 0 0 0 0 0.18 d43 0 d10 0 0 0 0 0 0.18 d25 0 d9 0.63 d29 0 d43 0 0 0 0 0 0.18 d22 C3 (.64xod.86)={C10,11,13,17,21,26,28,29,30,39,41,50} 0 d15 0.63 d11 0 d12 0 0 0 0 0 0.18 d12 0 d49 0 d33 0 0 0 0 0 0.18 d16 0.84 d50 0 d16 0 d14 Single: 46 (x od=.99); 7 (=1.16); 35 (=1.47) D=sum of 0 0 0 0 0 0.18 d2 0 d38 0.84 d13 0 d23 allC2docs 0 0 0 0 0 0.27 d27 0 d49 0 d6 0 0 0 0 0 0.27 d23 0.27 d23 0.84 d30 Next, on each Ck try D=Ck, Thres=.2 0 d25 0 0 0 0 0 0 d18 0.36 d25 0 d45 0 0 0 0 0 0.27 d42 0.95 d26 0 d22 0.36 d4 0 d2 0 0 0 0 0 0.27 d15 0.25 d1 0.95 d28 D=sum of D=sum of D=sum of 0 d29 0 0 0 0 0 0.27 d13 0.25 d37 0.36 d38 0 d13 0 0 0 0 0 0.27 d47 0.95 d10 0.45 d15 all C1docs all C11docs all C3docs 0 d9 0 0 0 0 0 0.36 d26 0.25 d43 0.45 d33 0.42 d16 0.57 d2 0.56 d11 0.95 d41 0 d32 0 0 0 0 0 d29 0.25 d8 0.45 d12 0.27 d28 0.27 0 0 0 1 0.36 0.42 d2 0.57 d3 0.66 d17 d36 0.25 d29 1.16 d21 0.27 d41 0.27 0 0 0 1 0.36 0.45 d36 0.42 d3 0.57 d16 0.66 d29 0.25 d25 0.46 d38 0.27 d42 0.27 0 0 0 1 0.46 d14 0.25 d42 C311(..63) ={11,17,29} C312(.84) ={13,30,50} 0.54 d8 0.42 d42 0.57 d22 0.75 d13 0.27 d30 0.27 0 0 0 1 0.46 d48 0.54 d44 0.25 d12 0.27 d21 0.27 0 0 0 1 0.42 d43 0.57 d42 0.85 d30 C313(.95) ={10,26,28,41} 21 outlier 0.27 d22 0.27 0 0 0 1 0.46 d8 0.25 d47 0.54 d47 0.42 d22 0.57 d43 0.85 d10 0.27 d15 0.27 0 0 0 1 0.46 d10 0.25 d48 0.63 d1 0.63 d18 0.94 d28

0.27 d36 0.27 0 0 0 1 0.46 d37 0.51 d32 0.63 d37 0.63 d49 0.94 d26 0.27 d11 0.27 0 0 0 1 0.55 d32 0.51 d14 0.63 d5 0.27 d38 0.27 0 0 0 1 0.55 d1 0.85 d6 0.94 d41 0.27 d46 0.27 0 0 0 1 0.55 d5 0.51 d4 0.63 d32 C11(xod=.42)={231622,42,43} 6,18,49 outliers 0.94 d50 0.27 d5 0.27 0 0 0 1 0.64 d21 0.51 d36 0.63 d50 1.03 d21 0.27 d8 0.27 0 0 0 1 d4 0.51 d13 0.72 d27 1.41 d39 0.27 d37 0.27 0 0 0 1 0.64 d11 0.51 d5 0.27 d48 0.27 0 0 0 1 0.64 0.72 d45 0.77 d10 C31(.56xod1.03) ={10,11,13,17,21,26,28,29,30,41,50} 39 outlier 0.64 d17 0.27 d39 0.27 0 0 0 1 0.92 d30 1.03 d11 0.72 d9 0.27 d4 0.27 0 0 0 1 1.01 d41 0.81 d14 1.29 d17 0.55 d50 0.55 0 0 1 0 Other Clustering methods later 0.55 d7 0.55 0 0 1 0 1.01 d28 1.54 d21 3.60 d35 3.60 1 1 1 0 1.10 d39 the 0's, .25s, .51s are clusters. d10, d11, d17, d21 outliers 1.29 d46 35, 7, 50 outliers {28,30,39,41,46} Cluster D=44docs C11: 2. This little pig went to market. This little pig stayed at home. This little pig had roast beef. This little pig had none. This little pig said Wee, wee. I can't find my way home. GT=.08 3. Diddle diddle dumpling, my son John. Went to bed with his breeches on, one stocking off, and one stocking on. Diddle diddle dumpling, my son John. 0.17 d22 16. Flour of England, fruit of Spain, met together in a shower of rain. Put in a bag tied round with a string. If you'll tell me this riddle, I will give you a ring. 0.17 d49 22. Had a little husband no bigger than my thumb. I put him in a pint pot, and there I bid him drum. I bought a little handkerchief to wipe his little nose and a little garters to tie his little hose. 0.21 d42 42. Bat bat, come under my hat and I will give you a slice of bacon. And when I bake I will give you a cake, if I am not mistaken. 0.21 d2 0.21 d16 43. Hark hark, the dogs do bark! Beggars are coming to town. Some in jags and some in rags and some in velvet gowns. 0.25 d18 C2: 1. Three blind mice! See how they run! They all ran after the farmer's wife, who cut off their tails with a carving knife. Did you ever see such a thing in your life as three blind mice? 0.25 d3 0.25 d43 4. Little Miss Muffet sat on a tuffet, eating of curds and whey. There came a big spider and sat down beside her and frightened Miss Muffet away. 0.25 d6 5. Humpty Dumpty sat on a wall. Humpty Dumpty had a great fall. All the Kings horses, and all the Kings men cannot put Humpty Dumpty together again. 0.34 d23 8. Jack Sprat could eat no fat. His wife could eat no lean. And so between them both they licked the platter clean. 0.34 d15 9. Hush baby. Daddy is near. Mamma is a lady and that is very clear. 0.34 d44 12. There came an old woman from France who taught grown-up children to dance. But they were so stiff she sent them home in a sniff. This sprightly old woman from France. 0.34 d38 0.34 d25 14. If all seas were one sea, what a great sea that would be! And if all the trees were one tree, what a great tree that would be! And if all the axes were one axe, what a great axe that would be! And if all the 0.34 d36 men were one man what a great man he would be! And if the great man took the great axe and cut down the great tree and let it fall into great sea, what a splish splash it would be! 0.38 d33 15. Great A. little a. This is pancake day. Toss the ball high. Throw the ball low. Those that come after may sing heigh ho! 0.38 d48 23. How many miles is it to Babylon? Three score miles and ten. Can I get there by candle light? Yes, and back again. If your heels are nimble and light, you may get there by candle light. 0.38 d8 36. Little Tommy Tittlemouse lived in a little house. He caught fishes in other mens ditches. 0.43 d4 0.43 d12 37. Here we go round mulberry bush, mulberry bush, mulberry bush. Here we go round mulberry bush, on a cold and frosty morning. This is way we wash our hands, wash our hands, wash our hands. This 0.47 d47 is way we wash our hands, on a cold and frosty morning. This is way we wash our clothes, wash clothes, wash our clothes. This is way we wash our clothes, on a cold and frosty morning. This is way we go to school, go to school, go to school. This is the way we go to school, on a cold and frosty morning. This is the way we come out of school, come out of school, 0.47 d9 0.47 d37 come out of school. This is the way we come out of school, on a cold and frosty morning. 0.51 d5 38. If I had as much money as I could tell, I never would cry young lambs to sell. Young lambs to sell, young lambs to sell. I never would cry young lambs to sell. 0.56 d1 The hart he loves the high wood. The hare she loves the hill. The Knight he loves his bright sword. The Lady loves her will. 0.56 d32 44. 0.56 d45 47. Cocks crow in the morn to tell us to rise and he who lies late will never be wise. For early to bed and early to rise, is the way to be healthy and wealthy and wise. 48. One two, buckle my shoe. Three four, knock at the door. Five six, ick up sticks. Seven eight, lay them straight. Nine ten. a good fat hen. Eleven twelve, dig and delve. Thirteen fourteen, maids a 0.56 d14 0.56 d27 courting. Fifteen sixteen, maids in the kitchen. Seventeen eighteen. maids a waiting. Nineteen twenty, my plate is empty. 0.64 d10 0.64 d17 C311: 11. One misty moisty morning when cloudy was weather, I met an old man clothed all in leather. He began to compliment and I began to grin. How do And how do? And how do again 0.64 d21 17. Here sits the Lord Mayor. Here sit his two men. Here sits the cock. Here sits the hen. Here sit the little chickens. Here they run in. Chin chopper, chin chopper, chin chopper, chin! 0.64 d29 29. When little Fred went to bed, he always said his prayers. He kissed his mamma and then his papa, and straight away went upstairs. 0.64 d11 0.69 d26 C312: 13. A robin and a robins son once went to town to buy a bun. They could not decide on plum or plain. And so they went back home again. 0.69 d50 30. Hey diddle diddle! The cat and the fiddle. The cow jumped over the moon. The little dog laughed to see such sport, and the dish ran away with the spoon. 0.69 d13 50. Little Jack Horner sat in the corner, eating of Christmas pie. He put in his thumb and pulled out a plum and said What a good boy am I! 0.73 d30 0.77 d28 C313: 10. Jack and Jill went up the hill to fetch a pail of water. Jack fell down, and broke his crown and Jill came tumbling after. When up Jack got and off did trot as fast as he could caper, to old Dame 0.82 d41 Dob who patched his nob with vinegar and brown paper. 0.86 d39 26. Sleep baby sleep. Our cottage valley is deep. The little lamb is on the green with woolly fleece so soft and clean. Sleep baby sleep. Sleep baby sleep, down where the woodbines creep. Be always like 0.99 d46 1.16 d7 the lamb so mild, a kind and sweet and gentle child. Sleep baby sleep. 1.47 d35 28. Baa baa black sheep, have you any wool? Yes sir yes sir, three bags full. One for my master and one for my dame, but none for the little boy who cries in the lane. 41. Old King Cole was a merry old soul. And a merry old soul was he. He called for his pipe and he called for his bowl and he called for his fiddlers three. And every fiddler, he had a fine fiddle and a very fine fiddle had he. There is none so rare as can compare with King Cole and his fiddlers three. s FAUST Cluster 1.2 WS0= 2 3 13 20 22 25 38 42 44 49 50 OUTLIER: DS1= | WS1= 2 20 25 46 49 51 46. Tom Tom the piper's son, stole a pig and away he run. The pig was eat and Tom was beat and Tom ran crying down the stree real HOB Alternate WS0, DS0 46 | DS2 46 DS0=|WS1= 7 10 17 23 25 28 33 34 37 40 43 45 50 35 |---|OUTLIER: 35. Sing a song of sixpence, a pocket full of rye. 4 and 20 blackbirds, baked in a pie. When the pie was opened, the birds began to sing. Was not that a dainty dish to set before the king? The king |DS2|was in his counting house, counting out his money. Queen was in the parlor, eating bread and honey. The maid was in the garden, hanging out the clothes. When down came a blackbird and snapped off her |35 | WS0= 2 nose. 3 13 32 38 42 44 52 DS1 |WS1= 42(Mother) C1: Mother theme 7. Old Mother Hubbard went to the cupboard to give her poor dog a bone. When she got there cupboard was bare and so the poor dog had none. She went to baker to buy him some bread. When she came back dog was dead. 7 9 |DS2|WS2=WS1 9. Hush baby. Daddy is near. Mamma is a lady and that is very clear. 11 |7 27. Cry baby cry. Put your finger in your eye and tell your mother it was not I. 27 |9 27 29 45 29. When little Fred went to bed, he always said his prayers. He kissed his mamma and then his papa, and straight away went upstairs. 29 45. Bye baby bunting. Father has gone hunting. Mother has gone milking. Sister has gone silking. And brother has gone to buy a skin to wrap the baby bunting in. 32 29 41 45 DS0|WS1 2 9 12 18 19 21 26 27 30 32 38 39 42 44 45 47 49 52 54 55 57 60 1 |DS1| WS2 12 19 26 39 44 OUTLIER: 10. Jack and Jill went up hill to fetch a pail of water. Jack fell down, and broke his crown and Jill came tumbling after. When up Jack got and off did trot as fast as he 10 |10 | DS2| WS3 could caper, to old Dame Dob who patched his nob with vinegar and brown paper. 13 | | 10 | DS3 10 17 37 14 39 21 41 26 44 28 30 47 50 WS0 22 38 44 52 DS1 WS1= 27 38 44 {fiddle(32 41) man(11 32) old(11 44) C2 fiddle old man theme 11 DS2 11. One misty moisty morning when cloudy was weather, I chanced to meet an old man clothed all in leather. He began to compliment and I began to grin. How do you do How do you do? How do you do again 32 11 32. Jack come and give me your fiddle, if ever you mean to thrive. No I will not give my fiddle to any man alive. If I'd give my fiddle they will think I've gone mad. For many a joyous day my fiddle and I have 41 22 had 44 41. Old King Cole was a merry old soul. And a merry old soul was he. He called for his pipe and he called for his bowl and he called for his fiddlers three. And every fiddler, he had a fine fiddle and a very fine fiddle had he. There is none so rare as can compare with King Cole and his fiddlers three. DS0| WS1=2 9 18 21 30 38 41 45 47 49 52 54 55 57 60 1 | DS1|WS2=2 9 18 30 39 45 55 OUTLIER: 13 | 39 |DS2 39. A little cock sparrow sat on a green tree. He chirped and chirped, so merry was he. A naughty boy with his bow and arrow, determined to shoot this little cock sparrow. 14 | |39 This little cock sparrow shall make me a stew, and his giblets shall make me a little pie. Oh no, says the sparrow I will not make a stew. So he flapped his wings\,away he flew 17 21 39 28 41 30 47 37 50 C3: men three 1. Three blind mice! See how they run! They all ran after the farmer's wife, who cut off their tails with a carving knife. Did you ever see such a thing in your life as three blind mice? WS0 38 52 5. Humpty Dumpty sat on a wall. Humpty Dumpty had a great fall. All the Kings horses, and all the Kings men cannot put Humpty Dumpty together again. DS1 WS1= 38 52 14. If all the seas were one sea, what a great sea that would be! And if all the trees were one tree, what a great tree that would be! And if all the axes were one axe, what a great axe that would be! 1 ---------- And if all the men were one man what a great man he would be! And if the great man took the great axe and cut down the great tree and let it fall into the great sea, what a splish splash that would 5 be! 17 17. Here sits the Lord Mayor. Here sit his two men. Here sits the cock. Here sits the hen. Here sit the little chickens. Here they run in. Chin chopper, chin chopper, chin chopper, chin! 23 23. How many miles is it to Babylon? Three score miles and ten. Can I get there by candle light? Yes, and back again. If your heels are nimble and light, you may get there by candle light. 28 28. Baa baa black sheep, have you any wool? Yes sir yes sir, three bags full. One for my master and one for my dame, but none for the little boy who cries in the lane. 36 36. Little Tommy Tittlemouse lived in a little house. He caught fishes in other mens ditches. 48 48. One two, buckle my shoe. Three four, knock at the door. Five six, pick up sticks. Seven eight, lay them straight. Nine ten. a good fat hen. Eleven twelve, dig and delve. Thirteen fourteen, maids a courting. Fifteen sixteen, maids in the kitchen. Seventeen eighteen. maids a waiting. Nineteen twenty, my plate is empty. DS0|WS1 2 5 8 13 14 15 16 22 24 25 29 36 41 44 47 48 51 54 57 59 13 |DS2|WS2 4 13 47 51 54 OUTLIER: 13. A robin and a robins son once went to town to buy a bun. They could not decide on plum or plain. And so they went back home again. 14 |13 |DS3 13 21 26 30 37 47 50 C4: 4. Little Miss Muffet sat on a tuffet, eating of curds and whey. There came a big spider and sat down beside her and frightened Miss Muffet WS0=2 away. 5 8 11 14 15 16 22 24 25 29 31 36 41 44 47 48 53 54 57 59 DS1|WS1(17wds)=2 5 11 15 16 22 24 25 29 31 41 44 47 48 54 57 59 6. See a pin and pick it up. All the day you will have good luck. See a pin and let it lay. Bad luck you will have all the day. 4 6 8|DS2=DS1 8. Jack Sprat could eat no fat. Wife could eat no lean. Between them both they licked platter clean. 12 15 18 21 12. There came an old woman from France who taught grown-up children to dance. But they were so stiff she sent them home in a sniff. This sprightly old woman from France. 25 26 30 33 37 15. Great A. little a. This is pancake day. Toss the ball high. Throw the ball low. Those that come after may sing heigh ho! 43 44 47 49 50 18. I had two pigeons bright and gay. They flew from me the other day. What was the reason they did go? I can not tell, for I do not k 21. Lion and Unicorn were fighting for crown. Lion beat Unicorn all around town. Some gave them white bread and some gave them brown. Some gave them plum cake, and sent them out of town. 25. There was an old woman, and what do you think? She lived upon nothing but victuals, and drink. Victuals and drink were the chief of her diet, and yet this old woman could never be quiet. 26. Sleep baby sleep. Our cottage valley is deep.Little lamb is on green with woolly fleece so soft, clean. Sleep baby sleep. Sleep baby sleep, down where woodbines creep. Be always like lamb so mild, a kind and sweet and gentle child. Sleep baby sleep. 30. Hey diddle diddle! The cat and the fiddle. The cow jumped over the moon. The little dog laughed to see such sport, and the dish ran away with the spoon. 33. Buttons, a farthing a pair! Come, who will buy them of me? They are round and sound and pretty and fit for girls of the city. Come, who will buy them of me? Buttons, a farthing a pair! 37. Here we go round mulberry bush, mulberry bush, mulberry bush. Here we go round mulberry bush, on a cold and frosty morning. This is way we wash our hands, wash our hands, wash our hands. This is way we wash our hands, on a cold and frosty morning. This is way we wash our clothes, wash our clothes, wash our clothes. This is way we wash our clothes, on a cold and frosty morning. This is way we go to school, go to school, go to school. This is the way we go to school, on a cold and frosty morning. This is the way we come out of school, come out of school, come out of school. This is the way we come out of school, on a cold and frosty morning. 43. Hark hark, the dogs do bark! Beggars are coming to town. Some in jags and some in rags and some in velvet gowns. 44. The hart he loves the high wood. The hare she loves the hill. The Knight he loves his bright sword. The Lady loves her will. 2. This littlewill pignever went be to market. This little pig stayed home. Thisislittle pig had beef.and This little pig none. This little pig said Wee, wee. I can't find my way 47. Cocks crow the43morn to 51 tell53 us57 toOUTLIERS: rise and he who lies late wise. For early to bed and early to rise, the way to beroast healthy wealthy andhad wise. DS0|WS1=6 7 8in14 46 48 home 49.3|DS2=DS1 There was a little girl who had a little curl right in the middle of her forehead. When she was good she was very very good and when she was bad she was horrid. 2 50. Little Jack Horner sat in the corner, eating of Christmas pie. He put in his thumb and pulled out a plum and said What a good boy am I! 3. Diddle diddle dumpling, my son John. Went to bed with his breeches on, one stocking off, and one stocking on. Diddle diddle dumpling, my son John. 16 22 42 16. Flour of England, fruit of Spain, met together in a shower of rain. Put in a bag tied round with a string. If you'll tell me this riddle, I will give you a ring. Each of the 10 words occur in 1 22. Had little husband no bigger than my thumb. Put him in a pint pot, there I bid him drum. Bought a little handkerchief to wipe his little nose, pair of little garters to tie little hose doc, so all 5 docs are outliers 42. Bat bat, come under my hat and I will give you a slice of bacon. And when I bake I will give you a cake, if I am not mistaken.

OUTLIER: 38. If I had as much money as I could tell, I never would cry young lambs to sell. Young lambs to sell, young lambs to sell. I never would cry young lambs to sell. Notes Using HOB, the final WordSet is the document cluster theme! When the theme is too long to be meaningful (C4) we can recurse on those (using the opposite DS)|WS0?). The other thing we can note is that DS) almost always gave us an outliers (except for C5) and only WS) almost always gave us clusters (excpt for the first one, 46). What happens if we reverse it? What happens if we just use WS0? FAUST Cluster 1.2.1 DS0|WS1=41 47 57 (on C4) 21|DS2 WS2=41(morn) 57(way) 26| 37 DS3=DS2 30| 47 . 37 47 50 real HOB Alternate WS0, DS0 recuring on C3 and C4 C4.1 37. Here we go round mulberry bush, mulberry bush, mulberry bush. Here we go round mulberry bush, on a cold and frosty morning. This is way we wash our hands, wash our hands, wash our hands. This is way we wash our hands, on a cold and frosty morning. This is way we wash our clothes, wash our clothes, wash our clothes. This is way we wash our clothes, on a cold and frosty morning. This is way we go to school, go to school, go to school. This is the way we go to school, on a cold and frosty morning. This is the way we come out of school, come out of school, come out of school. This is the way we come out of school, on a cold and frosty morning. 47. Cocks crow in the morn to tell us to rise and he who lies late will never be wise. For early to bed and early to rise, is the way to be healthy and wealthy and wise. WS0=2 5 11 15 16 22 24 25 29 31 44 47 54 59 DS1|WS1=2 5 15 16 22 24 25 44 47 54 59 4 DS2 WS2=2 15 16 24 25 44 47 54 59 6 4 DS3 WS3=WS2 8 6 4 12 8 8 Final WordSet is 15 12 12 too long. Recurse 4.2 18 21 21 21 25 25 25 26 26 26 30 30 30 43 43 43 50 50 49 50 DS0|WS1= 47 (plum) 21 DS2 WS2=WS1 26 21 30 50 50 WS0= 2 15 16 23 24 27 36 DS1|WS1 = 2 15 16 25 44 59 4 |DS2 WS2=15 16 25 44 59 8 |4 DS3 WS3=15 16 44 59 12 |8 8 DS4 WS4=15 44 59 25 |12 12 12 DS5 WS544 59 26 |25 25 25 12 DS6=DS5 30 |26 26 26 25 C4.2.1 word47(plum) 21. Lion &Unicorn were fighting for crown. Lion beat Unicorn all around town. Some gave them white bread and some gave them brown. Some gave them plum cake sent them out of town. 50. Little Jack Horner sat in corner, eating of Christmas pie. He put in his thumb and pulled out a plum and said What a good boy am I! C4.2.2 word44(old) word59(woman) 12. There came an old woman from France who taught grown-up children to dance. But they were so stiff she sent them home in a sniff. This sprightly old woman from France. 25. There was old woman. What do you think? She lived upon nothing but victuals, and drink. Victuals and drink were the chief of her diet, and yet this old woman could never be quiet. DS0|WS1=1 2 3 15 16 23 24 27 30 36 49 60 Doc26 and doc30 have none of the 12 words in commong so these two will come out outliers on the next recursion! 26 |DS1=DS0 OUTLIERS: 26. Sleep baby sleep. Cottage valley is deep.Little lamb is on green with woolly fleece soft, clean. Sleep baby sleep. Sleep baby 30 sleep, down where woodbines creep. Be always like lamb so mild, a kind and sweet and gentle child. Sleep baby sleep. 30. Hey diddle diddle! Cat and the fiddle. Cow jumped over moon.Little dog laughed to see such sport, and dish ran away with spoon. WS0= 5 11 22 25 29 31 OUTLIER: 6. See a pin and pick it up. All the day you will have good luck. See a pin and let it lay. Bad luck you will have all the day. DS1 WS1=5 22 6 1518 49 DS2 6 DS0|WS1=22 25 29 C4.2.3 (day eat girl) 4. Little Miss Muffet sat on tuffet, eating curd, whey. Came big spider, sat down beside her, frightened Miss Muffet away 8. Jack Sprat could eat no fat. Wife could eat no lean. Between them both they licked platter clean. 4 DS2 =WS1 15. Great A. little a. This is pancake day. Toss the ball high. Throw the ball low. Those that come after may sing heigh ho! 8 |4 8 15|15 18 Recursing 18. I had 2 pigeons bright and gay. They flew from me other day. What was the reason they did go? I can not tell, for I do not know. 18|33 49 no change 33. Buttons, farthing pair! Come who will buy them? They are round, sound, pretty, fit for girls of city. Come, who will buy ? Buttons, farthing a pair 33 43 49 49. There was little girl had little curl right in the middle of her forehead. When she was good she was very good and when she was bad she was horrid. Doc43 and doc44 have none of the 6 words in commong so these two will come out outliers on the next recursion! OUTLIERS: 43. Hark hark, the dogs do bark! Beggars are coming to town. Some in jags and some in rags and some in velvet gowns. 44. The hart he loves the high wood. The hare she loves the hill. The Knight he loves his bright sword. The Lady loves her will. recurse on C3: DS0=|WS1=21 38 49 52 1 |DS1 |WS2=21 38 49 14 |1 |DS3=DS2 17 |14 C31 [21]cut [38]men [49]run 1. Three blind mice! See how run! All ran after farmer's wife, cut off tails with carving knife. Ever see such thing in life as 3 blind 28 |17 mice? 14. If all seas were 1 sea, what a great sea that would be! And if all trees were 1 tree, what a great tree that would be! And if all axes were 1 axe, what a great axe that would be! if all men were 1 man what a great man he would be! And if great man took great axe and cut down great tree and let it fall into great sea, what a splish splash that would be! 17. Here sits Lord Mayor. Here sit his 2 men. Here sits the cock. Here sits hen. Here sit the little chickens. Here they run in. Chin chopper, chin chopper, chin chopper, chin! WS0=38 52 DS1|WS1=WS0 5 | . 23 28 36 48 C32: [38]men [52] three 5. Humpty Dumpty sat on wall. Humpty Dumpty had great fall. All Kings horses, all Kings men cannot put Humpty Dumpty together again. 23. How many miles to Babylon? 3 score miles and 10. Can I get there by candle light? Yes, back again. If your heels are nimble, light, you may get there by candle light. 28. Baa baa black sheep, have you any wool? Yes sir yes sir, three bags full. One for my master and one for my dame, but none for the little boy who cries in the lane. 36. Little Tommy Tittlemouse lived in a little house. He caught fishes in other mens ditches. 48. One two, buckle my shoe. Three four, knock at the door. Five six, pick up sticks. Seven eight, lay them straight. Nine ten. a good fat hen. Eleven twelve, dig and delve. Thirteen fourteen, maids a courting. Fifteen sixteen, maids in the kitchen. Seventeen eighteen. maids a waiting. Nineteen twenty, my plate is empty. FAUST Cluster 1.2.2 HOB Alternate WS0, DS0 16 OUTLIERS: 2 3 6 10 13 16 22 26 30 35 38 39 42 43 44 46 Categorize clusters (hub-spoke, cyclic, chain, disjoint...)? Separate disjoint sub-clusters? Each of the 3 C423 words gives a disjoint cluster! Each of the 2 C32 work gives a disjoint sub-clusters also. C4231 day C4232 eat C4233 girl 15 15. Great A. little a. This is pancake day. Toss ball high. Throw ball low. Those come after sing heigh ho! 18. I had 2 pigeons bright and gay. They flew from me other day. What was reason they go? I can not tell, I do not know. 4. Little Miss Muffet sat on tuffet, eat curd, whey. Came big spider, sat down beside her, frightened away 8. Jack Sprat could eat no fat. Wife could eat no lean. Between them both they licked platter clean. 4 eat day 18 8 girl 33. Buttons, farthing pair! Come who will buy them? They are round, sound, pretty, fit for girls of city. Come, who will buy ? Buttons, farthing a 33 49 pair 49. There was little girl had little curl right in the middle of her forehead. When she was good she was very good and when she was bad she was C1: mother horrid. 7. Old Mother Hubbard went to cupboard to give her poor dog a bone. When she got there cupboard was bare, so poor dog had none. She went to baker to buy some bread. When she came back dog was dead. 9. Hush baby. Daddy is near. Mamma is a lady and that is very clear. 27. Cry baby cry. Put your finger in your eye and tell your mother it was not I. 29. When little Fred went to bed, he always said his prayers. He kissed his mamma and then his papa, and straight away went upstairs. 45. Bye baby bunting. Father has gone hunting. Mother has gone milking. Sister has gone silking. And brother has gone to buy a skin to wrap the baby bunting in. C2: fiddle old men {cyclic} 11. 1 misty moisty morning when cloudy was weather, Chanced to meet old man clothed all leather. He began to compliment,I began to grin. How do you do How do? How do again 32. Jack come give me your fiddle, if ever you mean to thrive. No I'll not give fiddle to any man alive. If I'd give my fiddle they will think I've gone mad. For many joyous day fiddle and I've had 41. Old King Cole was merry old soul. Merry old soul was he. He called for his pipe, he called for his bowl, he called for his fiddlers 3. And every fiddler, had a fine fiddle, a very fine fiddle had he. There is none so rare as can compare with King Cole and his fiddlers three. C11 cut C321 men C322 three C421 plum C422 old men run men 32 old 41 fiddle 11 {cyclic} 1 1. Three blind mice! See how run! All ran after farmer's wife, cut off tails with carving knife. Ever see such thing in life as 3 blind mice? 14. If all seas were 1 sea, what a great sea that would be! And if all trees were 1 tree, what a great tree that would be! And if all axes were 1 axe, what a great axe that would be! if run cut all men were 1 man what a great man he would be! And if great man took great axe and cut down great tree and let it fall into great sea, what a splish splash that would be! 17 17. Here sits Lord Mayor. Here sit his 2 men. Here sits the cock. Here sits hen. Here sit the little chickens. Here they run in. Chin chopper, chin chopper, chin chopper, chin! men 14 5. Humpty Dumpty sat on wall. Humpty Dumpty had great fall. All Kings horses, all Kings men can't put Humpty together again. 5 36. Little Tommy Tittlemouse lived in little house. He caught fishes in other mens ditches. men 36 three 28 23. How many miles to Babylon? 3 score 10. Can I get there by candle light? Yes, back again. If your heels are nimble, light, you may get there by candle light. three 23 28. Baa baa black sheep, have any wool? Yes sir yes sir, 3 bags full. One for my master and one for my dame, but none for the little boy who cries in the lane. 48. One two, buckle my shoe. Three four, knock at the door. Five six, pick up sticks. Seven eight, lay them straight. Nine ten. a good fat hen. Eleven twelve, three 48 dig and delve. Thirteen fourteen, maids a courting. Fifteen sixteen, maids in the kitchen. Seventeen eighteen. maids a waiting. Nineteen twenty, my plate is empty. C4.1 morn way 37. Here we go round mulberry bush, mulberry bush, mulberry bush. Here we go round mulberry bush, on cold and frosty morn. This is way wash our hands, wash our hands, wash our hands. morn This is way wash our hands, on a cold and frosty morning. This is way we wash our clothes, wash our clothes, wash our clothes. This is way we wash r clothes, on a cold and frosty morning. 37 This is way we go to school, go to school, go to school. This is the way we go to school, on a cold and frosty morning. This is the way we come out of school, come out of school, come out way 47 of school. This is the way we come out of school, on a cold and frosty morning. 47. Cocks crow in the morn to tell us to rise and he who lies late will never be wise. For early to bed and early to rise, is the way to be healthy and wealthy and wise. 21. Lion &Unicorn were fighting for crown. Lion beat Unicorn all around town. Some gave them white bread and some gave them brown. Some gave them plum cake sent them out of town. 50. Little Jack Horner sat in corner, eating of Christmas pie. He put in his thumb and pulled out a plum and said What a good boy am I! woman 12. There came an old woman from France who taught grown-up children to dance. But they were so stiff she sent them home in a sniff. This sprightly old woman from France. 25. There was old woman. What do you think? She lived upon nothing but victuals, and drink. Victuals and drink were the chief of her diet, and yet this old woman could never be quiet. 12 old woman 25 Let's pause and ask "What are we after?" Of course it depends upon the client. 3 main categories for relatioinship mining? text corpuses, market baskets (includes recommenders), bioinformatics? Others? What do we want from text mining? (anomalies detection, cliques, bicliques?) What do we want from market basket mining? (future purchase predictions, recommendations...) What do we want in bioinformatics? (cliques, strong clusters, ...???) FAUST Cluster 1.2.3 1 word-labeled document graph 30 Let's stare at what we got and try to see what we might wish we had gotten in addition. 1 cut three 23 three 48 48 three 28 Can we capture more of them? Of course we can capture a sub-graph for each word, but that might be 100,000. run three old fall 5 men 14 king fall 10 old old

tree me n me n maid 37 A bake-bread sub-corpus would have been strong. (docs{C7 21 35 42) bread bake 42 bake 36 house 33 m buy tow buy n tow dog 35 old 22 son ki ng 3 8 eat clean old 25 woman 12 men child 26 baby bed always r 29 the o m away 4 away eat green pie eat boy 50 50 lamb full men 32 old 41 fiddle 11 lamb cry baby mother 45 bright lady round 44 2 pie 17 cock run away 39 46 cry boy 28 cry baby mother 27 30 away high pig 41 41 son 9 money day way fid thre dle e merry men mo th er round 32 18 old Using AVG+1 2 9 10 25 45 47 d21 0 0 1 0 0 1 d35 0 0 1 1 1 0 d39 1 1 0 0 1 0 d46 1 0 0 1 0 0 d50 0 1 0 1 1 1 hill day day coc k bed run 15 43 three plum 49 bad day n b 47 wife 6 23 sing dog dish thum old There are many others. back 13 nose three girl bread bake cloth cloth 11 m wa orn y run cut men 14 two buy back 7 eat A bake-bread sub-corpus would have been strong. (docs{C7 21 35 42) plum 21 pl u brown crown fid dle We have captured only a few of the salient subgraphs. 17 38 bag 16 HOB2 Alt (use other HOBs) fall 10 37 bread bake 42 7 47 3 8 eat clean 26 baby bed always r 29 the o m

away full day 18 4 away eat 30 green 50 50 away e pi oy b boy baby mother lady round 44 bright 2 pie 17 cock run away 39 away eat eat eat 46 46 cry boy 28 cry lamb cry high pig 41 baby mother 45 32 way lamb 27 hill day ki ng son run 15 43 three plum plum 49 bad day n dog dish 9 money 6 23 sing coc k bed child 22 son wife 12 old nose old old 25 woman dog 35 35 eat men tow bread bake cloth cloth 11 m wa orn y back 13 bake 36 house three girl b round me n me n maid old 33 um buy tow buy n buy back 21 21 pl le men 14 king old tree brown crown fid d 5 two old fid thre dle e merry 48 fall thum 12. There came an old woman from France who taught grown-up children to dance. But they were so stiff she sent them home in a sniff. This sprightly old woman from France. c o w h l o i d m l a d n 15 44 59 d12 1 1 1 1 cut And if we want to pull out a particular word cluster, just turn the word-pTree into a list.: w=baby w=boy a b a b w a w o b a y a y 2 9 2 3 d28 0 1 d9 0 1 d39 1 1 d26 1 1 d50 0 1 d27 0 1 d45 1 For a particular doc cluster, just turn the docpTree into a list: run three r recurse: wAv+2,dAvg-1 e p a i t e 2 9 10 25 45 d35 0 0 1 1 1 d39 1 1 0 0 1 d46 1 0 0 1 0 d50 0 1 0 1 1 30 mo th e FAUST Cluster 1.2.4 wAvg+1, dAvg+1 a b b e p p w o r a i l a y e t e u y a m d 2 9 10 25 45 47 d21 0 0 1 0 0 1 d35 0 0 1 1 1 0 d39 1 1 0 0 1 0 d46 1 0 0 1 0 0 d50 0 1 0 1 1 1 38 bag 16

FAUST HULL Classification 1 Using the clustering of FAUST Clustering1 as classes, we extract 80% from each class as TrainingSet (w class=cluster#). How accurate is FAUST Hull Classification on the remaining 20% plus the outliers (which should be "Other"). Use Lpd, Sp, Rpd with p=ClassAvg and d=unitized ClassSum. C11={C2,3,16,22,42,43} C2 ={C1,4,5,8,9,12,14,15,23,25,27,32,33,36,37,38,44,45,47,48} C311= {C11,17,29} Full classes from slide: FAUST Clustering1 C312={C13,30,50} C313={C10,26,28,41} OUTLIERS {C18,49} {C6} {C39} {C21} {C46} {C7} {C35} C11={C2,16,22,42,43} C2 ={C1,5,8,9,12,15,25,27,32,33,36,37,38,44,47,48} C311= {C11,17} 80% Training Set C312={C30,50} C313={C10,28,41} D11=C11 p=avC11 L 0 D1=TS p=avTS Lpd -.09 .106 .305 .439 .572 C312 MIN MAX CLASS C311 C11 MIN MAX CLASS C311 .63 .63 C11 -0.09 .106 C11 0 0 .63 C2 0.106 .439 C2 .106 .439 .505 .771 C313 0 0 C311 0.572 .572 C311 C2 C313 0 .31 .31 C312 0.305 .439 C312 C2 0 .31 C313 0.505 .771 C313 0 D312=C312 p=avC312 .31 C11 D311=C311 p=avC311 C11 1.3 1.6 L MN MX CLAS .31 C2 L MN MX CLAS 0 .33 C311 C312 0 .31 C11 .31 C311 0 0 C11 0 .31 C2 .31 C313 0 0.66 C2 0 .33 C313 0 .31 C311 1.33 1.66 C311 1.58 C312 1.58 1.58 C312 0 0.33 C312 0 .66 0 .31 C313 0 0.33 C313 C2 .31 C312 .31 C11={C3} C2 ={C4,14,23,45} 20% Test Set C311= {C29} C312={C13} O={C18 49 6 39 21 46 7 35} C313={C26} D2=C2 p=avC2 L 0 .63 C11 .63 MIN 0 .44 .66 .11 .44 MAX CLS .22 C11 .77 C2 .66 C311 .22 C312 .66 C313 .22 .44 .66 C11 C313 .11 .22 .44 .77 C312 C2 .66 C311 D313=C313 p=avC313 0 .22 L MN MX 0 .22 0 .44 0 .44 0.22 .22 1.34 1.56 CLAS C11 C2 C311 C312 C313 0 0 C11 .44 C2 .44 C311 1.34 1.56 .22 C313 C312 All 6 class hulls separated using Lpd, p=CLavg, D=CLsum. D311 separates C311, D312 separates C312 and D313 separates C313 from all others. D2 separates C11 and C2. Now, remove some false positives with S and R using the same p's and d's: D1=TS p=avTS Sp 4.2 C313 5.4 1.9 C11 2.1 2.4 C311 3.4 4.6 C312 4.7 1.8 C2 3.8 D11=C11 p=avC11 Sp [1.6]C11 [3.4 4 4]C311 [5.4 6]C313 [2.4 4.4]C2 [5]C312 D2=C2 p=avC2 Sp [1.8 D311=C311 p=avC311 Sp D312=C312 p=avC312 Sp [1.2]C311 [4.2]C11 [6.2 7.2]C312 [3.5 4.5]C11 [6.5 7.5]C313 [2.2 6.2]C2 [4.5 6.5]C2 [6.2 8.2]C313 [2.5]C312 [5.5]C311 [2 2.3]C11 [4.5 5.8]C313 3.5]C2 [5 5.1]C312 [2.5 3.5]C311 D313=C313 p=avC313 Sp [3.5 4.2]C11 [6.5]C312 [2.8 6.2]C2 [3.8 6.2]C311 [2.5 3.5]C313 Sp removes a lot of the potential for false positives. (Many of the classes lie a single distance from p.) D1=TS p=avTS Rpd [1.3 1.4]C11 [1.3 1.9]C2 [1.5 1.8]C311 [2.1]C312 [2.0 D311=C311 p=avC311 Rpd [1.4]C11 [1.2 2]C2 [1.1]C311 [2.2]C312 [2.2 2.4]C313 2.2]C313 D11=C11 p=avC11 Rpd [1.2]C11 [1.4 2]C2 [1.7 2]C311 [2.2 2.]]C312 [2.2 2.4]C313 D312=C312 p=avC312 Rpd [1.3 1.4]C11 [1.4 2]C2 [1.7 1.9]C311 [1.5]C312 [2.2 2.4]C313 Rpd removes even more of the potential for false positives. D2=C2 p=avC2 Rpd [1.3 1.4]C11 [1.3 1.8]C2 [1.6 1.8]C311 [2.2]C312 [2.1 2.4]C313 D313=C313 p=avC313 Rpd [1.3 1.4]C11 [1.3 2]C2 [1.6 2]C311 [2.2]C312 [1.5 1.8]C313 FAUST Hull Classification 2 (TESTING) D1=TS p=avTS Lpd [.57]C311 [-.09 .11]C11 [.31 .44]C312 [.11 .44]C2 [.51 .77]C313 D11=C11 p=avC11 Lpd [0]C311 [.31]C312 [0 .31]C313 [0 D1=TS p=avTS Rpd D1=TS p=avTS Sp [1.3 1.4]C11 [1.9 2.1]C11 [2.4 3.4]C311 [4.2 5.4]C313 [1.3 1.9]C2 [1.5 1.8]C311 [1.8 3.8]C2 [4.6 4.7]C312 Test Set C11={C3} C2 ={C4,14,23, [2.1]C312 [2.0 2.2]C313 45} C311= {C29} D11=C11 p=avC11 Rpd [1.2]C11 C312={C13} [1.4 2]C2 C313={C26} [1.7 2]C311 [2.2 2.]]C312 O={C18 49 6 [2.2 2.4]C313 39 21 46 7 35} D2=C2 p=avC2 Rpd D11=C11 p=avC11 Sp [1.6]C11 [3.4 4 4]C311 [5.4 6]C313 [2.4

4.4]C2 [5]C312 [.63]C11 .63]C2 D2=C2 p=avC2 Lpd D2=C2 p=avC2 Sp .[44 .66]C313 [2 2.3]C11 [4.5 [0 .22]C11 [.44 .77]C2 [1.8 3.5]C2 [2.5 3.5]C311 [.66] C311 [.11 .22]C312 D311=C311 p=avC311 Lpd [0]C11 [0 .33]C312 [0 .33]C313 [0 .66]C2 [1.3 D312=C312 p=avC312 Lpd .31 .31 .31 .31 C11 C2 C311 C313 1.58 C312 1.6]C311 D=TS Rpd 1.41 1.40 1.92 1.38 1.97 2.22 2.60 1.40 2.50 1.40 2.42 3.46 2.60 2.35 1.41 Sp 2.19 2.06 3.71 1.99 3.92 4.99 6.78 2.13 6.37 2.06 5.92 12.2 6.78 5.57 2.19 [1.3 1.6]C313 Lpd trueCL -0.4 -0.3 0.01 -0.2 -0.1 -0.2 -0.0 -0.3 0.34 -0.3 -0.1 0.47 -0.0 0.14 -0.4 11 2 2 2 311 312 313 d3 d4 d14 d23 d29 d13 d26 d6 d7 d18 d21 d35 d39 d46 d49 5.1]C312 5.8]C313 D311=C311 p=avC311 Sp [1.2]C311 [4.2]C11 [6.2 7.2]C312 [2.2 6.2]C2 [6.2 8.2]C313 D312=C312 p=avC312 Sp [3.5 4.5]C11 [4.5 [2.5]C312 [5.5]C311 D313=C313 p=avC313 Lpd [0 .22]C11 [0 .44]C2 [0 .44]C311 [.22]C312 [5 [6.5 7.5]C313 6.5]C2 =.8 Final predicted predicted Class Other 11 Other 2 Other 2 Other 2 Other 311(all 311|2 all) Other 312(all 312|313 a Other Other Other . Other . Other . Other . Other . Other . Other . Other Other 8/15 = 53% correct just with D=TS p=AvgTS Note: It's likely to get worse as we consider more D's. D311=C311 p=avC311 Rpd [1.4]C11 [1.2 2]C2 [1.1]C311 D312=C312 p=avC312 Rpd [1.3 1.4]C11 [1.4 [1.5]C312 D313=C313 p=avC313 Sp [3.5 4.2]C11 [6.5]C312 [2.8 6.2]C2 [3.8 6.2]C311 [2.5 3.5]C313 Predicted____CLASS R S L 2 Oth 11 2 2 11 Oth 2 11 2|11 2|11 Oth Oth 312|313 11 313 313 11 Oth Oth 11 2|11 2 11 313 Oth 2 2|11 2|11 Oth 313 Oth Oth Oth Oth Oth Oth Oth 11 Oth Oth 2 2 2 Oth [1.3 1.4]C11 [2.1 2.4]C313 [1.3 1.8]C2 [1.6 1.8]C311 [2.2]C312 [2.2]C312 [2.2 2.4]C313 2]C2 [1.7 1.9]C311 D313=C313 p=avC313 Rpd [1.3 1.4]C11 [1.3 [1.6 [1.5 [2.2 2.4]C313 2]C2 2]C311 1.8]C313 [2.2]C312 Let's think about TrainingSet quality resulting from clustering. This a poor quality TrainingSet (from clustering Mother Goose Rythmes. MGR is a difficult corpus to cluster since: 1., in MGR, almost every document is isolated (an outlier), so the clustering is vague (no 2 MGRs deal with the same topic so their word use is quite different.). Instead of tightening the class hulls by replacing CLASSmin and CLASSmax by CLASSfpci (fpci=first percipitous count increase) and CLASSlpcd, we might loosen class hulls (since we know the classes somewhat arbitrary) by expanding the [CLASSmin, CLASSmax] interval as follows: Let A = Avg{CClASSmin, CLASSmax} and R (for radius) = ACLASSmin (=CLASSmax-A also). Use [A-R-, A+R+]. Let =.8 increases accuracy to 100% (assuming all Other stay Other.). Finally, it occurs to me that Clustering to produce a TrainingSet, then setting aside a TestSet gives a good way to measure the quality of the clustering. If the TestSet part classifies well under the TrainingSet part, the clustering must have been high quality (produced a good TrainingSet for classification). This clustering quality test method is probably not new (check the literature?). If it is new, we might have a paper here? (discuss this quality measure and assess using different 's?) WP Wed 11/26 Yes, we have discovered also that one has to think about the quality of the training set. If it is very high quality (expected to fully encompass all borderline cases of all classes) then using exact gap endpoints is probably wise, but if there is reason to worry about the comprehensiveness of the training set (e.g., when there are very few training samples - which is often the case in medical expert systems where getting a sufficient number of training samples is difficult and expensive), then it is probably better to move the cutpoints toward the midpoint (reflecting the vagueness of training set class boundaries). What does one use to decide how much to move away from the endpoints? That's not an easy question. Cluster deviation seems like a useful measure to employ. One last though on how to decide whether to cut at gap midpoint, endpoints, or to move the cut-points away from the endpoints toward the midpoint, If one has a time-stamp on training samples, one might assess the "class endpoint" change rate over time. As the training set gets larger and larger, if an endpoint stops moving much and isn't an outlier, then cutting at the endpoint seems wise. If an endpt is still changing a lot, then moving away from that endpoint seems wise (maybe based on the rate of change of that endpoint as well as other measures?). A complete subgraph is a clique. A maximal clique is not a proper subset of any other clique. In G=(X,Y,E), a bipartite graph, a clique (Sx, Sy) is a complete bipartite subgraph induced by bipartite vertex set (Sx, Sy). The Consensus Set or clique of Sx, CLQ(Sx) = xSxNy(x), i.e., the set of all y's that are adjacent (edge connected) to every x in Sx. Clearly, (Sx, CLQ(Sx)) is a clique. Thm2: SyY s.t. CLQ(Sy) ( CLQ(Sy), CLQ(CLQ(Sy)) ) is maximal. Thm1: (Sx, Sy) is a maximal clique iff Sy=CLQ(Sx) and Sx=CLQ(Sy) Find all cliques starting with Sy=singletons. Then tripletons etc. Then examine Sy1y2-doubletons s.t. Px(Sy1y2) Examining MGRs, (x=docs, y=words) all singleton wordsets, Sy, form a nonempty clique. AND pairwise to find all nonempty doubleton wordset cliques, Sy1y2. AND those nonempty doubleton wordset with each other singleton wordset to find all nonempty tripleton wordset cliques, Sy 1y2y3... Start w singleton docs, incl another... until . The last nonempty set is a max-clique and all subsets are cliques. Remove them. Iterate.

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 8 w58 14 w21 17 w49 23 w52 28 w52 30 w49 41 w52 46 w49 48 w52 8 none 14 none 17 30 46 23 28 41 28 23 41 30 17 46 41 23 28 46 17 30 48 23 28 13 13 13 13 13 13 21 50 23 33 45 21 43 46 21 48 48 48 41 2 2 2 2 37 w57 46 w45 47 w57 37 47 w57 3 3 3 3 3 3 13 w51 29 w8 46 w51 47 w8 13 46 w51 29 47 w8 w49 w52 w47 w4 w13 w54 w51 w47 w54 29 45 w28 29 47 w8 29 30 39 46 w2 30 30 30 30 32 41 35 43 46 32 36 32 41 w22 w27 4 4 4 4 4 4 4 4 4 5 5 5 5 w27 w23 w24 w2 w49 8 29 30 35 39 46 50 8 29 10 11 36 36 w25 w2 w2 w25 w2 w2 w25 w25 35 46 50 30 39 46 14 17 32 36 41 6 15 18 32 6 49 14 17 32 36 14 39 15 18 32 15 35 15 44 33 33 33 35 35 35 35 35 35 36 37 45 49 36 37 38 39 41 42 w47 w55 16 16 17 17 17 17 w22 w50 w31 w48 w13 w29 w33 w17 w40 w45 w34 w7 37 38 39 39 39 39 47 46 41 46 47 50 41 48 42 w25 w2 w26 w38 w34 w34 w38 w22 w5 33 28 30 32 39 48 37 46 36 47 w48 w6 w42 w38 w18 w56 w41 w57 w57 w39 w2 w18 w9 w45 w52 43 44 7 7 7 7 7 7 7 7 7 7 8 8 9 9 9 9 13 35 35 13 30 9 10 13 35 45 26 35 26 27 44 45 18 18 21 21 21 21 22 22 42 33 45 43 23 29 11 12 46 50 27 45 29 45

w4 w7 w10 w13 w24 45 w42 25 41 w44 w4 w13 w7 w10 w13 w42 w16 w25 w3 w42 w35 w3 w42 32 44 35 42 43 50 35 50 w22 w10 w10 w4 w54 w47 w43 w53 45 46 47 50 w25 48 49 50 23 28 41 48 25 26 26 26 26 26 10 10 10 10 11 12 25 41 14 10 44 21 w44 w42 w32 w12 w19 11 12 25 41 11 14 17 32 36 11 35 37 12 12 12 12 w44 w38 w17 25 41 w44 25 w59 26 w15 25 w44 w59 #CLQs 13 1 7 9 23 48 #docs 2 6 5 4 3 2 #words 2 1 1 1 1 1 There is something wrong here. This does not find all maximal cliques. w52 41 w44 27 45 w3 28 w60 29 w1 38 w36 39 w30 27 35 27 50 28 28 28 28 w43 w53 35 38 46 39 50 41 48 w28 w20 w9 w52 Next I try the following logic: 1.Find all 1WdC (1 Word Cliques). 2.A kWdC contains each of k (k-1)WdCs, so of a (k-1) wordset is not the wordset of a clique than none of its supersets are either (downward closure property). 3.Thus, the wordset of any 2WdCs can be composed by unioning the wordsets of two 1WdCs and any k WdC wordset is the union o f a (k-1)WdC wordset with a 1WdCwordset. 4 3 2 2 3 3 3 3 2 2 4 2 2 2 3 3 2 4 2 4 2 3 53 32 32 22 26 22 22 22 25 2 6 3 2 3 3 4 2 3 5 2 3 2 2 3 2 2 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 2 3 4 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 10 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0AND 0 0 with w1 1 1 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 00 00 01 00 00 10 00 00 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 10 00 00 01 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 00 00 00 00 00 00 00 00 01 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 10 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1For 0 wh=w2 0 0 d4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 01 00 00 00 00 00 00 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 00 10 00 00 00 00 00 10 00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0w60 0 0,I show 0 d6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 00 00 00 00 01 00 00 00 00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0only 1 0the0 1-d8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 d9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 00 10 00 00 01 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0counts 0 0 of 0 d10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 00 00 00 00 00 00 10 00 00 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 d11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0wh&wk 0 0 0k>h d12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 00 00 00 00 01 00 00 00 00 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 d13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 d15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 d16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 d17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 d18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 00 00 01 00 00 10 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 d21 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 00 00 00 00 00 00 00 00 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 00 00 00 10 00 00 00 00 00 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 d23 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 00 00 10 00 00 00 00 00 00 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 6 d26 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 00 00 01 00 00 10 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 00 00 10 00 01 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 10 00 00 00 00 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 d28 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 11 00 00 11 00 00 01 00 10 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 3 d29 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 01 00 00 00 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 00 00 00 00 00 00 00 01 00 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 d32 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 00 00 01 00 00 00 01 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d33 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 00 00 00 00 00 11 00 00 00 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 d35 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 00 10 00 00 01 00 00 00 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 d36 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d37 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 d38 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 00 01 00 00 00 00 10 10 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d39 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d41 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 10 00 00 00 00 00 00 00 00 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 d42 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 00 00 00 00 00 00 00 01 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 d43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 01 00 00 00 00 00 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 d44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 10 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d45 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 00 00 00 00 00 00 00 00 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 d46 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 0 d47 06 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 60 7d48 8 9 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d49 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 00 0 0d50 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 2 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 1 0 0 2 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0A1first 0 0 0 0 0of0 our team might be to implement (in optimized Treeminer parallel code ), that is, a 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 goal 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0maximal 0 0 0 0 complete 0 0 1 subgraph finder (maximal clique finder). The benefits would be substantial! 0 0 2 0 0 1 1 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 1 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 01.This 0 0 0 would 1 0 0 0be an exercise in parallel programming (e.g., in a Treeminer Hadoop environment). 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 is 0a 0typical exponential growth case. If you can find an engineering breakthru here, it will be a 0 1 1 1 0 0 1 0 1 0 0 0 1 1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 1 0 1 1 1 1 0 1 0 0 1 0 0 0 2.This 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0breakthru 1 0 0 0 0for 0 0a massive collection of similar existing big data parallel programming problems . 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0What 1 0 0 0 0 0next 0 step? We would have to have all the wh&wk, k>h results (not just the 1 counts) but 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0is0the 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 that 0 0 would 0 0 1 0have 1 taken about 60 more slides to display ;-( 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1 0 1 1 0Just 0 0 looking 0 0 0 0 at 1 the pTee results of w1&wk k>1 above, we see that even though w1&w2 and w1&w3 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 their AND (which is w1&w2&w3) has 0 count and therefore we need not consider 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0both 0 0 have 0 0 0 counts=1, 0 0 1 1 0 1 1 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0any 0 0combinations 0 0 0 0 0 of the type w1&w2&w3& by the downward closure). 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 2 1 1 0 1 1 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 In 0 fact, 0 0 0the 0 0only 0 wh for which we need to look further is h=42. Note that ct(w1&w2&w42)=1 but all 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 The only maximal clique involving w1 and w2 is {CDocSet={C29}, 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 other 0 0 0ct(w1&w2&wh)=0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 right? 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 WordsSet={C1,2,42}, 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 0 1 0 Next 0 0 0we 0 0

0 0 look at the pTrees of w2&wk, k>2. Clearly we only need to consider the 16 WDpTrees would 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 {C8,9,18,20,23,24,25,27,30,39,42,45,46,49,51,55}, 0 0 0 0 0 0 1 not all 58 of them. And going down to w30 the 0 0 1 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 WDpTreeSet 0 1 1 0 0 0 0is (30,48} only 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 2 0 0 0 1 0 1 0 0 0 0 1 0 0 0 To 0 0appreciate 0 0 0 0 0 that we need engineering breakthroughs here, recall that a typical vocabulary might be 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 100,000 0 0 0 0 0 0 0 1 0 0 0 0 0 words, 0 2 0 not just 60. 0 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 3 2 2 3 3 3 3 2 2 4 2 2 2 3 3 2 4 2 4 2 3 53 32 32 22 26 22 22 22 25 2 6 3 2 3 3 4 2 3 5 2 3 2 2 3 2 2 201 1 1 1 1 1 1 1,301,30 1 1 1 1,30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 2 3 8 2 15 3 3 3 3,36 42 2 60 3,36 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 8 15 15 15 15 8 15,60 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 16 16 16 42 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 10 00 00 01 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 00 00 00 00 00 00 00 00 01 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 10 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 01 00 00 00 00 00 00 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 00 10 00 00 00 00 00 10 00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 40 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 00 00 00 00 01 00 00 00 00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 61 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 00 10 00 00 01 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 70 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 00 00 00 00 00 00 10 00 00 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 80 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 00 00 00 00 01 00 00 00 00 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 10 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 11 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 12 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 13 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14 1 00 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 00 00 01 00 00 10 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 00 00 00 00 00 00 00 00 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0 00 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 00 00 00 10 00 00 00 00 00 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 16 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 00 00 10 00 00 00 00 00 00 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 18 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 00 00 10 00 01 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 10 00 00 00 00 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 21 0 00 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 11 00 00 11 00 00 01 00 10 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 22 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 01 00 00 00 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 23 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 00 00 00 00 00 00 00 01 00 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 25 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 00 00 01 00 00 00 01 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 00 00 00 00 00 11 00 00 00 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 26 0 00 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 00 10 00 00 01 00 00 00 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 27 0 00 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 28 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 29 0 01 0 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 00 01 00 00 00 00 10 10 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 30 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 10 00 00 00 00 00 00 00 00 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 32 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 00 00 00 00 00 00 00 01 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 33 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 01 00 00 00 00 00 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 35 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 10 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 00 00 00 00 00 00 00 00 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 36 0 00 0 1 2 3 4 5 37 0 0 06 0 0 0 0 0 0 0 0 0 0 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0 0 0 38 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 39 0 0 00 00 0 00 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 2 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 1 0 0 2 0 1 0 0 41 0 1 00 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 42 0 0 00 00 0 00 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 43 1 0 00 00 0 0 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 44 0 0 45 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 46 0 0 00 00 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 1 1 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 1 0 1 1 1 1 0 0 0 0 1 0 0 0 47 0 0 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 48 0 0 00 10 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 49 0 0 00 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 50 0 1 1 1 0 0 1 0 1 0 0 0 1 1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 1 0 1 1 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 29 0 0 26 0 0290 29 26 26 26 26 26 29 29 26 26 0 1 0 0 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0discard 0 We can those subWordSets that have the same DocSet 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 W1&wW 0 1 0 20 30 40 50 60 7 8 910 1 2 3 4 5 6 7 8 920 1 2 3 4 5 6 7 8 930 1 2 3 4 5 6 7 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 0 11 1 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 00 00 01 00 00 10 00 00 01 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 d11 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 d2 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 10 0 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 d3 1 1 0 1 1 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 0 02 0 30 d4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 400 0d5 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 1 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 2 1 1 0 1 1 1 0 1 0 0 500 0d6 01 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 1 0 1 0 0 0 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 600 0d7 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 700 0d8 00 0 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 d9 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 1 0 0 08 900 0d10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 10 00 0d11 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 11 00 0d12 00 0 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 12 00 0d13 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 13 00 0d14 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 1 0 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 14 00 0d15 15 0 0 d16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 0 1 0 0 0 16 00 0d17 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 17 00 0d18 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 00 0 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 00 0d21 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 1 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 21 10 0d22 22 0 d23 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 23 10 0 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 d25 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 1 25 00

0 00 0 00 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 00 00 01 00 00 10 00 00 00 6 d26 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 26 00 0d27 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 1 0 0 0 0 0 0 0 27 00 2d28 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 00 0 10 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 01 1 0 1 0 0 0 0 1 0 0 0 0 0 28 03 0d29 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 1 0 0 0 0 1 0 0 1 0 0 29 00 0d30 30 One possible way to process is to start with 0 d32 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 1 0 0 0 32 0 00 00 2 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 d33 0 1 0 0 1 0 0 1 0 1 33 00 0 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 d35 w1CountVector: 0 0 1 0 1 0 0 0 0 35 00 1d36 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 1 0 1 2 0 36 00 0d37 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 37 00 1d38 38 01 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 1 1 0 0 0 10 0d39 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0d41 0 0 0 0 0 39 0 41 0 0 d42 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 1 0 42 00 0d43 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 0 43 10 0d44 01 0 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 0 44 00 0d45 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 45 00 0d46 46 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 00 0d47 47 0 0 0d48 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 48 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 d49 49 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 0 d50 50 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 940 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 950 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 FAUST Hull Classification, F(x)=D1ox: Scalar pTreeSet (column of reals) , SPF(X) pTree calculated: mod( int(SP F(X)/(2exp) , 2 ) SPF(X)-min = SPD1oX = SPD1,iXi pD1,0 pD1,-1 pD1,-2 pD1,-3 pe1,0 pe1,-1 pe1,-2 pe1,-3 pe2,0 pe2,-1 pe2,-2 pe2,-3 F(a)= F(b)= F(c)= F(d)= F(e)= F(f)= F(g)= F(h)= 1*1.0 1*1.5 1*1.2 1*0.6 1*2.2 1*2.3 1*2.0 1*2.5 -*3.0 -*3.0 -*2.4 -*2.4 -*2.1 -*3.0 -*2.4 -*2.4 SPe1oX SPe2oX h1 [.6,1.5] h2 [2,2. 5] h1 [2.4,3] h2 [2.1,3] (1,3) (1.5,3) b a c d (.6,2.4) (1.2,2.4) F(d) = = = = = = = = F(a) F(b=F(c)) F(f)=F(g) F(e) D1=(1 , -) F(h) 0.1 0.6 0.6 0 1.75 1.4 1.4 1.9 -.5 0 0 -.6 1.15 .8 .8 1.3 SPD1oX -mn h1 [0,.6] h2 [1.4,1.9] 2.3,3) f g h (2,2.4) (2.5,2.4) e (2.2,2.1) 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 Idea: Incrementally build clusters one at a time using all F values. E.g., start with one pt, x. Recall F dis dominated, which means actual distance F difference. If the hull is close to convex hull, max Fdiff approximates distance? Then 1st gap in maxFdiss isolates x-cluster? FAUST Hull Classification, F(x)=D1ox: Scalar pTreeSet (column of reals) , SPF(X) SPF(X)-min = SPD oX = SPD X 1 F(a)= F(b)= F(c)= F(d)= F(e)= F(f)= F(g)= F(h)= 1,i 1*1.0 1*1.5 1*1.2 1*0.6 1*2.2 1*2.3 1*2.0 1*2.5 -*3.0 -*3.0 -*2.4 -*2.4 -*2.1 -*3.0 -*2.4 -*2.4 = = = = = = = = SPe1oX SPe2oX h1 [.6,1.5] h2 [2,2. 5] h1 [2.4,3] h2 [2.1,3] i -.5

0 0 -.6 1.15 .8 .8 1.3 0.1 0.6 0.6 0 1.75 1.4 1.4 1.9 SPD1oX -mn h1 [0,.6] h2 [1.4,1.9] Incrementally build clusters 1 at a time with F values. E.g., start with 1 pt, x. Recall F dis dominated, which means actual separation F separation. If the hull is well developed (close to convex hull) max Fdiff approximates distance? Then 1 st gap in maxFdis isolates x-cluster? (1,3) (1.5,3) b a c d (.6,2.4) (1.2,2.4) F(d) F(a) F(b=F(c)) F(f)=F(g) F(e) D1=(1 , -) F(h) pTree calculated: mod( int(SP F(X)/(2exp) , 2 ) 2.3,3) f g h (2,2.4) (2.5,2.4) e (2.2,2.1) mxFdf(a) mxFdf(b) mxFdf(c) mxFdf(d) mxFdf(e) mxFdf(g) mxFdf(h) mxFdf(f) 0 .5 .6 .6 1.65 1.3 1.3 1.8 .5 0 .6 .9 1.15 .8 .8 1.3 .6 .6 0 .6 1.15 1.1 .8 1.3 .6 .9 .6 0 1.75 1.7 1.4 1.9 1.65 1.15 1.15 1.75 0 .9 .35 .3 1.3 .8 .8 1.4 .35 .6 0 .5 1.8 1.3 1.3 1.9 .3 .6 .5 0 1.3 .8 1.1 1.7 .9 0 .6 .6 Gap=.5 Density= 4/.9^2= 4.9 Gap=.55 Density= 3/.35^2= 24.5 Gap=.5 Density= 6/.8^2= 9.4 Gap=.7 Density= 4/.6^2= 11.1 {a,b,c,d} a-core. Gap=.7 Density= 4/.6^2= 11.1 No b-core No c-core {a,b,c,d} d-core {e,g,h} e-core {b,c,e,f,g,h} g-core {e,f,g,h} h-core No f-core FAUST Oblique, F(x)=D1ox: Scalar pTreeSet (column of reals) , SPF(X) SPF(X)-min = SPD oX = SPD X 1 F(a)= F(b)= F(c)= F(d)= F(e)= F(f)= F(g)= F(h)= 1*3.0 1*1.5 1*1.2 1*0.6 1*2.2 1*2.3 1*2.0 1*2.5 1,i -*1.5 -*3.0 -*2.4 -*2.4 -*2.1 -*3.0 -*2.4 -*2.4 = = = = = = = = (1.5,3) b c d (.6,2.4) (1.2,2.4) i 2.25 0 0 -.6 1.15 .8 .8 1.3 2.85 0.6 0.6 0 1.75 1.4 1.4 1.9 2.3,3) f g h (2,2.4) (2.5,2.4) e (2.2,2.1) (3,1.5) a F(d) F(b=F(c)) F(f)=F(g) F(e) D1=(1 , -) F(h) F(a) pTree calculated: mod( int(SP F(X)/(2exp) , 2 ) Clustering: Partition; TODO: http://www.cs.ndsu.nodak.edu/~perrizo/saturday/teach/879s15/dmcluster.htm http://www.cs.ndsu.nodak.edu/~perrizo/saturday/teach/879s15/dmlearn.htm 1.Attribute selection prior to FAUST (for speed/accuracy, clustering/classification; Treeminer methods, hi pTree correlation with class?, hi info_gai/gini_index/other?, In information theory and machine learning, information gain is a synonym for KullbackLeibler divergence. The expected value of the information gain is the mutual information I(X; A) of X and A i.e. the reduction in the entropy of X achieved by learning the state of the random variable A. In machine learning, this concept can be used to define a preferred sequence of attributes to investigate to most rapidly narrow down the state of X. Such a sequence (which depends on the outcome of the investigation of previous attributes at each stage) is called a decision tree. Usually an attribute with hi mutual info should be preferred to others. In general terms, the expected information gain is the change in information entropy from a prior state to a state that takes some information as given: Formal definition: Let denote a set of training examples, each of the form where is the value of the th attribute of example and is the corresponding class label. The information gain for an attribute is defined in terms of entropy as follows: The mutual information is equal to the total entropy for an attribute if for each of the attribute values a unique classification can be made for the result attribute. In this case, the relative entropies subtracted from the total entropy are 0. Drawbacks: Although information gain is usually a good measure for deciding the relevance of an attribute, it is not perfect. A notable problem occurs when information gain is applied to attributes that can take on a large number of distinct values. For example, suppose that one is building a decision tree for some data describing the customers of a business. Information gain is often used to decide which of the attributes are the most relevant, so they can be tested near the root of the tree. One of the input attributes might be the customer's credit card number. This attribute has a high mutual information, because it uniquely identifies each customer, but we do not want to include it in the decision tree: deciding how to treat a customer based on their credit card number is unlikely to generalize to customers we haven't seen before (overfitting). Information gain ratio is sometimes used instead. This biases the decision tree against considering attributes with a large number of distinct values. However, attributes with very low information values then appeared to receive an unfair advantage. In statistics, dependence is any statistical relationship between two random variables or two sets of data. Correlation refers to any of a class of statistical relationships involving dependence. Familiar examples of dependent phenomena include the correlation between the physical statures of parents and their offspring, and the correlation between the demand for a product and its price. Correlations are useful because they can indicate a predictive relationship that can be exploited in practice. For example, an electrical utility may produce less power on a mild day based on the correlation between electricity demand and weather. In this example there is a causal relationship, because extreme weather causes people to use more electricity for heating or cooling; however, statistical dependence is not sufficient to demonstrate the presence of such a causal relationship (i.e., correlation does not imply causation). Formally, dependence refers to any situation in which random variables do not satisfy a mathematical condition of probabilistic independence. In loose usage, correlation can refer to any departure of two or more random variables from independence, but technically it refers to any of several more specialized types of relationship between mean values. There are several correlation coefficients, often or r, measuring the degree of correlation. Most common of these is Pearson correlation coefficient, sensitive only to a linear relationship between 2 variables. Other correlation coeffs have been developed to be more robust than Pearson correlation. Several sets of (x, y) points, with the Pearson correlation coefficient of x and y for each set. Note that the correlation reflects the noisiness and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in that case correlation coefficient is undefined

because the variance of Y=0 Contents: Pearson's product-moment coefficient: Main article: Pearson product-moment correlation coefficient :The most familiar measure of dependence between two quantities is the Pearson product-moment correlation coefficient, or "Pearson's correlation coefficient", commonly called simply "the correlation coefficient". It is obtained by dividing the covariance of the two variables by the product of their standard deviations. Karl Pearson developed the coefficient from a similar but slightly different idea by Francis Galton.[4] The population correlation coefficient X,Y between two random variables X and Y with expected values X and Y and standard deviations X and Y is defined as: where E is expected value operator, cov means covariance, and, corr an alternative notation for the correlation coefficient. Pearson correlation is defined only if both standard deviations are finite and nonzero. It is a corollary of the CauchySchwarz inequality that the correlation cannot exceed 1 in absolute value. The correlation coefficient is symmetric: corr(X,Y) = corr(Y,X). The Pearson correlation is +1 in the case of a perfect direct (increasing) linear relationship (correlation), 1 in the case of a perfect decreasing (inverse) linear relationship (anticorrelation),[5] and some value between 1 and 1 in all other cases, indicating the degree of linear dependence between variables. As it approaches 0 there is less of a relationship (closer to uncorrelated). The closer coefficient is to either 1 or 1, stronger the correlation between variables. If variables are indep Pearson's correlation coeff is 0, but the converse is not true because the correlation coefficient detects only linear dependencies between two variables. E.g., random variable X is symmetrically distributed about 0, Y = X2. Then Y is completely determined by X, so that X and Y are perfectly dependent, but their correlation is zero; uncorrelated. Special case X and Y are jointly normal, uncorrelatedness = independence. If a series of n measmnts of X ,Y written xi and yi i = 1...n, sample correlation coefficient can be used to estimate pop Pearson correlation r between X and Y. Sample correlation coeff is written where x and y are the sample means of X and Y, and sx and sy are the sample standard deviations of X and Y. Info Gain as an Attribute Selection Measure Minimizes expected number of tests needed to classify an object and guarantees simple tree (not the simplest) At any stage, let S = {Cs1,...,sm} be a TRAINING SUBSET. S[C] = {CC1,...,Cc} be the distinct classes in S. EXPECTED INFORMATION needed to classify a sample given S as TRAINING SET is: I{Cs 1,...,sm} = -i=1..mpi*log2(pi) pi= |SCi|/|S| Choosing A as decision attribute, Expected Info gained is E(A) = j=1..v; i=1..m ( si,j/|S| * I{Csij..smj} ) where Skh = SA=akCh Gain(A) = I(s1..sm) - E(A) - expected reduction of info required to classify after splitting via A-values. The algorithm above computes the information gain of each attribute and selects the one with the highest information gain as the test attribute. The example: Training Data: Band B1: Band B2: 3 3 7 7 7 3 3 2 3 3 7 7 7 3 3 2 2 2 10 15 11 11 10 10 2 10 15 15 11 11 10 10 S: X-Y B1 B2 B3 B4 0,0 0011 0111 1000 1011 0,1 0011 0011 1000 1111 0,2 0111 0011 0100 1011 0,3 0111 0010 0101 1011 1,0 0011 0111 1000 1011 1,1 0011 0011 1000 1011 1,2 0111 0011 0100 1011 1,3 0111 0010 0101 1011 2,0 0010 1011 1000 1111 2,1 0010 1011 1000 1111 2,2 1010 1010 0100 1011 2,3 1111 1010 0100 1011 3,0 0010 1011 1000 1111 3,1 1010 1011 1000 1111 3,2 1111 1010 0100 1011 3,3 1111 1010 0100 1011 Band B3: 8845 8845 8844 8844 Band B4: 11 15 11 11 11 11 11 11 15 15 11 11 15 15 11 11 ENTROPY based on partition into subsets by B2 is E(B2)=j=1..v[ (s1j+..+smj)*I(s1j..smj)/s ] where Ij=I(s1j..smj)=- i=1..m [pij*log2(pij)], pij=sij/|Aj| Since m=5, the sij's are: j=1 --0 0 2 0 0 --2 j=2 --0 2 2 0 0 --4 j=3 --0 2 0 0 0 --2 j=4 --0 0 0 1 3 --4 j=5 --3 <-- s1j 0 <-- s2j 0 <-- s3j 1 <-- s4j 0 <-- s5j --4 <- s1j+..+s5j j=1 j=2 j=3 j=4 j=5 --- --- --- --- --2 4 2 4 4 <- |Aj| where Aj's are the rootcounts of P2(aj)'s. B1 is the class label (2, 3, 7, 10, 15 are C1,..,C5). We need to know the count of the number of pixels (rows in the table above) that contain each value in each attribute. We also need to know the count of pixels that contain pairs of values, one from a descriptive attribute and the other from the class label attribute. B2=(a1,a2,a3,a4,a5}={C 2, 3, 7,10,11 } as 1st candidate attribute. Aj={Ct: :t(B2)=aj}, a1=0010, a2=0011, a3=0111, a4=1010, a5=1011. sij is number of samples of class, Ci, in a subset, Aj. so sij=rc( P1(ci)^P2(aj) ), where ci{C2,3,7,10,15}, aj{C2,3,7,10,11}. EXPECTED INFO needed to classify the sample is: I = I(s1..sm) = -SUM(i=1..m)[ pi * log2(pi) ], m=5 s=16 si=3,4,4,2,3 (rootcounts for class labels, rc(P1(ci))'s) pi = s1/s = 3/16, 1/4, 1/4, 1/8, 3/16 I = -(3/16*log(3/16)+4/16*log(4/16)+4/16*log(4/16) +2/16*log(2/16)+3/16*log(3/16)) = -( -.453 -.5 -.5 -.375 -.453 ) = -( -2.281) = 2.281 Therefore, j=1 j=2 j=3 ------0 0 0 0 .5 .5 1 .5 0 0 0 0 0 0 0 and 0* 0 0 0 -.5 -.5 0 -.5 0 0 0 0 0 0 0 -----0 1 .5 2 4 4 0 .25 .125 j=4 --0 0 0 .25 .75 j=5 --.75 0 0 .25 0 <<<<<- p1j p2j p3j p4j p5j p1j*log2(p1j) p2j*log2(p2j) p3j*log2(p3j) p4j*log2(p4j) p5j*log2(p5j) 0 0 0 -.5 -.31 ---.81 4 -.31 0 0 -.5 0 ---.81 4 <<<<<- .203 .203 (s1j+..+s5j)*I(s1j..s5j)/16 <- I(s1j..s5j) <- s1j+..+s5j so that, E(B2)=.781 I(s1..sm)=2.28 GAIN(B2) -> GAIN(B3) -> GAIN(B4) -> 1.750 1.331 .568 I(s1..sm) - E(B2) I(s1..sm) - E(B3) I(s1..sm) - E(B4) For Info Gain, IG, we can (for that matter, we can do this for correlation too) 1.eliminate attributes based on an IG threshold or 2. use 1/IG as a weighting in the classification vote or 3.both! From: Arjun Roy January 14, 2015 2:03 AM For starting 'd', we have been using the vector from mean to median which has worked quite well. But I am curious to know if there is any other starting 'd' which would give even better separation between the classes. I could randomly choose some d's but also wanted to avoid too many computations. WP: For clustering, we use the vector from mean to median for recursive clustering (to break apart a cluster into subclusters), but those vectors may not be very helpful for classification (of course, they are not hurtful either - the more d's the merrier for classification, right? Computation cost is the only limiting issue.) For classification, the vectors connecting pairs of class means (or medians or...) would be good vectors to pick (together with the freebees, e1=(1.0...0), ... , en=(0...0,1) ). In the picture below, d seems to connect the two class means? We might also take the "single link" class pair vectors (connecting the pair of points that give the single link distance between the two classes (=the min pairwise distance)). Do we know how to find single link distance using pTrees? (seems like we may have done that????). I apologize if I am forgetting an idea already mentioned in MY lecture slides. ;-) Then of course we can look at the "complete link" vectors (max instead of min). The "average link" vectors are those connecting the means. Of course the single link and complete link pairs can be approximate. For convex hull algorithm, I was looking for the minimum number of d required to build the class model (kind of an end condition). For e.g. in last Saturday slide, one d was sufficient enough (the red line) and black line computation was redundant. WP: Remember the black ( using d=e2=(0,1) )and gold lines ( using d=e1=(1,0) ) are very helpful in eliminating false positives in the case where "none of the classes" is a possibility, right? (e.g., one-class classification).. So I recommend (as a minimal set of d's): 1. Use all ei (the freebees). 2. Use all pairwise class mean connectors. Then if more are desired, add the following in the order given: 3. Main diagonals (connecting each corner of the (0,x2,...,xn) face with the opposite corner on the (1,x2,...,xn) face). The alg for main diagonal endpts: Let x2,...,xn be any sequence of 0's and 1's. D is a main diagonal iff D connects (0,x2,x3,...,xn) to (1,x2',x3',...,xn') where ' =bit complement (flip each bit). So exactly 2^(n-1) main diagonals?? 4. All non-main diagonals (basically fix a set of coordinates at some fixed sequence of 0's and 1's then from the other coordinates, pick one, say k. Run a vector from xk=0 with the fixed sequence and all others 0 or 1, to xk=1 with the fixed sequence and all others flipped. 5. Do 3 using xh=1/2 6. Do 4 using xh=1/2 So the classifier could build initial hulls using 1 first (Use only the freebees , for a quick and dirty FAUST HULL classifier.). Then as the user is using those results, reclassify (add to the class hulls) in the background using 1 and 2. Then as the user is using those results, reclassify in the background using 1 and 2 and 3. Then as the user is using those results, reclassify in the background using 1 and 2 and 3 and 4. Then as the user is using those results, reclassify in the background using 1 and 2 and 3 and 4 and 5. Then as the user is using those results, reclassify in the background using 1 and 2 and 3 and 4 and 5 and 6 Is computing variance the only way to choose the next d? I need to check with Mohammad how expensive it is. Graphically, its intuitive to choose d just by looking at the points but algorithmically its quite a computational pain. WP: Maybe you can re-phrase this question? I'm not catching it. WP: Alternatively, we could use the freebees, then use the PTM partitioning of the surface of the unit sphere (vectors from the origin to a vertex of a great circle triangle under PTM or equivalently, HTM), starting with the partition into 8 triangles (which gives just the feebees, e1,e2,e3), then descend one layer (subdivide all triangles by connecting all side midpoints, etc. Basically, again, at each level in PTM, we would run a vector from the northern hemisphere to the point in the southern hemisphere directly opposite it (or more simply, from the origin to each northern hemisphere vertex. That way, the vectors, D, are coincident with the vertexes of the triangles at that level (there are formulas in the literatire see HTM but it seems like we could get the same thing by: 1.use the freebees {Ce1..en}, which is the top level PTM triangulation. 2.use all V1={C ei ej | i=1,2,3; ij=2,3}, which I believe is the 2nd level PTM triangulation. 3.use all V2={C v + w | v ,w in V1}, 4.use all V3={C v + w | v ,w in V2},

Recently Viewed Presentations

  • Index Structures for Files - csuohio.edu

    Index Structures for Files - csuohio.edu

    Some types of indexes work only in conjunction with a certain file organization Index Structures for Files Single level ordered indexes allow us to search for a record by searching the index file using binary search The index is typically...
  • Nutrition

    Nutrition

    Passes through the digestive tract unchanged - serves as a laxative. Fiber. Best obtained through dietary sources. Fruits, vegetables, dried beans, peas, legumes, cereals, grains, nuts, and seeds ... Incomplete proteins - have small amounts of some of the essential...
  • Windows 7 Application Compatibility Overview

    Windows 7 Application Compatibility Overview

    AELookupSvc. Processes application compatibility cache requests for applications as they are launched. Custom ETW. BDESVC. Provides BitLocker client services for user interface and auto-unlocking of data volumes. Custom ETW. BTHSERV. The Bluetooth service supports discovery and association of remote Bluetooth...
  • 2010 Dual Enrollment Liaison Workshop - Rio Salado College

    2010 Dual Enrollment Liaison Workshop - Rio Salado College

    2010 Dual Enrollment Liaison Workshop * If quick-admitted to obtain Student ID for placement testing, obtain MEID by following "Forgot Your MEID' instructions. * Resource center for information pertaining to: classes enrolled, tuition balance, grades, placement test scores, transcript requests,...
  • Senior CHAT: Consumer Health Awareness Training Base on ...

    Senior CHAT: Consumer Health Awareness Training Base on ...

    Senior CHAT: Consumer Health Awareness Training. Funded by a 2010/2011 Express Consumer Health Outreach Award from NN/LM SCR. Our project is designed to improve health information literacy among the senior citizens of Tangipahoa Parish and ultimately improve health outcomes.
  • The Legal Profession in Britain - unizg.hr

    The Legal Profession in Britain - unizg.hr

    The Legal Profession in Britain ... they do the preparatory work and approach the barristers Barristers represent the client in all courts Legal Services Act 2007 The Act provides for the creation of the Legal Services Board (LSB) that consists...
  • www.jst.umn.edu Housekeeping Combating clutter 1 Clutter / Housekeeping

    www.jst.umn.edu Housekeeping Combating clutter 1 Clutter / Housekeeping

    Dept. of Chem and Materials Science & Engineering, Univ. of Arizona, Privy Press: Safety in Lab, Sept. 22, 2005. Proper Storage Procedures: Store chemicals in approved chemical storage cabinets / refrigerators. Dispose of chemicals that have not been used in...
  • 16 - Greenwood High School

    16 - Greenwood High School

    Paracrines: locally acting chemicals that affect cells other than those that secrete them. Autocrines and paracrines are local chemical messengers and will not be considered part of the endocrine system. Figure 11.5. Local signaling. Target cells. Secreting cell.