Succinct Summarization of Transactional Databases: An Overlapped Hyperrectangle

Succinct Summarization of Transactional Databases: An Overlapped Hyperrectangle

Succinct Summarization of Transactional Databases: An Overlapped Hyperrectangle Scheme Yang Xiang, Ruoming Jin, David Fuhry, Feodor F. Dragan Kent State University Presented by: Yang Xiang Introduction How to succinctly describe a transactional database? i1 i2 i3 i4 i5 i6 i7 i8 i9 t1 t2 t3 t4 t5 t6 t7 t8 t1 t2

t7 t8 t4 t5 t2 t3 t6 t7 i1 i2 i8 i9 {t1,t2,t7,t8}X{i1,i2,i8,i9} i4 i5 i6 {t4,t5}X{i4,i5,i6} i2 i3 i7 i8 {t2,t3,t6,t7}X{i2,i3,i7,i8} Summarization a set of hyperrectangles (Cartesian products) with minimal cost to cover ALL the cells (transaction-item pairs) of the transactional database 2 Problem Formulation Example: cost ({t1,t2,t7,t8}X{i1,i2,i8,i9})

For a hyperrectangle, Ti X Ii , we define its cost to be |Ti|+|Ii| i1 i2 i3 i4 i5 i6 i7 i8 i9 t1 t2 t3 t4 t5 t6 t7 t8 t1 t2 t7 t8 t4 t5 t2 t3 t6 t7 i1 i2 i8 i9 {t1,t2,t7,t8}X{i1,i2,i8,i9} i4 i5 i6

{t4,t5}X{i4,i5,i6} Total Covering Cost=8+5+8=21 i2 i3 i7 i8 {t2,t3,t6,t7}X{i2,i3,i7,i8} 3 Related Work Data Descriptive Mining and Rectangle Covering [Agrawal94] [Lakshmanan02] [Gao06] Summarization for categorical databases [Wang06] [Chandola07] Data Categorization and Comparison [Siebes06] [Leeuwen06] [Vreeken07] Others: Co-clustering [Li05], Approximate Frequent Itemset Mining [Afrati04] [Pei04] [Steinbach04], Data

Compression [Johnson04] 4 Hardness Results Unfortunately, this problem and several variations are proved to be NP-Hard! (Proof hint: Reduce minimum set cover problem to this problem.) 5 Weighted Set Cover Problem The summarization problem is closely related to the weighted set covering problem All cells of the database Ground set Candidate sets (each set has a weight) All possible hyperrectangles (each

hyperrectangle has a cost) Weighted set cover problem: Use a subset of candidate sets to cover the ground set such that the total weight is minimum Use a subset of all possible hyperrectangles to cover the database such that the total cost is minimum 6 Nave Greedy Algorithm Greedy algorithm: Each time choose a hyperrectangle (Hi=TiIi) with lowest price ( H ) | T | | I | . i

i i | Ti I i \ R | |Ti|+|Ii| is hyperrectangle cost. R is the set of covered cells Approximation ratio is ln(k)+1 [V.Chvtal 1979]. k is the number of selected hyperrectangles. The problem? The number of candidate hyperrectangles are 2|T|+|I| !!! 7 Basic Idea-1 Restricted number of candidates A candidate is a hyperrectangle whose itemset is either frequent, or a single item. C = {Ti X Ii | Ii F Is} is the set of candidates.

Given an itemset either frequent or singleton, it corresponds to an exponential number of hyperrectangles. For example: {1,2,3}X{a} It corresponds to the following hyperrectangles: {1} X {a}, {2} X {a}, {3} X {a}, {1,2} X {a}, {1,3} X{a}, {2,3} X{a}, {1,2,3} X{a} The number of hyperrectangle is still exponential 8 Basic Idea-2 Given an itemset, we do NOT try to enumerate the exponential number of hyperrectangles sharing this itemset. A linear algorithm to find the hyperrectangle with the lowest price among all the hyperrectanges sharing the same itemset. 9 Idea-2 Illustration i2 i4

i5 i7 t1 t3 t4 t6 t7 t9 Price=5/4=1.25 Price=6/8=0.75 Hyperrectangle of Lowest Price {t1,t4,t6,t7}{i2,i4,i5,i7} Price=7/10=0.70 Price=8/12=0.67 Price=9/13=0.69 Lowest Price=8/12=0.67 10 HYPER Algorithm While there are some cells not covered

[STEP1] Calculate lowest price hyperrectangle for each frequent or single itemset. (basic idea-2) [STEP2] Find the frequent or single itemset whose corresponding lowest price hyperrectangle is the lowest among all. [STEP3] Output this hyperrectangle. [STEP4] Update Coverage of the database. 11 HYPER We assume Apriori algorithm provides F . HYPER is able to find the best cover which utilizes the exponential number of hyperrectangles, described by candidate sets C = {Ti X Ii | Ii F Is} |T ( I )| | C | 2

( ). i I i F I s Properties: Approximation ratio is still ln(k)+1 w.r.t. C , Running time is O (| T | (| I s | log | T |)(| F | | I s |) k ) polynomial to F I s . 12 Pruning Technique for HYPER One important observation: For each frequent or single itemset, the price of lowest price hyperrectangle will only increase! i2 i4 i5 i7 t1 t3 t4 t6

t7 t9 Hyperrectangle of Lowest Price 0.67 0.80 {t11,t44,t66,t77}{i22,i44,i55,i77} Ii F Is 0.37 0.74 0.94 0.53 0.48 0.66 0.80 I1 I2 I3

1.33 Ii Significantly reduce the time of step 1 In 13 Further Summarization: HYPER+ The number of hyperrectangles returned by HYPER may be too large or the cost is too high. We can do further summarization by allowing false positive budget , i.e. (false cells)/(true cells) t1 t2 t7 t8 i1 i2 i3 i4 i5 i6 i7 i8 i9 t1 t2

t3 t4 t5 t6 t7 t8 =2/7 =0 t4 t5 t2 t3 t6 t7 i1 i2 i8 i9 i4 i5 i6 i2 i3 i7 i8 t1 t2 t3 t6 t7

t8 i1 i2 i3 i7 i8 i9 Cost=12+5=17 Cost=8+8+5=21 14 Experimental Results Two important observations: Convergence behavior Threshold behavior Two important conclusions:

min-sup doesnt need to be too low. We can reduce k to a relatively small number without increasing false positive ratio too much. 15 Conclusion and Future Work Conclusion HYPER can utilize exponential number of candidates to achieve a ln(k)+1 approximate bound but works in polynomial time. We can speed up HYPER by pruning technique. HYPER and HYPER+ works effectively and we find

threshold behavior and convergence behavior in the experiments. Future work Apply this method to real world applications. What is the best for summarization? 16 Thank you! Questions? 17 References [Agrawal94] Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In SIGMOD Conference, pp 94-105, 1998. [Lakshmanan02] Laks V. S. Lakshmanan, Raymond T. Ng, Christine Xing Wang, Xiaodong Zhou, and Theodore J. Johnson. The generalized mdl approach for summarization. In VLDB 02, pp 766777, 2002. [Gao06] Byron J. Gao and Martin Ester. Turning clusters into patterns: Rectangle-based discriminative data description. In ICDM, pages 200211, 2006. [Wang06] Jianyong Wang and George Karypis. On efficiently summarizing categorical databases. Knowl. Inf. Syst., 9(1):1937, 2006. [Chandola07] Varun Chandola and Vipin Kumar. Summarization -compressing data into an informative representation. Knowl. Inf. Syst., 12(3):355378, 2007. [Siebes06] Arno Siebes, Jilles Vreeken, and Matthijs van Leeuwen. Itemsets that compress. In SDM, 2006.

[Leeuwen06] Matthijs van Leeuwen, Jilles Vreeken, and Arno Siebes. Compression picks item sets that matter. In PKDD, pp 585592, 2006. [Vreeken07] Jilles Vreeken, Matthijs van Leeuwen, and Arno Siebes. Characterising the difference. In KDD 07, pages 765774, 2007. [Li05] Tao Li. A general model for clustering binary data. In KDD, pp 188197, 2005. [Afrati04] Foto N. Afrati, Aristides Gionis, and Heikki Mannila. Approximating a collection of frequent sets. In KDD, pp 1219, 2004. [Pei04] Jian Pei, Guozhu Dong, Wei Zou, and Jiawei Han. Mining condensed frequent-pattern bases. Knowl. Inf. Syst., 6(5):570594, 2004. [Steinbach04] Michael Steinbach, Pang-Ning Tan, and Vipin Kumar. Support envelopes: a technique for exploring the structure of association patterns. In KDD 04, pages 296305, New York, NY, USA, 2004. ACM. [Johnson04] David Johnson, Shankar Krishnan, Jatin Chhugani, Subodh Kumar, and Suresh Venkatasubramanian. Compressing large boolean matrices using reordering techniques. In VLDB2004, pages 1323. VLDB Endowment, 2004. [V.Chvtal 1979] V. Chvtal. A greedy heuristic for the set-covering problem. Math. Oper. Res, 4:233235, 1979. 18

Recently Viewed Presentations

  • Gaze palsy Mahmood J Showail INTRODUCTION The control

    Gaze palsy Mahmood J Showail INTRODUCTION The control

    INTRODUCTION. The control of eye movement has three components. The supranuclear pathway (from the cortex and other control centers in the brain to the ocular motor nuclei in the brainstem) . The ocular motor nuclei . The . infranuclear. pathway...
  • New Spin on an Old Theme - PC\|MAC

    New Spin on an Old Theme - PC\|MAC

    New Spin on an Old Theme. Middle School Poetry, Bloom's Taxonomy and ... When the top was more and more uncovered until December fifteenthWhen finally it snowed and snowedI love seeing this mountain like a mouseAttached to the tail of...
  • Vitamin D, Calcium and Rickets Dr. Mohammad Shareq

    Vitamin D, Calcium and Rickets Dr. Mohammad Shareq

    Anterior bowing of tibia and femur. Coxa Vara. Leg pain . HYPOCALCEMIC SYMPTOMS. Tetany. Seizures. Stridor due to laryngeal spasm. INVESTIGATIONS. RADIOLOGY. Changes are most easily visualized on PA view of wrist- although characteristic racitic changes are seen at other...
  • WHAT IS WHITE PRIVILEGE? - Western Kentucky University

    WHAT IS WHITE PRIVILEGE? - Western Kentucky University

    Used to separate "their" behavior from 'my" behavior As humans we tend to feel like other cultures are different and therefore we observe them as being wrong, therefore developing stereotypes Prevent us from identifying our true feelings Franeisha Media Affect...
  • Exercises for CS3511 Week 31 (first week of practical)

    Exercises for CS3511 Week 31 (first week of practical)

    Suppose A B and B C. Prove or disprove A C. A B C C D D C P(C) 4. Translate into English and determine the truth value of each of the following: (R is the set of real numbers.)...
  • Prise en charge du TC grave Réanimation et stratégies ...

    Prise en charge du TC grave Réanimation et stratégies ...

    Recommandations SFMU 1993, Sfar 1999, Sfar-Sfmu 2002 Analgésie médicamenteuse en ft de l'EVA-EN Utilisation large de la morphine titrée : Dose de 0,05 mg/kg puis 2-4 mg / 5-7 min AL et ALR : A développer en médecine d'urgence Incertitudes...
  • Soft Tissue Calcifications Soft Tissue Calcification  Arterial: Arteriosclerosis

    Soft Tissue Calcifications Soft Tissue Calcification Arterial: Arteriosclerosis

    Multiple elongated calcifications in muscle. Used to be problem with pork when pigs fed 'swill' (uncooked garbage) Trichinosis is one of the various parasites that can lead to characteristic calcifications within the soft tissues.
  • Gear and Gear Terminology - WordPress.com

    Gear and Gear Terminology - WordPress.com

    Gear and Gear Terminology By: Gp Cap Dr Hamid Ullah Khan Niazi Objectives After completing this unit you should be able to learn: Identify and state the purposes of six types of gears used in the industry Apply various formulas...