# Introduction to KDD for Tony's MI Course

1 COMP3503 Inductive Decision Trees with Daniel L. Silver 2 Agenda Explanatory/Descriptive Modeling Inductive Decision Tree Theory The Weka IDT System Weka IDT Tutorial

3 Explanatory/Descriptive Modeling 4 Overview of Data Mining Methods Automated Exploration/Discovery Prediction/Classification

A e.g.. discovering new market segments x2 distance and probabilistic clustering algorithms x1 e.g.. forecasting gross sales given current factors statistics (regression, K-nearest neighbour) artificial neural networks, genetic algorithms f(x) Explanation/Description

B e.g.. characterizing customers by demographics inductive decision trees/rules rough sets, Bayesian belief nets if age > 35 and income < \$35k then ... x 5 Inductive Modeling = Learning

Objective: Develop a general model or hypothesis from specific examples Function approximation (curve f(x) fitting) x Classification (concept learning, pattern recognition) A x2

B x1 6 Inductive Modeling with IDT Basic Framework for Inductive Learning Testing Examples Environment Training Examples (x, f(x)) Inductive

Learning System Induced Model of Classifier ~ f(x)? h(x) = The focus is on developing a model h(x) that can be understood (is transparent). Output Classification (x, h(x)) 7

Inductive Decision Tree Theory 8 Inductive Decision Trees Decision Tree A? Root A representational structure B? C? An acyclic, directed graph Yes Nodes are either a:

D? Leaf - indicates class or value (distribution) Leaf Decision node - a test on a single attribute - will have one branch and subtree for each possible outcome of the test Classification made by traversing from root to a leaf in accord with tests Inductive Decision Trees (IDTs) 9

A Long and Diverse History Independently developed in the 60s and 70s by researchers in ... Statistics: L. Breiman & J. Friedman - CART (Classification and Regression Trees) Pattern Recognition: Uof Michigan - AID, G.V. Kass - CHAID (Chi-squared closest to Automated Interaction Detection) Scenario AI and Info. Theory: R. Quinlan - ID3, C4.5 (Iterative Dichotomizer) 10

Inducing a Decision Tree Given: Set of examples with Pos. & Neg. classes Problem: Generate a Decision Tree model to classify a separate (validation) set of examples with minimal error Approach: Occams Razor - produce the simplest model that is consistent with the training examples -> narrow, short tree. Every traverse should be as short as possible Formally: Finding the absolute simplest tree is intractable, but we can at least try our best 11 Inducing a Decision Tree How do we produce an optimal tree? Heuristic (strategy) 1: Grow the tree from the top down. Place the most important variable

test at the root of each successive subtree The most important variable: the variable (predictor) that gains the most ground in classifying the set of training examples the variable that has the most significant relationship to the response variable to which the response is most dependent or least independent 12 Inducing a Decision Tree Importance of a predictor variable

CHAID/CART Chi-squared [or F (Fisher)] statistic is used to test the independence between the catagorical [or continuous] response variable and each predictor variable The lowest probability (p-value) from the test determines the most important predictor (p-values are first corrected by the Bonferroni adjustment) C4.5 (section 4.3 of WFH, and PDF slides) Theoretic Information Gain is computed for each predictor and one with the highest Gain is chosen

13 Inducing a Decision Tree How do we produce an optimal tree? Heuristic (strategy) 2: To be fair to predictors variables that have only 2 values, divide variables with multiple values into similar groups or segments which are then treated as separated variables (CART/CHAID only) The p-values from the Chi-squared or F statistic is used to determine variable/value combinations which are most similar in terms of their relationship to the response variable 14

Inducing a Decision Tree How do we produce an optimal tree? Heuristic (strategy) 3: Prevent overfitting the tree to the training data so that it generalizes well to a validation set by: Stopping: Prevent the split on a predictor variable if it is above a level of statistical significance - simply make it a leaf (CHAID) Pruning: After a complex tree has been grown, replace a split (subtree) with a leaf if the predicted validation error is no worse than the more complex tree (CART, C4.5) 15 Inducing a Decision Tree Stopping (pre-pruning) means a choice

of level of significance (CART) .... If the probability (p-value) of the statistic is less than the chosen level of significance then a split is allowed Typically the significance level is set to: 0.05 which provides 95% confidence 0.01 which provides 99% confidence 16 Inducing a Decision Tree Stopping means a minimum number of examples at a leaf node (C4.5 = J48).... M factor = minimum number of examples allowed at a leave node M =2 is default

17 Inducing a Decision Tree Pruning means reducing the complexity of a tree .. (C4.5 = J48).... C factor = confidence in the data used to train the tree C = 25% is default If there is 25% confidence that a pruned branch will generate < or = training errors on a test set then prune it. p.196 WFH, PDF slides 18

The Weka IDT System Weka SimpleCART creates a treebased classification model The target or response variable must be categorical (multiple classes allowed) Uses the Chi-Squared test for significance Prunes the tree by using a test/tuning set Copyright (c), 2002 All Rights Reserved 19 The Weka IDT System Weka J48 creates a tree-based classification model = Ross Quinlans

orginal C4.5 algorithm The target or response variable must be categorical Uses information gain test for significance Prunes the tree by using a test/tuning set Copyright (c), 2002 All Rights Reserved 20 The Weka IDT System Weka M5P creates a tree-based classification model = also by Ross Quinlan

The target or response variable must be continuous Uses information gain test for significance Prunes the tree by using a test/tuning set Copyright (c), 2002 All Rights Reserved 21 IDT Training 22 IDT Training

How do you ensure that a decision tree has been well trained? Objective: To achieve good generalization accuracy on new examples/cases Establish a maximum acceptable error rate Train the tree using a method to prevent over-fitting stopping / pruning Validate the trained network against a separate test set 23 IDT Training Approach #1: Large Sample

When the amount of available data is large ... Available Examples 70% Training Set Divide randomly Test Set Used to develop one IDT model 30% HO Set Compute goodness

of fit Generalization = goodness of fit 24 IDT Training Approach #2: Cross-validation When the amount of available data is small ... Available Examples 10% 90% Training Set

Test Set Used to develop 10 different IDT models Repeat 10 times HO Set Generalization = mean and stddev of goodness of fit Tabulate goodness of fit stats

25 IDT Training How do you select between two induced decision trees ? A statistical test of hypothesis is required to ensure that a significant difference exists between the fit of two IDT models If Large Sample method has been used then apply McNemars test* or diff. of proportions If Cross-validation then use a paired t test for difference of two proportions *We assume a classification problem, if this is function approximation then use paired t test for difference of means

26 Pros and Cons of IDTs Cons: Only one response variable at a time Different significance tests required for nominal and continuous responses Can have difficulties with noisy data Discriminate functions are often suboptimal due to orthogonal decision hyperplanes 27 Pros and Cons of IDTs Pros:

Proven modeling method for 20 years Provides explanation and prediction Ability to learn arbitrary functions Handles unknown values well Rapid training and recognition speed Has inspired many inductive learning algorithms using statistical regression The IDT Application Development Process Guidelines for inducting decision trees 1. IDTs are good method to start with 2. Get a suitable training set 3. Use a sensible coding for input variables 4. Develop the simplest tree by adjusting tuning parameters (significance level) 5. Use a method to prevent over-fitting

6. Determine confidence in generalization through cross-validation 28 29 THE END [email protected]

## Recently Viewed Presentations

• You can use Naviance, a program that GISD offers to all high school students that allows you to take career surveys and provides information on careers and colleges. ... Every GISD 8th-12th grader has a Khan Academy already set up...
• After this I looked, and there before me was a great multitude that no one could count, from every nation, tribe, people and language, standing before the throne and before the Lamb. ... 'he will lead them to springs of...
• Paragraph Formatting. MOAC Lesson 4. Paragraph Formatting. Process of changing the appearance of a paragraph of text. Includes: Alignment. ... Series of paragraphs, each beginning with a bullet character. Setting Tabs. Tabs can be set by using the ruler. Left...
• STEPS TO CREATE AUXILIARY VIEWS. The process of creating an auxiliary view in a two-dimensional CAD drawing is similar to that used in board drafting. However, auxiliary views can usually be drawn in less time because the CAD software provides...
• new ways of thinking about . government. Laissez Faire . economists argued that government should leave business alone to operate "naturally" ... "iron law of wages" "There can be no rise in the value of labor without a fall of...
• When North Carolinian fourteen-year-old Dylan Sands joins his fifteen-year-old cousin Rio in running the Rio Grande River, they face a tropical storm and a fugitive kidnapper. Author Interview 6:00 minutes 3:00 min about book, 3:00 tour by author of area...
• They should first unite all the lands Germany used to own (parts of Poland, Czechoslovakia, Austria, etc.). Afterwards, the Germans should expand eastward into Poland, Ukraine, and Russia and use the farming lands here as a base of supply for...
• The completed preloader with the animation and ActionScript. The preloader animation plays in frames 1 through 10. The startofMovie frame label in frame 11. The preloader ActionScript code in frames 1 and 10. Create a Preloader