Learning Bayesian Networks from Data Nir Friedman Hebrew U. . Daphne Koller Stanford Overview Introduction Parameter Estimation Model Selection Structure Discovery Incomplete Data Learning from Structured Data 2 Bayesian Networks Compact representation of probability distributions via conditional independence Family of Alarm

Qualitative part: Earthquake Directed acyclic graph (DAG) Nodes - random variables Radio Edges - direct influence Burglary Alarm E B P(A | E,B) e b 0.9 0.1 e b 0.2 0.8 e b 0.9 0.1 e b 0.01 0.99 Call Together: Define a unique distribution in a factored form Quantitative part: Set of conditional probability distributions

P (B,E ,A,C ,R ) P (B )P (E )P (A | B,E )P (R | E )P (C | A ) 3 Example: ICU Alarm network Domain: Monitoring Intensive-Care Patients 37 variables 509 parameters instead of 254 MINVOLSET PULMEMBOLUS PAP KINKEDTUBE INTUBATION SHUNT VENTMACH VENTLUNG DISCONNECT VENITUBE PRESS MINOVL

ANAPHYLAXIS SAO2 TPR HYPOVOLEMIA LVEDVOLUME CVP PCWP LVFAILURE STROEVOLUME FIO2 VENTALV PVSAT ARTCO2 EXPCO2 INSUFFANESTH CATECHOL

HISTORY ERRBLOWOUTPUT CO HR HREKG ERRCAUTER HRSAT HRBP BP 4 Inference Posterior probabilities Most likely explanation

Scenario that explains evidence Rational decision making Probability of any event given any evidence Maximize expected utility Value of Information Effect of intervention Earthquake Radio Burglary Alarm Call 5 Why learning? Knowledge acquisition bottleneck Knowledge acquisition is an expensive process Often we dont have an expert Data is cheap

Amount of available information growing rapidly Learning allows us to construct models from raw data 6 Why Learn Bayesian Networks? Conditional independencies & graphical language capture structure of many real-world distributions Graph structure provides much insight into domain Allows knowledge discovery Learned model can be used for many tasks Supports all the features of probabilistic learning Model selection criteria Dealing with missing data & hidden variables 7

Learning Bayesian networks E Data + Prior Information Learner R B A C E B P(A | E,B) e b .9 .1 e b .7 .3 e b .8 .2 e b .99 .01

8 Known Structure, Complete Data E, B, A . . E B P(A | E,B) e b ? ? e b ? ? e b ? ? e b

? ? Learner E B A B A E B P(A | E,B) e b .9 .1 e b .7 .3 e b .8 .2 e b .99 .01 Network structure is specified

E Inducer needs to estimate parameters Data does not contain missing values 9 Unknown Structure, Complete Data E, B, A . . E B P(A | E,B) e b ? ? e b ? ?

e b ? ? e b ? ? Learner E B A B A E B P(A | E,B) e b .9 .1 e b .7 .3

e b .8 .2 e b .99 .01 Network structure is not specified E Inducer needs to select arcs & estimate parameters Data does not contain missing values 1 Known Structure, Incomplete Data E, B, A . . E B P(A | E,B) e b ?

? e b ? ? e b ? ? e b ? ? E Learner E B A B

A E B P(A | E,B) e b .9 .1 e b .7 .3 e b .8 .2 e b .99 .01 Network structure is specified Data contains missing values Need to consider assignments to missing values 1 Unknown Structure, Incomplete Data E, B, A . .

E B P(A | E,B) e b ? ? e b ? ? e b ? ? e b ? ? E Learner

E B A B A E B P(A | E,B) e b .9 .1 e b .7 .3 e b .8 .2 e b .99 .01 Network structure is not specified Data contains missing values Need to consider assignments to missing values 1 Overview

Introduction Parameter Estimation Likelihood function Bayesian estimation Model Selection Structure Discovery Incomplete Data Learning from Structured Data 1 Learning Parameters Training data has the form: E B A E [1] B[1] A [1] C [1]

D E [ M ] B [ M ] A [ M ] C [ M ] C

1 Likelihood Function Assume i.i.d. samples Likelihood function is L( : D) P (E [m],B[m],A[m],C [m] : ) E B A C m 1 Likelihood Function By definition of network, we get E B A

L( : D) P (E [m],B[m],A[m],C [m] : ) C m P (E [m] : ) P (B[m] : ) m P (A [m] | B[m],E [m] : ) P (C [m] | A [m] : ) E [1] B[1] A [1] C [1]

E [ M ] B [ M ] A [ M ] C [ M ] 1 Likelihood Function Rewriting terms, we get E B

A L( : D) P (E [m],B[m],A[m],C [m] : ) C m P (E [m] : ) m P (B[m] : ) = m P (A[m] | B[m],E [m] : ) m P (C [m] | A[m] : ) m E [1] B[1] A [1] C [1]

E [ M ] B [ M ] A [ M ] C [ M ] 1 General Bayesian Networks Generalizing for any Bayesian network: L( : D) P (x1[m], , xn [m] : ) m P (xi [m] | Pai [m] : i ) i

m Li (i : D) i Decomposition Independent estimation problems 1 Likelihood Function: Multinomials L( : D) P (D | ) P (x [m] | ) m The likelihood for the sequence H,T, T, H, H is L( :D) L( : D) (1 ) (1 ) 0 0.2 0.4 General case: 0.6

0.8 Count of kth outcome in D 1 K L( : D) k k 1 Nk Probability of kth outcome 1 Bayesian Inference Represent uncertainty about parameters using a probability distribution over parameters, data Learning using Bayes rule Likelihood Prior P (x [1], x [M ] | )P ()

P ( | x [1], x [M ]) P (x [1], x [M ]) Posterior Probability of data 2 Bayesian Inference Represent Bayesian distribution as Bayes net X[1] X[2] X[m] Observed data The values of X are independent given P(x[m] | ) = Bayesian prediction is inference in this network 2 Example: Binomial Data

Prior: uniform for in [0,1] P( |D) the likelihood L( :D) P ( | x [1], x [M ]) P (x [1], x [M ] | ) P ( ) (NH,NT ) = (4,1) MLE for P(X = H ) is 4/5 = 0.8 Bayesian prediction is 0 0.2 0.4 0.6 0.8 1 5 P (x [M 1] H | D) P ( | D )d 0.7142 7 2

Dirichlet Priors Recall that the likelihood function is K L( : D) k N k k 1 Dirichlet prior with hyperparameters 1,,K K P () k k 1 k 1 the posterior has the same form, with hyperparameters 1+N N 1,,K +N N K K K K k 1 k 1

k 1 P ( | D) P ()P (D | ) k k 1 k N k k k N k 1 2 Dirichlet Priors - Example 5 4.5 Dirichlet(heads, tails) 4 P(heads) 3.5 3 2.5 Dirichlet(0.5,0.5) 2 Dirichlet(5,5) Dirichlet(2,2) 1.5 Dirichlet(1,1) 1 0.5

0 0 0.2 0.4 0.6 heads 0.8 1 2 Dirichlet Priors (cont.) If P() is Dirichlet with hyperparameters 1,,K k P (X [1] k ) k P ()d Since the posterior is also Dirichlet, we get k N k P (X [M 1] k | D) k P ( | D)d

( N ) 2 Bayesian Nets & Bayesian Prediction Y|X X X m X[1] X[2] X[M] X[M+1] Y[1] Y[2] Y[M] Y[M+1] Observed data

Query X[m] Y|X Y[m] Plate notation Priors for each parameter group are independent Data instances are independent given the unknown parameters 2 Bayesian Nets & Bayesian Prediction Y|X X X[1] X[2] X[M] Y[1] Y[2]

Y[M] Observed data We can also read from the network: Complete data posteriors on parameters are independent Can compute posterior over parameters separately! 2 Learning Parameters: Summary Estimation relies on sufficient statistics For multinomials: counts N(x ,pa ) i i Parameter estimation N (xi , pai ) xi | pai N ( pai ) MLE

(xi , pai ) N (xi , pai ) ~ x | pa i i ( pai ) N ( pai ) Bayesian (Dirichlet) Both are asymptotically equivalent and consistent Both can be implemented in an on-line manner by accumulating sufficient statistics 2 Learning Parameters: Case Study 1.4 Instances sampled from ICU Alarm network KL Divergence to true distribution 1.2 M strength of prior 1 0.8

MLE 0.6 Bayes; M'=20 0.4 Bayes; M'=50 0.2 Bayes; M'=5 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500

5000 # instances 2 Overview Introduction Parameter Learning Model Selection Scoring function Structure search Structure Discovery Incomplete Data Learning from Structured Data 3 Why Struggle for Accurate Structure? Earthquake Alarm Set Burglary Sound Missing an arc

Adding an arc Earthquake Alarm Set Burglary Earthquake Alarm Set Sound Cannot be compensated for by fitting parameters Wrong assumptions about domain structure Burglary Sound Increases the number of parameters to be estimated Wrong assumptions about domain structure 3

Scorebased Learning Define scoring function that evaluates how well a structure matches the data E, B, A . . E B E E A A B A B Search for a structure that maximizes the score 3

Likelihood Score for Structure (G : D) log L(G : D) M I(X i ; PaiG ) H(Xi ) i Mutual information between Xi and its parents Larger dependence of Xi on Pai higher score Adding arcs always helps I(X; Y) I(X; {Y,Z}) Max score attained by fully connected network Overfitting: A bad idea 3 Bayesian Score Likelihood score: L(G : D) P(D | G,G ) Max likelihood params Bayesian approach: Deal with uncertainty by assigning probability to all possibilities P (D | G ) P (D | G, )P ( | G )d Marginal Likelihood Likelihood Prior over parameters

P(D| G)P(G) P(G | D) P(D) 3 Marginal Likelihood: Multinomials Fortunately, in many cases integral has closed form P() is Dirichlet with hyperparameters 1,,K D is a dataset with sufficient statistics N1,,NK Then ( N ) P (D) ( ) ( N )

3 Marginal Likelihood: Bayesian Networks Network structure determines form of marginal likelihood 1 2 3 4 5 6 7 X H T T

H T H H Y H T H H T T H Network 1: Two Dirichlet marginal likelihoods X P( )

Integral over X P( ) Integral over Y Y 3 Marginal Likelihood: Bayesian Networks Network structure determines form of marginal likelihood 1 2 3 4 5 6

7 X H T T H T H H Y H T H H T T

H Network 2: Three Dirichlet marginal likelihoods X Y P( ) Integral over X P( ) Integral over Y|X=H P( ) Integral over Y|X=T 3 Marginal Likelihood for Networks The marginal likelihood has the form: P (D | G ) Dirichlet marginal likelihood i

paiG ( paiG ) for multinomial P(Xi | pai) ( paiG ) N ( paiG ) xi ( (xi , paiG ) N (xi , paiG )) ( (xi , paiG )) N(..) are counts from the data (..) are hyperparameters for each family given G 3 Bayesian Score: Asymptotic Behavior log M log P(D| G) (G : D) dim(G)O(1) 2 log M

M I(X i ; PaiG ) H(X i ) dim(G) O(1) 2 i Fit dependencies in empirical distribution Complexity penalty As M (amount of data) grows, Increasing pressure to fit dependencies in distribution Complexity term avoids fitting noise Asymptotic equivalence to MDL score Bayesian score is consistent Observed data eventually overrides prior 3 Structure Search as Optimization Input: Training data Scoring function Set of possible structures Output:

A network that maximizes the score Key Computational Property: Decomposability: score(G) = score ( family of X in G ) 4 Tree-Structured Networks MINVOLSET PULMEMBOLUS Trees: At most one parent per variable PAP KINKEDTUBE INTUBATION SHUNT VENTLUNG SAO2 TPR HYPOVOLEMIA LVEDVOLUME CVP

PCWP DISCONNECT VENITUBE PRESS MINOVL VENTALV PVSAT Why trees? Elegant math we can solve the optimization problem Sparse parameterization avoid overfitting VENTMACH LVFAILURE STROEVOLUME ARTCO2 EXPCO2 INSUFFANESTH

CATECHOL HISTORY ERRBLOWOUTPUT CO HR HREKG ERRCAUTER HRSAT HRBP BP 4 Learning Trees Let p(i) denote parent of Xi We can write the Bayesian score as Score (G : D) Score (X i : Pai )

i Score(X i : X p(i ) ) Score(X i ) Score(X i ) i Improvement over empty network i Score of empty network Score = sum of edge scores + constant 4 Learning Trees Set w(ji) =Score( Xj Xi ) - Score(Xi) Find tree (or forest) with maximal weight Standard max spanning tree algorithm O(n2 log n)

Theorem: This procedure finds tree with max score 4 Beyond Trees When we consider more complex network, the problem is not as easy Suppose we allow at most two parents per node A greedy algorithm is no longer guaranteed to find the optimal network In fact, no efficient algorithm exists Theorem: Finding maximal scoring structure with at most k parents per node is NP-hard for k > 1 4 Heuristic Search Define a search space: search states are possible structures operators make small changes to structure Traverse space looking for high-scoring structures Search techniques: Greedy hill-climbing Best first search Simulated Annealing

... 4 Local Search Start with a given network empty network best tree a random network At each iteration Evaluate all possible changes Apply change based on score Stop when no modification improves score 4 Heuristic Search Typical operations: S C E C

C Add D S D To update score after local change, E only re-score families that changed C C D score = S({C,E} D) - S({E} D) Re ver se C

S te e l De E E S C E E D D 4 Learning in Practice: Alarm domain 2 true distribution

KL Divergence to 1.5 Structure known, fit params 1 Learn both structure & params 0.5 0 0 500 1000 1500 2000 2500 3000 #samples 3500 4000

4500 5000 4 Local Search: Possible Pitfalls Local search can get stuck in: Local Maxima: All one-edge changes reduce the score Plateaux: Some one-edge changes leave the score unchanged Standard heuristics can escape both Random restarts TABU search Simulated annealing

4 Improved Search: Weight Annealing Standard annealing process: Take bad steps with probability exp(score/t) Probability increases with temperature Weight annealing: Take uphill steps relative to perturbed score Perturbation increases with temperature Score(G|D) G 5 Perturbing the Score Perturb the score by reweighting instances Each weight sampled from distribution: Mean = 1 Variance temperature Instances sampled from original distribution but perturbation changes emphasis

Benefit: Allows global moves in the search space 5 Weight Annealing: ICU Alarm network Cumulative performance of 100 runs of annealed structure search True structure Learned params Greedy hill-climbing Annealed search 5 Structure Search: Summary Discrete optimization problem In some cases, optimization problem is easy Example: learning trees In general, NP-Hard Need to resort to heuristic search In practice, search is relatively fast (~100 vars in ~2-5 min):

Decomposability Sufficient statistics Adding randomness to search is critical 5 Overview Introduction Parameter Estimation Model Selection Structure Discovery Incomplete Data Learning from Structured Data 5 Structure Discovery Task: Discover structural properties Is there a direct connection between X & Y Does X separate between two subsystems Does X causally effect Y Example: scientific data mining

Disease properties and symptoms Interactions between the expression of genes 5 P(G|D) Discovering Structure E R B A C Current practice: model selection Pick a single high-scoring model Use that model to infer domain structure 5 Discovering Structure P(G|D) E B R

A C R E B E A C R B A C E R B A C E R B A

C Problem Small sample size many high scoring models Answer based on one model often useless Want features common to many models 5 Bayesian Approach Posterior distribution over structures Estimate probability of features Edge XY Path X Y Bayesian score for G P (f | D) f (G )P (G | D) G Feature of G, e.g., XY Indicator function

for feature f 5 MCMC over Networks Cannot enumerate structures, so sample structures 1 n P (f (G ) | D) f (Gi ) n i 1 MCMC Sampling Define Markov chain over BNs Run chain to get samples from posterior P(G | D) Possible pitfalls: Huge (superexponential) number of networks Time for chain to converge to posterior is unknown Islands of high posterior, connected by low bridges 5 ICU Alarm BN: No Mixing 500 instances: Score of score sample cuurent

-8400 -8600 -8800 -9000 -9200 empty greedy -9400 0 100000 200000 300000 400000 500000 600000 iteration MCMC Iteration

The runs clearly do not mix 6 Effects of Non-Mixing True BN Random start Two MCMC runs over same 500 instances Probability estimates for edges for two runs True BN True BN Probability estimates highly variable, nonrobust 6 Fixed Ordering Suppose that We know the ordering of variables say, X > X > X > X > > X 1 2 3 4

n parents for Xi must be in X1,,Xi-1 Limit number of parents per nodes to k 2knlog n networks Intuition: Order decouples choice of parents Choice of Pa(X ) does not restrict choice of Pa(X ) 7 12 Upshot: Can compute efficiently in closed form Likelihood P(D | ) Feature probability P(f | D, ) 6 Our Approach: Sample Orderings We can write P (f | D) P (f |,D)P (| D) Sample orderings and approximate n P (f | D) P (f | i ,D) i 1

MCMC Sampling Define Markov chain over orderings Run chain to get samples from posterior P( | D) 6 Mixing with MCMC-Orderings 4 runs on ICU-Alarm with 500 instances fewer iterations than MCMC-Nets approximately same amount of computation -8400 Score score of cuurent sample -8405 -8410 -8415 -8420 -8425 -8430 -8435 -8440 -8445

random greedy -8450 0 10000 20000 30000 40000 50000 60000 iteration MCMC Iteration Process appears to be mixing! 6 Mixing of MCMC runs Two MCMC runs over same instances Probability estimates for edges

500 instances 1000 instances Probability estimates very robust 6 Application: Gene expression Input: Measurement of gene expression under different conditions Thousands of genes Hundreds of experiments Output: Models of gene interaction Uncover pathways 6 Map of Feature Confidence Yeast data [Hughes et al 2000] 600 genes 300 experiments 6

Mating response Substructure KAR4 SST2 TEC1 NDJ1 KSS1 YLR343W YLR334C MFA1 STE6 FUS1 PRM1 AGA1 AGA2 TOM6

FIG1 FUS3 YEL059W Automatically constructed sub-network of highconfidence edges Almost exact reconstruction of yeast mating pathway 6 Overview Introduction Parameter Estimation Model Selection Structure Discovery Incomplete Data Parameter estimation Structure search Learning from Structured Data 6 Incomplete Data Data is often incomplete

Some variables of interest are not assigned values This phenomenon happens when we have Missing values: Some variables unobserved in some instances Hidden variables: Some variables are never observed We might not even know they exist 7 Hidden (Latent) Variables Why should we care about unobserved variables? X1 X2 X3 X1 X2 X3 Y3 Y1 Y2 Y3

H Y1 Y2 17 parameters 59 parameters 7 Example X Y Contour plots of log likelihood for different number of missing values of X (M = 8): Y|X=H P(X) assumed to be known Likelihood function of: Y|X=T, Y|X=H Y|X=T

Y|X=T no missing values 2 missing value Y|X=T 3 missing values In general: likelihood function has multiple modes 7 Incomplete Data In the presence of incomplete data, the likelihood can have multiple maxima H Y Example: We can rename the values of hidden variable H If H has two values, likelihood has two maxima In practice, many local maxima 7 L(|D)

EM: MLE from Incomplete Data Use current point to construct nice alternative function Max of new function scores than current point 7 Expectation Maximization (EM) A general purpose method for learning from incomplete data Intuition: If we had true counts, we could estimate parameters But with missing values, counts are unknown We complete counts using probabilistic inference based on current parameter assignment We use completed counts as if real to re-estimate parameters 7 Expectation Maximization (EM) Data P(Y=H|X=H,Z=T,) = 0.3 Current

model X Y Z H T H H T ? ? H T T T ? ? T H Expected Counts N (X,Y ) X Y # H T H T

H H T T 1.3 0.4 1.7 1.6 P(Y=H|X=T,) = 0.4 7 Expectation Maximization (EM) Reiterate Initial network (G,0) X1 X2 X3 Computation H Y1 Y2

Y3 (E-Step) Expected Counts N(X1) N(X2) N(X3) N(H, X1, X1, X3) N(Y1, H) N(Y2, H) N(Y3, H) Updated network (G,1) X1 Reparameterize (M-Step) X2 X3 H Y1 Y2 Y3

Training Data 7 Expectation Maximization (EM) Formal Guarantees: L(1:D) L(0:D) Each iteration improves the likelihood If 1 = 0 , then 0 is a stationary point of L(:D) Usually, this means a local maximum 7 Expectation Maximization (EM) Computational bottleneck: Computation of expected counts in E-Step Need to compute posterior for each unobserved variable in each instance of training set All posteriors for an instance can be derived from one pass of standard BN inference 7 Summary: Parameter Learning with Incomplete Data

Incomplete data makes parameter estimation hard Likelihood function Does not have closed form Is multimodal Finding max likelihood parameters: EM Gradient ascent Both exploit inference procedures for Bayesian networks to compute expected sufficient statistics 8 Incomplete Data: Structure Scores Recall, Bayesian score: P (G | D) P (G )P (D | G ) P (G )P (D | G, )P ( | G )d With incomplete data: Cannot evaluate marginal likelihood in closed form We have to resort to approximations: Evaluate score around MAP parameters Need to find MAP parameters (e.g., EM) 8

Naive Approach Perform EM for each candidate graph G3 G2 G1 Parameter space Parametric optimization (EM) Local Maximum Computationally expensive: Gn G4 Parameter optimization via EM non-trivial Need to perform EM for all candidate structures Spend time even on poor candidates In practice, considers only a few candidates 8 Structural EM

Recall, in complete data we had Decomposition efficient search Idea: Instead of optimizing the real score Find decomposable alternative score Such that maximizing new score improvement in real score 8 Structural EM Idea: Use current model to help evaluate new structures Outline: Perform search in (Structure, Parameters) space At each iteration, use current model for finding either: Better scoring parameters: parametric EM step or Better scoring structure: structural EM step 8 Reiterate Computation X1 X2 X3 H Y1

Y2 Y3 Training Data Expected Counts N(X1) N(X2) N(X3) N(H, X1, X1, X3) N(Y1, H) N(Y2, H) N(Y3, H) N(X2,X1) N(H, X1, X3) N(Y1, X2) N(Y2, Y1, H) Score & Parameterize X1 X2 X3

H Y1 X1 Y2 X2 Y3 X3 H Y1 Y2 Y3 8 Example: Phylogenetic Reconstruction Input: Biological sequences Human CGTTGC Chimp CCTAGG Orang CGAACG An instance of

evolutionary process Assumption: positions are independent . Output: a phylogeny 10 billion years leaf 8 Phylogenetic Model 8 leaf 2 12 internal node 11 10 1 9 branch (8,9)

3 4 5 Topology: bifurcating Observed species 1N Ancestral species N+12N-2 Lengths t = {ti,j} for each branch (i,j) Evolutionary model: 6 7 P(A changes to T| 10 billion yrs ) 8 Phylogenetic Tree as a Bayes Net

Variables: Letter at each position for each species Current day species observed Ancestral species - hidden BN Structure: Tree topology BN Parameters: Branch lengths (time spans) Main problem: Learn topology If ancestral were observed easy learning problem (learning trees) 8 Algorithm Outline Compute expected pairwise stats Weights: Branch scores Original Tree (T0,t0) 8 Algorithm Outline Compute expected pairwise stats Weights: Branch scores Find: T ' argmaxT w i ,j

(i , j )T Pairwise weights O(N2) pairwise statistics suffice to evaluate all trees 9 Algorithm Outline Compute expected pairwise stats Weights: Branch scores Find: T ' argmaxT w i ,j (i , j )T Construct bifurcation T1 Max. Spanning Tree 9 Algorithm Outline Compute expected pairwise stats Weights: Branch scores Find: T ' argmaxT w

i ,j (i , j )T Construct bifurcation T1 Theorem: L(T1,t1) L(T0,t0) New Tree Repeat until convergence 9 Real Life Data Lysozyme c Mitochondrial genomes # sequences 43 34 # pos 122 3,578 -2,916.2

-74,227.9 -2,892.1 -70,533.5 0.19 1.03 Traditional approach LogStructural EM likelihood Approach Difference per position Each position twice as likely 9 Overview Introduction

Parameter Estimation Model Selection Structure Discovery Incomplete Data Learning from Structured Data 9 Bayesian Networks: Problem Bayesian nets use propositional representation Real world has objects, related to each other Intelligence Difficulty Grade 9 Bayesian Networks: Problem Bayesian nets use propositional representation Real world has objects, related to each other These instances are not independent!

Intell_J.Doe Diffic_CS101 Grade_JDoe_CS101 A Intell_FGump Intell_FGump Diffic_CS101 Grade_FGump_CS101 C Diffic_Geo101 Grade_FGump_Geo101 9 St. Nordaf University Teaching-ability Teaches Teaches Teaching-ability Welcome to In-courseGrade

Registered Satisfac Forrest Gump Geo101 Grade Welcome to Difficulty Registered In-courseSatisfac CS101 Difficulty Intelligence Grade In-courseSatisfac Intelligence Registered Jane Doe 9 Relational Schema

Specifies types of objects in domain, attributes of each type of object, & types of links between objects Professor Classes Student Intelligence Teaching-Ability Take Teach Links Attributes Course Difficulty In Registration Grade Satisfaction 9 Representing the Distribution

Many possible worlds for a given university All possible assignments of all attributes of all objects Infinitely many potential universities Each associated with a very different set of worlds Need to represent infinite set of complex distributions 9 Possible Worlds Prof. High Jones Low Teaching-ability Prof. Smith High Teaching-ability World: assignment to all attributes of all objects in domain B Grade C Welcome to

Hate Satisfac Like Forrest Gump Geo101 Welcome to Easy Difficulty C B Grade Hate Satisfac CS101 Hard Easy Difficulty Weak Intelligence A Grade Hate Like Satisfac

Smart Intelligence Jane Doe 1 Probabilistic Relational Models Key ideas: Universals: Probabilistic patterns hold for all objects in class Locality: Represent direct probabilistic dependencies Links give us potential interactions! Professor Teaching-Ability Course Studen tIntelligence Difficulty easy,weak Reg Grade Satisfaction

A B C easy,smart hard,weak hard,smart 0% 20% 40% 60% 80% 100% 1 PRM Semantics Prof. Jones Teaching-ability Prof. Smith Teaching-ability Instantiated PRM BN variables: attributes of all objects

dependencies: determined by Grade|Intell,Diffic links & PRM Grade Welcome to Satisfac Intelligence Forrest Gump Geo101 Grade Welcome to Difficulty Satisfac CS101 Grade Difficulty Satisfac Intelligence Jane Doe 1 The Web of Influence

Welcome to CS101Objects are all correlated Need to perform inference over entire model For large databases, use approximate inference: Loopy belief propagation C 0% 0% Welcome to Geo101 easy / hard A weak 50% 50% 100% 100% smart weak / smart

1 PRM Learning: Complete Data Prof. Jones Low Teaching-ability Prof. Smith High Teaching-ability Grade|Intell,Diffic Grade C Welcome to Satisfac Like Weak Intelligence Introduce prior over parameters Geo101 Entire

database is single instance Update prior with sufficient statistics: B Grade Parameters Easyused many times in instance Difficulty Hate Satisfac Count(Reg.Grade=A,Reg.Course.Diff=lo,Reg.Student.Intel=hi) Welcome to CS101 Easy Difficulty A Grade Smart Intelligence Like Satisfac 1

PRM Learning: Incomplete Data ??? ??? C Hi ??? Welcome to Use expected sufficient statistics Geo101 But, everything is correlated: A E-step uses (approx) inference over entire model ??? Welcome to Low CS101 ???

B Hi ??? 1 A Web of Data Tom Mitchell Professor Project-of WebKB Project Works-on Advisee-of Sean Slattery Student Contains CMU CS Faculty [Craven et al.] 1 Standard Approach Pag eCategory Word1 ... Word N

Professor department extract information computer science machine learning 0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68 1 Whats in a Link To-Pag

eCategory From Pag - Category e Word1 ... Word Word1 N ... Word N Exists Link 0.52 0.54 0.56 0.58 0.6 0.62

0.64 0.66 0.68 1 Discovering Hidden Concepts Internet Movie Database http://www.imdb.com 1 Discovering Hidden Concepts Acto r Type Gender Directo r Type Movie Appeared Credit-Order Genre

Type Year MPAA Rating Rating #Votes Internet Movie Database http://www.imdb.com 1 Web of Influence, Yet Again Movies Actors Directors Wizard of Oz Cinderella Sound of Music The Love Bug Pollyanna The Parent Trap Mary Poppins Swiss Family Robinson Sylvester Stallone Bruce Willis Harrison Ford Steven Seagal Kurt Russell

Kevin Costner Jean-Claude Van Damme Arnold Schwarzenegger Alfred Hitchcock Stanley Kubrick David Lean Milos Forman Terry Gilliam Francis Coppola Terminator 2 Batman Batman Forever Mission: Impossible GoldenEye Starship Troopers Hunt for Red October Anthony Hopkins Robert De Niro Tommy Lee Jones Harvey Keitel Morgan Freeman Gary Oldman Steven Spielberg Tim Burton Tony Scott James Cameron John McTiernan Joel Schumacher

1 Conclusion Many distributions have combinatorial dependency structure Utilizing this structure is good Discovering this structure has implications: To density estimation To knowledge discovery Many applications Medicine Biology Web 1 The END Thanks to

Gal Elidan Lise Getoor Moises Goldszmidt Matan Ninio Dana Peer Eran Segal Ben Taskar Slides will be available from: http://www.cs.huji.ac.il/~nir/ http://robotics.stanford.edu/~koller/ 1