Information Extraction from the World Wide Web Part III: HMM Learning & CRFs CSE 454 Administrivia Congratulations ACM Programming Contest #2 - The Ministry of Silly Walks Jason Kivlighn, Scott Shawcroft, Alyssa Harding Daniel S. Weld Landscape of IE Techniques: Models Classify Pre-segmented Candidates Lexicons Abraham Lincoln was born in Kentucky. member? Alabama Alaska

Wisconsin Wyoming Boundary Models Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. Sliding Window Abraham Lincoln was born in Kentucky. Classifier Classifier which class? which class? Try alternate window sizes: Finite State Machines Abraham Lincoln was born in Kentucky.

Context Free Grammars Abraham Lincoln was born in Kentucky. NNP V V P Classifier PP which class? VP NP BEGIN END

BEGIN NP END Each model can capture words, formatting, or both ke ly NNP Mo st li Most likely state sequence? pa rse ?

BEGIN VP S Slides from Cohen & McCallum Hidden Markov Models (HMMs) standard sequence model in genomics, speech, NLP, Graphical model Finite state model ... S t-1 St State sequence Observation sequence

transitions ... observations ... O t -1 Generates: S t+1 Ot O t +1 |o | o1 o2

o3 o4 o5 o6 o7 o8 P ( s , o ) P ( st | st 1 ) P (ot | st ) t 1 Parameters: for all states S={s1,s2,} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 )

Observation (emission) probabilities: P(ot|st ) Training: Usually a multinomial over atomic, fixed alphabet Slides from Cohen & McCallum Definition of a hidden Markov model Definition: A hidden Markov model (HMM) Alphabet = { b1, b2, , bM } Set of states Q = { 1, ..., K } Transition probabilities between any two states 1 2 K

aij = transition prob from state i to state j ai1 + + aiK = 1, for all states i = 1K Start probabilities a0i a01 + + a0K = 1 Emission probabilities within each state ei(b) = P( xi = b | i = k) ei(b1) + + ei(bM) = 1, for all states i = 1K Slides from Serafim Batzoglou Example: The Dishonest Casino A casino has two dice: Fair die P(1) = P(2) = P(3) = P(5) = P(6) = 1/6 Loaded die P(1) = P(2) = P(3) = P(5) = 1/10 P(6) = 1/2 Casino player switches back-&-forth between fair and loaded die once every 20 turns Game: 1. You bet $1

2. You roll (always with a fair die) 3. Casino player rolls (maybe with fair die, maybe with loaded die) 4. Highest number wins $2 Slides from Serafim Batzoglou The dishonest casino model 0.05 0.95 FAIR P(1|F) P(2|F) P(3|F) P(4|F) P(5|F) P(6|F) = = = =

= = 1/6 1/6 1/6 1/6 1/6 1/6 0.95 LOADED 0.05 P(1|L) P(2|L) P(3|L) P(4|L) P(5|L) P(6|L)

= = = = = = 1/10 1/10 1/10 1/10 1/10 1/2 Slides from Serafim Batzoglou Information Extraction with HMMs Example - Research Paper Headers Daniel S. Weld de by Okan Basegmez

8 HMMs for Info Extraction For sparse extraction tasks : Separate HMM for each type of target Each HMM should Model entire document Consist of target and non-target states Not necessarily fully connected Given HMM, how extract info? Daniel S. Weld de by Okan Basegmez 9 A parse of a sequence Given a sequence x = x1xN, A parse of x is a sequence of states = 1, , N 1

1 1 1 2 2 2 2

K K K x1 x2 x3 K xK Slides from Serafim Batzoglou Question # 1 Evaluation GIVEN

A sequence of observations (e.g., rolls by the casino player) 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION How likely is this sequence, given our HMM ? This is the EVALUATION problem in HMMs Slides from Serafim Batzoglou Question # 2 Decoding GIVEN A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION What portion of the sequence was generated with the fair die, and what portion with the loaded die? This is the DECODING question in HMMs Slides from Serafim Batzoglou Question # 3 Learning GIVEN

A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION How loaded is the loaded die? How fair is the fair die? How often does the casino player change from fair to loaded, and back? This is the LEARNING question in HMMs Slides from Serafim Batzoglou Decoding Problem Find the best parse of a sequence Slides from Serafim Batzoglou Decoding GIVEN x = x1x2xN We want to find = 1, , N, such that P[ x, ] is maximized 1

1 1 1 2 2 2 2

K K K x x2 x3 K * = argmax P[ x, ] We can use dynamic programming! xK

1 Let Vk(i) = max{1,,i-1} P[x1xi-1, 1, , i-1, xi, i = k] = Probability of most likely sequence of states ending at state i = k Slides from Serafim Batzoglou Decoding main idea Given that for all states k, and for a fixed position i, Vk(i) = max{1,,i-1} P[x1xi-1, 1, , i-1, xi, i = k] What is Vl(i+1)? From definition, Vl(i+1) = max{1,,i}P[ x1xi, 1, , i, xi+1, i+1 = l ] = max{1,,i}P(xi+1, i+1 = l | x1xi,1,, i) P[x1xi, 1,, i] = max{1,,i}P(xi+1, i+1 = l | i ) P[x1xi-1, 1, , i-1, xi, i] = maxk [P(xi+1, i+1 = l | i = k) max{1,,i-1}P[x1xi-1,1,,i-1, xi,i=k]] = el(xi+1) maxk akl Vk(i) Remember: Vk(i) = probability of most likely state seq ending with state i = k

Slides from Serafim Batzoglou The Viterbi Algorithm Dynamic Programming x1 State 1 x2 xi xi+1..xN 2 x Ma j Vj(i+1) K Time:

O(K2N) Space: O(KN) Remember: Vk(i) = probability of most likely state seq ending with state i = k Slides from Serafim Batzoglou Decoding = Extraction Example - Research Paper Headers Pedro Domingos from Okan Basegmez Daniel S. Weld 18 Evaluation Problem Whats the probability that this observations sequence would have been generated by HMM ?

Slides from Serafim Batzoglou Why Do We Care? It will be important for learning The Forward Algorithm We want to calculate P(x) = probability of x, given the HMM Sum over all possible ways of generating x: P(x) = P(x, ) = P(x | ) P() To avoid summing over an exponential number of paths , define fk(i) = P(x1xi, i = k) (the forward probability) Slides from Serafim Batzoglou Forward Probabilities What is the probability that, given an HMM at time t, the state is si, and the partial observation o1 ot has been emitted?

t (i) P(o1 ... ot , qt si | ) ft Daniel S. Weld de by Bonnie Dorr 23 Problem 1: Probability of an Observation Sequence What is P(O | ?) The probability of an observation sequence is the sum of the probabilities of all possible state sequences in the HMM times the chance that they emitted the observations Nave computation is very expensive.

Given T observations and N states, there are NT possible state sequences. Even small HMMs, e.g. T=10 and N=10, contain 10 billion different paths Solution to this and problem 2 is Use dynamic programming Daniel S. Weld de by Bonnie Dorr 24 Forward Probabilities t (i) P(o1 ...ot , qt si | ) Daniel S. Weld de by Bonnie Dorr

N t ( j) t 1 (i) aij b j (ot ) i1 25 Relation between Forward and Viterbi VITERBI Initialization: V0(0) = 1 Vk(0) = 0, for all k > 0 Iteration: Vj(i) = ej(xi) maxk Vk(i-1) akj Termination: P(x, *) = maxk Vk(N) FORWARD Initialization: f0(0) = 1 fk(0) = 0, for all k > 0 Iteration: fl(i) = el(xi) k fk(i-1) akl

Termination: P(x) = k fk(N) ak0 Slides from Serafim Batzoglou Backward Probabilities t (i) P(ot 1 ...oT | qt si , ) Daniel S. Weld de by Bonnie Dorr N t (i) aij b j (ot 1 ) t 1 ( j) j1 27 Computational Complexity

What is the running time, and space required, for Forward, and Backward? Time: O(K2N) Space: O(KN) Useful implementation technique to avoid underflows Viterbi: sum of logs Forward/Backward: rescaling at each position by multiplying by a constant Slides from Serafim Batzoglou Learning Problem Find the most likely HMM given observed sequence Slides from Serafim Batzoglou We Need to Learn This ?!? Definition: A hidden Markov model (HMM)

Alphabet = { b1, b2, , bM } Set of states Q = { 1, ..., K } Transition probabilities between any two states 1 2 K aij = transition prob from state i to state j ai1 + + aiK = 1, for all states i = 1K Start probabilities a0i a01 + + a0K = 1 Emission probabilities within each state ei(b) = P( xi = b | i = k) ei(b1) + + ei(bM) = 1, for all states i = 1K

Slides from Serafim Batzoglou How Learn HMM? Two questions: structure & parameters Daniel S. Weld 36 Simplest Case Fix structure Learn transition & emission probabilities Training data? Label each word as target or non-target Challenges Sparse training data Unseen words have Smoothing! Daniel S. Weld 37

Problem 3: Learning So far: assumed we know the underlying model A, E, A0 (A,B, ) Often these parameters are estimated on annotated training data, which has 2 drawbacks: Annotation is difficult and/or expensive from the current data Training data is different We want to maximize the parameters with respect to the current data, i.e., were looking for a model , such that ' 'argmax P(O | ) Daniel S. Weld de by Bonnie Dorr

38 Problem 3: Learning Unfortunately, there is no known way to analytically find a global maximum, i.e., a model ' , such that 'argmax P(O | ) But it is possible to find a local maximum! model , we can always find Given an initial a model ' , such that P(O | ') P(O | ) Daniel S. Weld de by Bonnie Dorr

39 Parameter Re-estimation Use a hill-climbing algorithm Called the forward-backward algorithm (or Baum-Welch) algorithm Idea: 0. Use an initial parameter instantiation 1. Loop Iteratively re-estimate the parameters Improve the probability that given observation are generated by the new parameters 2. Until estimates dont change much Daniel S. Weld de by Bonnie Dorr 40

Parameter Re-estimation Three parameters need to be re-estimated: Initial state distribution: i Transition probabilities: ai,j Emission probabilities: ei(ot) Daniel S. Weld de by Bonnie Dorr 41 Expectation Maximization The forward-backward algorithm is an instance of the more general EM algorithm The E Step: Compute the forward and backward probabilities for a give model The M Step: Re-estimate the model parameters Daniel S. Weld

de by Bonnie Dorr 42 Chicken & Egg Problem If we knew the actual sequence of states It would be easy to learn ai,j and ei(ot) But we cant observe states, so we dont! If we knew transition & emission probabilities, Then itd be easy to estimate the sequence of states Viterbi Algorithm But we dont know them! Daniel S. Weld 43 Expectation Maximization (EM) (high-level version) Pretend we do know the parameters Initialize randomly

[E step] Compute probability of instance having each possible value of the hidden variable (e.g. which state) Daniel S. Weld 44 Simplest Version Mixture of two distributions Know: form of distribution & variance, % =5 .01 .03 .05 .07 .09 Just need mean of each distribution Daniel S. Weld 45 Input Looks Like .01

Daniel S. Weld .03 .05 46 .07 .09 We Want to Predict ? .01 Daniel S. Weld .03 .05

47 .07 .09 Chicken & Egg Note that coloring instances would be easy if we knew Gausians. .01 Daniel S. Weld .03 .05 48 .07 .09

Chicken & Egg And finding the Gausians would be easy If we knew the coloring .01 Daniel S. Weld .03 .05 49 .07 .09 Expectation Maximization (EM) Pretend we do know the parameters Initialize randomly: set

1=?; 2=? .01 .03 .05 .07 .09 Daniel S. Weld 50 Expectation Maximization (EM) Pretend we do know the parameters Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable Daniel S. Weld .01 .03 .05 51

.07 .09 Expectation Maximization (EM) Pretend we do know the parameters Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable Daniel S. Weld .01 .03 .05 52 .07 .09

Expectation Maximization (EM) Pretend we do know the parameters Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable [M step] Treating each instance as fractionally having both values compute the new parameter values Daniel S. Weld .01 .03 .05 53 .07 .09

ML Mean of Single Gaussian Uml = argminu Daniel S. Weld (x u) i i 2 .01 .03 .05 .07 .09 54 Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable [M step] Treating each instance as fractionally having both values compute the new parameter values

Daniel S. Weld .01 .03 .05 55 .07 .09 Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable Daniel S. Weld .01 .03

.05 56 .07 .09 Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable [M step] Treating each instance as fractionally having both values compute the new parameter values Daniel S. Weld .01 .03 .05 57 .07

.09 Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable [M step] Treating each instance as fractionally having both values compute the new parameter values Daniel S. Weld .01 .03 .05 58 .07 .09 Iterate

Daniel S. Weld 59 Until Convergence Problems Need to assume form of distribution Local Maxima But It really works in practice! Can easilly extend to multiple variables E.g. Mean & Variance Or much more complex models Daniel S. Weld 60 Parameter Re-estimation Use a hill-climbing algorithm Called the forward-backward algorithm

(or Baum-Welch) algorithm Idea: 0. Use an initial parameter instantiation 1. Loop Iteratively re-estimate the parameters Improve the probability that given observation are generated by the new parameters 2. Until estimates dont change much Daniel S. Weld de by Bonnie Dorr 61 Parameter Re-estimation Two parameters need to be re-estimated: Transition probabilities: ai,j Emission probabilities: ei(ot) Can use forward & backward algorithms

Daniel S. Weld de by Bonnie Dorr 62 Re-estimating Transition Probabilities t (i, j) P(qt si , qt 1 s j | O, ) t (i, j) Daniel S. Weld de by Bonnie Dorr t (i) ai, j b j (ot 1 ) t 1 ( j) N N

(i)64 a t i1 j1 i, j b j (ot 1 ) t 1 ( j) That was HMMs We want More than an Atomic View of Words Would like richer representation of text: many arbitrary, overlapping features of the words. S t-1 identity of word ends in -ski is capitalized is part of a noun phrase is Wisniewski is in a list of city names is under node X in WordNet part of

ends in is in bold font noun phrase -ski is indented O t 1 is in hyperlink anchor last person name was female next two words are and Associates St S t+1 Ot O t +1 Slides from Cohen & McCallum

Problems with Richer Representation and a Joint Model These arbitrary features are not independent. Multiple levels of granularity (chars, words, phrases) Multiple dependent modalities (words, formatting, layout) Past & future Two choices: Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! S t-1 O t -1 Ignore the dependencies. This causes over-counting of evidence (ala nave Bayes). Big problem when combining evidence, as in Viterbi! St

S t+1 S t-1 St Ot O t +1 O Ot t -1 S t+1 O Slides fromt +1 Cohen & McCallum

Conditional Sequence Models We prefer a model that is trained to maximize a conditional probability rather than joint probability: P(s|o) instead of P(s,o): Can examine features, but not responsible for generating them. Dont have to explicitly model their dependencies. Dont waste modeling effort trying to generate what we are given at test time anyway. Slides from Cohen & McCallum Conditional Finite State Sequence Models [McCallum, Freitag & Pereira, 2000] [Lafferty, McCallum, Pereira 2001] From HMMs to CRFs s s1 , s2 ,...sn o o1 , o2 ,...on

St-1 St St+1 ... |o | Joint P ( s , o ) P( st | st 1 ) P (ot | st ) t 1 Ot-1 Conditional |o | 1

P ( s | o ) P ( s t | s t 1 ) P ( ot | s t ) P(o ) t 1 Ot St-1 Ot+1 St St+1 ... |o | 1 s ( s t , s t 1 ) o (ot , s t ) Z (o ) t 1

f ( s , o ) k k t t k Ot-1 Ot ... Ot+1

... where o (t ) exp (A super-special case of Conditional Random Fields.) Slides from Cohen & McCallum Conditional Random Fields [Lafferty, McCallum, Pereira 2001] 1. FSM special-case: linear chain among unknowns, parameters tied across time steps. St St+1 St+2 St+3 St+4 |o |

O = Ot, Ot+1, Ot+2, Ot+3, Ot+4 1 P ( s | o ) exp k f k ( st , st 1 , o , t ) Z (o ) t 1 k 2. In general: CRFs = "Conditionally-trained Markov Network" arbitrary structure among unknowns 3. Relational Markov Networks [Taskar, Abbeel, Koller 2002]: Parameters tied across hits from SQL-like queries ("clique templates") 4. Markov Logic Networks [Richardson & Domingos]: Weights on arbitrary logical sentences Slides from Cohen & McCallum

Feature Functions Example f k ( st , st 1 , o , t ) : 1 if Capitalized(ot ) s i st 1 s j st f Capitalized, si ,s j ( st , st 1 , o , t ) 0 otherwise o = Yesterday Pedro Domingos spoke this example sentence. o1 o2 s1 s2 s3 s4

o3 o4 o5 o6 o7 f Capitalized , s1 , s3 ( s1 , s2 , o ,2) 1 Slides from Cohen & McCallum Learning Parameters of CRFs Maximize log-likelihood of parameters k given training data D |o | 1

L log exp k f k ( st , st 1 , o , t ) s , o D k Z (o ) t 1 2k k 2 2 Log-likelihood gradient: L k (i ) (i ) #k ( s , o ) P ( s '| o ) #k ( s ' , o ) 2 k s , oD i s'

where # k ( s , o ) f k (st 1 , st , o , t ) t Methods: iterative scaling (quite slow) [Sha & Pereira 2002] & [Malouf 2002] conjugate gradient (much faster) limited-memory quasi-Newton methods, BFGS (super fast) Slides from Cohen & McCallum General CRFs vs. HMMs More general and expressive modeling technique Comparable computational efficiency Features may be arbitrary functions of any or all observations Parameters need not fully specify generation of observations; require less training data Easy to incorporate domain knowledge State means only state of process, vs state of process and observational history Im keeping Slides from Cohen & McCallum

Person name Extraction [McCallum 2001, unpublished] Slides from Cohen & McCallum Person name Extraction Slides from Cohen & McCallum Features in Experiment Capitalized Xxxxx Mixed Caps XxXxxx All Caps XXXXX Initial Cap X. Contains Digit xxx5

All lowercase xxxx Initial X Punctuation .,:;!(), etc Period . Comma , Apostrophe Dash Preceded by HTML tag Character n-gram classifier says string is a person name (80% accurate) In stopword list (the, of, their, etc) In honorific list (Mr, Mrs, Dr, Sen, etc) In person suffix list

(Jr, Sr, PhD, etc) In name particle list (de, la, van, der, etc) In Census lastname list; segmented by P(name) In Census firstname list; segmented by P(name) In locations lists (states, cities, countries) In company name list (J. C. Penny) In list of company suffixes (Inc, & Associates, Foundation) Hand-built FSM person-name extractor says yes, (prec/recall ~ 30/95) Conjunctions of all previous feature pairs, evaluated at the current time step. Conjunctions of all previous feature pairs, evaluated at current step and one step

ahead. All previous features, evaluated two steps ahead. All previous features, evaluated one step behind. Total number of features = ~500k Slides from Cohen & McCallum Training and Testing Trained on 65k words from 85 pages, 30 different companies web sites. Training takes 4 hours on a 1 GHz Pentium. Training precision/recall is 96% / 96%. Tested on different set of web pages with similar size characteristics. Testing precision is 92 95%, recall is 89 91%. Slides from Cohen & McCallum Limitations of Finite State Models Finite state models have a linear structure

Web documents have a hierarchical structure Are we suffering by not modeling this structure more explicitly? How can one learn a hierarchical extraction model? Slides from Cohen & McCallum IE Resources Data RISE, http://www.isi.edu/~muslea/RISE/index.html Linguistic Data Consortium (LDC) Penn Treebank, Named Entities, Relations, etc. http://www.biostat.wisc.edu/~craven/ie http://www.cs.umass.edu/~mccallum/data Code TextPro, http://www.ai.sri.com/~appelt/TextPro MALLET, http://www.cs.umass.edu/~mccallum/mallet

SecondString, http://secondstring.sourceforge.net/ Both http://www.cis.upenn.edu/~adwait/penntools.html Slides from Cohen & McCallum References

[Bikel et al 1997] Bikel, D.; Miller, S.; Schwartz, R.; and Weischedel, R. Nymble: a high-performance learning name-finder. In Proceedings of ANLP97, p194-201. [Califf & Mooney 1999], Califf, M.E.; Mooney, R.: Relational Learning of Pattern-Match Rules for Information Extraction, in Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99). [Cohen, Hurst, Jensen, 2002] Cohen, W.; Hurst, M.; Jensen, L.: A flexible learning system for wrapping tables and lists in HTML documents. Proceedings of The Eleventh International World Wide Web Conference (WWW-2002) [Cohen, Kautz, McAllester 2000] Cohen, W; Kautz, H.; McAllester, D.: Hardening soft information sources. Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000). [Cohen, 1998] Cohen, W.: Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity, in Proceedings of ACM SIGMOD-98. [Cohen, 2000a] Cohen, W.: Data Integration using Similarity Joins and a Word-based Information Representation Language, ACM Transactions on Information Systems, 18(3). [Cohen, 2000b] Cohen, W. Automatically Extracting Features for Concept Learning from the Web, Machine Learning: Proceedings of the Seventeeth International Conference (ML-2000). [Collins & Singer 1999] Collins, M.; and Singer, Y. Unsupervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999. [De Jong 1982] De Jong, G. An Overview of the FRUMP System. In: Lehnert, W. & Ringle, M. H. (eds), Strategies for Natural Language Processing. Larence Erlbaum, 1982, 149-176.

[Freitag 98] Freitag, D: Information extraction from HTML: application of a general machine learning approach, Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98). [Freitag, 1999], Freitag, D. Machine Learning for Information Extraction in Informal Domains. Ph.D. dissertation, Carnegie Mellon University. [Freitag 2000], Freitag, D: Machine Learning for Information Extraction in Informal Domains, Machine Learning 39(2/3): 99-101 (2000). Freitag & Kushmerick, 1999] Freitag, D; Kushmerick, D.: Boosted Wrapper Induction. Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99) [Freitag & McCallum 1999] Freitag, D. and McCallum, A. Information extraction using HMMs and shrinakge. In Proceedings AAAI99 Workshop on Machine Learning for Information Extraction. AAAI Technical Report WS-99-11. [Kushmerick, 2000] Kushmerick, N: Wrapper Induction: efficiency and expressiveness, Artificial Intelligence, 118(pp 15-68). [Lafferty, McCallum & Pereira 2001] Lafferty, J.; McCallum, A.; and Pereira, F., Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, In Proceedings of ICML-2001. [Leek 1997] Leek, T. R. Information extraction using hidden Markov models. Masters thesis. UC San Diego. [McCallum, Freitag & Pereira 2000] McCallum, A.; Freitag, D.; and Pereira. F., Maximum entropy Markov models for information extraction and segmentation, In Proceedings of ICML-2000 [Miller et al 2000] Miller, S.; Fox, H.; Ramshaw, L.; Weischedel, R. A Novel Use of Statistical Parsing to Extract Information from Text. Proceedings of the 1st Annual Meeting of the North American Chapter of the ACL (NAACL), p. 226 - 233. Slides from Cohen & McCallum