Automatically Building Special Purpose Search Engines with ...

Automatically Building Special Purpose Search Engines with ...

Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory Computer Science Department University of Massachusetts Amherst Joint work with David Jensen Site Visit March 2005 QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture. Intelligence Technology Innovation Center ITIC QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime and a TIFF (Uncompressed) decomp are needed to see this pictu

Goal: Mine actionable knowledge from unstructured text. Extracting Job Openings from the Web foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.htm OtherCompanyJobs: foodscience.com-Job1 A Portal for Job Openings Category = High Tech Keyword = Java Location = U.S. Job Openings:

QuickTime and a TIFF (LZW) decompressor are needed to see this picture. Data Mining the Extracted Job Information IE from Chinese Documents regarding Weather Department of Terrestrial System, Chinese Academy of Sciences 200k+ documents several millennia old - Qing Dynasty Archives - memos - newspaper articles - diaries IE from Cargo Container Ship Manifests Cargo Tracking Div. US Navy IE from Research Papers [McCallum et al 99] IE from Research Papers

Mining Research Papers [Rosen-Zvi, Griffiths, Steyvers, Smyth, 2004] QuickTime and a TIFF (LZW) decompressor are needed to see this picture. QuickTime and a TIFF (LZW) decompressor are needed to see this picture. What is Information Extraction As a family of techniques: Information Extraction = segmentation + classification + clustering + association October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and

development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. Richard Stallman, founder of the Free Software Foundation, countered saying Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation What is Information Extraction

As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. Richard Stallman, founder of the Free Software Foundation, countered saying

Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation What is Information Extraction As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. Richard Stallman, founder of the Free Software Foundation, countered saying Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

What is Information Extraction As a family of techniques: Information Extraction = segmentation + classification + association + clustering "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. Richard Stallman, founder of the Free Software Foundation, countered saying TITLE CEO VP founder Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select

customers. * Microsoft Corporation CEO Bill Gates * Microsoft Gates * Microsoft Bill Veghte * Microsoft VP Richard Stallman founder Free Software Foundation NAME Bill Gates Bill Veghte Richard Stallman For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

ORGANIZATION Microsoft Microsoft Free Soft.. October 14, 2002, 4:00 a.m. PT Larger Context Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database

Document collection Actionable knowledge Prediction Outlier detection Decision support Outline Examples of IE and Data Mining. Brief review of Conditional Random Fields Joint inference: Motivation and examples Joint Labeling of Cascaded Sequences (Belief Propagation) Joint Labeling of Distant Entities (BP by Tree Reparameterization) Joint Co-reference Resolution Joint Segmentation and Co-ref (Graph Partitioning)

(Iterated Conditional Samples) Hidden Markov Models HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, Graphical model Finite state model ... S t-1 St State sequence Observation sequence transitions ... observations ...

Generates: S t+1 O Ot t -1 O t +1 v |o | o1 o2 o3 o4 o5 o6

o7 o8 vv P ( s , o ) P ( st | st 1 ) P (ot | st ) t =1 Parameters: for all states S={s1,s2,} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Usually a multinomial over atomic, fixed alphabet Observation (emission) probabilities: P(ot|st ) Training: Maximize probability of training observations (w/ prior) IE with Hidden Markov Models Given a sequence of observations: Yesterday Rich Caruana spoke this example sentence.

and a trained HMM: person name location name background Find the most likely state sequence: (Viterbi) Yesterday Rich Caruana spoke this example sentence. Any words said to be generated by the designated person name state extract as a person name: Person name: Rich Caruana We want More than an Atomic View of Words Would like richer representation of text: many arbitrary, overlapping features of the words. S t-1 identity of word ends in -ski is capitalized is part of a noun phrase is Wisniewski is in a list of city names is under node X in WordNet part of ends in

is in bold font noun phrase -ski is indented O t 1 is in hyperlink anchor last person name was female next two words are and Associates St S t+1 Ot O t +1 Problems with Richer Representation and a Joint Model These arbitrary features are not independent. Multiple levels of granularity (chars, words, phrases) Multiple dependent modalities (words, formatting, layout) Past & future

Two choices: Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! S t-1 O t -1 Ignore the dependencies. This causes over-counting of evidence (ala nave Bayes). Big problem when combining evidence, as in Viterbi! St S t+1 S t-1 St S t+1

Ot O t +1 O Ot O t +1 t -1 Conditional Sequence Models We prefer a model that is trained to maximize a conditional probability rather than joint probability: P(s|o) instead of P(s,o): Can examine features, but not responsible for generating them. Dont have to explicitly model their dependencies. Dont waste modeling effort trying to generate what we are given at test time anyway. (Linear Chain) Conditional Random Fields [Lafferty, McCallum, Pereira 2001] Undirected graphical model,

trained to maximize conditional probability of outputs given inputs Finite state model Graphical model OTHER y t-1 PERSON yt OTHER y t+1 ORG y t+2 TITLE y t+3 output seq FSM states ... observations x

x t -1 said t Veght 1 T p(y | x) = y (y t , y t1 ) xy (x t , y t ) Z(x) t=1 x a x t +1 t +2

Microsoft where x t +3 VP input seq () = exp k f k () k Fast-growing, wide-spread interest, many positive experimental results. Noun phrase, Named entity [HLT03], [CoNLL03] Protein structure prediction [ICML04] IE from Bioinformatics text [Bioinformatics 04], Asian word segmentation [COLING04], [ACL04] Research papers [HTL04] IE from

Object classification in images [CVPR 04] Table Extraction from Government Reports Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------: : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------Year : of : Per Milk Cow : Percentage :

Total :Milk Cows 1/:-------------------: of Fat in All :-----------------: : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------: 1,000 Head --- Pounds --Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602

3.66 155,644 5,694.3 -------------------------------------------------------------------------------1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves. Table Extraction from Government Reports [Pinto, McCallum, Wei, Croft, 2003 SIGIR] 100+ documents from www.fedstats.gov CRF of milk during 1995 at $19.9 billion dollars, was eturns averaged $12.93 per hundredweight, 1994. Marketings totaled 154 billion pounds, ngs include whole milk sold to plants and dealers consumers. ds of milk were used on farms where produced, es were fed 78 percent of this milk with the cer households.

1993-95 ------------------------------------ n of Milk and Milkfat 2/ -------------------------------------: Percentage : Non-Table Table Title Table Header Table Data Row Table Section Data Row Table Footnote ... (12 in all) Features:

uction of Milk and Milkfat: w Labels: Total ----: of Fat in All :-----------------Milk Produced : Milk : Milkfat ------------------------------------ Percentage of digit chars Percentage of alpha chars Indented Contains 5+ consecutive spaces Whitespace in this line aligns with prev. ...

Conjunctions of all previous features, time offset: {0,0}, {-1,0}, {0,1}, {1,2}. Table Extraction Experimental Results [Pinto, McCallum, Wei, Croft, 2003 SIGIR] Line labels, percent correct HMM 65 % Stateless MaxEnt 85 % CRF 95 % Table segments, F1 64 %

92 % IE from Research Papers [McCallum et al 99] IE from Research Papers Field-level F1 Hidden Markov Models (HMMs) 75.6 [Seymore, McCallum, Rosenfeld, 1999] Support Vector Machines (SVMs) 89.7 error 40% [Han, Giles, et al, 2003] Conditional Random Fields (CRFs) [Peng, McCallum, 2004] 93.9

Named Entity Recognition CRICKET MILLNS SIGNS FOR BOLAND CAPE TOWN 1996-08-22 South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional. Labels: PER ORG LOC MISC Examples: Yayuk Basuki Innocent Butare 3M KDP

Cleveland Cleveland Nirmal Hriday The Oval Java Basque 1,000 Lakes Rally Named Entity Extraction Results [McCallum & Li, 2003, CoNLL] Method F1 HMMs BBN's Identifinder 73% CRFs w/out Feature Induction 83% CRFs with Feature Induction based on LikelihoodGain 90% Outline

Examples of IE and Data Mining. Brief review of Conditional Random Fields Joint inference: Motivation and examples Joint Labeling of Cascaded Sequences (Belief Propagation) Joint Labeling of Distant Entities (BP by Tree Reparameterization) Joint Co-reference Resolution Joint Segmentation and Co-ref (Graph Partitioning) (Iterated Conditional Samples) Larger Context Spider Filter Data Mining

IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Document collection Actionable knowledge Prediction Outlier detection Decision support Knowledge Discovery IE

Problem: Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Document collection Actionable knowledge Combined in serial juxtaposition, IE and DM are unaware of each others weaknesses and opportunities. 1) DM begins from a populated DB, unaware of where the data came from, or its inherent uncertainties. 2) IE is unaware of emerging patterns and

regularities in the DB. The accuracy of both suffers, and significant mining of complex text sources is beyond reach. Solution: Uncertainty Info Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Document collection

Actionable knowledge Emerging Patterns Prediction Outlier detection Decision support Solution: Unified Model Spider Filter Data Mining IE Segment Classify Associate Cluster Probabilistic

Model Discover patterns - entity types - links / relations - events Discriminatively-trained undirected graphical models Document collection Conditional Random Fields [Lafferty, McCallum, Pereira] Conditional PRMs [Koller], [Jensen], [Geetor], [Domingos] Complex Inference and Learning Just what we researchers like to sink our teeth into! Actionable knowledge Prediction Outlier detection Decision support

Larger-scale Joint Inference for IE What model structures will capture salient dependencies? Will joint inference improve accuracy? How do to inference in these large graphical models? How to efficiently train these models, which are built from multiple large components? 1. Jointly labeling cascaded sequences Factorial CRFs [Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words 1. Jointly labeling cascaded sequences Factorial CRFs [Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries

Part-of-speech English words 1. Jointly labeling cascaded sequences Factorial CRFs [Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words But errors cascade--must be perfect at every stage to do well. 1. Jointly labeling cascaded sequences Factorial CRFs [Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words Joint prediction of part-of-speech and noun-phrase in newswire,

matching accuracy with only 50% of the training data. Inference: Tree reparameterization BP [Wainwright et al, 2002] 2. Jointly labeling distant mentions Skip-chain CRFs [Sutton, McCallum, SRL 2004] Senator Joe Green said today . Green ran for Dependency among similar, distant mentions ignored. 2. Jointly labeling distant mentions Skip-chain CRFs [Sutton, McCallum, SRL 2004]

Senator Joe Green said today . Green ran 14% reduction in error on most repeated field in email seminar announcements. Inference: Tree reparameterization BP [Wainwright et al, 2002] for 3. Joint co-reference among all pairs Affinity Matrix CRF Entity resolution Object correspondence . . . Mr Powell . . . 45

. . . Powell . . . Y/N Y/N 99 Y/N 11 ~25% reduction in error on co-reference of proper nouns in newswire. . . . she . . . Inference: Correlational clustering graph partitioning [Bansal, Blum, Chawla, 2002] [McCallum, Wellner, IJCAI WS 2003, NIPS 2004] Coreference Resolution AKA "record linkage", "database record deduplication",

"entity resolution", "object correspondence", "identity uncertainty" Input Output News article, with named-entity "mentions" tagged Number of entities, N = 3 Today Secretary of State Colin Powell met with . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . he . . . . . . . . . . . . . . . . . . . Condoleezza Rice . . . . . . . . . Mr Powell . . . . . . . . . .she . . . . . . . . . . . . . . . . . . . . . Powell . . . . . . . . . . . . . . . President Bush . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rice . . . . . . . . . . . . . . . . Bush . . . . . . . . . . . . . . . . . . . . . ........................... #1 Secretary of State Colin Powell he Mr. Powell

Powell #2 Condoleezza Rice she Rice ......................... #3 President Bush Bush Inside the Traditional Solution Pair-wise Affinity Metric Mention (3) . . . Mr Powell . . . N Y Y Y Y N Y Y N Y

N N Y Y Mention (4) Y/N? . . . Powell . . . Two words in common One word in common "Normalized" mentions are string identical Capitalized word in common > 50% character tri-gram overlap < 25% character tri-gram overlap In same sentence Within two sentences Further than 3 sentences apart "Hobbs Distance" < 3 Number of entities in between two mentions = 0 Number of entities in between two mentions > 4 Font matches Default OVERALL SCORE =

29 13 39 17 19 -34 9 8 -1 11 12 -3 1 -19 98 > threshold=0 The Problem . . . Mr Powell . . . affinity = 98 Y

affinity = 104 Pair-wise merging decisions are being made independently from each other . . . Powell . . . N Y affinity = 11 . . . she . . . Affinity measures are noisy and imperfect. They should be made in relational dependence with each other. A Markov Random Field for Co-reference (MRF) [McCallum & Wellner, 2003, ICML]

. . . Mr Powell . . . 45 . . . Powell . . . Y/N Y/N 30 Y/N Make pair-wise merging decisions in dependent relation to each other by - calculating a joint prob. - including all edge weights - adding dependence on consistent triangles. 11 . . . she . . .

1 v v P( y | x ) = exp l f l (x i , x j , y ij ) + ' f '(y ij , y jk , y ik ) Z xv i, j l i, j,k A Markov Random Field for Co-reference (MRF) [McCallum & Wellner, 2003] . . . Mr Powell . . . 45 . . . Powell . . . Y/N Y/N 30 Y/N

11 Make pair-wise merging decisions in dependent relation to each other by - calculating a joint prob. - including all edge weights - adding dependence on consistent triangles. . . . she . . . 1 v v P( y | x ) = exp l f l (x i , x j , y ij ) + ' f '(y ij , y jk , y ik ) Z xv i, j l i, j,k A Markov Random Field for Co-reference (MRF)

[McCallum & Wellner, 2003] . . . Mr Powell . . . 45) . . . Powell . . . N 30) N Y (11) . . . she . . . 4 1 v v P( y | x ) = exp l f l (x i , x j , y ij ) + ' f '(y ij , y jk , y ik )

Z xv i, j l i, j,k A Markov Random Field for Co-reference (MRF) [McCallum & Wellner, 2003] . . . Mr Powell . . . 45) . . . Powell . . . Y 30) N Y (11) . . . she . . .

infinity 1 v v P( y | x ) = exp l f l (x i , x j , y ij ) + ' f '(y ij , y jk , y ik ) Z xv i, j l i, j,k A Markov Random Field for Co-reference (MRF) [McCallum & Wellner, 2003] . . . Mr Powell . . . 45) . . . Powell . . . Y 30)

N N (11) . . . she . . . 64 1 v v P( y | x ) = exp l f l (x i , x j , y ij ) + ' f '(y ij , y jk , y ik ) Z xv i, j l i, j,k Inference in these MRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002] . . . Mr Powell . . . 45 . . . Powell . . .

106 30 134 11 . . . Condoleezza Rice . . . . . . she . . . 10 v v log(P( y | x )) l f l (x i , x j , y ij ) = i, j l w i, j w/in paritions ij w

i, j across paritions ij Inference in these MRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002] . . . Mr Powell . . . 45 . . . Powell . . . 106 30 134 11 . . . Condoleezza Rice . . . . . . she . . . 10 v v log(P( y | x )) l f l (x i , x j , y ij ) = i, j

l w i, j w/in paritions ij w i, j across paritions ij = 22 Inference in these MRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002] . . . Mr Powell . . . 45 . . . Powell . . .

106 30 134 11 . . . Condoleezza Rice . . . . . . she . . . 10 v v log(P( y | x )) l f l (x i , x j , y ij ) = i, j l w i, j w/in paritions ij + w'

i, j across paritions ij = 314 Co-reference Experimental Results [McCallum & Wellner, 2003] Proper noun co-reference DARPA ACE broadcast news transcripts, 117 stories Single-link threshold Best prev match [Morton] MRFs Partition F1 16 % 83 % 88 % error=30% Pair F1 18 % 89 %

92 % error=28% DARPA MUC-6 newswire article corpus, 30 stories Single-link threshold Best prev match [Morton] MRFs Partition F1 11% 70 % 74 % error=13% Pair F1 7% 76 % 80 % error=17% Joint co-reference among all pairs Affinity Matrix CRF . . . Mr Powell . . . 45 . . . Powell . . .

Y/N Y/N 99 Y/N 11 ~25% reduction in error on co-reference of proper nouns in newswire. . . . she . . . Inference: Correlational clustering graph partitioning [Bansal, Blum, Chawla, 2002] [McCallum, Wellner, IJCAI WS 2003, NIPS 2004] 4. Joint segmentation and co-reference Extraction from and matching of research paper citations.

o s Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), AddisonWesley, 1990. World Knowledge c y Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990. p Co-reference decisions y

Database field values c s c y Citation attributes s o Segmentation o 35% reduction in co-reference error by using segmentation uncertainty. 6-14% reduction in segmentation error by using co-reference. Inference: Variant of Iterated Conditional Modes [Besag, 1986] [Wellner, McCallum, Peng, Hay, UAI 2004] see also [Marthi, Milch, Russell, 2003]

4. Joint segmentation and co-reference Joint IE and Coreference from Research Paper Citations Textual citation mentions (noisy, with duplicates) Paper database, with fields, clean, duplicates collapsed AUTHORS TITLE Cowell, Dawid Probab Montemerlo, ThrunFastSLAM Kjaerulff Approxi QuickTime and a TIFF (LZW) decompressor are needed to see this picture. VENUE Springer AAAI Technic

Citation Segmentation and Coreference Laurel, B. Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 . Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 . Citation Segmentation and Coreference Laurel, B. Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 . Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 . 1) Segment citation fields Citation Segmentation and Coreference Laurel, B.

Y ? N Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 . Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 . 1) Segment citation fields 2) Resolve coreferent citations Citation Segmentation and Coreference Laurel, B. Y ? N Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 . Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 . AUTHOR = TITLE = PAGES = BOOKTITLE = EDITOR = PUBLISHER = YEAR = Brenda Laurel Interface Agents: Metaphors with Character 355-366 The Art of Human-Computer Interface Design T. Smith Addison-Wesley 1990 1) Segment citation fields 2)

Resolve coreferent citations 3) Form canonical database record Resolving conflicts Citation Segmentation and Coreference Laurel, B. Y ? N Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 . Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 . AUTHOR = TITLE = PAGES = BOOKTITLE = EDITOR =

PUBLISHER = YEAR = Perform Brenda Laurel Interface Agents: Metaphors with Character 355-366 The Art of Human-Computer Interface Design T. Smith Addison-Wesley 1990 1) Segment citation fields 2) Resolve coreferent citations 3) Form canonical database record jointly.

IE + Coreference Model AUT AUT YR TITL TITL CRF Segmentation s Observed citation x J Besag 1986 On the IE + Coreference Model AUTHOR = J Besag YEAR = 1986 TITLE = On the Citation mention attributes c CRF Segmentation

s Observed citation x J Besag 1986 On the IE + Coreference Model Smyth , P Data mining Structure for each citation mention c s x Smyth . 2001 Data Mining J Besag 1986 On the IE + Coreference Model

Smyth , P Data mining Binary coreference variables for each pair of mentions c s x Smyth . 2001 Data Mining J Besag 1986 On the IE + Coreference Model Smyth , P Data mining Binary coreference variables for each pair of mentions y

n n c s x Smyth . 2001 Data Mining J Besag 1986 On the IE + Coreference Model Smyth , P Data mining AUTHOR = P Smyth YEAR = 2001 TITLE = Data Mining ... Research paper entity

attribute nodes y n n c s x Smyth . 2001 Data Mining J Besag 1986 On the IE + Coreference Model Smyth Research paper entity attribute node , P Data mining y

y y c s x Smyth . 2001 Data Mining J Besag 1986 On the IE + Coreference Model Smyth , P Data mining y n n c

s x Smyth . 2001 Data Mining J Besag 1986 On the Such a highly connected graph makes exact inference intractable IE + Coreference Model Smyth , P Data mining Exact inference on these linear-chain regions From each chain pass an N-best List into coreference Smyth . 2001 Data Mining J Besag 1986 On the

IE + Coreference Model Smyth , P Data mining Approximate inference by graph partitioning Make scale to 1M citations with Canopies integrating out uncertainty in samples of extraction Smyth . 2001 Data Mining [McCallum, Nigam, Ungar 2000] J Besag 1986 On the Inference: Sample = N-best List from CRF Segmentation

Name Title Book Title Year Laurel, B. Interface Agents: Metaphors with Character The Art of Human Computer Interface Design 1990 Laurel, B. Interface Agents: Metaphors with Character The Art of Human Computer

Interface Design 1990 Agents: Metaphors with Character The Art of Human Computer Interface Design Laurel, B. Interface When calculating similarity with another citation, have more opportunity to find correct, matching fields. Name Title Laurel, B

Interface Agents: Metaphors with Character The Laurel, B. Interface Agents: Metaphors with Character Laurel, B. Interface Agents Metaphors with Character 1990 y?n

IE + Coreference Model Smyth , P Data mining Exact (exhaustive) inference over entity attributes y n n Smyth . 2001 Data Mining J Besag 1986 On the IE + Coreference Model Smyth , P Data mining

Revisit exact inference on IE linear chain, now conditioned on entity attributes y n n Smyth . 2001 Data Mining J Besag 1986 On the Parameter Estimation Separately for different regions IE Linear-chain Exact MAP Coref graph edge weights MAP on individual edges Entity attribute potentials MAP, pseudo-likelihood y

n n In all cases: Climb MAP gradient with quasi-Newton method 4. Joint segmentation and co-reference [Wellner, McCallum, Peng, Hay, UAI 2004] o Extraction from and matching of research paper citations. s Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), AddisonWesley, 1990. World

Knowledge c y p Co-reference decisions y Database field values Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990. c s c y

s o Citation attributes Segmentation o 35% reduction in co-reference error by using segmentation uncertainty. 6-14% reduction in segmentation error by using co-reference. Inference: Variant of Iterated Conditional Modes [Besag, 1986] Outline Examples of IE and Data Mining. Brief review of Conditional Random Fields Joint inference: Motivation and examples Joint Labeling of Cascaded Sequences

(Belief Propagation) Joint Labeling of Distant Entities (BP by Tree Reparameterization) Joint Co-reference Resolution Joint Segmentation and Co-ref (Graph Partitioning) (Iterated Conditional Samples) Piecewise Training Efficiently Learning Large Probabilistic Relational Models Charles Sutton and Andrew McCallum Information Extraction and Synthesis Laboratory Computer Science Department University of Massachusetts Amherst QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture. Intelligence Technology Innovation Center ITIC QuickTime and a

TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime and a TIFF (Uncompressed) decomp are needed to see this pictu Parameter Estimation Separately for different regions IE Linear-chain Exact MAP Coref graph edge weights MAP on individual edges Entity attribute potentials MAP, pseudo-likelihood y n n In all cases: Climb MAP gradient with quasi-Newton method

Piecewise Training Piecewise Training with NOTA Experimental Results Named entity tagging (CoNLL-2003) Training set = 15k newswire sentences 9 labels Test F1 Training time CRF 89.87 9 hours MEMM 88.90 1 hour CRF-PT

90.50 5.3 hours stat. sig. improvement at p = 0.001 Experimental Results 2 Part-of-speech tagging (Penn Treebank, small subset) Training set = 1154 newswire sentences 45 labels Test F1 Training time CRF 88.1 14 hours MEMM 88.1

2 hours CRF-PT 88.8 2.5 hours stat. sig. improvement at p = 0.001 Parameter Independence Diagrams Graphical models = formalism for representing independence assumptions among variables. Here we represent independence assumptions among parameters (in factor graph) Piecewise Training Research Questions How to select the boundaries of pieces? What choices of limited interaction are best? How to sample sparse subsets of NOTA instances? Application to simpler models (classifiers) Application to more complex models (parsing)

Piecewise Training in Factorial CRFs for Transfer Learning [Sutton, McCallum, 2005] Emailed seminar annmt entities Email English words GRAND CHALLENGES FOR MACHINE LEARNING 60k words training. Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning),

learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. Too little labeled training data. Piecewise Training in Factorial CRFs for Transfer Learning [Sutton, McCallum, 2005] Train on related task with more data. Newswire named entities Newswire English words 200k words training. CRICKET MILLNS SIGNS FOR BOLAND CAPE TOWN 1996-08-22 South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England allrounder Phillip DeFreitas as Boland's overseas professional. Piecewise Training in Factorial CRFs for Transfer Learning

[Sutton, McCallum, 2005] At test time, label email with newswire NEs... Newswire named entities Email English words Piecewise Training in Factorial CRFs for Transfer Learning [Sutton, McCallum, 2005] then use these labels as features for final task Emailed seminar annmt entities Newswire named entities Email English words Piecewise Training in Factorial CRFs for Transfer Learning [Sutton, McCallum, 2005] Piecewise training of a joint model. Seminar Announcement entities Newswire named entities English words CRF Transfer Experimental Results [Sutton, McCallum, 2005]

Seminar Announcements Dataset [Freitag 1998] CRF stime etime location speaker overall No transfer 99.1 97.3 81.0 73.7 87.8 Cascaded transfer 99.2

96.0 84.3 74.2 88.4 Joint transfer 99.1 96.0 85.3 76.3 89.2 New best published accuracy on common dataset Social Network Analysis from Textual Message Data

Andrew McCallum, Andres Corrada, Xuerui Wang Information Extraction and Synthesis Laboratory Computer Science Department University of Massachusetts Amherst QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture. Intelligence Technology Innovation Center ITIC QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime and a TIFF (Uncompressed) decomp are needed to see this pictu Managing and Understanding Connections of People in our Email World Workplace effectiveness ~ Ability to leverage network of acquaintances But filling Contacts DB by hand is tedious, and incomplete. Contacts DB

Email Inbox QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture. Automatically QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture. WWW QuickTime and this a decompressor TIFF (Uncompressed) are needed to see picture. System Overview WWW

Email CRF QuickTime and this a decompressor TIFF (Uncompressed) are needed to see picture. Keyword Extraction Person Name Extraction Name Coreference Homepage Retrieval

names Contact Info and Person Name Extraction Social Network Analysis An Example To: Andrew McCallum [email protected] Subject ... Search for new people First Name: Andrew Middle Name:

Kachites Last Name: McCallum JobTitle: Associate Professor Company: University of Massachusetts Street Address: 140 Governors Dr. City: Amherst State:

MA Zip: 01003 Company Phone: (413) 545-1323 Links: Fernando Pereira, Sam Roweis, Key Words: Information extraction, social network, Example keywords extracted Person

Keywords William Cohen Logic programming Text categorization Data integration Rule learning Daphne Koller Bayesian networks Relational models Probabilistic models Hidden variables Deborah McGuiness Semantic web Description logics Knowledge representation Ontologies Tom Mitchell

1. 2. Machine learning Cognitive states Learning apprentice Artificial intelligence Summary of Results Contact info and name extraction performance (25 fields) CRF Token Acc Field Prec Field Recall Field F1

94.50 85.73 76.33 80.76 QuickTime and a TIFF (LZW) decompressor are needed to see this picture. Expert Finding: When solving some task, find friends-of-friends with relevant expertise. Avoid stove-piping in large orgs by automatically suggesting collaborators. Given a task, automatically suggest the right team for the job. (Hiring aid!) Social Network Analysis: Understand the social structure of your organization. Suggest structural changes for improved efficiency. Clustering words into topics with Latent Dirichlet Allocation [Blei, Ng, Jordan 2003] Example topics

induced from a large collection of text JOB SCIENCE BALL FIELD STORY MIND DISEASE WATER WORK STUDY GAME MAGNETIC STORIES WORLD BACTERIA FISH JOBS SCIENTISTS TEAM MAGNET TELL DREAM DISEASES SEA CAREER

SCIENTIFIC FOOTBALL WIRE CHARACTER DREAMS GERMS SWIM KNOWLEDGE BASEBALL EXPERIENCE NEEDLE THOUGHT CHARACTERS FEVER SWIMMING WORK PLAYERS EMPLOYMENT CURRENT AUTHOR IMAGINATION CAUSE POOL OPPORTUNITIES RESEARCH PLAY COIL READ MOMENT CAUSED

LIKE WORKING CHEMISTRY FIELD POLES TOLD THOUGHTS SPREAD SHELL TRAINING TECHNOLOGY PLAYER IRON SETTING OWN VIRUSES SHARK SKILLS MANY BASKETBALL COMPASS TALES REAL INFECTION TANK CAREERS MATHEMATICS COACH

LINES PLOT LIFE VIRUS SHELLS POSITIONS BIOLOGY PLAYED CORE TELLING IMAGINE MICROORGANISMS SHARKS FIND FIELD PLAYING ELECTRIC SHORT SENSE PERSON DIVING POSITION PHYSICS HIT DIRECTION INFECTIOUS DOLPHINS CONSCIOUSNESS FICTION

FIELD LABORATORY TENNIS FORCE ACTION STRANGE COMMON SWAM OCCUPATIONS STUDIES TEAMS MAGNETS TRUE FEELING CAUSING LONG REQUIRE WORLD GAMES BE EVENTS WHOLE SMALLPOX SEAL OPPORTUNITY SPORTS

MAGNETISM SCIENTIST TELLS BEING BODY DIVE EARN STUDYING BAT POLE TALE MIGHT INFECTIONS DOLPHIN ABLE SCIENCES TERRY INDUCED NOVEL HOPE CERTAIN UNDERWATER [Tennenbaum et al] Example topics induced from a large collection of text

JOB SCIENCE BALL FIELD STORY MIND DISEASE WATER WORK STUDY GAME MAGNETIC STORIES WORLD BACTERIA FISH JOBS SCIENTISTS TEAM MAGNET TELL DREAM DISEASES SEA CAREER SCIENTIFIC FOOTBALL

WIRE CHARACTER DREAMS GERMS SWIM KNOWLEDGE BASEBALL EXPERIENCE NEEDLE THOUGHT CHARACTERS FEVER SWIMMING WORK PLAYERS EMPLOYMENT CURRENT AUTHOR IMAGINATION CAUSE POOL OPPORTUNITIES RESEARCH PLAY COIL READ MOMENT CAUSED LIKE

WORKING CHEMISTRY FIELD POLES TOLD THOUGHTS SPREAD SHELL TRAINING TECHNOLOGY PLAYER IRON SETTING OWN VIRUSES SHARK SKILLS MANY BASKETBALL COMPASS TALES REAL INFECTION TANK CAREERS MATHEMATICS COACH LINES

PLOT LIFE VIRUS SHELLS POSITIONS BIOLOGY PLAYED CORE TELLING IMAGINE MICROORGANISMS SHARKS FIND FIELD PLAYING ELECTRIC SHORT SENSE PERSON DIVING POSITION PHYSICS HIT DIRECTION INFECTIOUS DOLPHINS CONSCIOUSNESS FICTION FIELD

LABORATORY TENNIS FORCE ACTION STRANGE COMMON SWAM OCCUPATIONS STUDIES TEAMS MAGNETS TRUE FEELING CAUSING LONG REQUIRE WORLD GAMES BE EVENTS WHOLE SMALLPOX SEAL OPPORTUNITY SPORTS MAGNETISM SCIENTIST

TELLS BEING BODY DIVE EARN STUDYING BAT POLE TALE MIGHT INFECTIONS DOLPHIN ABLE SCIENCES TERRY INDUCED NOVEL HOPE CERTAIN UNDERWATER [Tennenbaum et al] From LDA to Author-Recipient-Topic (ART)

Inference and Estimation Gibbs Sampling: - Easy to implement - Reasonably fast r Enron Email Corpus 250k email messages 23k people Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT) From: [email protected] To: [email protected] Subject: Enron/TransAltaContract dated Jan 1, 2001 Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions. DP Debra Perlingiere Enron North America Corp. Legal Department 1400 Smith Street, EB 3885 Houston, Texas 77002 [email protected]

Topics, and prominent sender/receivers discovered by ART Topics, and prominent sender/receivers discovered by ART Beck = Chief Operations Officer Dasovich = Government Relations Executive Shapiro = Vice Presidence of Regulatory Affairs Steffes = Vice President of Government Affairs Comparing Role Discovery Traditional SNA ART Author-Topic distribution over authored topics distribution over authored topics connection strength (A,B) =

distribution over recipients Comparing Role Discovery Tracy Geaconne Dan McCarty Traditional SNA ART Similar roles Different roles Geaconne = Secretary McCarty = Vice President Author-Topic Different roles Comparing Role Discovery Tracy Geaconne Rod Hayslett Traditional SNA Different roles

ART Not very similar Author-Topic Very similar Geaconne = Secretary Hayslett = Vice President & CTO Comparing Role Discovery Lynn Blair Kimberly Watson Traditional SNA Different roles ART Very similar Author-Topic Very different Blair = Gas pipeline logistics

Watson = Pipeline facilities planning McCallum Email Corpus 2004 January - October 2004 23k email messages 825 people From: [email protected] Subject: NIPS and .... Date: June 14, 2004 2:27:41 PM EDT To: [email protected] There is pertinent stuff on the first yellow folder that is completed either travel or other things, so please sign that first folder anyway. Then, here is the reminder of the things I'm still waiting for: NIPS registration receipt. CALO registration receipt. Thanks, Kate McCallum Email Blockstructure Four most prominent topics in discussions with ____? Two most prominent topics in discussions with ____?

Words love house time great hope dinner saturday left ll visit evening stay bring weekend road sunday kids flight Prob 0.030514 0.015402 0.013659 0.012351 0.011334

0.011043 0.00959 0.009154 0.009154 0.009009 0.008282 0.008137 0.008137 0.007847 0.007701 0.007411 0.00712 0.006829 0.006539 0.006539 Words today tomorrow time ll meeting week talk meet morning

monday back call free home won day hope leave office tuesday Prob 0.051152 0.045393 0.041289 0.039145 0.033877 0.025484 0.024626 0.023279 0.022789 0.020767 0.019358 0.016418 0.015621

0.013967 0.013783 0.01311 0.012987 0.012987 0.012742 0.012558 Role-Author-Recipient-Topic Models Information Extraction and Mining the Research Literature Information Extraction and Synthesis Laboratory Computer Science Department University of Massachusetts Amherst QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture. Intelligence Technology Innovation Center ITIC QuickTime and a TIFF (Uncompressed) decompressor

are needed to see this picture. QuickTime and a TIFF (Uncompressed) decomp are needed to see this pictu Previous Systems QuickTime and a TIFF (LZW) decompressor are needed to see this picture. Previous Systems Cites Research Paper More Entities and Relations Expertise Cites Research Paper

Grant Venue Person University Groups Summary Conditional Random Fields Conditional probability models of structured data Data mining complex unstructured text suggests the need for joint inference IE + DM. Early examples Factorial finite state models Jointly labeling distant entities Coreference analysis Segmentation uncertainty aiding coreference

Piecewise Training Faster + higher accuracy Current projects Email, contact management, expert-finding, SNA Mining the scientific literature End of Talk

Recently Viewed Presentations

  • Light Scattering Vijay Natraj Ge152 February 9, 2007

    Light Scattering Vijay Natraj Ge152 February 9, 2007

    Light Scattering Vijay Natraj Ge152 ... Each mixing group is a combination of 4 aerosol components Lognormal distribution Global Climatology of Aerosol Types Scattering Matrix Describes transformation from incident to viewing direction In many cases, function only of scattering angle...
  • Toxic Garbage Island movie tonight W 4/21 at

    Toxic Garbage Island movie tonight W 4/21 at

    Sanderson's proposed solution- multilayered approach: Demographic Sustainability Ecological Functions Human Use and Interactions Can we return animals to levels that existed before humans had a negative impact? Instead of a single number…A number for each level can be determined. How...
  • The Civil War April 1861- May1864 - TypePad

    The Civil War April 1861- May1864 - TypePad

    The soldiers of the civil war had to deal with terrible medical conditions. Doctors didn't know about infections. They didn't even bother to wash their hands! Many soldiers died from infections and disease. Even a small wound could end up...
  • Understanding and Conceptualizing Interaction

    Understanding and Conceptualizing Interaction

    Times New Roman Geneva Times Arial Default Design Understanding and Conceptualizing Interaction Understanding the problem space A framework for analysing the problem space Conceptual model User's model, Design model, System image Conceptual models based on activities Interface analogies to physical...
  • Seventh Regional Public Procurement Forum May 16 -

    Seventh Regional Public Procurement Forum May 16 -

    WTO GPA accession (2): Benefits Remarks by Mr. Pascal Lamy, The WTO Director-General "Participation in the GPA brings real benefits not only in terms of access to other Parties' markets for procurement of goods, services and construction services, but also...
  • Introduction to Compensation

    Introduction to Compensation

    Requesting department sends a completed Position Evaluation Questionnaire (PEQ) and/or Job Description to Human Resources for appropriate compensation band placement and job classification according to the Fair Labor Standards Act. Human Resources will assign the appropriate job title, job classification,...
  • Crime, Punishment, and Forgiveness

    Crime, Punishment, and Forgiveness

    Can efficiency be sustained by the Grim Trigger? Suppose that the other two fishermen are playing the grim trigger strategy of sending one boat until somebody sends two boats and if anybody ever sends two boats, you send two boats...
  • Presentazione standard di PowerPoint

    Presentazione standard di PowerPoint

    0-5' 4 h. da 48 h in poi. lesioni a: - cervello - midollo - cuore - aorta. emorragia cerebrale. emotorace. pneumotorace. fratture pelviche. fratture ossa lunghe