Bayesian Modeling of Lexical Resources for Low-Resource Settings Nicholas Andrews with Mark Dredze, Benjamin Van Durme, and Jason Eisner A place name 2 This Talk: Sequence Labeling Corpus ... known as [Llanfairpwllgwyngyll] yt = Person? Location? Other?
3 This Talk: Sequence Labeling with Gazetteers Gazetteer Llanfairpwllgwyngyll Jacksonville New York Allentown Corpus ... known as [Llanfairpwllgwyngyll] yt = Person? Location? Other? 4
This Talk in One Slide Dont Condition on the gazetteer: Do Generate the gazetteer! P(y | P( Gazetteer Llanfairpwllgwyngyll Jacksonville
, x) Gazetteer Llanfairpwllgwyngyll Jacksonville , x, y) 5 Warning: Pick a good generative model Its easy to have rich discriminative models A little harder for generative models, but possible:
High-resource case: LSTM language models Low-resource case (this paper): hierarchical Bayesian LMs 6 PART I: GAZETTEER FEATURES 7 Discriminative Named-Entity Recognition Corpus Person? Location?
he went to [Jacksonville] What type is Jacksonville? 8 Discriminative Named-Entity Recognition Corpus Person? Location? he went to [Jacksonville] What type is Jacksonville? 9
Discriminative Named-Entity Recognition Corpus Person? Location? he went to [Jacksonville] Location! P(labels | words) context 10 Discriminative Named-Entity Recognition
Corpus Person? Location? he went to [Jacksonville] Location! P(labels | words) context Location! spelling 11
Discriminative Named-Entity Recognition Corpus yt = loc he went to [Jacksonville] Location! P(labels | words) context Location! spelling
12 What if context and spelling arent enough? Corpus yt = Person? Location? ... known as [Llanfairpwllgwyngyll] 13 What if context and spelling arent enough? Corpus
yt = Person? Location? ... known as [Llanfairpwllgwyngyll] ? context 14 What if context and spelling arent enough? Corpus yt = Person? Location?
... known as [Llanfairpwllgwyngyll] ? ? context spelling 15 Hmm, what if we had some sort of list of names we knew were locations?
Corpus y =? t ... known as [Llanfairpwllgwyngyll] ? ? context spelling 16
Solution: use Gazetteers Gazetteer Llanfairpwllgwyngyll Jacksonville New York Allentown Albert Einstein Albert Eyntey n Alberts Eintei ns
17 Gazetteer Features Gazetteer Llanfairpwllgwyngyll Jacksonville New York Allentown Corpus ... known as [Llanfairpwllgwyngyll] GazFeature(str) :=
1 0 IF str IN GAZ OTHERWISE 18 Gazetteer Llanfairpwllgwyngyll Jacksonville New York Allentown Corpus
yt = loc ... known as [Llanfairpwllgwyngyll] ? context Location! gazetteer ? spelling 19
PART II: THE TROUBLE WITH GAZETTEER FEATURES 20 What goes wrong with gazetteer features 1. Overfitting: gazetteer inhibits learning of spelling + context features from annotated corpus 2. Discriminative training doesnt learn spelling information from the gazetteer 21
The larger the gazetteer, the more we overfit Gazetteer Llanfairpwllgwyngyll Corpus Corpus Training 1. 2. 3. 4. a statement from Clinton [] [...]
known as [Llanfairpwllgwyngyll] he went to [Jacksonville] She is from [New York] a statement from [Allentown] context gazetteer spelling 22 The larger the gazetteer, the more we overfit
Gazetteer Llanfairpwllgwyngyll *Jacksonville* Corpus Corpus Training 1. 2. 3. 4. a statement from Clinton [] [...] known
as [Llanfairpwllgwyngyll] he went to [*Jacksonville*] She is from [New York] a statement from [Allentown] context gazetteer spelling 23 The larger the gazetteer, the more we overfit Gazetteer
Llanfairpwllgwyngyll Jacksonville *New York* Corpus Corpus Training 1. 2. 3. 4. a statement from Clinton [] [...] known
as [Llanfairpwllgwyngyll] he went to [Jacksonville] She is from [*New York*] a statement from [Allentown] context gazetteer spelling 24 The larger the gazetteer, the more we overfit Gazetteer
Llanfairpwllgwyngyll Jacksonville New York *Allentown* Corpus Corpus Training 1. 2. 3. 4. context a statement from Clinton []
[...] known as [Llanfairpwllgwyngyll] he went to [Jacksonville] She is from [New York] a statement from [*Allentown*] gazetteer spelling 25 The larger the gazetteer, the more we overfit
Gazetteer Llanfairpwllgwyngyll Jacksonville New York Allentown Corpus Train TEST Corpus A statement [...] a statement by [Townville] from Clinton []
context gazetteer spelling 26 The larger the gazetteer, the more we overfit Gazetteer Llanfairpwllgwyngyll Jacksonville New York
Allentown Corpus Train TEST Corpus A statement [...] a statement by [Townville] from Clinton [] Townville not in gazetteer. Location!
Person! context gazetteer spelling 27 What goes wrong with gazetteer features 1. Overfitting: gazetteer inhibits learning of spelling + context features from annotated corpus 2. Discriminative training doesnt learn spelling information from the gazetteer
Arent more observations supposed to help? (Bayes) The Problem: So far, we treat gazetteer as features, not observations 28 Gazetteer Features Ignore Information Gazetteer Llanfairpwllgwyngyll Jacksonville New York Allentown Test Corpus
Corpus A statement by [Townville] [...] a statement from Clinton [] Can we learn spelling from the gazetteer? 29 Prior Work We are not the first to notice some of these issues
Weight undertraining (Sutton et al., 2006) CRF-specific remedies have been proposed Logarithmic opinion pools (Smith et al., 2005) Our Solution: Model the corpus and the gazetteer jointly 30 PART III: GENERATE THE GAZETTEER 31
Explorers Gazetteer Explorers Diary Jacksonville Allentown Greenville Georgetown 1. 2. 3. 4. known as Centertown
he went to Townville She is from Georgeville a statement from Allentown 32 Explorer names new places Explorers Gazetteer Jacksonville Allentown Greenville Georgetown 33
Explorer names new Pspelling(name | yt = loc) places Explorers Gazetteer Jacksonville Allentown Greenville Georgetown 34 Explorer writes about places Explorers Gazetteer
Explorers Diary Jacksonville Allentown Greenville Georgetown 1. 2. 3. 4. known as Centertown he went to Townville She is from Georgeville
a statement from Allentown 35 NOTE: the SAME spelling model generates both types and tokens Pcontext(yt = loc | context) * Pspelling(name | yt = loc) Explorers Gazetteer Explorers Diary Jacksonville Allentown Greenville
Georgetown 1. 2. 3. 4. known as Centertown he went to Townville She is from Georgeville a statement from Allentown 36 (Conditional model)
Condition on x (Proposed model) Model x: gazetteer + corpus 37 yt-2 yt-1 yt Context model 38
yt-2 yt-1 yt xt Spelling Model 39 We can now generalize from the gazetteer! Test Corpus
Gazetteer Llanfairpwllgwyngyll A statement by [Townville] Jacksonville New York Allentown Location! Pspelling(T, o, w, n, v, i, l, l, e | yt = location) 40 We can now generalize from the gazetteer! Test Corpus
Gazetteer Llanfairpwllgwyngyll A statement by [Townville] Jacksonville New York Allentown Townville not Location! VERSUS in gazetteer. Pspelling(T, o, w, n, v, i, l, l, e | yt = location) gazetteer
41 What about Llanfairpwllgwyngyllgogeryc drobwll Problem: Pspelling(L, l, a, n, f, a, i, r, p, w, l, l, | y = loc) is tiny 42 What about Llanfairpwllgwyngyllgogeryc drobwll Problem: Pspelling(L, l, a, n, f, a, i, r, p, w, l, l, | y = loc) is tiny But gazetteer features handled this case! Gazetteer features recognize specific strings via: GazFeature(str) :=
1 0 IF str IN GAZ OTHERWISE Even a weirdly spelled name is a location, if its in gazetteer! 43 What about Llanfairpwllgwyngyllgogeryc drobwll Problem: Pspelling(L, l, a, n, f, a, i, r, p, w, l, l, | y = loc) is tiny IF str IN model? GAZ
Can we account for this 1in generative GazFeature(str) := 0 OTHERWISE Even a weirdly spelled name is still a name, if its in gazetteer! 44 Solution: Stochastic Memoization Gazetteer Llanfairpwllgwyngyll Jacksonville New York
Allentown With probability : Sample an existing word in the gazetteer E.g. Llanfairpwllgwyngyll 45 Solution: Stochastic Memoization Gazetteer Llanfairpwllgwyngyll Jacksonville New York Allentown
With probability : Sample an existing word in the gazetteer E.g. Llanfairpwllgwyngyll With probability 1 : Spell a new word character-bycharacter E.g. Townville 46 Solution: Stochastic Memoization Gazetteer Llanfairpwllgwyngyll Jacksonville New York
Allentown With probability : Sample an existing word in the gazetteer E.g. Llanfairpwllgwyngyll With probability 1 : Spell a new word character-bycharacter E.g. Townville Pcache(word) + (1 ) Pspelling(x = w, o, r, d | y = label) 47 Summary & Trade-offs
Condition on the Gazetteer Generate the Gazetteer Fewer independence assumptions More independence assumptions Gazetteer features: may overfit Gazetteer is data: no overfitting
Does not model the gazetteer; needs annotated data to learn spelling Learns spelling from gazetteers; no need for supervised data 48 PART IV: EXPERIMENTS LOW-RESOURCE NAMED-ENTITY RECOGNITION + PART-OF-SPEECH INDUCTION 49 Experiment 1: Low-Resource NER
Language: Turkish Baseline: CRF with gazetteer features We vary: Supervision: 1 to 500 sentences Gazetteers size: 10, 100, 1000 For each type: person, location, organization, other 50 F1 of model minus F1 of baseline NUMBER OF LABELED SENTENCES FOR TRAINING 51
Experiment 2: Part-of-Speech Induction Use Wiktionary entries as a gazetteer (Incomplete) dictionary: words and their parts-of-speech Baseline: HMM trained with EM (Li et al., 2012) dictionary as constraints on possible parts-of-speech for each word type Data: CoNLL-X and CoNLL 2007 languages 52 Concluding Remarks
54 Key ideas / take-aways Discriminative training has intrinsic limitations when incorporating gazetteers or other lexical knowledge Solution: use a generative model and treat gazetteer entries as ordinary observations Pick your favorite rich generative model Low-resource (this paper): Bayesian backoff via Pitman-Yor processes High-resource: LSTM language model + LSTM spelling model Experiments with more languages in the paper Code: https://github.com/noa/bayesner 55
Generate your Gazetteer! Explorers Gazetteer Explorers Diary Llanfairpwllgwyngyll Allentown Greenville Georgetown 1. 2.
3. 4. known as Llanfairpwllgwyngyll he went to Townville She is from Georgeville a statement from Allentown 56