Eric Hardisty, Jordan Boyd-Graber, and Philip Resnik. Modeling Perspective using Adaptor Grammars. Empirical Methods in Natural Language Processing, 2010, 10 010,Author {Eric Hardisty and Jordan Boyd-Graber and Philip Resnik},Url {docs/adapted naive bayes.pdf},Booktitle {Empirical Methods in Natural Language Processing},Location {Cambridge, MA},Year {2010},Title {Modeling Perspective using Adaptor Grammars},}Downloaded from jbg/docs/adapted naive bayes.pdf1

Modeling Perspective using Adaptor GrammarsJordan Boyd-GraberEric A. HardistyUMD iSchoolDepartment of Computer Scienceand UMIACSand UMIACSUniversity of MarylandUniversity of MarylandCollege Park, MDCollege Park, ong indications of perspective can oftencome from collocations of arbitrary length; forexample, someone writing get the governmentout of my X is typically expressing a conservative rather than progressive viewpoint. However, going beyond unigram or bigram featuresin perspective classification gives rise to problems of data sparsity. We address this problem using nonparametric Bayesian modeling,specifically adaptor grammars (Johnson et al.,2006). We demonstrate that an adaptive naı̈veBayes model captures multiword lexical usagesassociated with perspective, and establishes anew state-of-the-art for perspective classification results using the Bitter Lemons corpus, acollection of essays about mid-east issues fromIsraeli and Palestinian points of view.1IntroductionMost work on the computational analysis of sentiment and perspective relies on lexical features. Thismakes sense, since an author’s choice of words isoften used to express overt opinions (e.g. describinghealthcare reform as idiotic or wonderful) or to framea discussion in order to convey a perspective moreimplicitly (e.g. using the term death tax instead ofestate tax). Moreover, it is easy and efficient to represent texts as collections of the words they contain,in order to apply a well known arsenal of supervisedtechniques (Laver et al., 2003; Mullen and Malouf,2006; Yu et al., 2008).At the same time, standard lexical features havetheir limitations for this kind of analysis. Such features are usually created by selecting some smalln-gram size in advance. Indeed, it is not uncommonPhilip ResnikDepartment of Linguisticsand UMIACSUniversity of MarylandCollege Park, [email protected] see the feature space for sentiment analysis limitedto unigrams. However, important indicators of perspective can also be longer (get the government outof my). Trying to capture these using standard machine learning approaches creates a problem, sinceallowing n-grams as features for larger n gives riseto problems of data sparsity.In this paper, we employ nonparametric Bayesianmodels (Orbanz and Teh, 2010) in order to addressthis limitation. In contrast to parametric models, forwhich a fixed number of parameters are specified inadvance, nonparametric models can “grow” to thesize best suited to the observed data. In text analysis,models of this type have been employed primarilyfor unsupervised discovery of latent structure — forexample, in topic modeling, when the true number oftopics is not known (Teh et al., 2006); in grammaticalinference, when the appropriate number of nonterminal symbols is not known (Liang et al., 2007); andin coreference resolution, when the number of entities in a given document is not specified in advance(Haghighi and Klein, 2007). Here we use them forsupervised text classification.Specifically, we use adaptor grammars (Johnsonet al., 2006), a formalism for nonparametric Bayesianmodeling that has recently proven useful in unsupervised modeling of phonemes (Johnson, 2008), grammar induction (Cohen et al., 2010), and named entitystructure learning (Johnson, 2010), to make supervised naı̈ve Bayes classification nonparametric inorder to improve perspective modeling. Intuitively,naı̈ve Bayes associates each class or label with aprobability distribution over a fixed vocabulary. Weintroduce adaptive naı̈ve Bayes (ANB), for which inprinciple the vocabulary can grow as needed to include collocations of arbitrary length, as determined

by the properties of the dataset. We show that usingadaptive naı̈ve Bayes improves on state of the artclassification using the Bitter Lemons corpus (Linet al., 2006), a document collection that has beenused by a variety of authors to evaluate perspectiveclassification.In Section 2, we review adaptor grammars, showhow naı̈ve Bayes can be expressed within the formalism, and describe how — and how easily — anadaptive naı̈ve Bayes model can be created. Section 3validates the approach via experimentation on the Bitter Lemons corpus. In Section 4, we summarize thecontributions of the paper and discuss directions forfuture work.2Adapting Naı̈ve Bayes to be Less Naı̈veIn this work we apply the adaptor grammar formalism introduced by Johnson, Griffiths, and Goldwater (Johnson et al., 2006). Adaptor grammars are ageneralization of probabilistic context free grammars(PCFGs) that make it particularly easy to express nonparametric Bayesian models of language simply andreadably using context free rules. Moreover, Johnson et al. provide an inference procedure based onMarkov Chain Monte Carlo techniques that makesparameter estimation straightforward for all modelsthat can be expressed using adaptor grammars.1 Variational inference for adaptor grammars has also beenrecently introduced (Cohen et al., 2010).Briefly, adaptor grammars allow nonterminals tobe rewritten to entire subtrees. In contrast, a nonterminal in a PCFG rewrites only to a collectionof grammar symbols; their subsequent productionsare independent of each other. For instance, a traditional PCFG might learn probabilities for the rewriterule PP 7 P NP. In contrast, an adaptor grammar can learn (or “cache”) the production PP 7 (P up)(NP(DET a)(N tree)). It does this by positing that the distribution over children for an adaptednon-terminal comes from a Pitman-Yor distribution.A Pitman-Yor distribution (Pitman and Yor, 1997)is a distribution over distributions. It has three parameters: the discount, a, such that 0 a 1,the strength, b, a real number such that a b,1And,better still,they provide code mj/Software.htm.and a probability distribution G0 known as the basedistribution. Adaptor grammars allow distributionsover subtrees to come from a Pitman-Yor distribution with the PCFG’s original distribution over treesas the base distribution. The generative process forobtaining draws from a distribution drawn from aPitman-Yor distribution can be described by the “Chinese restaurant process” (CRP). We will use the CRPto describe how to obtain a distribution over observations composed of sequences of n-grams, the keyto our model’s ability to capture perspective-bearingn-grams.Suppose that we have a base distribution Ω that issome distribution over all sequences of words (theexact structure of such a distribution is unimportant;such a distribution will be defined later in Table 1).Suppose further we have a distribution φ drawn fromP Y (a, b, Ω), and we wish to draw a series of observations w from φ. The CRP gives us a generativeprocess for doing those draws from φ, marginalizing out φ. Following the restaurant metaphor, weimagine the ith customer in the series entering therestaurant to take a seat at a table. The customer sitsby making a choice that determines the value of then-gram wi for that customer: she can either sit at anexisting table or start a new table of her own.2If she sits at a new table j, that table is assigneda draw yj from the base distribution, Ω; note that,since Ω is a distribution over n-grams, yj is an ngram. The value of wi is therefore assigned to be yj ,and yj becomes the sequence of words assigned tothat new table. On the other hand, if she sits at anexisting table, then wi simply takes the sequence ofwords already associated with that table (assigned asabove when it was first occupied).The probability of joining an existing table j,c awith cj patrons already seated at table j, is cj· b ,wherePc· is the number of patrons seated at all tables:c· j 0 cj 0 . The probability of starting a new tableis b t ac· b , where t is the number of tables presentlyoccupied.Notice that φ is a distribution over the same spaceas Ω, but it can drastically shift the mass of the distribution, compared with Ω, as more and more pa2Note that we are abusing notation by allowing wi to correspond to a word sequence of length 1 rather than a singleword.

trons are seated at tables. However, there is alwaysa chance of drawing from the base distribution, andtherefore every word sequence can also always bedrawn from φ.In the next section we will write a naı̈ve Bayes-likegenerative process using PCFGs. We will then usethe PCFG distribution as the base distribution for aPitman-Yor distribution, adapting the naı̈ve Bayesprocess to give us a distribution over n-grams, thuslearning new language substructures that are usefulfor modeling the differences in perspective.2.1 Classification Models as PCFGsNaı̈ve Bayes is a venerable and popular mechanismfor text classification (Lewis, 1998). It posits thatthere are K distinct categories of text — each with adistinct distribution over words — and that every document, represented as an exchangeable bag of words,is drawn from one (and only one) of these distributions. Learning the per-category word distributionsand global prevalence of the classes is a problem ofposterior inference which can be approached using avariety of inference techniques (Lowd and Domingos,2005).More formally, naı̈ve Bayes models can be expressed via the following generative process:31. Draw a global distribution over classes θ Dir (α)2. For each class i {1, . . . , K}, draw a worddistribution φi Dir (λ)3. For each document d {1, . . . , M }:(a) Draw a class assignment zd Mult (θ)(b) For each word position n {1, . . . , Nd ,draw wd,n Mult (φzd )A variant of the naı̈ve Bayes generative process canbe expressed using the adaptor grammar formalism(Table 1). The left hand side of each rule representsa nonterminal which can be expanded, and the righthand side represents the rewrite rule. The rightmostindices show replication; for instance, there are V rules that allow W ORDi to rewrite to each word in the3Here α and λ are hyperparameters used to specify priorsfor the class distribution and classes’ word distributions, respectively; α is a symmetric K-dimensional vector where each element is π. Nd is the length of document d. Resnik and Hardisty(2010) provide a tutorial introduction to the naı̈ve Bayes generative process and underlying concepts.S ENTD OCd 0.0017 7 D OCdIDd W ORDSiW ORDSiW ORDSiW ORDi7 7 7 W ORDSi W ORDiW ORDivd 1, . . . , md 1, . . . , m;i {1, K}i {1, K}i {1, K}v V ; i {1, K}Table 1: A naı̈ve Bayes-inspired model expressed as aPCFG.vocabulary. One can assume a symmetric Dirichletprior of Dir (1̄) over the production choices unlessotherwise specified — as with the D OCd productionrule above, where a sparse prior is used.Notice that the distribution over expansions forW ORDi corresponds directly to φi in Figure 1(a).There are, however, some differences between themodel that we have described above and the standardnaı̈ve Bayes model depicted in Figure 1(a). In particular, there is no longer a single choice of class perdocument; each sentence is assigned a class. If thedistribution over per-sentence labels is sparse (as itis above for D OCd ), this will closely approximatenaı̈ve Bayes, since it will be very unlikely for thesentences in a document to have different labels. Anon-sparse prior leads to behavior more like modelsthat allow parts of a document to express sentimentor perspective differently.2.2Moving Beyond the Bag of WordsThe naı̈ve Bayes generative distribution posits thatwhen writing a document, the author selects a distribution of categories zd for the document from θ. Theauthor then generates words one at a time: each wordis selected independently from a flat multinomialdistribution φzd over the vocabulary.However, this is a very limited picture of how textis related to underlying perspectives. Clearly wordsare often connected with each other as collocations,and, just as clearly, extending a flat vocabulary toinclude bigram collocations does not suffice, sincesometimes relevant perspective-bearing phrases arelonger than two words. Consider phrases like healthcare for all or government takeover of health care,connected with progressive and conservative positions, respectively, during the national debate onhealthcare reform. Simply applying naı̈ve Bayes,or any other model, to a bag of n-grams for high n is

ααθzdWd,nNdMθzdaWd,nNdφiMΩKφiλK(a) Naı̈ve Bayesbτ(b) Adaptive Naı̈ve BayesFigure 1: A plate diagram for naı̈ve Bayes and adaptive naı̈ve Bayes. Nodes represent random variables and parameters;shaded nodes represent observations; lines represent probabilistic dependencies; and the rectangular plates denotereplication.going to lead to unworkable levels of data sparsity;a model should be flexible enough to support bothunigrams and longer phrases as needed.Following Johnson (2010), however, we can useadaptor grammars to extend naı̈ve Bayes flexibly toinclude richer structure like collocations when theyimprove the model, and not including them whenthey do not. This can be accomplished by introducing adapted nonterminal rules: in a revised generative process, the author can draw from Pitman-Yordistribution whose base distribution is over word sequences of arbitrary length.4 Thus in a setting where,say, K 2, and our two classes are PROGRESSIVEand CONSERVATIVE, the sequence health care for allmight be generated as a single unit for the progressiveperspective, but in the conservative