I256 Applied Natural Language Processing Fall 2009 Lecture 1 Introduction Barbara Rosario Introductions Barbara Rosario iSchool alumni (class 2005) Intel Labs Gopal Vaswani iSchool master student (class 2010) You?
Today Introductions Administrivia What is NLP NLP Applications Why is NLP difficult Corpus-based statistical approaches Course goals What well do in this course
Administrivia http://courses.ischool.berkeley.edu/i256/f09/index.html Books: Foundations of Statistical NLP, Manning and Schuetze, MIT press Natural Language Processing with Python, Bird, Klein & Loper, O'Reilly. (also on line) See Web site for additional resources Work: Individual coding assignments (Python & NLTK-Natural Language Toolkit) (4 or 5) Final group project Participation Office hours: Barbara: Thursday 2:00-3:00 in Room 6 Gopal: Tuesday 2:00-3:00 in Room 6 (to be confirmed)
Administrivia Communication: My email: [email protected] Gopal : [email protected] Mailing list: [email protected] Send an email to [email protected] with subscribe i256 in the body Through intranet Announcements: webpage and/or mailing list and/or Bspace (TBA) Public discussion: Bspace(?) Related course: Statistical Natural Language Processing, Spring 2009, CS 288 http://www.cs.berkeley.edu/~klein/cs288/sp09/ Instructor: Dan Klein Much more emphasis on statistical algorithms
Questions? Natural Language Processing Fundamental goal: deep understand of broad language Not just string processing or keyword matching! End systems that we want to build: Ambitious: speech recognition, machine translation, question answering Modest: spelling correction, text categorization Slide taken from Kleins course: UCB CS 288 spring 09 Example: Machine Translation NLP applications
Text Categorization Classify documents by topics, language, author, spam filtering, information retrieval (relevant, not relevant), sentiment classification (positive, negative) Spelling & Grammar Corrections Information Extraction Speech Recognition Information Retrieval Synonym Generation
Summarization Machine Translation Question Answering Dialog Systems Language generation Why NLP is difficult A NLP system needs to answer the question who did what to whom Language is ambiguous At all levels: lexical, phrase, semantic Iraqi Head Seeks Arms Word sense is ambiguous (head, arms) Stolen Painting Found by Tree Thematic role is ambiguous: tree is agent or location?
Ban on Nude Dancing on Governors Desk Syntactic structure (attachment) is ambiguous: is the ban or the dancing on the desk? Hospitals Are Sued by 7 Foot Doctors Semantics is ambiguous : what is 7 foot? Why NLP is difficult Language is flexible New words, new meanings Different meanings in different contexts Language is subtle He arrived at the lecture He chuckled at the lecture He chuckled his way through the lecture **He arrived his way through the lecture Language is complex!
Why NLP is difficult MANY hidden variables Knowledge about the world Knowledge about the context Knowledge about human communication techniques Can you tell me the time? Problem of scale Many (infinite?) possible words, meanings, context Problem of sparsity Very difficult to do statistical analysis, most things (words, concepts) are never seen before Long range correlations Why NLP is difficult Key problems:
Representation of meaning Language presupposes knowledge about the world Language only reflects the surface of meaning Language presupposes communication between people Meaning What is meaning? Physical referent in the real world Semantic concepts, characterized also by relations. How do we represent and use meaning I am Italian From lexical database (WordNet) Italian =a native or inhabitant of Italy Italy = republic in southern Europe [..]
I am Italian Who is I? I know she is Italian/I think she is Italian How do we represent I know and I think Does this mean that I is Italian? What does it say about the I and about the person speaking? I thought she was Italian How do we represent tenses? Today
Introductions Administrivia What is NLP NLP Applications Why is NLP difficult Corpus-based statistical approaches Course goals What well do in this course Corpus-based statistical approaches to tackle NLP problem How can a can a machine understand these differences? Decorate the cake with the frosting Decorate the cake with the kids Rules based approaches, i.e. hand coded syntactic
constraints and preference rules: The verb decorate require an animate being as agent The object cake is formed by any of the following, inanimate entities (cream, dough, frosting..) Such approaches have been showed to be time consuming to build, do not scale up well and are very brittle to new, unusual, metaphorical use of language To swallow requires an animate being as agent/subject and a physical object as object I swallowed his story The supernova swallowed the planet Corpus-based statistical approaches to tackle NLP problem A Statistical NLP approach seeks to solve these problems by automatically learning lexical and structural preferences from text collections (corpora)
Statistical models are robust, generalize well and behave gracefully in the presence of errors and new data. So: Get large text collections Compute statistics over those collections (The bigger the collections, the better the statistics) Corpus-based statistical approaches to tackle NLP problem Decorate the cake with the frosting Decorate the cake with the kids From (labeled) corpora we can learn that: #(kids are subject/agent of decorate) > #(frosting is subject/agent of decorate) From (UN-labeled) corpora we can learn that: #(the kids decorate the cake) >> #(the frosting decorates the cake)
#(cake with frosting) >> #(cake with kids) etc.. Given these facts we then need a statistical model for the attachment decision Corpus-based statistical approaches to tackle NLP problem Topic categorization: classify the document into semantics topics Document 1 Document 2 The U.S. swept into the Davis Cup final on Saturday when twins Bob and Mike Bryan defeated Belarus's Max Mirnyi and Vladimir Voltchkov to give the Americans an unsurmountable 3-0 lead
in the best-of-five semi-final tie. One of the strangest, most relentless hurricane seasons on record reached new bizarre heights yesterday as the plodding approach of Hurricane Jeanne prompted evacuation orders for hundreds of thousands of Floridians and high wind warnings that stretched 350 miles from the swamp towns south of Miami to the historic city of St. Augustine. Topic = sport Topic = disaster Corpus-based statistical approaches to tackle NLP problem
Topic categorization: classify the document into semantics topics Document 1 (sport) Document 2 (disasters) The U.S. swept into the Davis Cup final on Saturday when twins Bob and Mike Bryan One of the strangest, most relentless hurricane seasons on record reached new bizarre heights yesterday as. From (labeled) corpora we can learn that: #(sport documents containing word Cup) > #(disaster documents containing word Cup) -- feature
We then need a statistical model for the topic assignment Corpus-based statistical approaches to tackle NLP problem Feature extractions (usually linguistics motivated) Statistical models Data (corpora, labels, linguistic resources) Goals of this Course Learn about the problems and possibilities of natural language analysis: What are the major issues? What are the major solutions? At the end you should: Agree that language is difficult, interesting and important
Be able to assess language problems Know which solutions to apply when, and how Feel some ownership over the algorithms Be able to use software to tackle some NLP language tasks Know language resources Be able to read papers in the field What Well Do in this Course Linguistic Issues What are the range of language phenomena? What are the knowledge sources that let us disambiguate? What representations are appropriate? Applications Software (Python and NLTK) Statistical Modeling Methods
What Well Do in this Course Read books, research papers and tutorials Final project Your own ideas or chose from some suggestions I will provide Well talk later during the couse about ideas/methods etc. but come talk to me if you have already some ideas Learn Python Learn/use NLTK (Natural Language ToolKit) to try out various algorithms Python Python - Simple yet powerful The zen of python : http://www.python.org/dev/peps/pep-0020/
Very clear, readable syntax Strong introspection capabilities http://www.ibm.com/developerworks/linux/library/l-pyint.html (recommended) Intuitive object orientation Natural expression of procedural code Full modularity, supporting hierarchical packages Exception-based error handling Very high level dynamic data types Extensive standard libraries and third party modules for virtually every task Excellent functionality for processing linguistic data. NLTK is one such extensive third party module.
Source : python.org NLTK NLTK defines an infrastructure that can be used to build NLP programs in Python. It provides basic classes for representing data relevant to natural language processing. Standard interfaces for performing tasks such as part-of-speech tagging, syntactic parsing, and text classification. Standard implementations for each task which can be combined to solve complex problems. Language processing task NLTK modules
Functionality Accessing corpora nltk.corpus standardized interfaces to corpora and lexicons String processing nltk.tokenize, nltk.stem tokenizers, sentence tokenizers, stemmers Collocation discovery nltk.collocations t-test, chi-squared, point-wise mutual information
Parsing nltk.parse chart, feature-based, unification, probabilistic, dependency Resources: Download at http://www.nltk.org/download Getting started with NLTK Chapter 1 NLP and NLTK talk at google http://www.youtube.com/watch?v=keXW_5-llD0 Source : nltk.org This is not the complete list Topics Text corpora & other resources
Words (Morphology, tokenization, stemming, part-ofspeech, WSD, collocations, lexical acquisition, language models) Syntax: chunking, PCFG & parsing Statistical models (esp. for classification) Applications Text classification Information extraction Machine translation Semantic Interpretation Sentiment Analysis QA / Summarization
Information retrieval Next Assignment Due before next class Tue Sep 1 No turn-in Download and install Python and NLTK Download the NLTK Book Collection, as described at the beginning of chapter 1 of the book Natural Language Processing with Python Readings: Chapter 1 of the book Natural Language Processing with Python Chapter 3 of Foundations of Statistical NLP Next class: Linguistic Essentials Python Introduction
In the jungle, there lived the Bear family: Papa bear, Mama Bear and Baby Bear. Communicating Spoken Transactional conv. Interpersonal conv. Short functional texts Monologues Written Short functional texts Essays of different genres Genre Communicative purpose Text structure Linguistic Features...
Increased level of training (certification) for road authorities and private contractors Integration of salt management plans with SPCs objectives to delineate source waters, identify threats and develop and implement SWP Plan Improved stormwater management practices * * * Studies in...
What is a Portal?. Portal - A doorway, entrance, or gate, especially one that is large and imposing. Enterprise Portal - A server used by institutions to build a gateway, providing access to and interaction with relevant information, applications and...
A joke, a quote, a question, some striking statistics Employs persuasive appeal of ethos Establishes credibility with the audience Narratio Establishes background and authority Typically 2-4 sentences following the exordium Explains the "nature of the case" Divisio Commonly called the...
Understanding CBS Cost Accumulation Information November 2012 * The Surcharge Rates above generates the offset. The offset that is created from the Surcharge process will show up in the Internal Fund as a Negative obligation (credit) The 77.87 is NOAA...
How to Choose a Random Sudoku Board Joshua Cooper USC Department of Mathematics How to Choose a Random Sudoku Board Joshua Cooper USC Department of Mathematics Rules: Place the numbers 1 through 9 in the 81 boxes, but do not...
Energy Flow in Ecosystems Everything that organisms do in ecosystems require energy. The flow of energy is the most important factor that controls what kinds of organisms live in an ecosystem and how many organisms the ecosystem can support.
Ready to download the document? Go ahead and hit continue!