CSE 535 Information Retrieval Chapter 1: Introduction to IR Srihari-CSE535-Spring2008 Motivation IR: representation, storage, organization of, and access to unstructured data Focus is on the user information need User information need: When did the Buffalo Bills last win the Super Bowl? Find all docs containing information on cricket players who are: (1) tempermental, (ii) popular in their countries, and (iii) play in international test series. Emphasis is on the retrieval of information (not data) Srihari-CSE535-Spring2008 Motivation Data retrieval
which docs contain a set of keywords? Well defined semantics a single erroneous object implies failure! Information retrieval information about a subject or topic deals with unstructured text semantics is frequently loose small errors are tolerated IR system: interpret contents of information items generate a ranking which reflects relevance notion of relevance is most important Srihari-CSE535-Spring2008 Basic Concepts The User Task Retrieval
Database Browsing Retrieval information or data purposeful needle in a haystack problem Browsing glancing around Formula 1 racing; cars, Le Mans, France, tourism Filtering (push rather than pull) Srihari-CSE535-Spring2008 Query Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? Could grep all of Shakespeares plays for Brutus and Caesar, then strip out lines containing Calpurnia?
Slow (for large corpora) NOT Calpurnia is non-trivial Other operations (e.g., find the phrase Romans and countrymen) not feasible Srihari-CSE535-Spring2008 Term-document incidence Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0
1 1 1 worser 1 0 1 1 1 0 1 if play contains word, 0 otherwise Srihari-CSE535-Spring2008 Incidence vectors So we have a 0/1 vector for each term. To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise AND. 110100 AND 110111 AND 101111 = 100100. Srihari-CSE535-Spring2008
Answers to query Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Srihari-CSE535-Spring2008 Bigger document collections Consider N = 1million documents, each with about 1K terms. Avg 6 bytes/term incl
spaces/punctuation 6GB of data in the documents. Say there are M = 500K distinct terms among these. Srihari-CSE535-Spring2008 Cant build the matrix 500K x 1M matrix has half-a-trillion 0s and 1s. But it has no more than one billion 1s. Why? matrix is extremely sparse: >99% zeros Whats a better representation? We only record the 1 positions. Inverted Index Srihari-CSE535-Spring2008 Ad-Hoc Retrieval
Most standard IR task System to provide documents from the collection that are relevant to an arbitrary user information need Information need: topic that user wants to know about Query: users abstraction of the information need Relevance: document is relevant if the user perceives it as valuable wrt his information need Srihari-CSE535-Spring2008 Issues to be Addressed by IR How to improve quality of retrieval Precison: what fraction of the returned results are relevant to information need? Recall: what fraction of relevant documents in the collection are returned by the system Understanding user information need
Faster indexes and smaller query response times Better understanding of user behaviour interactive retrieval visualization techniques Srihari-CSE535-Spring2008 Inverted index For each term T: store a list of all documents that contain T. Do we use an array or a list for this? Brutus 2 Calpurnia 1 Caesar 4 2 8 16 32 64 128
3 5 8 13 21 34 13 16 What happens if the word Caesar is added to document 14? Srihari-CSE535-Spring2008 Inverted index Linked lists generally preferred to arrays Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers Brutus 2 4
Postings Sorted by docID (more later on why). Srihari-CSE535-Spring2008 Inverted index construction Documents to be indexed. Friends, Romans, countrymen. Tokenizer Token stream. More on these later. Modified tokens. Inverted index. Friends Romans Linguistic modules friend roman Countrymen countryman Indexer friend
2 4 roman 1 2 16 13 countrymanSrihari-CSE535-Spring2008 Basic Concepts Logical view of the documents Docs documents represented by a set of index terms or keywords Accents Noun Automatic or stopwords stemming Manual spacing groups indexing
Structure recognition structure Full text Index terms Document representation viewed as a continuum: logical view of docs might shift Srihari-CSE535-Spring2008 The Retrieval Process User Interface user need Text 2,3 Text Operations logical view logical view Query Operations user feedback Indexing Criteria,
Preferences Indexer 9 query Searching 4,5,6 inverted file Index 7 retrieved docs Document Collection ranked docs Ranking 7,21 Retrieval Indexing Srihari-CSE535-Spring2008 Applications of IR Specialized Domains
biomedical, legal, patents, intelligence Summarization Cross-lingual Retrieval, Information Access Question-Answering Systems Ask Jeeves Web/Text Mining data mining on unstructured text Multimedia IR images, document images, speech, music Web applications shopbots personal assistant agents Srihari-CSE535-Spring2008 IR Techniques Machine learning
clustering, SVM, latent semantic indexing, etc. improving relevance feedback, query processing etc. Natural Language Processing, Computational Linguistics better indexing, query processing incorporating domain knowledge: e.g., synonym dictionaries use of NLP in IR: benefits yet to be shown for large-scale IR Information Extraction Highly focused Natural language processing (NLP) named entity tagging, relationship/event detection Text indexing and compression User interfaces and visualization AI advanced QA systems, inference, etc. Srihari-CSE535-Spring2008
Stone Serif MS Pゴシック Arial Wingdings Arial Narrow Monotype Sorts Palatino Helvetica Palatino Linotype Times New Roman Comic Sans MS Calibri darkcorporate 1_darkcorporate Microsoft Excel Worksheet Bitmap Image Microsoft Word Document Pathway Bioinformatics (2) Overview Pathway Bioinformatics Definition of Metabolic...
Communiquer les résultats de ses travaux de recherche Patricia Volland-Nail Inra Tours-Nouzilly A scientific experiment, no matter how spectacular the results, is not complete until the results are published …
JetFighter Exercise Base Plane Price=$5M, Deluxe Plane Price=$10M Raw materials cost varies by type of plane and amount purchased (see hand-out) Your team goal is to earn most profit: Revenue - Cost (Planes Passed*Sales Price) - (Planes Bought*purchase cost) Planes...
Uniform Paving Surface. Promotes smoother pavements. Helps with achieving uniform density. Condition of Milled Surfaces . Very difficult to predict during contract preparation. Can change with time, weather conditions and traffic. Old pavements are prone to stripping. Scabbing is a...
Dollar days, case lot sales, club packs Health and Wellness We're ready to move Virtual Commissary to the next level Outreach WOF 7 Stores Deployed 51 Scheduled FY 2007 Spouse and Family Hiring is at 38% of our workforce We're...
Ready to download the document? Go ahead and hit continue!