Title

Title

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schtze and Christina Lioma Lecture 9: Relevance Feedback & Query Expansion 1 Introduction to Information Retrieval Take-away today Interactive relevance feedback: improve initial retrieval

results by telling the IR system which docs are relevant / nonrelevant Best known relevance feedback method: Rocchio feedback Query expansion: improve retrieval results by adding synonyms / related terms to the query Sources for related terms: Manual thesauri, automatic thesauri, query logs 2 Introduction to Information Retrieval Overview

Motivation Relevance feedback: Basics Relevance feedback: Details Query expansion

3 Introduction to Information Retrieval Outline Motivation Relevance feedback: Basics

Relevance feedback: Details Query expansion 4 Introduction to Information Retrieval How can we improve recall in search? Main topic today: two ways of improving recall: relevance

feedback and query expansion As an example consider query q: [aircraft] . . . . . . and document d containing plane, but not containing aircraft A simple IR system will not return d for q. Even if d is the most relevant document for q! We want to change this: Return relevant documents even if there is no term match with the (original) query 5 Introduction to Information Retrieval Recall

Loose definition of recall in this lecture: increasing the number of relevant documents returned to user This may actually decrease recall on some measures, e.g., when expanding jaguar with panthera . . .which eliminates some relevant documents, but increases relevant documents returned on top pages 6 Introduction to Information Retrieval Options for improving recall Local: Do a local, on-demand analysis for a user query Main local method: relevance feedback

Part 1 Global: Do a global analysis once (e.g., of collection) to produce thesaurus Use thesaurus for query expansion Part 2 7 Introduction to Information Retrieval Google examples for query expansion One that works well flights -flight

One that doesnt work so well hospitals -hospital 8 Introduction to Information Retrieval Outline Motivation

Relevance feedback: Basics Relevance feedback: Details Query expansion 9 Introduction to Information Retrieval

Relevance feedback: Basic idea The user issues a (short, simple) query. The search engine returns a set of documents. User marks some docs as relevant, some as nonrelevant. Search engine computes a new representation of the information need. Hope: better than the initial query. Search engine runs new query and returns new results. New results have (hopefully) better recall.

10 Introduction to Information Retrieval Relevance feedback We can iterate this: several rounds of relevance feedback. We will use the term ad hoc retrieval to refer to regular retrieval without relevance feedback. We will now look at three different examples of relevance feedback that highlight different aspects of the process. 11

Introduction to Information Retrieval Relevance feedback: Example 1 12 Introduction to Information Retrieval Results for initial query 13 Introduction to Information Retrieval

User feedback: Select what is relevant 14 Introduction to Information Retrieval Results after relevance feedback 15 Introduction to Information Retrieval Vector space example: query canine (1) Source:

Fernando Daz 16 Introduction to Information Retrieval Similarity of docs to query canine Source: Fernando Daz 17 Introduction to Information Retrieval

User feedback: Select relevant documents Source: Fernando Daz 18 Introduction to Information Retrieval Results after relevance feedback Source: Fernando Daz 19

Introduction to Information Retrieval Example 3: A real (non-image) example Initial query: [new space satellite applications] Results for initial query: (r = rank) + + + r 1 2 3

0.539 0.533 0.528 4 0.526 5 0.525 6

0.524 7 0.516 8 0.509 NASA Hasnt Scrapped Imaging Spectrometer NASA Scratches Environment Gear From Satellite Plan Science Panel Backs NASA Satellite Plan, But Urges Launches of

Smaller Probes A NASA Satellite Project Accomplishes Incredible Feat: Staying Within Budget Scientist Who Exposed Global Warming Proposes Satellites for Climate Research Report Provides Support for the Critics Of Using Big Satellites to Study Climate Arianespace Receives Satellite Launch Pact From Telesat Canada Telecommunications Tale of Two Companies User then marks relevant documents with +. 20

Introduction to Information Retrieval Expanded query after relevance feedback 2.074 30.816 5.991 4.196 3.516 3.004 2.790 2.003 0.836

new 15.106 space satellite 5.660 application nasa 5.196 eos launch 3.972 aster instrument 3.446 arianespace bundespost 2.806 ss rocket 2.053 scientist broadcast 1.172 earth

oil 0.646 measure Compare to original query: [new space satellite applications] 21 Introduction to Information Retrieval Results for expanded query * *

r 1 2 3 * 4 5 6 7 8 0.513 NASA Scratches Environment Gear From Satellite Plan

0.500 NASA Hasnt Scrapped Imaging Spectrometer 0.493 When the Pentagon Launches a Secret Satellite, Space Sleuths Do Some Spy Work of Their Own 0.493 NASA Uses Warm Superconductors For Fast Circuit 0.492 Telecommunications Tale of Two Companies 0.491 Soviets May Adapt Parts of SS-20 Missile For Commercial Use 0.490 Gaping Gap: Pentagon Lags in Race To Match the Soviets In Rocket Launchers 0.490 Rescue of Satellite By Space Agency To Cost $90 Million 22 Introduction to Information Retrieval

Outline Motivation Relevance feedback: Basics Relevance feedback: Details

Query expansion 23 Introduction to Information Retrieval Key concept for relevance feedback: Centroid The centroid is the center of mass of a set of points. Recall that we represent documents as points in a highdimensional space. Thus: we can compute centroids of documents. Definition:

where D is a set of documents and we use to represent document d. is the vector 24 Introduction to Information Retrieval Centroid: Example 25 Introduction to Information Retrieval

Rocchio algorithm The Rocchio algorithm implements relevance feedback in the vector space model. Rocchio chooses the query that maximizes Dr : set of relevant docs; Dnr : set of nonrelevant docs Intent: ~qopt is the vector that separates relevant and nonrelevant docs maximally. Making some additional assumptions, we can rewrite as: 26

Introduction to Information Retrieval Rocchio algorithm The optimal query vector is: We move the centroid of the relevant documents by the difference between the two centroids. 27 Introduction to Information Retrieval Exercise: Compute Rocchio vector

circles: relevant documents, Xs: nonrelevant documents 28 Introduction to Information Retrieval Rocchio illustrated : centroid of relevant documents 29 Introduction to Information Retrieval Rocchio illustrated

does not separate relevant / nonrelevant. 30 Introduction to Information Retrieval Rocchio illustrated centroid of nonrelevant documents. 31 Introduction to Information Retrieval Rocchio illustrated

32 Introduction to Information Retrieval Rocchio illustrated - difference vector 33 Introduction to Information Retrieval Rocchio illustrated

Add difference vector to 34 Introduction to Information Retrieval Rocchio illustrated to get 35 Introduction to Information Retrieval

Rocchio illustrated separates relevant / nonrelevant perfectly. 36 Introduction to Information Retrieval Rocchio illustrated separates relevant / nonrelevant perfectly. 37 Introduction to Information Retrieval

Terminology We use the name Rocchio for the theoretically better motivated original version of Rocchio. The implementation that is actually used in most cases is the SMART implementation we use the name Rocchio (without prime) for that. 38 Introduction to Information Retrieval Rocchio 1971 algorithm (SMART) Used in practice:

qm: modified query vector; q0: original query vector; Dr and Dnr : sets of known relevant and nonrelevant documents respectively; , , and : weights New query moves towards relevant documents and away from nonrelevant documents. Tradeoff vs. /: If we have a lot of judged documents, we want a higher /. Set negative term weights to 0.

Negative weight for a term doesnt make sense in the vector space model. 39 Introduction to Information Retrieval Positive vs. negative relevance feedback Positive feedback is more valuable than negative feedback. For example, set = 0.75, = 0.25 to give higher weight to positive feedback. Many systems only allow positive feedback. 40

Introduction to Information Retrieval Relevance feedback: Assumptions When can relevance feedback enhance recall? Assumption A1: The user knows the terms in the collection well enough for an initial query. Assumption A2: Relevant documents contain similar terms (so I can hop from one relevant document to a different one when giving relevance feedback). 41 Introduction to Information Retrieval

Violation of A1 Assumption A1: The user knows the terms in the collection well enough for an initial query. Violation: Mismatch of searchers vocabulary and collection vocabulary Example: cosmonaut / astronaut 42 Introduction to Information Retrieval Violation of A2 Assumption A2: Relevant documents are similar.

Example for violation: [contradictory government policies] Several unrelated prototypes Subsidies for tobacco farmers vs. anti-smoking campaigns Aid for developing countries vs. high tariffs on imports from developing countries Relevance feedback on tobacco docs will not help with finding docs on developing countries. 43 Introduction to Information Retrieval Relevance feedback: Evaluation

Pick one of the evaluation measures from last lecture, e.g., precision in top 10: [email protected] Compute [email protected] for original query q0 Compute [email protected] for modified relevance feedback query q1 In most cases: q1 is spectacularly better than q0! Is this a fair evaluation? 44 Introduction to Information Retrieval Relevance feedback: Evaluation Fair evaluation must be on residual collection: docs not yet judged by user.

Studies have shown that relevance feedback is successful when evaluated this way. Empirically, one round of relevance feedback is often very useful. Two rounds are marginally useful. 45 Introduction to Information Retrieval Evaluation: Caveat True evaluation of usefulness must compare to other methods taking the same amount of time. Alternative to relevance feedback: User revises and resubmits query.

Users may prefer revision/resubmission to having to judge relevance of documents. There is no clear evidence that relevance feedback is the best use of the users time. 46 Introduction to Information Retrieval Exercise Do search engines use relevance feedback? Why? 47

Introduction to Information Retrieval Relevance feedback: Problems Relevance feedback is expensive. Relevance feedback creates long modified queries. Long queries are expensive to process. Users are reluctant to provide explicit feedback. Its often hard to understand why a particular document was retrieved after applying relevance feedback. The search engine Excite had full relevance feedback at one point, but abandoned it later.

48 Introduction to Information Retrieval Pseudo-relevance feedback Pseudo-relevance feedback automates the manual part of true relevance feedback. Pseudo-relevance algorithm: Retrieve a ranked list of hits for the users query Assume that the top k documents are relevant. Do relevance feedback (e.g., Rocchio) Works very well on average But can go horribly wrong for some queries.

Several iterations can cause query drift. 49 Introduction to Information Retrieval Pseudo-relevance feedback at TREC4 Cornell SMART system Results show number of relevant documents out of top 100 for 50 queries (so total number of documents is 5000): method number of relevant documents

lnc.ltc 3210 lnc.ltc-PsRF 3634 Lnu.ltu 3709 Lnu.ltu-PsRF

4350 Results contrast two length normalization schemes (L vs. l) and pseudo-relevance feedback (PsRF). The pseudo-relevance feedback method used added only 20 terms to the query. (Rocchio will add many more.) This demonstrates that pseudo-relevance feedback is effective on average. 50 Introduction to Information Retrieval Outline

Motivation Relevance feedback: Basics Relevance feedback: Details Query expansion

51 Introduction to Information Retrieval Query expansion Query expansion is another method for increasing recall. We use global query expansion to refer to global methods for query reformulation. In global query expansion, the query is modified based on some global resource, i.e. a resource that is not querydependent. Main information we use: (near-)synonymy A publication or database that collects (near-)synonyms is called a thesaurus.

We will look at two types of thesauri: manually created and automatically created. 52 Introduction to Information Retrieval Query expansion: Example 53 Introduction to Information Retrieval Types of user feedback User gives feedback on documents.

More common in relevance feedback User gives feedback on words or phrases. More common in query expansion 54 Introduction to Information Retrieval Types of query expansion Manual thesaurus (maintained by editors, e.g., PubMed) Automatically derived thesaurus (e.g., based on cooccurrence statistics) Query-equivalence based on query log mining (common on the web as in the palm example)

55 Introduction to Information Retrieval Thesaurus-based query expansion For each term t in the query, expand the query with words the thesaurus lists as semantically related with t. Example from earlier: HOSPITAL MEDICAL Generally increases recall May significantly decrease precision, particularly with ambiguous terms INTEREST RATE INTEREST RATE FASCINATE

Widely used in specialized search engines for science and engineering Its very expensive to create a manual thesaurus and to maintain it over time. A manual thesaurus has an effect roughly equivalent to annotation with a controlled vocabulary. 56 Introduction to Information Retrieval Example for manual thesaurus: PubMed 57

Introduction to Information Retrieval Automatic thesaurus generation Attempt to generate a thesaurus automatically by analyzing the distribution of words in documents Fundamental notion: similarity between two words Definition 1: Two words are similar if they co-occur with similar words. car motorcycle because both occur with road, gas and license, so they must be similar. Definition 2: Two words are similar if they occur in a given grammatical relation with the same words. You can harvest, peel, eat, prepare, etc. apples and pears, so

apples and pears must be similar. Co-occurrence is more robust, grammatical relations are more accurate. 58 Introduction to Information Retrieval Co-occurence-based thesaurus: Examples Word absolutely bottomed captivating

doghouse makeup mediating keeping lithographs pathogens senses Nearest neighbors absurd whatsoever totally exactly nothing dip copper drops topped slide trimmed shimmer stunningly superbly plucky witty dog porch crawling beside downstairs repellent lotion glossy sunscreen skin gel

reconciliation negotiate case conciliation hoping bring wiping could some would drawings Picasso Dali sculptures Gauguin toxins bacteria organisms bacterial parasite grasp psyche truly clumsy naive innate WordSpace demo on web 59 Introduction to Information Retrieval Query expansion at search engines Main source of query expansion at search engines: query logs

Example 1: After issuing the query [herbs], users frequently search for [herbal remedies]. herbal remedies is potential expansion of herb. Example 2: Users searching for [flower pix] frequently click on the URL photobucket.com/flower. Users searching for [flower clipart] frequently click on the same URL. flower clipart and flower pix are potential expansions of each other. 60 Introduction to Information Retrieval

Take-away today Interactive relevance feedback: improve initial retrieval results by telling the IR system which docs are relevant / nonrelevant Best known relevance feedback method: Rocchio feedback Query expansion: improve retrieval results by adding synonyms / related terms to the query Sources for related terms: Manual thesauri, automatic thesauri, query logs 61 Introduction to Information Retrieval

Resources Chapter 9 of IIR Resources at http://ifnlp.org/ir Salton and Buckley 1990 (original relevance feedback paper) Spink, Jansen, Ozmultu 2000: Relevance feedback at Excite Schtze 1998: Automatic word sense discrimination (describes a simple method for automatic thesuarus generation) 62

Recently Viewed Presentations

  • Implementation Plan ofTRACES NT linked to legal ... - Europa

    Implementation Plan ofTRACES NT linked to legal ... - Europa

    (currently unidirectional link in place for TRACES Classic and iRASFF) fulfil art. 74 non-compliance with and withdrawal of the approval of pre-export controls performed by Third Countries. 2) Exchanges with National Systems (e.g. Netherlands, Germany …) 3) Conditions for Third...
  • STFC Corporate PowerPoint Template

    STFC Corporate PowerPoint Template

    Recent Highlights . Mark Thomson is Professor of Experimental Particle Physics at the University of Cambridge. He is an author on over 1000 publications.
  • The Peril and Promise of the Internet Within Mormonism

    The Peril and Promise of the Internet Within Mormonism

    Borderlands Mormon Stories Sunstone Sunstone Classic An Atheists' Prayer Reproductions of GA Talks LDS Voices BYU Classic Speeches April 05 General Conference Podcasts Podcasts Podcasts Podcasts Podcasts Podcasts Podcasts Wiki Wars FAIR Wiki MormonWiki ZCLM Wiki The Peril and Promise...
  • Nih Fundamentals

    Nih Fundamentals

    NIH is an agency of DHHS (one of 12) Other HHS Agencies include FDA, CDC, HRSA, SAMSA, IHS. IC = Institute or Center. I. The NIH. ... RAs, part time wages) (PSU fringe 36.5% for full time staff, 13.2% for...
  • AREA 1 GUIDING PRINCIPLES - Weebly

    AREA 1 GUIDING PRINCIPLES - Weebly

    Consequential Ethics Utilitarianism pleasure pain Consequential Ethics E.g.Utilitarianism: actions are good or bad depending on the outcome. The moral consequences of the action is the promotion of human happiness/wellbeing and the minimising of unhappiness/pain.
  • Penistone High Street Roadworks

    Penistone High Street Roadworks

    christmas & new year 2018/2019 thurgoland surgery monday 24th december normal opening hours tuesday 25th december christmas day closed wednesday 26th december boxing day closed thursday 27th december normal opening hours friday 28th december normal opening hours monday 31st december...
  • Building An IPv6 Deployment Plan v0.1 Alastair Matthews

    Building An IPv6 Deployment Plan v0.1 Alastair Matthews

    Building An IPv6 Deployment Plan v0.1 Alastair Matthews Network Engineer Outline Why Deploy IPv6 What Where How - Choices Timescales Conclusions Why….
  • Using the Internet for Educational Research

    Using the Internet for Educational Research

    Arial Franklin Gothic Book Perpetua Wingdings 2 Times New Roman Equity 1_Equity 2_Equity 3_Equity 4_Equity Using the Internet for Educational Research Discussion Topics Strengths and Weaknesses Strengths and Weaknesses Strengths and Weaknesses Strengths and Weaknesses Organization of Information Internet Search...