DB & IR: Both Sides Now Gerhard Weikum

DB & IR: Both Sides Now Gerhard Weikum

DB & IR: Both Sides Now Gerhard Weikum [email protected] http://www.mpi-inf.mpg.de/~weikum/ in collaboration with Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath, Ralf Schenkel, Fabian Suchanek, Martin Theobald DB and IR: Two Parallel Universes Database Systems canonical application: data type: foundation: accounting Information Retrieval libraries numbers, text short strings parallel universes forever ? algebraic / probabilistic / logic based statistics based search paradigm: Boolean retrieval ranked retrieval market leaders:

Oracle, IBM DB2, MS SQL Server, etc. Google, Yahoo!, MSN, Verity, Fast, etc. (exact queries, result sets/bags) (vague queries, result lists) Gerhard Weikum June 14, 2007 2/41 Why DB&IR Now? Application Needs Simplify life for application areas like: Global health-care management for monitoring epidemics News archives for journalists, press agencies, etc. Product catalogs for houses, cars, vacation places, etc. Customer support & CRM in insurances, telcom, retail, software, etc. Bulletin boards for social communities Enterprise search for projects, skills, know-how, etc. Personalized & collaborative search in digital libraries, Web, etc. Comprehensive archive of blogs with time-travel search Typical data: Disease (DId, Name, Category, Pathogen ) UMLS-Categories ( ) Patient ( Age, HId, Date, Report, TreatedDId) Hospital (HId, Address ) Typical query: symptoms of tropical virus diseases and reported anomalies with young patients in central Europe in the last two weeks Gerhard Weikum June 14, 2007 3/41

Why DB&IR Now? Platform Desiderata Unstructured search (keywords) Structured search (SQL,XQuery) Keyword Search on Relational Graphs IR Systems Search Engines Integrated DB&IR Platform Querying entities & (IIT Bombay, UCSD, MSR, Hebrew U, CU Hong Kong, Duke U, ...) DB Systems Structured data (records) relations from IE (MSR Beijing, UW Seattle, IBM Almaden, UIUC, MPI, ) Unstructured data (documents) Platform desiderata (from app developers viewpoint): Flexible ranking on text, categorical, numerical attributes cope with too many answers and no answers Ontologies (dimensions, facets) for products, locations, orgs, etc. for query rewriting (relaxation, strengthening) Complex queries combining text & structured attributes XPath/XQuery Full-Text with ranking High update rate concurrently with high query load

Gerhard Weikum June 14, 2007 4/41 Why DB&IR Forever? Turn the Web, Web2.0, and Web3.0 into the worlds most comprehensive knowledge base (semantic DB) ! Data enrichment at very large scale Text and speech are key sources of knowledge production (publications, patents, conferences, meetings, ...) indexed Web Flickr photos digital photos Wikipedia OECD researchers patents world-wide US Library of Congres Google Scholar 2000 2007 2 Bio. --? 8 000 7.4 Mio. ? 115 Mio. --- 20 Bio. 100 Mio. 150 Bio. 1.8 Mio. 8.4 Mio. 60 Mio. 134 Mio. 500 Mio.

Gerhard Weikum June 14, 2007 5/41 Outline Past : Matter, Antimatter, and Wormholes Present : XML and Graph IR Future : From Data to Knowledge Gerhard Weikum June 14, 2007 6/41 Parallel Universes: A Closer Look Matter Antimatter user = programmer query = precise spec. of info request interaction via API user = your kids query = approximation of users real info needs interaction process via GUI strength: indexing, QP weakness: user model strength: ranking model weakness: interoperability eval. measure: efficiency eval. measure: effectiveness (throughput, response time, TPC-H, XMark, )

(precision, recall, F1, MAP, NDCG, TREC & INEX benchmarks, Gerhard Weikum June 14, 2007 7/41 Web Query Languages: DB Uncertain & Prob. Relations: W3QS, WebOQL, Araneus Prob. DB (Cavallo&Pittarelli) Mystiq, Trio Semistructured Data: Lore, Xyleme XPath 2nd Gen. XML IR: 1st Gen. XRank,Timber, TIJAH, XSearch, FleXPath, WHIRL XML IR: CoXML, TopX, (Cohen) XXL, Prob. Tuples (Barbara et al.) DB & IR: Both Sides Now VAGUE (Motro)

MarkLogic, Fast XIRQL, Elixir, JuruXML INEX Gerhard Weikum XPath Full-Text Deep Web Search Prob. Datalog Digital Libraries (Fuhr et al.) [email protected] Struct. Docs http://www.mpi-inf.mpg.de/~weikum/ Multimedia IR Proximal Nodes (Baeza-Yates et al.) IR 1990 1995 Faceted Search: Flamenco 2000 Graph IR Web Entity Search:

Libra, Avatar, ExDB 2005 WHIRL: IR over Relations [W.W. Cohen: SIGMOD98] Add text-similarity selection and join to relational algebra Example: Select * From Movies M, Reviews R Where M.Plot ~ fight And M.Year > 1990 And R.Rating > 3 And M.Title ~ R.Title And M.Plot ~ R.Comment Movies Reviews Title Plot Matrix In the near future computer hacker Neofor DB&IR fight training Hero Year Title 1999 Matrix 1

Comment cool fights new techniques integration query-time data Matrix fights More recent work: MinorThird, Spider, DBLife, etc. Reloaded and more fights In ancient China fights 2002 But sword fight scoring models fairly ad hoc fairly boring fights Broken Sword Shrek 2 In Far Far Away our lovely hero fights with cat killer 2004 Scoring and ranking: s (, q: A~B) = cosine (x.A, y.B) s (, q1 qm) =

m Rating 4 1 Matrix matrix spectrum Eigenvalues orthonormal 5 Ying xiong aka. Hero 5 fight for peace sword fight dramatic colors xj ~ tf (word j in x) idf (word j) with dampening & normalization s ( x , y , q i ) i 1 Gerhard Weikum June 14, 2007 9/41 XXL: Early XML IR [Anja Theobald, GW: Adding Relevance toXML, WebDB00] Union of heterogeneous sources without global schema Similarity-aware XPath: Which professors from Saarbruecken (SB)

//~Professor [//* = ~SB] are teaching IR and have [//~Course [//* = ~IR] ] research projects on XML? [//~Research [//* = ~XML] ] Lecturer Professor Name: Gerhard Weikum Name: Ralf Schenkel Address ... Research City: SB Teaching Country: Germany ... Seminar Course Project Title: Contents: Title: IR Intelligent Ranked Syllabus retrieval Search of Description: ... Heterogeneous Literature: Information Book Article XML Data retrieval ...

... ... Funding: EU Gerhard Weikum June 14, 2007 Activities Address: Max-Planck Institute for Informatics, Germany Scientific Other Name: Sponsor: INEX task coordinator EU (Initiative for the Evaluation of XML ) 10/41 XXL: Early XML IR [Anja Theobald, GW: Adding Relevance toXML, WebDB00] Motivation: Union of heterogeneous sources has no schema Similarity-aware XPath: Which professors //~Professor [//* = ~Saarbruecken] from Saarbruecken (SB) are teaching IR and have [//~Course [//* = ~IR] ] research projects on XML? [//~Research [//* = ~XML] ] alchemist Professor primadonna

magician artist director wizard Name: investigator Lecturer Name: Scoring and ranking: Activities Ralf Address: Address tf*idfSchenkel for content condition Gerhard Max-Planck ... intellectual Weikum Institute for for RELATED (0.48) ontological similarity Research City: SB Informatics, Teaching relaxed tag condition Germany professor Country: researcher Germany ... Seminar score aggregationScientific with

Other Course HYPONYM (0.749) Project scientist probabilisticName: independence query expansion model: Title: Contents: Title: IR scholarofSyllabus disjunction tags mentor Description: academic, ... Information academician, Book Article faculty member retrieval ... ... ... lecturer Sponsor: Intelligent Ranked INEX task EU retrieval Search of coordinator Wu&Palmer: |path| through lca(x,y)

teacher (Initiative for the Heterogeneous Literature: 2 #(x,y) / (#x + #y) on Web Dice coeff.: Evaluation of XML ) XML Data Funding: EU Gerhard Weikum June 14, 2007 11/41 The Past: Lessons Learned precision DB&IR: added flexible ranking to (semi) structured querying to cope with schema and instance diversity but ranking seems ad hoc and not consistently good in benchmarks recall to win benchmark: tuning needed, but tuning is easier if ranking is principled ! ontologies are mixed blessing: quality diverse, concept similarity subtle, danger of topic drift ontology-based query expansion (into large disjunctions) poses efficiency challenge Gerhard Weikum June 14, 2007 entity substance solid food

produce element edible fruit pome apple Golden Delicious gold // ~Professor [...] // { Professor, Researcher, Lecturer, Scientist, Scholar, Academic, ... }[...] 12/41 Outline Past : Matter, Antimatter, and Wormholes Present : XML and Graph IR Future : From Data to Knowledge Gerhard Weikum June 14, 2007 13/41 TopX: 2nd Generation XML IR [Martin Theobald, Ralf Schenkel, GW: VLDB05, VLDB Journal] Exploit tags & structure for better precision Can relax tag names & structure for better recall Principled ranking by probabilistic IR (Okapi BM25 for XML) Efficient top-k query processing (using improved TA) Robust ontology integration (self-throttling to avoid topic drift) Efficient query expansion (on demand, by extended TA) Relevance feedback for automatic query rewriting Semantic XPath Full-Text query: /Article [ftcontains(//Person, Max Planck)] [ftcontains(//Work, quantum physics)] //Children[@Gender = female]//Birthdates supported by TopX engine: http://infao5501.ag5.mpi-sb.mpg.de:8080/topx/

http://topx.sourceforge.net Gerhard Weikum June 14, 2007 14/41 Commercial Break [Martin Theobald, Ralf Schenkel, GW: VLDB95] TopX demo today 3:30 5:30 Gerhard Weikum June 14, 2007 15/41 Principled Ranking by Probabilistic IR binary features, conditional independence of features [Robertson & Sparck-Jones 1976] related to but different from God does not play dice. (Einstein) statistical language models IR does. P[d R(q ) | contents of d ] P[ R|d ] s( d , q ) P[d R(q ) | contents of d ] P[ R |d ] odds for item d with terms di being relevant for query q = {q1, , qm} to tf*idf pi tasks)1 qi P[d i | R ] BM25 mto Okapi ~ ledRelationship (wins TREC ~ iq d log log

i 1 P[d i | R ] 1 pi df ( k )qi adapted and extended tf ( i , d )to XML k log i ... in TopX, (k , d ) df ( i ) pi P[d i | R] with q i P [d i | R ] k Now estimate pi and qi values from relevance feedback, pseudo-relevance feedback, corpus statistics by MLE (with statistical smoothing) and store precomputed pi, qi in index p i (# rel . docs) /# docs tf ( i , d ) pi k tf ( k , d ) qi P[d i | corpus] df ( i ) q i k df (k ) Gerhard Weikum June 14, 2007

16/41 Probabilistic Ranking for SQL [S. Chaudhuri, G. Das, V. Hristidis, GW: TODS06] SQL queries that return many answers need ranking Examples: Houses (Id, City, Price, #Rooms, View, Pool, SchoolDistrict, ) Select * From Houses Where View = Lake And City In (Redmond, Bellevue) Movies (Id, Title, Genre, Country, Era, Format, Director, Actor1, Actor2, ) Select * From Movies Where Genre = Romance And Era = 90s s(d , q ) P [ R |d ] P [d | R ] P [ XY | R ] ~ P [ R |d ] P [ d | R ] P [ XY | R ] P [Y | R ] 1 P [ X |Y ] P [Y ] odds for tuple d with attributes XY relevant for query q: X1=x1 Xm=xm Estimate probs, exploiting workload W: P[Y | R] P[Y | XW ] Example: frequent queries Where Genre = Romance And Actor1 = Hugh Grant Where Actor1 = Hugh Grant And Actor2 = Julia Roberts boosts HG and JR movies in ranking for Genre = Romance And Era = 90s Gerhard Weikum June 14, 2007 17/41

From Tables and Trees to Graphs [BANKS, Discover, DBExplorer, KUPS, SphereSearch, BLINKS] Schema-agnostic keyword search over multiple tables: graph of tuples with foreign-key relationships as edges Example: Conferences (CId, Title, Location, Year) Journals (JId, Title) CPublications (PId, Title, CId) JPublications (PId, Title, Vol, No, Year) Authors (PId, Person) Editors (CId, Person) Select * From * Where * Contains Gray, DeWitt, XML, Performance And Year > 95 Related use cases: XML Result is connected tree with nodes that contain beyond trees as many query keywords as possible RDF graphs ER graphs (e.g. from IE) Ranking: 1 social networks s( tree , q ) nodeScore( n, q ) (1 ) 1 edgeScore(e ) nodes n edges e

with nodeScore based on tf*idf or prob. IR and edgeScore reflecting importance of relationships (or confidence, authority, etc.) Top-k querying: compute best trees, e.g. Steiner trees (NP-hard) Gerhard Weikum June 14, 2007 18/41 The Present: Observations & Opportunities Probabilistic IR and statistical language models yield principled ranking and high effectiveness (related to prob. relational models (Suciu, Getoor, ) but different) Structural similarity and ranking based on tree edit distance (FleXPath, Timber, ) actor movie movie movie plot director actor actor director plot Aim for comprehensive XML ranking model capturing content, structure, ontologies Aim to generate structure skeleton in XPath query from user feedback Good progress on performance but still many open efficiency issues life physicist Max Planck //article[//person Max Planck] [//category physicist] //biography Gerhard Weikum June 14, 2007

19/41 Outline Past : Matter, Antimatter, and Wormholes Present : XML and Graph IR Future : From Data to Knowledge Gerhard Weikum June 14, 2007 20/41 Knowledge Queries Turn the Web, Web2.0, and Web3.0 into the worlds most comprehensive knowledge base (semantic DB) ! Answer knowledge queries such as: proteins that inhibit both protease and some other enzyme neutron stars with Xray bursts > 1040 erg s-1 & black holes in 10 differences in Rembetiko music from Greece and from Turkey connection between Thomas Mann and Goethe market impact of Web2.0 technology in December 2006 sympathy or antipathy for Germany from May to August 2006 Nobel laureate who survived both world wars and his children drama with three women making a prophecy to a British nobleman that he will become king Gerhard Weikum June 14, 2007 21/41 Three Roads to Knowledge Handcrafted High-Quality Knowledge Bases (Semantic-Web-style ontologies, encyclopedias, etc.) Large-scale Information Extraction & Harvesting: (using pattern matching, NLP, statistical learning, etc. for product search, Web entity/object search, ...) Social Wisdom from Web 2.0 Communities (social tagging, folksonomies, human computing, e.g.: del.icio.us, flickr, answers.yahoo, iknow.baidu, ...) Gerhard Weikum June 14, 2007

22/41 High-Quality Knowledge Sources universal common-sense ontologies: SUMO (Suggested Upper Merged Ontology): 60 000 OWL axioms Cyc: 5 Mio. facts (OpenCyc: 2 Mio. facts) domain-specific ontologies: UMLS (Unified Medical Language System): 1 Mio. biomedical concepts 135 categories, 54 relations (e.g. virus causes disease | symptom) GeneOntology, etc. thesauri and concept networks: WordNet: 200 000 concepts (word senses) and hypernym/hyponym relations can be cast into OWL-lite (or typed graph with statistical weights) lexical sources: Wikipedia (1.8 Mio. articles, 40 Mio. links, 100 languages) etc. hand-tagged natural-language corpora: TEI (Text Encoding Initiative) markup of historic encyclopedia FrameNet: sentences classified into frames with semantic roles growing with strong momentum Gerhard Weikum June 14, 2007 23/41 High-Quality Knowledge Sources General-purpose thesauri and concept networks: WordNet family can be cast into OWL-lite or into graph, with weights for relation strengths (derived from that co-occurrence enzyme -- (any of several complex proteins are producedstatistics)

by cells and act as catalysts in specific biochemical reactions) => protein -- (any of a large group of nitrogenous organic compounds that are essential constituents of living cells; ...) => macromolecule, supermolecule ... => organic compound -- (any compound of carbon and another element or a radical) ... => catalyst, accelerator -- ((chemistry) a substance that initiates or accelerates a chemical reaction without itself being affected) => activator -- ((biology) any agency bringing about activation; ...) Gerhard Weikum June 14, 2007 24/41 High-Quality Knowledge Sources Wikipedia and other lexical sources Gerhard Weikum June 14, 2007 25/41 Exploit Hand-Crafted Knowledge Wikipedia, WordNet, and other lexical sources {{Infobox_Scientist | name = Max Planck | birth_date = [[April 23]], [[1858]] | birth_place = [[Kiel]], [[Germany]] | death_date = [[October 4]], [[1947]] | death_place = [[Gttingen]], [[Germany]] | residence = [[Germany]] | nationality = [[Germany|German]] | field = [[Physicist]] | work_institution = [[University of Kiel]]
[[Humboldt-Universitt zu Berlin]]
[[Georg-August-Universitt Gttingen]] | alma_mater = [[Ludwig-Maximilians-Universitt Mnchen]] | doctoral_advisor = [[Philipp von Jolly]]

| doctoral_students = [[Gustav Ludwig Hertz]]
| known_for = [[Planck's constant]], [[Quantum mechanics|quantum theory]] | prizes = [[Nobel Prize in Physics]] (1918) Gerhard Weikum June 14, 2007 26/41 YAGO: Yet Another Great Ontology [F. Suchanek, G. Kasneci, GW: WWW 2007] Turn Wikipedia into explicit knowledge base (semantic DB) Exploit hand-crafted categories and templates Represent facts as explicit knowledge triples: relation (entity1, entity2) entity1 relation entity2 (in 1st-order logic, compatible with RDF, OWL-lite, XML, etc.) Map (and disambiguate) relations into WordNet concept DAG Examples: Max_Planck bornIn Kiel Kiel isInstanceOf Gerhard Weikum June 14, 2007

City 27/41 YAGO Knowledge Representation Knowledge Base # Facts subclass KnowItAll 30 000 SUMO 60 000 WordNet 200 000 Person OpenCyc 300 000 subclass Cyc 5 000 000 Scientist YAGOsubclass 6 000 000 Biologist Accuracy: 97% Entity subclass concepts Location subclass subclas s subclass

City Country Physicist instanceOf instanceOf Erwin_Planck Nobel Prize hasWon October 4, 1947 bornIn Kiel FatherOf diedOn individuals Max_Planck means Max Planck bornOn means Max Karl Ernst Ludwig Planck April 23, 1858 means Dr. Planck words

Online access and download at http://www.mpi-inf.mpg.de/~suchanek/yago/ Gerhard Weikum June 14, 2007 28/41 NAGA: Graph IR on YAGO [G. Kasneci et al.: WWW07] Graph-based search on YAGO-style knowledge bases with built-in ranking based on confidence and informativeness statistical language model for result graphs conjunctive queries Kiel bornIn $x isa scientist queries with regular expressions Ling hasFirstName | hasLastName (coAuthor | advisor)* Beng Chin Ooi $x isa scientist worksFor $y locatedIn* Gerhard Weikum June 14, 2007 Zhejiang

29/41 Ranking Factors Confidence: Prefer results that are likely to be correct Certainty of IE Authenticity and Authority of Sources bornIn (Max Planck, Kiel) from Max Planck was born in Kiel (Wikipedia) livesIn (Elvis Presley, Mars) from They believe Elvis hides on Mars (Martian Bloggeria) Informativeness: q: isa (Einstein, $y) Prefer results that are likely important May prefer results that are likely new to user Frequency in answer Frequency in corpus (e.g. Web) Frequency in query log Compactness: Prefer results that are tightly connected Size of answer graph isa isa (Einstein, scientist) isa (Einstein, vegetarian) q: isa ($x, vegetarian) isa (Einstein, vegetarian) isa (Al Nobody, vegetarian) vegetarian Einstein

won Nobel Prize Gerhard Weikum June 14, 2007 won Tom isa Cruise bornIn 1962 Bohr diedIn 30/41 Information Extraction (IE): Text to Records Person BirthDate Max Planck 4/23, 1858 Albert Einstein 3/14, 1879 Mahatma Gandhi 10/2, 1869 BirthPlace ... Kiel Ulm Porbandar Person ScientificResult Max Planck Quantum Theory Constant Value Dimension Plancks constant 6.2261023 Js Person Collaborator

Max Planck Albert Einstein Max Planck Niels Bohr combine NLP, pattern matching, lexicons, statistical learning Gerhard Weikum June 14, 2007 31/41 Knowledge Acquisition from the Web Learn Semantic Relations from Entire Corpora at Large Scale (as exhaustively as possible but with high accuracy) Examples: all cities, all basketball players, all composers headquarters of companies, CEOs of companies, synonyms of proteins birthdates of people, capitals of countries, rivers in cities which musician plays which instruments who discovered or invented what which enzyme catalyzes which biochemical reaction Existing approaches and tools (Snowball [Gravano et al. 2000], KnowItAll [Etzioni et al. 2004], ): almost-unsupervised pattern matching and learning: seeds (known facts) patterns (in text) (extraction) rule (new) facts Gerhard Weikum June 14, 2007 32/41 Methods for Web-Scale Fact Extration seeds text rules

new facts Example: Example: city in in city (Seattle) (Seattle) in downtown downtown Seattle Seattle in downtown downtown X X city Seattle X city (Seattle) (Seattle) Seattle and and other other towns towns X and and other other towns towns city Las VegasLas andVegas otherand towns and other towns city (Las (Las Vegas) Vegas)

otherXtowns X and other towns plays plays (Zappa, (Zappa, guitar) guitar) playing playing guitar: guitar: Zappa Zappa playing playing Y: Y: X X plays X plays (Davis, (Davis, trumpet) trumpet) Davis Davis blows blows trumpet trumpet X blows blows Y Y in downtown Beijing Coltrane blows sax city(Beijing) old center of Beijing plays(Coltrane, sax) sax player Coltrane city(Beijing) plays(C., sax) old center of X Y player X Assessment of facts & generation of rules based on statistics

Rules can be more sophisticated: playing NN: (ADJ|ADV)* NP & class(NN)=instrument & class(head(NP))=person plays(head(NP), NN) Gerhard Weikum June 14, 2007 33/41 Performance of Web-IE State-of-the-art precision/recall results: relation countries cities scientists headquarters birthdates instanceOf precision 80% 80% 60% 90% 80% 40% recall 90% ??? ??? 50% 70% 20% corpus Web Web Web News Wikipedia Web

systems KnowItAll KnowItAll KnowItAll Snowball, LEILA LEILA Text2Onto, LEILA Open IE 80% ??? Web TextRunner precision value-chain: entities 80%, attributes 70%, facts 60%, events 50% Anecdotic evidence: invented (A.G. Bell, telephone) married (Hillary Clinton, Bill Clinton) isa (yoga, relaxation technique) isa (zearalenone, mycotoxin) contains (chocolate, theobromine) contains (Singapore sling, gin) invented (Johannes Kepler, logarithm tables) married (Segolene Royal, Francois Hollande) isa (yoga, excellent way) isa (your day, good one) contains (chocolate, raisins) plays (the liver, central role) makes (everybody, mistakes) Gerhard Weikum June 14, 2007 34/41

Beyond Surface Learning with LEILA Learning to Extract Information by Linguistic Analysis [F.Suchanek, G.Ifrim, GW: KDD06] Limitation of surface patterns: who discovered or invented what Teslas work formed the basis of AC electric power Al Gore funded more work for a better basis of the Internet Almost-unsupervised Statistical Learning with Dependency Parsing (Cologne, Rhine), (Cairo, Nile), (Cairo, Rhine), (Rome, 0911), (, [0..9]*), NP outperforms PP NP VP other NP PP NP NP NP LEILA Web-IE methods Cologne lies on the banks of the Rhine People of in Cairo like wine fromF1, the but: Rhine valley in terms precision, recall, dependency Mp Js parser Osis slow AN Ss MVp DMc Mp Dg

one relation Jp Js Sp Mvp Ds at a time NP VP PP NP NP PP NP NP Js NP VP VP PP NP NP PP NP NP Paris was founded on an island in the Seine

Ss Pv MVp Ds DG Js (Paris, Seine) Js MVp Gerhard Weikum June 14, 2007 35/41 IE Efficiency and Accuracy Tradeoffs [see also tutorials by Cohen, Doan/Ramakrishnan/Vaithyanathan, Agichtein/Sarawagi] IE is cool, but whats in it for DB folks? precision vs. recall: two-stage processing (filter pipeline) 1) recall-oriented harvesting 2) precision-oriented scrutinizing preprocessing indexing: NLP trees & graphs, N-grams, PoS-tag patterns ? exploit ontologies? exploit usage logs ? turn crawl&extract into set-oriented query processing candidate finding efficient phrase, pattern, and proximity queries

optimizing entire text-mining workflows [Ipeirotis et al.: SIGMOD06] Gerhard Weikum June 14, 2007 36/41 The Future: Challenges Generalize YAGO approach (Wikipedia + WordNet) Methods for comprehensive, highly accurate mappings across many knowledge sources cross-lingual, cross-temporal scalable in size, diversity, number of sources Pursue DB support towards efficient IE (and NLP) Achieve Web-scale IE throughput that can sustain rate of new content production (e.g. blogs) with > 90% accuracy and Wikipedia-like coverage Integrate handcrafted knowledge with NLP/ML-based IE Incorporate social tagging and human computing Gerhard Weikum June 14, 2007 37/41 Outline Past : Matter, Antimatter, and Wormholes Present : XML and Graph IR Future : From Data to Knowledge Gerhard Weikum June 14, 2007 38/41 Major Trends in DB and IR Database Systems Information Retrieval malleable schema (later) record linkage

deep NLP, adding structure info extraction graph mining entity-relationship graph IR dataspaces ontologies Web objects statistical language models data uncertainty ranking programmability search as Web Service Web 2.0 Web 2.0 Gerhard Weikum June 14, 2007 39/41 Conclusion DB&IR integration agenda: models ranking, ontologies, prob. SQL ?, graph IR ? languages and APIs XQuery Full-Text++ ? systems drop SQL, go light-weight ? combine with P2P, Deep Web, ... ? Rethink progress measures and experimental methodology Address killer app(s) and grand challenge(s): from data to knowledge (Web, products, enterprises) integrate knowledge bases, info extraction, social wisdom cope with uncertainty; ranking as first-class principle Bridge cultural differences between DB and IR: co-locate SIGIR and SIGMOD Gerhard Weikum June 14, 2007

40/41 DB&IR: Both Sides Now Joni Mitchell (1969): Both Sides Now I've looked at life from both sides now, From up and down, and still somehow It's life's illusions i recall. I really don't know life at all. Thank You ! Gerhard Weikum June 14, 2007 41/41

Recently Viewed Presentations

  • Student-Centered Learning: - Boston University

    Student-Centered Learning: - Boston University

    object in motion continues in motion with the same speed and in the same direction unless another force acts upon it to slow it down or stop it. ... Introduce learning contracts (especially with teamwork). Encourage self-assessment.
  • February 26- March 1 Authors Purpose Authors Purpose=

    February 26- March 1 Authors Purpose Authors Purpose=

    Author's Purpose= The why an author writes/his or her reason! PIE. Persuade. Inform. Entertain. Persuade- The author wants you to do, buy, or believe something.
  • Predictors of Genocide

    Predictors of Genocide

    The 8 Stages of Genocide. Understanding the genocidal process is one of the most important steps in preventing future genocides. The Eight Stages of Genocide were first outlined by Dr. Greg Stanton, Department of State: 1996.
  • Annual Meeting February 23, 2016 Agenda Welcome Jay

    Annual Meeting February 23, 2016 Agenda Welcome Jay

    Dartwood - Deborah Lund. Desco E. Desco W. Joyce Way E. Joyce Way W - Hilary Moorehead. Lupton. Prestonshire - David Gleeson. Stefani E - Tiffany Westerman. Stefani W. Villa Park Cir. Deliver door-to-door social event flyers (3-4x/yr) Deliver annual...
  • The World at War (again) - Yola

    The World at War (again) - Yola

    Review Define lebensraum and appeasement Provide at least two possible reasons why countries did not stop Hitler sooner C. Hitler's Goals For Revenge!! 1. "LEBENSRAUM" -Living Space' for Germans a. Take land from non-Germans (Less Human) and give to his...
  • Sedimentary Rocks - Rensselaer Polytechnic Institute

    Sedimentary Rocks - Rensselaer Polytechnic Institute

    Sedimentary Rocks Deposited on or Near Surface of Earth by Mechanical or Chemical Processes ... (Walther's Law - Johannes Walthers, 1894). Time lines Terrestrial Sedimentary Environments: Stream sediments (fluvial, alluvial sediments) Streams are the principal means of transporting sediment ...
  • Preamble (Interpretation) - CIVICS DEPARTMENT

    Preamble (Interpretation) - CIVICS DEPARTMENT

    Preamble (Interpretation) SS.7.C.1.6: Interpret the intentions of the Preamble of the Constitution. Essential Question:. What are the goals and purposes of government according to the Preamble?
  • Powerpoint template for scientific posters (Swarthmore College)

    Powerpoint template for scientific posters (Swarthmore College)

    Dr. Cathy L. Connor University of Alaska Southeast, Environmental Science Program UAS Glaciation and Climate Change course students Alex Sargent and Nat Kugler observe the changing Mendenhall Glacier Terminus (Connor 2010 photo) Introduction The recruitment of science students into university...