Big Data Integration - Hong Kong University of Science and ...

Big Data Integration - Hong Kong University of Science and ...

Big Data Integration Xin Luna Dong (Google Inc.) Divesh Srivastava (AT&T Labs-Research) What is Big Data Integration? Big data integration = Big data + data integration Data integration: easy access to multiple data sources [DHI12] Virtual: mediated schema, query reformulation, link + fuse answers Warehouse: materialized data, easy querying, consistency issues Big data: all about the Vs Size: large volume of data, collected and analyzed at high velocity Complexity: huge variety of data, of questionable veracity Utility: data of considerable value 2

What is Big Data Integration? Big data integration = Big data + data integration Data integration: easy access to multiple data sources [DHI12] Virtual: mediated schema, query reformulation, link + fuse answers Warehouse: materialized data, easy querying, consistency issues Big data in the context of data integration: still about the Vs Size: large volume of sources, changing at high velocity Complexity: huge variety of sources, of questionable veracity Utility: sources of considerable value 3 Outline Motivation

Why do we need big data integration? How has small data integration been done? Challenges in big data integration Schema alignment Record linkage Data fusion Emerging topics 4 Why Do We Need Big Data Integration? Building web-scale knowledge bases MSR knowledge base A Little Knowledge Goes a Long Way. NELL Google knowledge graph

5 Why Do We Need Big Data Integration? Reasoning over linked data 6 Why Do We Need Big Data Integration? Geo-spatial data fusion http://axiomamuse.wordpress.com/2011/04/18/ 7 Why Do We Need Big Data Integration? Scientific data analysis

http://scienceline.org/2012/01/from-index-cards-to-information-overload/ 8 Outline Motivation Why do we need big data integration? How has small data integration been done? Challenges in big data integration Schema alignment Record linkage Data fusion Emerging topics 9 Small Data Integration: What Is It? Data integration = solving lots of jigsaw puzzles

Each jigsaw puzzle (e.g., Taj Mahal) is an integrated entity Each piece of a puzzle comes from some source Small data integration solving small puzzles 10 Small Data Integration: How is it Done? Small data integration: alignment + linkage + fusion Schema alignment: mapping of structure (e.g., shape) Schema Alignment ? Record Linkage

Data Fusion 11 Small Data Integration: How is it X Done? Small data integration: alignment + linkage + fusion Schema alignment: mapping of structure (e.g., shape) Schema Alignment ? Record Linkage Data Fusion 12 Small Data Integration: How is it

Done? Small data integration: alignment + linkage + fusion Record linkage: matching based on content (e.g., color, pattern) Schema Alignment Record Linkage Data Fusion 13 Small Data Integration: How is it X Done? Small data integration: alignment + linkage + fusion Record linkage: matching based on content (e.g., color, pattern)

Schema Alignment Record Linkage Data Fusion 14 Small Data Integration: How is it Done? Small data integration: alignment + linkage + fusion Record linkage: matching based on content (e.g., color, pattern) Schema Alignment Record Linkage Data Fusion 15

Small Data Integration: How is it Done? Small data integration: alignment + linkage + fusion Data fusion: reconciliation of mismatching content (e.g., pattern) Schema Alignment Record Linkage Data Fusion 16 Small Data Integration: How is it X Done? Small data integration: alignment + linkage + fusion

Data fusion: reconciliation of mismatching content (e.g., pattern) Schema Alignment Record Linkage Data Fusion 17 Small Data Integration: How is it Done? Small data integration: alignment + linkage + fusion Data fusion: reconciliation of mismatching content (e.g., pattern) Schema Alignment Record Linkage

Data Fusion 18 Outline Motivation Why do we need big data integration? How has small data integration been done? Challenges in big data integration Schema alignment Record linkage Data fusion Emerging topics 19 BDI: Why is it Challenging? Data integration = solving lots of jigsaw puzzles

Big data integration big, messy puzzles E.g., missing, duplicate, damaged pieces 20 Case Study I: Domain Specific Data [DMP12] Goal: analysis of domain-specific structured data across the Web Questions addressed: How is the data about a given domain spread across the Web? How easy is it to discover entities, sources in a given domain? How much value do the tail entities in a given domain have? 21 Domain Specific Data: Spread How many sources needed to build a complete DB for a domain?

[DMP12] looked at 9 domains with the following properties Access to large comprehensive databases of entities in the domain Entities have attributes that are (nearly) unique identifiers, e.g., ISBN for Books, phone number or homepage for Restaurants Methodology of case study: Used the entire web cache of Yahoo! search engine Webpage has an entity if it contains an identifying attribute Aggregate the set of all entities found on each website (source) 22 recall Domain Specific Data: Spread 1-coverage

top-10: 93% top-100: 100% strong aggregator source # of sources 23 recall Domain Specific Data: Spread 5-coverage top-5000: 90% top-100K: 95% # of sources 24 recall

Domain Specific Data: Spread 1-coverage top-100: 80% top-10K: 95% # of sources 25 recall Domain Specific Data: Spread 5-coverage top-100: 35% top-10K: 65% # of sources 26

recall Domain Specific Data: Spread All reviews are distinct top-100: 65% top-1000: 85% # of sources 27 Domain Specific Data: Connectivity How well are the sources connected in a given domain? Do you have to be a search engine to find domain-specific sources? [DMP12] considered the entity-source graph for various domains

Bipartite graph with entities and sources (websites) as nodes Edge between entity e and source s if some webpage in s contains e Methodology of case study: Study graph properties, e.g., diameter and connected components 28 Domain Specific Data: Connectivity Almost all entities are connected to each other Largest connected component has more than 99% of entities 29 Domain Specific Data: Connectivity High redundancy and overlap enable use of bootstrapping

Low diameter ensures that most sources can be found quickly 30 Domain Specific Data: Lessons Learned Spread: Even for domains with strong aggregators, we need to go to the long tail of sources to build a reasonably complete database Especially true if we want k-coverage for boosting confidence Connectivity: Sources in a domain are well-connected, with a high degree of content redundancy and overlap

Remains true even when head aggregator sources are removed 31 Case Study II: Deep Web Quality [LDL+13] Study on two domains Belief of clean data Poor quality data can have big impact #Sources Period #Objects 1000*20 #Localattrs

333 #Globalattrs 153 Considered items 16000*20 Stock 55 7/2011 Flight 38 12/2011

1200*31 43 15 7200*31 32 Deep Web Quality Is the data consistent? Tolerance to 1% value difference 33 Deep Web Quality

Why such inconsistency? Nasdaq Semantic ambiguity Yahoo! Finance Days Range: 93.80-95.71 52wk Range: 25.38-95.71 52 Wk: 25.38-93.72 34 Deep Web Quality Why such inconsistency? Unit errors 76.82B

76,821,000 35 Deep Web Quality Why such inconsistency? Pure errors FlightView FlightAware Orbitz 6:15 PM 6:22 PM 6:15 PM

9:40 PM 8:33 PM 9:54 PM 36 Deep Web Quality Why such inconsistency? Random sample of 20 data items + 5 items with largest # of values 37 Deep Web Quality Copying between sources? 38

Deep Web Quality Copying on erroneous data? 39 Deep Web Quality: Lessons Learned Deep Web data has considerable inconsistency Even in domains where poor quality data can have big impact Semantics ambiguity, out of date data, unexplainable errors Deep Web sources often copy from each other Copying can happen on erroneous data, spreading poor quality data 40

BDI: Why is it Challenging? Number of structured sources: Volume Millions of websites with domain specific structured data [DMP12] 154 million high quality relational tables on the web [CHW+08] 10s of millions of high quality deep web sources [MKK+08] 10s of millions of useful relational tables from web lists [EMH09] Challenges: Difficult to do schema alignment Expensive to warehouse all the integrated data Infeasible to support virtual integration 41 BDI: Why is it Challenging? Rate of change in structured sources: Velocity

43,000 96,000 deep web sources (with HTML forms) [B01] 450,000 databases, 1.25M query interfaces on the web [CHZ05] 10s of millions of high quality deep web sources [MKK+08] Many sources provide rapidly changing data, e.g., stock prices Challenges: Difficult to understand evolution of semantics Extremely expensive to warehouse data history Infeasible to capture rapid data changes in a timely fashion 42 BDI: Why is it Challenging? Representation differences among sources: Variety Free-text extractors

43 BDI: Why is it Challenging? Poor data quality of deep web sources [LDL+13]: Veracity 44 Outline Motivation Schema alignment Overview Techniques for big data Record linkage Data fusion Emerging topics 45

Schema Alignment Matching based on structure (e.g., shape) ? 46 Schema Alignment X Matching based on structure (e.g., shape) ? 47

Schema Alignment: Three Steps [BBR11] Schema alignment: mediated schema + matching + mapping Enables linkage, fusion to be semantically meaningful Mediated Schema S1 (name, hPhone, hAddr, oPhone, oAddr) S2 (name, phone, addr, email) S3 a: (id, name); b: (id, resPh, workPh)

S4 (name, pPh, pAddr) S5 (name, wPh, wAddr) Attribute Matching Schema Mapping 48 Schema Alignment: Three Steps Schema alignment: mediated schema + matching + mapping Enables domain specific modeling Mediated Schema

Attribute Matching S1 (name, hPhone, hAddr, oPhone, oAddr) S2 (name, phone, addr, email) S3 a: (id, name); b: (id, resPh, workPh) S4 (name, pPh, pAddr) S5

(name, wPh, wAddr) MS (n, pP, pA, wP, wA) Schema Mapping 49 Schema Alignment: Three Steps Schema alignment: mediated schema + matching + mapping Identifies correspondences between schema attributes Mediated Schema Attribute Matching Schema Mapping

S1 (name, hPhone, hAddr, oPhone, oAddr) S2 (name, phone, addr, email) S3 a: (id, name); b: (id, resPh, workPh) S4 (name, pPh, pAddr) S5 (name, wPh, wAddr)

MS (n, pP, pA, wP, wA) MSAM MS.n: S1.name, S2.name, S3a.name, MS.pP: S1.hPhone, S3b.resPh, S4.pPh MS.pA: S1.hAddr, S4.pAddr MS.wP: S1.oPhone, S2.phone, MS.wA: S1.oAddr, S2.addr, S5.wAddr 50 Schema Alignment: Three Steps Schema alignment: mediated schema + matching + mapping Specifies transformation between records in different schemas Mediated Schema

Attribute Matching Schema Mapping S1 (name, hPhone, hAddr, oPhone, oAddr) S2 (name, phone, addr, email) S3 a: (id, name); b: (id, resPh, workPh) S4 (name, pPh, pAddr)

S5 (name, wPh, wAddr) MS (n, pP, pA, wP, wA) MSSM (GAV) MS(n, pP, pA, wP, wA) :- S1(n, pP, pA, wP, wA) MS(n, _, _, wP, wA) :- S2(n, wP, wA, e) MS(n, pP, _, wP, _) :- S3a(i, n), S3b(i, pP, wP) MS(n, pP, pA, _, _) :- S4(n, pP, pA) MS(n, _, _, wP, wA) :- S5(n, wP, wA) 51

Outline Motivation Schema alignment Overview Techniques for big data Record linkage Data fusion Emerging topics 52 BDI: Schema Alignment Volume, Variety

Integrating deep web query interfaces [WYD+04, CHZ05] Crawl, index deep web data [MKK+08] Extract structured data from web tables [CHW+08, LSC10, PS12, DFG+12] and web lists [GS09, EMH09] Dataspace systems [FHM05, HFM06, DHY07] Keyword search based data integration [TJM+08] Velocity Keyword search-based dynamic data integration [TIP10] 53 Tomorrow Soon Full semantic integration

Domain Specific Integration Probabilistic Integration Keyword Search Now Availability of Integration Results Space of Strategies Low Medium High Level of Semantic Integration 55 Dataspace Approach [FHM05,

HFM06] Motivation: SDI approach (as-is) is infeasible for BDI Volume, variety of sources unacceptable up-front modeling cost Velocity of sources expensive to maintain integration results Key insight: pay-as-you-go approach may be feasible Start with simple, universally useful service Iteratively add complexity when and where needed [JFH08] Approach has worked for RDBMS, Web, Hadoop 56 Bootstrapping DI Systems [DDH08] Thesis: completely automated data integration is feasible, but

Need to model uncertainty about semantics of attributes in sources Automatically create a mediated schema from a set of sources Uncertainty probabilistic mediated schemas P-mediated schemas offer benefits in modeling uncertainty Automatically create mappings from sources to mediated schema Probabilistic mappings use weighted attribute correspondences 57 Probabilistic Mediated Schemas [DDH08] S1 name

hPhone hAddr oPhone S4 oAddr name pPh pAddr Mediated schemas: automatically created by inspecting sources Clustering of source attributes Volume, variety of sources uncertainty in accuracy of clustering 58

Probabilistic Mediated Schemas [DDH08] S1 name hPhone hAddr oPhone S4 oAddr name pPh pAddr Example P-mediated schema MS

M1({name}, {hPhone, pPh}, {oPhone}, {hAddr, pAddr}, {oAddr}) M2({name}, {hPhone}, {pPh, oPhone}, {hAddr}, {pAddr, oAddr}) M3({name}, {hPhone, pPh}, {oPhone}, {hAddr}, {pAddr}, {oAddr}) M4({name}, {hPhone}, {pPh, oPhone}, {hAddr}, {pAddr}, {oAddr}) MS = {(M1, 0.6), (M2, 0.4)} 59 Probabilistic Mappings [DHY07, DDH08] Mapping between P-mediated schema and a source schema S1 name hPhone hAddr oPhone S4 oAddr

name pPh pAddr Example mappings between M1 and S1 G1({M1.n, name}, {M1.phP, hPhone}, {M1.phA, hAddr}, ) G2({M1.n, name}, {M1.phP, oPhone}, {M1.phA, oAddr}, ) G = {(G1, 0.6), (G2, 0.4)} 60 Probabilistic Mappings Mapping between P-mediated schema and a source schema S1 name

hPhone hAddr oPhone S4 oAddr name pPh pAddr Answering queries on P-mediated schema based on P-mappings By table semantics: one mapping for all tuples in a table By tuple semantics: different mappings are okay in a table 61 Probabilistic Mappings: By Table

Semantics Consider query Q1: SELECT name, pPh, pAddr FROM MS name S1 Ken hPhone hAddr oPhone 111-1111 New York 222-2222 Summit

Summit 444-4444 New York Barbie 333-3333 oAddr Result of Q1, under by table semantics, in a possible world G1({M1.n, name}, {M1.phP, hPhone}, {M1.phA, hAddr}, ) name pPh Q1R Ken

111-1111 (Prob = 0.60) Barbie 333-3333 pAddr New York Summit Map G1 G1 62 Probabilistic Mappings: By Table Semantics Consider query Q1: SELECT name, pPh, pAddr FROM MS name S1 Ken

hPhone hAddr oPhone 111-1111 New York 222-2222 Summit Summit 444-4444 New York

Barbie 333-3333 oAddr Result of Q1, under by table semantics, in a possible world G2({M1.n, name}, {M1.phP, oPhone}, {M1.phA, oAddr}, ) name pPh Q1R Ken 222-2222 (Prob = 0.40) Barbie 444-4444 pAddr Summit New York

Map G2 G2 63 Probabilistic Mappings: By Table Semantics Now consider query Q2: SELECT pAddr FROM MS name S1 Ken hPhone hAddr oPhone 111-1111

New York 222-2222 Summit Summit 444-4444 New York Barbie 333-3333 oAddr Result of Q2, under by table semantics, across all possible worlds pAddr Q2R Summit

New York Prob 1.0 1.0 64 Probabilistic Mappings: By Tuple Semantics Consider query Q1: SELECT name, pPh, pAddr FROM MS name S1 Ken hPhone hAddr

oPhone 111-1111 New York 222-2222 Summit Summit 444-4444 New York Barbie 333-3333 oAddr

Result of Q1, under by tuple semantics, in a possible world G1({M1.n, name}, {M1.phP, hPhone}, {M1.phA, hAddr}, ) G2({M1.n, name}, {M1.phP, oPhone}, {M1.phA, oAddr}, ) name pPh Q1R Ken 111-1111 (Prob = 0.36) Barbie 333-3333 pAddr New York Summit Map G1 G1 65

Probabilistic Mappings: By Tuple Semantics Consider query Q1: SELECT name, pPh, pAddr FROM MS name S1 Ken hPhone hAddr oPhone 111-1111 New York 222-2222

Summit Summit 444-4444 New York Barbie 333-3333 oAddr Result of Q1, under by tuple semantics, in a possible world G1({M1.n, name}, {M1.phP, hPhone}, {M1.phA, hAddr}, ) G2({M1.n, name}, {M1.phP, oPhone}, {M1.phA, oAddr}, ) name pPh

Q1R Ken 222-2222 (Prob = 0.16) Barbie 444-4444 pAddr Summit New York Map G2 G2 66 Probabilistic Mappings: By Tuple Semantics Consider query Q1: SELECT name, pPh, pAddr FROM MS name S1

Ken hPhone hAddr oPhone 111-1111 New York 222-2222 Summit Summit 444-4444

New York Barbie 333-3333 oAddr Result of Q1, under by tuple semantics, in a possible world G1({M1.n, name}, {M1.phP, hPhone}, {M1.phA, hAddr}, ) G2({M1.n, name}, {M1.phP, oPhone}, {M1.phA, oAddr}, ) name pPh Q1R Ken 111-1111 (Prob = 0.24) Barbie 444-4444 pAddr New York

New York Map G1 G2 67 Probabilistic Mappings: By Tuple Semantics Consider query Q1: SELECT name, pPh, pAddr FROM MS name S1 Ken hPhone hAddr oPhone

111-1111 New York 222-2222 Summit Summit 444-4444 New York Barbie 333-3333 oAddr Result of Q1, under by tuple semantics, in a possible world

G1({M1.n, name}, {M1.phP, hPhone}, {M1.phA, hAddr}, ) G2({M1.n, name}, {M1.phP, oPhone}, {M1.phA, oAddr}, ) name pPh Q1R Ken 222-2222 (Prob = 0.24) Barbie 333-3333 pAddr Summit Summit Map G2 G1 68

Probabilistic Mappings: By Tuple Semantics Now consider query Q2: SELECT pAddr FROM MS name S1 Ken hPhone hAddr oPhone 111-1111 New York 222-2222

Summit Summit 444-4444 New York Barbie 333-3333 oAddr Result of Q2, under by tuple semantics, across all possible worlds Note the difference with the result of Q2, under by table semantics pAddr Q2R Summit New York

Prob 0.76 0.76 69 WebTables [CHW+08] Background: Google crawl of the surface web, reported in 2008 154M good relational tables, 5.4M attribute names, 2.6M schemas ACSDb (schema, count) Frequency Frequency-Ranked schemas from ACSDb

Rank 70 WebTables: Keyword Search [CHW+08] Query model: keyword search Goal: Rank tables on web in response to query keywords Not web pages (can have multiple tables), not individual records Challenges: Web page features apply ambiguously to embedded tables Web tables on a page may not all be relevant to a query Web tables have specific features (e.g., schema elements) 71

WebTables: Keyword Search Example keyword query: presidents of the US 72 WebTables: Keyword Search FeatureRank: use table specific features Query independent features Query dependent features Linear regression estimator Heavily weighted features Result quality: fraction of high scoring relevant tables k Nave FeatureRank

10 0.26 0.43 20 0.33 0.56 30 0.34 0.66 73

WebTables: Keyword Search Example keyword query: presidents of the US 74 WebTables: Keyword Search SchemaRank: also include schema coherency as a table feature Use point-wise mutual information (pmi) derived from ACSDb p(a) = fraction of unique schemas containing attributes a pmi(a,b) = log2(p(a,b)/(p(a)*p(b))) Coherency = average pmi(a,b) over all a, b in attrs(R) Result quality: fraction of high scoring relevant tables k Nave

FeatureRank SchemaRank 10 0.26 0.43 0.47 20 0.33 0.56 0.59

30 0.34 0.66 0.68 75 WebTables: Keyword Search Example keyword query: presidents of the US T1(President, Vice President) T2(President, Term, Party, Vice President)

T3(State, Governor, Party, Term) T4(State, Senator, Party, Term, Born On) T5(Chief Justice, Nominated By, Term) 76 WebTables: Keyword Search Example keyword query: presidents of the US T1(President, Vice President) T2(President, Term, Party, Vice President) T3(State, Governor, Party, Term) T4(State, Senator, Party, Term, Born On) T5(Chief Justice, Nominated By, Term)

pmi(a,b) = log2(p(a,b)/(p(a)*p(b))) pmi(President, Vice President) = log2(0.4/(0.4 * 0.4)) = 1.32 77 WebTables: Keyword Search Example keyword query: presidents of the US T1(President, Vice President) T2(President, Term, Party, Vice President) T3(State, Governor, Party, Term) T4(State, Senator, Party, Term, Born On) T5(Chief Justice, Nominated By, Term)

pmi(a,b) = log2(p(a,b)/(p(a)*p(b))) pmi(President, Vice President) = log2(0.4/(0.4 * 0.4)) = 1.32 pmi(President, Term) = log2(0.2/(0.4*0.8)) = -0.68 78 WebTables: Keyword Search Example keyword query: presidents of the US

1 T1(President, Vice President) T2(President, Term, Party, Vice President) T3(State, Governor, Party, Term) T4(State, Senator, Party, Term, Born On) T5(Chief Justice, Nominated By, Term) 2 Schema coherency = average pmi(a,b) over all a, b in attrs(R) coherency(T1) = avg({1.32}) = 1.32 coherency(T2) = avg({1.32, -0.68, -0.26, 0.32, -0.26, -0.68}} = -0.15 79 Annotating Web Tables [LSC10] Goal: given a Web table, which entities occur in which cells, what

are the column types, and the relationships between columns? Why is this challenging? Text in table cells often mention entities, but can be ambiguous Column headers, if present, do not use controlled vocabulary Benefits of solving this problem Permits use of relational, metadata-aware queries on Web tables Extracts knowledge from Web tables 80 Annotating Web Tables: Entities Goal: given a Web table, which entities occur in which cells, what are the column types, and the relationships between columns? 81

Annotating Web Tables: Column Types Goal: given a Web table, which entities occur in which cells, what are the column types, and the relationships between columns? US Politician 82 Annotating Web Tables: Relationships Goal: given a Web table, which entities occur in which cells, what are the column types, and the relationships between columns? US President US Vice President 83

Annotating Web Tables: Using a Catalog A catalog consists of a type hierarchy, entities that are instances of (possibly multiple) types, and binary relationships US President US Vice President P401 P420 P420 P471 Entity Person Politician

P401 George Washington P420 University U101 George Washington University U107 University of Washington John Adams

84 Annotating Web Tables: Using a Catalog How good is it to label cell (r, c), containing text Drc, with entity E? Similarity between Drc and L(E) Entity Person Politician P401 University U101

P420 George Washington University George Washington U107 University of Washington John Adams 85 Annotating Web Tables: Using a

Catalog How good is it to label cell (r, c), containing text Drc, with entity E? Similarity between Drc and L(E) Entity Person Politician P401 University U101 P420 George Washington University George

Washington U107 University of Washington John Adams 86 Annotating Web Tables: Using a Catalog X How good is it to label cell (r, c), containing text Drc, with entity E?

Similarity between Drc and L(E) Entity Person Politician P401 University U101 P420 George Washington University George Washington U107 University of

Washington John Adams 87 Annotating Web Tables: Using a Catalog How good is it to label column c with type T and cell (r, c) with E? Entity E should belong to type T (but catalog may be incomplete) Entity Person Politician

P401 University U101 P420 George Washington University George Washington U107 University of Washington John Adams

88 Annotating Web Tables: Using a Catalog How good is it to label column c with type T and cell (r, c) with E? Entity E should belong to type T (but catalog may be incomplete) Entity Person Politician P401 University U101

P420 George Washington University George Washington U107 University of Washington John Adams 89 Annotating Web Tables: Using a Catalog

X How good is it to label column c with type T and cell (r, c) with E? Entity E should belong to type T (but catalog may be incomplete) Entity Person Politician P401 University U101 P420 George Washington University George

Washington U107 University of Washington John Adams 90 Annotating Web Tables: Using a Catalog Do entity annotations erc for cell (r, c) and erc for cell (r, c) vote for or against annotating column pair (c, c) with relation R?

Entity Person Politician US President US Vice President P401 P420 P420 P471 P401 University U101

P420 George Washington University George Washington U107 University of Washington John Adams 91 Annotating Web Tables: Using a Catalog

Model table annotation using interrelated random variables, represented by a probabilistic graphical model Cell text (in Web table) and entity label (in catalog) Column header (in Web table) and type label (in catalog) Column type and cell entity (in Web table) Pair of column types (in Web table) and relation (in catalog) Entity pairs (in Web table) and relation (in catalog) 92 Annotating Web Tables: Using a Catalog Model table annotation using interrelated random variables, represented by a probabilistic graphical model

Cell text (in Web table) and entity label (in catalog) Column header (in Web table) and type label (in catalog) Column type and cell entity (in Web table) 93 Annotating Web Tables: Using a Catalog Model table annotation using interrelated random variables, represented by a probabilistic graphical model Pair of column types (in Web table) and relation (in catalog) Entity pairs (in Web table) and relation (in catalog) 94 Annotating Web Tables: Using a

Catalog Model table annotation using interrelated random variables, represented by a probabilistic graphical model Cell text (in Web table) and entity label (in catalog) Column header (in Web table) and type label (in catalog) Column type and cell entity (in Web table) Pair of column types (in Web table) and relation (in catalog) Entity pairs (in Web table) and relation (in catalog) Task of annotation amounts to searching for an assignment of values to the variables that maximizes the joint probability Problem is NP-hard in the general case

Use iterative belief propagation in factor graphs until convergence 95 Finding Related Tables [DFG+12] Motivation: given a table T and a corpus C of tables, find tables T in C that can be integrated with T to augment Ts information 96 Finding Related Tables X Motivation: given a table T and a corpus C of tables, find tables T in C that can be integrated with T to augment Ts information 97

Finding Related Tables Motivation: given a table T and a corpus C of tables, find tables T in C that can be integrated with T to augment Ts information 98 Finding Related Tables ? Motivation: given a table T and a corpus C of tables, find tables T in C that can be integrated with T to augment Ts information 99 Finding Related Tables

X Motivation: given a table T and a corpus C of tables, find tables T in C that can be integrated with T to augment Ts information 100 Finding Related Tables Motivation: given a table T and a corpus C of tables, find tables T in C that can be integrated with T to augment Ts information Examples of related tables Tables that are candidates for union, and add new entities Tables that are candidates for join, and add new attributes 101 Finding Related Tables

Motivation: given a table T and a corpus C of tables, find tables T in C that can be integrated with T to augment Ts information More generally: Are tables T and T the results of applying queries Q and Q on U? Are Q and Q different, but have a similar select-project structure? Is virtual table U coherent? Problem: Find top-k tables with highest relatedness scores to T 102 Finding Related Tables: Entity Complement Goal: Find top-k tables T that are candidates for union with T Methodology Entity consistency: T should have the same type of entities as T

Entity expansion: T should substantially add new entities to T Schema consistency: T and T should have similar schemas 103 Finding Related Tables: Entity Complement Goal: Find top-k tables T that are candidates for union with T Entity consistency, entity expansion 104 Finding Related Tables: Entity Complement Goal: Find top-k tables T that are candidates for union with T

Schema consistency 105 Finding Related Tables: Entity Complement Goal: Find top-k tables T that are candidates for union with T [DFG+12] use three signals to ensure entity complement tables WebIsA: noisy database of entities (155M) and types (1.5M) Freebase: curated database of entities (16M) and types (600K) WebTable labels: count co-occurrence of entities in Web tables Relatedness score: Use weighted Jaccard similarity on label sets for entity consistency

Use bipartite max-weight matching for schema consistency 106 Finding Related Tables: Schema Complement Goal: Find top-k tables T that are candidates for join with T Methodology Coverage of entity set: T should contain most of Ts entities Coherent schema expansion: use the ACSDb to measure the maximum benefit that a subset of attributes of T can provide to T Recall, ACSDb(Schema, Count) can be used for schema coherency 107 Finding Related Tables: Schema Complement

Goal: Find top-k tables T that are candidates for join with T Entity coverage 108 Finding Related Tables: Schema Complement Goal: Find top-k tables T that are candidates for join with T Coherent schema expansion 109 Finding Related Tables: Efficiency

Issues Nave approach: compute relatedness score for every table pair Very expensive on large table corpora Key idea: use filters to scale up computation of table relatedness Fewer comparisons: use filters as blocking criteria to bucketize tables, and only perform relatedness comparisons within buckets Faster comparisons: apply sequence of filters, based on selectivity and computational efficiency of filters Useful filters: Two tables must share entity column name or inferred label Two tables must share at least n entities, n = 1, 2, 3 110

Recently Viewed Presentations

  • Essentials of Human Anatomy & Physiology

    Essentials of Human Anatomy & Physiology

    The early students of anatomy and physiology were most likely concerned with treating illnesses and injuries. Early healers relied on _____ and magic. Later, herbs were used to treat certain ailments. Eventually, after much controversy the study of medicine with...
  • Unified IT Rob Bailey, Charley Edamala, and Mark

    Unified IT Rob Bailey, Charley Edamala, and Mark

    Rob Bailey, Charley Edamala, and Mark Walbert Our task is to ensure that the University's investment in IT supports the vision and goals of Educating Illinois (and the IT Strategic Plan ), and enhances teaching and learning, research and creative...
  • ARTIFICIAL INTELLIGENCE IN THE ACCOUNTING PROFESSION William C.

    ARTIFICIAL INTELLIGENCE IN THE ACCOUNTING PROFESSION William C.

    Artificial Intelligence . i. n the Accounting Profession William . C. Nantz, CPA, CFF, CGMA, MBA, JD. The Nantz Law Firm & HCC. Center of Excellence - Business
  • Irony in Literature Students Will Be Able To:

    Irony in Literature Students Will Be Able To:

    of Shel Silverstein's "Messy Room". Put up your colors(s) as soon as you decide! What Type(s) of Irony Am I? Interactive Q & A, Continued "Titanic" boarding scene. Red, blue, or green? Can you explain why? What Type(s) of Irony...
  • Can be used to review for the ILST

    Can be used to review for the ILST

    Can be used to review for the ILST. Review of vocabulary and concepts. Introduces some concepts that will be expanded on in LE and ES. Appropriate level text with text dependent questions. ... PowerPoint Presentation Last modified by: Windows User
  • Developing an Organisational Culture

    Developing an Organisational Culture

    Towards an Ethical Outcome Gather the facts Define ethical issues Identify affected parties Identify consequences Identify obligations Consider your character and integrity Think creatively about potential actions Check your gut feeling (Trevino and Nelson 2007) Any questions?
  • Temple Hill Road: Path to Vision 2020 Temple

    Temple Hill Road: Path to Vision 2020 Temple

    A 0.6 mile road, 30 feet wide with curb and gutter, runs up the timber hill at 12 % grade for 1000 feet and runs along the grade on top of the hill with tall trees on either side embanking...
  • Chapter 5 The Structure and Function of Large

    Chapter 5 The Structure and Function of Large

    Chapter 5 The Structure and Function of Large Biological Molecules Chapter 5 (p.68-89 ) THE STRUCTURE & FUNCTION of large biomolecules Cell wall Microfibril Cellulose microfibrils in a plant cell wall Cellulose molecules Glucose monomer 10 m 0.5 m Figure...