Cheshire Ii

Cheshire Ii

XML Structured Document Retrieval and Distributed Resource Discovery Ray R. Larson School of Information Management & Systems University of California, Berkeley [email protected] August 22, 2001 NASA Ames Lecture -- Ray R. Larson Context Context NSF/JISC International Digital Library Grant Cross-Domain Resource Discovery: Integrated Discovery and Use of Textual, Numeric and Spatial Data UC Berkeley DLI2 Grant: ReInventing Scholarly Information Access UC Berkeley working with the University of Liverpool/Manchester Computing with participation from

DeMontfort University (MASTER) Art and Humanities Data Service (http://ahds.ac.uk/) OTA (Oxford), HDS (Essex), PADS (Glasgow), ADS (York), VADS (Surrey & Northumbria) Consortium of University Research Libraries (CURL) UC Berkeley Library (and California Digital Library) Making of America II Online Archive of California British Natural History Museum, London NESSTAR (NEtworked Social Science Tools and Resources) August 22, 2001 NASA Ames Lecture -- Ray R. Larson Research Research Areas Areas Goals are Practical application of existing DL technologies to some large-scale cross-domain collections

Theoretical examination and evaluation of nextgeneration designs for systems architecture and and distributed cross-domain searching for DLs August 22, 2001 NASA Ames Lecture -- Ray R. Larson Approach Approach For the first goal, we are implementing a distributed search system based on international standards (Z39.50 and SGML/XML) using the Cheshire II information retrieval system Databases include: HE Archives hub Arts and Humanities Data Service (AHDS) MASTER

CURL (Consortium of University Research Libraries) Online Archive of California (OAC) Making of America II (MOA2) August 22, 2001 NASA Ames Lecture -- Ray R. Larson Current Usage of Cheshire II Current Usage of Cheshire II Web clients for:

Berkeley NSF/NASA/ARPA Digital Library World Conservation Digital Library SunSite (UC Berkeley Science Libraries) University of Liverpool Higher Education Archives Hub University of Essex, HDS (part of AHDS) Oxford Text Archive (test only) California Sheet Music Project Cha-Cha (Berkeley Intranet Search Engine) Berkeley Metadata project cross-language demo Univ. of Virginia (test implementations) Cheshire ranking algorithm is basis for original Inktomi August 22, 2001

Glasgow, Edinburgh, Bath, Liverpool, Kings College London, University College London, Nottingham, Durham, School of Oriental and African Studies, Manchester, Southhampton, Warwick and others (to be expanded) NASA Ames Lecture -- Ray R. Larson Current Current and and Upcoming Upcoming Usage Usage of of Cheshire Cheshire II II DIEPER Digitized European Periodicals project. http://gdz.sub.uni-goettingen.de/dieper/ NESSTAR (Networked Social Science Tools and Resources. http://www.nesstar.org/ FASTER Flexible Access to Statistics Tables and

Electronic Resources. (Continuation of NESSTAR) http://www.faster-data.org/ MASTER (Manuscript Access through Standards for Electronic Records. http://www.cta.dmu.ac.uk/projects/master/ August 22, 2001 NASA Ames Lecture -- Ray R. Larson Upcoming Upcoming Usage Usage of of Cheshire Cheshire II II ZETOC (Prototype of the Electronic Table of Contents from the British Library) http://zetoc.mimas.ac.uk/ Archives Hub http://www.archiveshub.ac.uk/ RSLP Palaeography project

http://www.palaeography.ac.uk/ British Natural History Museum, London JISC data services directory hosted by MIMAS Resource Discovery Network (RDN), where it will be used to harvest RDN records from the various hubs using OAI and provide search August 22, 2001 NASA Ames Lecture -- Ray R. Larson Client/Server Client/Server Architecture Architecture Server Supports:

Database storage Indexing Z39.50 access to local data Boolean and Probabilistic Searching Relevance Feedback External SQL database support Client Supports: Programmable (Tcl/Tk) Graphical User Interface Z39.50 access to remote servers SGML/XML & MARC formatting Combined Client/Server CGI scripting via WebCheshire used for web applications August 22, 2001 NASA Ames Lecture -- Ray R. Larson SGML/XML SGML/XML Support Support

Underlying native format for all data is SGML/XML The DTD defines the file format for each file Full SGML/XML parsing XML Configuration Files define the database USMARC DTD and MARC to SGML conversion (and back again) Access to full-text via special SGML tags Support for SGML/XML component definition and indexing August 22, 2001 NASA Ames Lecture -- Ray R. Larson SGML/XML SGML/XML Support Support Example XML record for a DL document

ELIB-v1.0 756 June 12, 1996 June 1996 Cumulative Watershed Effects: Applicability of Available Methodologies to the Sierra Nevada University of California report USDA Forest Service Neil H. Berg Ken B. Roby Bruce J. McGurk SNEP Vol 3 40 /elib/data/docs/0700/756/HYPEROCR/hyperocr.html /elib/data/docs/0700/756/OCR-ASCII-NOZONE August 22, 2001 NASA Ames Lecture -- Ray R. Larson SGML/XML SGML/XML Support Support

Example SGML/MARC Record 00722n am2 200229 4 50 00100140000000500170001400800410003101000140007203500200008603500170010 610000190012324501050014225000110024726000320025830000330029050400500032365000360 0373700002200409700002200431950003200453998000700485 CUBGGLAD1282B19940414143202.0 830810 1983 nyu eng u 82019962 (CU)ocm08866667(CU)GLAD1282Burch, John G.Information systems :theory and practice /John G. Burch, Jr., Felix R. Strater, Gary Grudnitski3rd edNew York :J. Wiley,1983xvi, 632 p. :ill. ;24 cmIncludes bibliographical references and index

SubjSys="LCSH">Management information systems. ... August 22, 2001 NASA Ames Lecture -- Ray R. Larson SGML/XML SGML/XML Support Support Configuration files for the Server are also SGML/XML: They include tags describing all of the data files and indexes for the database. They also include instructions on how data is to be extracted for indexing and how Z39.50 attributes map to the indexes for a given database. They include definition of components and component indexes August 22, 2001 NASA Ames Lecture -- Ray R. Larson Component Component Extraction Extraction and

and Retrieval Retrieval Any sub-elements of an SGML/XML document can be defined as a separately indexed component. Components can be ranked and retrieved independently of the source document (but linked back to their original source) For example paragraphs and abstracts in the full text of documents could be defined as components to provide paragraph-level search Example: Glassier archives August 22, 2001 NASA Ames Lecture -- Ray R. Larson Component Component Extraction Extraction and and Retrieval Retrieval The Glassier archive is an EAD document (1.9 Mb in size)

Contains Series, Subseries, and Item level descriptions of things in the archive August 22, 2001 NASA Ames Lecture -- Ray R. Larson Excerpt from Glasier Archive Excerpt from Glasier Archive GP-1-1: General correspondence. Public letters. GP-1-1 Glasier Papers. General correspondence. Public letters. Arrangement

Public letters arranged alphabetically within each year

GP-1-1-0001 Letter from Richard Murray. Glasgow; 7 Apr 1879. Murray, Richard 1 letter

Employment reference for J.B.G. as draughtsmanGlasier, John Bruce

ETC. August 22, 2001 NASA Ames Lecture -- Ray R. Larson Example Example Component Component Def Def

/home/ray/Work/Glasier_test/indexes/COMPONENT_DB1 NONE c level item August 22, 2001 NASA Ames Lecture -- Ray R. Larson Components Components Both individual tags and ranges with a starting tag and (different) ending tag can be used as components Components permit parts of complex SGML/XML documents to be treated as

separate documents August 22, 2001 NASA Ames Lecture -- Ray R. Larson Cheshire Cheshire II II Searching Searching Local Remote Z39.50 Internet Z39.50 Z39.50 Z39.50 Scanned Images Text

August 22, 2001 NASA Ames Lecture -- Ray R. Larson Boolean Boolean Search Search Capability Capability All Boolean operations are supported zfind author x and (title y or subject z) not subject A Named sets are supported and stored on the server Boolean operations between stored sets are supported zfind SET1 and subject widgets or SET2 Nested parentheses and truncation are supported zfind xtitle Alice# August 22, 2001 NASA Ames Lecture -- Ray R. Larson Probabilistic Probabilistic Models

Models Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle) Rely on accurate estimates of probabilities August 22, 2001 NASA Ames Lecture -- Ray R. Larson Probability Probability Ranking Ranking Principle Principle If a reference retrieval systems response to each request is a ranking of the documents in the collections in the order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then the overall effectiveness of the

system to its users will be the best that is obtainable on the basis of that data. August 22, 2001 Stephen E. Robertson, J. Documentation 1977 NASA Ames Lecture -- Ray R. Larson Probabilistic Probabilistic Models: Models: Logistic Logistic Regression Regression Estimates for relevance based on log-linear model with various statistical measures of document content as independent variables. Log odds of relevance is a linear function of attributes: log O( R|qi , d j , t k ) c0 c1v1 c2 v2 ... cn vn Term contributions summed: m log O( R | qi , d j ) [log O ( R | qi , d j ,t k ) log O( R )] k 1

Probability of Relevance is inverse of log odds: 1 P ( R | qi , d j ) log(O ( R|qi , d j )) 1 e August 22, 2001 NASA Ames Lecture -- Ray R. Larson Relevance Logistic Logistic Regression Regression August 22, 2001 100 90 80 70 60 50 40 30 20 10 0 - 0

10 20 30 40 50 60 Term Frequency in Document NASA Ames Lecture -- Ray R. Larson Probabilistic Probabilistic Retrieval: Retrieval: Logistic Logistic Regression Regression Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients (TREC). At retrieval the probability estimate is obtained by: 6 P( R | Q, D) c0 ci X i i 1 For the 6 X attribute measures shown on the next slide August 22, 2001 NASA Ames Lecture -- Ray R. Larson

Probabilistic Probabilistic Retrieval: Retrieval: Logistic Logistic Regression attributes Regression attributes 1 X1 M log QAF M 1 tj Query Length X 2 QL 1

X3 M M log DAF 1 X 4 DL Average Absolute Query Frequency tj Average Absolute Document Frequency Document Length 1 M X 5 log IDFt j Average Inverse Document Frequency M 1 N nt j Inverse Document Frequency IDF nt j X 6 log M

August 22, 2001 Number of Terms in common between query and document -- logged NASA Ames Lecture -- Ray R. Larson Cheshire Cheshire Probabilistic Probabilistic Retrieval Retrieval Uses Logistic Regression ranking method developed at Berkeley with new algorithm for weigh calculation at retrieval time. Z39.50 relevance operator used to indicate probabilistic search Any index can have Probabilistic searching performed: zfind topic @ cheshire cats, looking glasses, march hares and other such things zfind title @ caucus races Boolean and Probabilistic elements can be combined: zfind topic @ government documents and title guidebooks August 22, 2001

NASA Ames Lecture -- Ray R. Larson Combining Combining Search Search Types Types It is also possible to combine the results of multiple independent searches into a single result set. (using the Z39.50 SORT service of the Cheshire system) E.g.: Search of Full Text (Probabilistic) Search of Full Text (Boolean) Search of Components (Probabilistic) Search of Titles (Probabilistic) Search of Subject Headings (Probabilistic) All result sets are merged and re-ranked to produce the final list.

August 22, 2001 NASA Ames Lecture -- Ray R. Larson Relevance Relevance Feedback. Feedback. Any records in a result set can be used for Relevance Feedback Uses the set name to receive feedback instructions. zfind SET1:2,5-9,30,45 zfind SET2:6 Chosen records are used to build a new probabilistic query Ranked results are returned Planned support for (modified) Rocchio RF August 22, 2001 NASA Ames Lecture -- Ray R. Larson Cheshire II - Two-Stage Retrieval (EVM generation)

Example: Using the LC Classification System Pseudo-Document created for each LC class containing terms derived from content-rich portions of documents in that class (subject headings, titles, etc.) Permits searching by any term in the class Ranked Probabilistic retrieval techniques attempt to present the Best Matches to a query first. User selects classes to feed back for the second stage search of documents (which includes info from first stage selections) Can be used with any classified/Indexed collection and controlled vocabulary August 22, 2001 NASA Ames Lecture -- Ray R. Larson Automatic Class Assignment Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually ordered clusters are order-independent, usually based on an intellectually derived scheme Search Engine Doc Doc

Doc Doc Doc Doc Doc 1. Create pseudo-documents representing intellectually derived classes. 2. Search using document contents 3. Obtain ranked list 4. Assign document to N categories ranked over threshold. OR assign to top-ranked category August 22, 2001 NASA Ames Lecture -- Ray R. Larson Cheshire Cheshire II II -- Cluster Cluster Generation Generation Define basis for clustering records. Select field to form the basis of the cluster.

Evidence Fields to use as contents of the pseudodocuments. During indexing cluster keys are generated with basis and evidence from each record. Cluster keys are sorted and merged on basis and pseudo-documents created for each unique basis element containing all evidence fields. Pseudo-Documents (Class clusters) are indexed on combined evidence fields. August 22, 2001 NASA Ames Lecture -- Ray R. Larson Cheshire II - Two-Stage Retrieval Using the Mesh Subject Heading System Pseudo-Document created for each MESH heading containing terms derived from content-rich portions of documents in that class (other subject headings, titles, abstract, etc.) Permits searching by any term in the class Ranked Probabilistic retrieval techniques attempt to present the Best Matches to a query first. User selects classes to feed back for the second stage search of documents.

Can be used with any classified/Indexed collection. August 22, 2001 NASA Ames Lecture -- Ray R. Larson Distributed Distributed Search: Search: The The Problem Problem Hundreds or Thousands of servers with databases ranging widely in content, topic, format Broadcast search is expensive in terms of bandwidth and in processing too many irrelevant results How to select the best ones to search? What to search first Which to search next Topical /domain constraints on the search selections Variable contents of database (metadata only, full text)

August 22, 2001 NASA Ames Lecture -- Ray R. Larson An An Approach Approach for for Cross-Domain Cross-Domain Resource Resource Discovery Discovery MetaSearch New approach to building metasearch based on Z39.50 Instead of using broadcast search we are using two Z39.50 Services Identification of database metadata using Z39.50 Explain Extraction of distributed indexes using Z39.50 SCAN Evaluation

How efficiently can we build distributed indexes? How effectively can we choose databases using the index? How effective is merging search results from multiple sources? Hierarchies of servers (general/meta-topical/individual)? August 22, 2001 NASA Ames Lecture -- Ray R. Larson Z39.50 Z39.50 Overview Overview UI Map Query Map Results Map Query Search Engine

Internet Map Results Map Query UI Map Results August 22, 2001 NASA Ames Lecture -- Ray R. Larson Z39.50 Z39.50 Explain Explain Explain supports searches for Server-Level metadata Server Name IP Addresses Ports Database-Level metadata

Database name Search attributes (indexes and combinations) Support metadata (record syntaxes, etc) August 22, 2001 NASA Ames Lecture -- Ray R. Larson Z39.50 Z39.50 SCAN SCAN Originally intended to support Browsing Query for Database Attributes plus Term (i.e., index and start point) Step Size Number of terms to retrieve Position in Response set

Results Number of terms returned List of Terms and their frequency in the database (for the given attribute combination) August 22, 2001 NASA Ames Lecture -- Ray R. Larson Z39.50 SCAN Results Z39.50 SCAN Results Syntax: zscan indexname1 term stepsize number_of_terms pref_pos zscan topic cat 1 20 1 % zscan title cat 1 20 1 {SCAN {Status 0} {SCAN {Status 0} {Terms 20} {Terms 20} {StepSize 1} {StepSize 1} {Position 1}} {Position 1}}

{cat 706} {cat 27} {cat-and-mouse 19} {cat-fight 1} {cat-burglar 1} {catalan 19} {cat-carrying 1} {cat-egory 1} {catalogu 37} {cat-fight 1} {catalonia 8} {cat-gut 1} {catalyt 2} {cat-litter 1} {catania 1} {cat-lovers 2} {cataract 1} {cat-pee 1} {catch 173} {cat-run 1} {catch-all 3} {cat-scanners 1} {catch-up 2} NASA Ames Lecture -- Ray R. Larson

August 22, 2001 MetaSearch MetaSearch Server Server Index Index Creation Creation For all servers, or a topical subset Get Explain information (especially DC mappings) For each index (or each DC index) Use SCAN to extract terms and frequency Add term + freq + source index + database metadata to the metasearch Collection Document (XML) Planned extensions: Post-Process indexes (especially Geo Names, etc) for special types of data e.g. create geographical coverage indexes August 22, 2001

NASA Ames Lecture -- Ray R. Larson MetaSearch MetaSearch Approach Approach Map Query Map Explain And Scan Queries MetaSearch Server Search Engine Map Results Internet DB 1 Map

Results Distributed Index August 22, 2001 Map Query Search Engine Db 5 Db 6 NASA Ames Lecture -- Ray R. Larson Map Results DB2 Search Engine DB 3 DB 4

Known Known Problems Problems Not all Z39.50 Servers support SCAN or Explain Solutions: Probing for attributes instead of explain (e.g. DC attributes or analogs) We also support OAI and can extract OAI metadata for servers that support OAI Collection Documents are static and need to be replaced when the associated collection changes August 22, 2001 NASA Ames Lecture -- Ray R. Larson Evaluation Evaluation Test Environment TREC Tipster and FT data (approx. 3.5 GB) Partitioned into 236 smaller collections based on source and

(for TIPSTER) date by month (Distributed Search Testbed built by French, et al.) High size variability (Range from 1 to thousands of docs) 21,225,299 Words, 142,345,670 chars total for harvested records Efficiency (old data) Average of 23.07 seconds per database to SCAN each database (3.4 indexes on average) Average of 14.07 seconds excluding FT (131 seconds for FT database with 7 indexes) Now collecting more information so longer harvest times longer, but still under one minute on average August 22, 2001 NASA Ames Lecture -- Ray R. Larson Evaluation Evaluation Effectiveness Still working on evaluation comparing our DB ranking with the TIPSTER relevance judgements Can be compared with published selection

methods (CORI, GlOSS, etc.) using the same testbed August 22, 2001 NASA Ames Lecture -- Ray R. Larson Future Future Testing of variant algorithms for ranking collections Application to real systems and testing in a production environment (Archives Hub) Logically Clustering servers by topic Meta-Meta Servers (treating the MetaSearch database as just another database) August 22, 2001 NASA Ames Lecture -- Ray R. Larson Distributed Distributed Metadata Metadata Servers Servers

Database General Servers August 22, 2001 Meta-Topical Servers Replicated servers NASA Ames Lecture -- Ray R. Larson Servers Conclusion Conclusion A lot of interesting work to be done Redesign and development of the Cheshire II system Evaluating new meta-indexing methods Developing and Evaluating methods for merging cross-domain results (or, perhaps, when to keep them separate) August 22, 2001

NASA Ames Lecture -- Ray R. Larson Further Further Information Information Full Cheshire II client and server source is available ftp://cheshire.berkeley.edu/pub/cheshire/ Includes HTML documentation Also on Berkeley Digital Library Software Distribution CD Project Web Site http://cheshire.berkeley.edu/ August 22, 2001 NASA Ames Lecture -- Ray R. Larson

Recently Viewed Presentations

  • Grade 8 Parents Information Evening - Diocesan College

    Grade 8 Parents Information Evening - Diocesan College

    The IEB AP Maths exam is written as an 8th subject in Grade 12 We are trying to teach some AP Maths to set 2 in Grade 10 - 12 but will have to wait and see if they write...
  • Christmas in Mexico - Mrs. Bass' First Grade Website

    Christmas in Mexico - Mrs. Bass' First Grade Website

    Christmas in Mexico Come along and uncover the Mexican Christmas celebrations, and the precious traditions of their culture! Spanish Christmas Customs and Those In The U.S. Christmas in Mexico doesn't have snow, snowmen, heavy coats, or jolly old St. Nick.
  • Expository Writing

    Expository Writing

    Why Do Writers Use Cause and Effect? To discover order in a reality that is in apparent chaos. To inform. To speculate. To change behavior. How Does a Writer Compose a Cause/Effect Essay? Choose a manageable subject that a specific...
  • DNA Sequencing: Importance

    DNA Sequencing: Importance

    Times Arial Wingdings Helvetica ヒラギノ角ゴ Pro W3 Blank 1_Blank Bitmap Image DNA sequencing: Importance Slide 2 Slide 3 Slide 4 Slide 5 DNA sequencing methodologies: ca. 1977 Maxam-Gilbert 'chemical' method versus "synthesis-based" methods What to label for visualization?
  • Chapter Nine - SSCC

    Chapter Nine - SSCC

    Choose: $1 in each of the three period or $8 in the third Translation Cashflow Gives constant payment x forever Cashflow Important cashflow: Perpetuity You can rent an apartment for $1000 each month (r=0.5%=0.005) You can buy it P=300.000 Renting...
  • Insert Title here - nphs.wales.nhs.uk:8080

    Insert Title here - nphs.wales.nhs.uk:8080

    Uptake for staff with direct patient contact is actually higher and now above 50% in ABMU, Cwm Taf, Hywel Dda, Powys, Public Health Wales and Velindre. Provisional end of season flu vaccine uptake figures for the 2015/16 season indicate that...
  • The Angles

    The Angles

    The Angles Come and meet the Angle family... "Hi! We're the Angles, we're one family, but all very different, so let us tell you a bit about ourselves!" Mr. Obtuse Angle This is Obtuse. He's the husband of the family....
  • RobertsRulersof Order The Chair  Preside and Maintain Order

    RobertsRulersof Order The Chair Preside and Maintain Order

    A member seeking to be excused from a vote must do so before a Roll Call vote is taken. A member can vote against their own motion, but cannot speak against their own motion. Vote. Most majorities are determined by...