ETL - University of Oklahoma

ETL - University of Oklahoma

ETL The process of updating the data warehouse. Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia [email protected] http://www.terry.uga.edu/~hwatson/dw_tutorial.ppt

Two Data Warehousing Strategies Enterprise-wide warehouse, top down, the Inmon methodology Data mart, bottom up, the Kimball methodology When properly executed, both result in an enterprise-wide data warehouse

The Data Mart Strategy The most common approach Begins with a single mart and architected marts are added over time for more subject areas Relatively inexpensive and easy to implement Can be used as a proof of concept for data warehousing Can perpetuate the silos of information problem Can postpone difficult decisions and activities Requires an overall integration plan

The Enterprise-wide Strategy A comprehensive warehouse is built initially An initial dependent data mart is built using a subset of the data in the warehouse Additional data marts are built using subsets of the data in the warehouse Like all complex projects, it is expensive, time consuming, and prone to failure When successful, it results in an integrated, scalable warehouse

Data Sources and Types Primarily from legacy, operational systems Almost exclusively numerical data at the present time External data may be included, often purchased from third-party sources Technology exists for storing unstructured data and expect this to become more important over time

Extraction, Transformation, and Loading (ETL) Processes The plumbing work of data warehousing Data are moved from source to target data bases A very costly, time consuming part of data warehousing Recent Development: More Frequent Updates

Updates can be done in bulk and trickle modes Business requirements, such as trading partner access to a Web site, requires current data For international firms, there is no good time to load the warehouse Recent Development: Clickstream Data Results from clicks at web sites

A dialog manager handles user interactions. An ODS (operational data store in the data staging area) helps to custom tailor the dialog The clickstream data is filtered and parsed and sent to a data warehouse where it is analyzed Software is available to analyze the clickstream data Data Extraction

Often performed by COBOL routines (not recommended because of high program maintenance and no automatically generated meta data) Sometimes source data is copied to the target database using the replication capabilities of standard RDMS (not recommended because of dirty data in the source systems) Increasing performed by specialized ETL software

Sample ETL Tools Teradata Warehouse Builder from Teradata DataStage from Ascential Software SAS System from SAS Institute Power Mart/Power Center from Informatica Sagent Solution from Sagent Software Hummingbird Genio Suite from Hummingbird Communications

Reasons for Dirty Data

Dummy Values Absence of Data Multipurpose Fields Cryptic Data Contradicting Data Inappropriate Use of Address Lines Violation of Business Rules Reused Primary Keys, Non-Unique Identifiers Data Integration Problems

Data Cleansing Source systems contain dirty data that must be cleansed ETL software contains rudimentary data cleansing capabilities Specialized data cleansing software is often used. Important for performing name and address correction and householding functions Leading data cleansing vendors include Vality

(Integrity), Harte-Hanks (Trillium), and Firstlogic (i.d.Centric) Steps in Data Cleansing Parsing Correcting Standardizing Matching Consolidating Parsing

Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files. Examples include parsing the first, middle, and last name; street number and street name; and city and state. Correcting Corrects parsed individual data

components using sophisticated data algorithms and secondary data sources. Example include replacing a vanity address and adding a zip code. Standardizing Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom

business rules. Examples include adding a pre name, replacing a nickname, and using a preferred street name. Matching Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications.

Examples include identifying similar names and addresses. Consolidating Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation. Data Staging Often used as an interim step between data

extraction and later steps Accumulates data from asynchronous sources using native interfaces, flat files, FTP sessions, or other processes At a predefined cutoff time, data in the staging file is transformed and loaded to the warehouse There is usually no end user access to the staging file An operational data store may be used for data staging

Data Transformation Transforms the data in accordance with the business rules and standards that have been established Example include: format changes, deduplication, splitting up fields, replacement of codes, derived values, and aggregates Data Loading

Data are physically moved to the data warehouse The loading takes place within a load window The trend is to near real time updates of the data warehouse as the warehouse is increasingly used for operational applications Meta Data Data about data

Needed by both information technology personnel and users IT personnel need to know data sources and targets; database, table and column names; refresh schedules; data usage measures; etc. Users need to know entity/attribute definitions; reports/query tools available; report distribution information; help desk contact information, etc. Recent Development:

Meta Data Integration A growing realization that meta data is critical to data warehousing success Progress is being made on getting vendors to agree on standards and to incorporate the sharing of meta data among their tools Vendors like Microsoft, Computer Associates, and Oracle have entered the meta data marketplace with significant product offerings

Recently Viewed Presentations

  • Synoptic Gospels Part 01 Jesus Birth and Youth

    Synoptic Gospels Part 01 Jesus Birth and Youth

    What Scripture say is so - is so !! We are not free to ignore it or change it. There is much detail that Scripture does not elaborate upon.-- They are left 'vacant' since they are not necessary to the...
  • Présentation PowerPoint - Free

    Présentation PowerPoint - Free

    L'excision du phage L'excision se produit lors de l'induction du cycle lytique (délysogénisation) du phage (ex : sous l'action des UV), elle est assurée par l'enzyme excisionase qui hydrolyse au niveau des sites d'attachement.
  • Final Jeopardy Fractions Addition Fractions Subtraction  Fractions Multiplication

    Final Jeopardy Fractions Addition Fractions Subtraction Fractions Multiplication

    How many pounds of candy will be in each bag? There will be 2 ⅓ pounds in each bag. Charlie, Dan, and Ellie each made a paper chain. Charlie's was 9 ½ feet long, Dan's was 6 ⅓ feet long,...
  • AP English Literature & Composition Essay Questions

    AP English Literature & Composition Essay Questions

    AP English Literature & Composition Free Response Section Three essays Two hours (about 40 minutes each) Two are "close reading" One on a prose passage One on poetry One is "open" Topic given Student chooses a work from a list...
  • Absolutism in Eastern Europe

    Absolutism in Eastern Europe

    FW I (King of P) FW II Elector of Brandenburg continued the work of his grandfather to strengthen Prussia and pull Brandenburg out of HRE. Created strong army- "soldier king" improved farming and ag production, improved the economy and left...
  • Embedding Technology in Teaching and Learning

    Embedding Technology in Teaching and Learning

    Evidence and research. Things to consider. Need to be clear about how you expect the introduction of technology to improve learning. Technology should support pupils to work harder, for longer or more efficiently to improve their learning.
  • Why Christians are Useless and Irrelevant in Today's Society

    Why Christians are Useless and Irrelevant in Today's Society

    What is Constantinianism? Constantine was the first emperor to declare Christianity as a national religion. He was the first of a long line of leaders to see Christianity as the unifying force that might bring together the empire's diverse population....
  • 第9章 模数转换器与数模转换器 - Shandong University

    第9章 模数转换器与数模转换器 - Shandong University

    并联比较型ADC的缺点是成本高、功耗大。 因为n位输出的ADC, 需要2n个电阻, (2n-1) 个比较器和D触发器, 以及复杂的编码网络, 其元件数量随位数的增加, 以几何级数上升。所以这种ADC适用于要求高速、低分辩率的场合。