DAMA-Big Data (R)evolution Presentation - DAMA Iowa

DAMA-Big Data (R)evolution Presentation - DAMA Iowa

Big Data - Technical Architecture Roni Schuling - Enterprise Architecture Tom Scroggins IS Domain Architecture Principal Financial Group Big Data - Technical Architecture AGENDA Foundational Definitions & where these technologies came from Big Data

NoSQL Hadoop Business & Technical Drivers How they are being used in many companies Predictions for the future Challenges & Obstacles Questions Big Data - Technical Architecture Foundational Definition Big Data Big data is an evolving term that describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for

information. Big data can be characterized by 3Vs: the extreme volume of data, the wide variety of types of data and the velocity at which the data must be must processed. There are many other aspects as well such as: Viscosity, Complexity, Ambiguity. Data in a corporation that cannot be processed using traditional data management techniques and technologies can be broadly classified as Big Data. Big Data - Technical Architecture Big Data - Technical Architecture Big Data Hadoop

Big Data NoSQL Hadoop NoSQL Hadoop & NoSQL are key technologies for working with Big Data effectively. Big Data - Technical Architecture Big Data - Technical Architecture Foundational Definition - NoSQL NoSQL database, also called Not Only SQL, is an approach to data management and database design that's useful for very large sets of distributed data. NoSQL seeks to solve the scalability and big data performance issues that relational databases werent designed to address. NoSQL is especially useful when an enterprise needs to access and analyze massive amounts of unstructured data or data that's stored remotely on multiple

virtual servers in the cloud However - NoSQL is not just about Big Data Big Data - Technical Architecture Where this technology came from - NoSQL 2005 2007 abas es Rise of Ob ject D Rela atab ases tiona

l Dat abas e Do mina nce Rise o f Re lation al Da t Flat Files 1970 1980 1990 2000

2010 2014+ Polygot Persistence Document DB Inspired by Lotus Notes Key Value Store Replicate Data during 24x7 Availability Enterprise will have a variety of different data storage technologies for different kinds of

data & application needs Need to Store Tabular Data in Distributed System Many Innovators In The 2005 to 2010 Timeframe Big Data - Technical Architecture Market view of whats out there we do NOT have all of these at PFG today. There are over 150 NoSQL databases in

the market these are just a few of the top ones. Big Data - Data Architecture at PFG Foundational Definition - Hadoop Hadoop is a open source, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation. Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes. Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating uninterrupted in case of a node failure. This approach lowers the risk of catastrophic

system failure, even if a significant number of nodes become inoperative. Big Data - Data Architecture at PFG Where this technology came from - Hadoop Google publishes Google File System & MapReduce papers Yahoo! Staffs Juggernaut,

open source DFS & MapReduce Doug Cutting builds Nutch DFS & MapReduce, joins Yahoo! 2010 2014+ Juggernaut & Nutch join forces Hadoop is born! Other Internet companies

add tools / frameworks to enhance Hadoop Service providers step into the market provide training, support, & hosting 1995 2005: Yahoo! Search team builds 4+ generations of systems to crawl & index the WWW. 20 Billion pages! 2006

Ad Mas op s tio n 2005 A Int naly er tic op To er o l ab ilit y 2004 En

te rp r Se ise G cu rit rad e y 1995 Big Data - Technical Architecture The Hadoop Vendor Landscape - 2014 Big Data - Technical Architecture

Big Data - Technical Architecture Business Drivers Provide access to all data needed for analytics (internal or external) Provide the ability to realistically interact with greater depths of data IE: tens of years instead of a couple of months Provide a greater speed to insight for all types of requests Lower the total cost of ownership across the enterprise for analytics Allow for exploration of our data in ways we never anticipated to identify differentiating understanding of customers and markets Theres an Imbalance today. Big Data - Technical Architecture

Technical Drivers Current technical capabilities dont align with changing expectations Big Data - Technical Architecture How they are being used today NoSQL Not focused on Big Data.yet Many companies using or at least experimenting with MongoDB Document store for web applications that only needs to persist the content for the lifespan of that interaction. Using NoSQL stores for user preferences to personalize what is presented on a web page for their interaction. Beginning to organization social streams of data

Hadoop Interrogating our web logs to better understand the behavior of people interacting with a website. Merging that semi-structured web activity with other structured legacy data. Massive storage of data for exploration and discovery often using interoperability with analytic consumption tools. Big Data - Technical Architecture NoSQL Plans for the future Database for web applications that need that speed of development

and nimbleness. Layering of NoSQL solutions on top of Hadoop to improve searchability and performance. Exploration of Graph NoSQL solutions for analytics on hierarchical type data . Hadoop Expansion of web activity data (more logs, more data in logs, more use cases.) Speech-to-text translation of Call Recordings and text analysis/Natural Language processing to determine call topics and caller sentiment. Extraction of text from documents to aid in analysis.

Data Lake solutioning both for ingestion and archive. Big Data - Technical Architecture Lake of Data Data Refinery Big Data - Technical Architecture Data Refinery Big Data - Technical Architecture Many Kinds of data in our organization Conceptually for illustration not a vetted/approved picture of the PFG environment Big Data - Technical Architecture Conceptual Workload Isolation Today

Conceptually for illustration not a vetted/approved picture of the PFG environment Big Data - Technical Architecture Conceptual Workload Isolation in the Future Conceptually for illustration not a vetted/approved picture of the PFG environment Big Data - Technical Architecture Big Data - Technical Architecture Big Data technologies are broader than just Hadoop & NoSQL but those are the key starting points for us.

Market view of whats out there we do NOT have all of these at PFG today. Big Data - Technical Architecture Challenges and Obstacles to overcome Security Governance Clear Use Cases

Integration Points Hosting models Big Data - Technical Architecture Q&A [email protected] NoSQL Data Architecture& Best Practices Data View - Overview We are in a Database Revolution Existing paradigms are being challenged o Models o Hardware o Software o Languages

Will tweaking current data solutions be enough? NoSQL Data Architecture& Best Practices Data View - Overview NoSQL Data Architecture& Best Practices Data View Five Data Paradigms NoSQL Data Architecture& Best Practices Data View Five Data Paradigms Relational Model PROs

Most flexible queries & updates Reuse data structures in any context Great DB-to-DB integration Mature tools Standard query language Easy to hire expertise CONs

Design-time, static relationships Design-time, static structures: design first then load data Hard to normalize model Requires code to integrate relational data with object-oriented code Cannot query for relevance NoSQL Data Architecture& Best Practices Data View Five Data Paradigms Dimensional Model PROs

Queries facts in context Self-service, ad hoc queries High-performance platforms Mature tools and integration Standard query language Turns data into information CONs Expensive platforms Design-time, static relationships Design-time, static structures: design first then load data

Cannot query for relevance Cannot query for answers that are not built into the model NoSQL Data Architecture& Best Practices Data View Five Data Paradigms Whats wrong (aka challenging) with SQL DBs? Relevance Velocity Volume Variety Variability NoSQL Data Architecture& Best Practices Data View Five Data Paradigms Key Value / Column Family Models

PROs Fast puts and gets Massive scalability Easy to shard & replicate Data colocation Simple to model Inexpensive Data in transactional context Developer in control

CONs Carefully design key Shred JSON into flat columns Secondary indexes required to query outside of hierarchical key No standard query API or language Hand code all joins in app Immature tools and platform Hard to integrate and hire NoSQL Data Architecture& Best Practices

Data View Five Data Paradigms Document Model PROs Fast development Schemaless, run-time designed, rich, JSON and/or XML data structures Queries everything in context Self-service, ad hoc queries Turns data into information Can query for relevance

CONs Defensive programming for unexpected data structures Expensive platforms, immature tools, and hard to integrate Non-standard Query Languages, and hard to hire expertise Not as fast as Column-Family / Key-Value databases NoSQL Data Architecture& Best Practices Data View Five Data Paradigms Graph Model PROs

Unlimited flexibility model any structure Run time definition of types & relationships Relate anything to anything in any way Query relationship patterns Standard Query Language (SPARQL) Creates maximum context around data CONs

Hard to model at such a low level Hard to integrate with other systems Immature tools Hard to hire expertise Cannot query for relevance because original document context is not preserved NoSQL Data Architecture& Best Practices Data ViewData FiveView Data Paradigms .. Whats wrong (aka challenging) with NoSQL DBs? Developer responsible

for consistency (handle threading) Locks Contention Serialization Dead Locks Race Conditions Threading Bugs NoSQL Data Architecture& Best Practices Data ViewData FiveView Data Paradigms NoSQL Data Architecture& Best Practices

Data View Modeling Takeaways Each model has a specialized purpose Dimensional Business intelligence reporting and analytics Relational Flexible queries, joins, updates, mature, standard

Column / Key-Value Simple, fast puts and gets, massively scalable Document Fast Development, schemaless JSON/XML, searchable Graph / RDF Modeling anything at runtime including relationships NoSQL Data Architecture& Best Practices

Data View Data HowView do you choose? .. How do you choose? How much Durability do you need? Durable data survives system failures & can be recovered after unwanted deletion How much Atomicity do you need? An atomic transaction is all or nothing, sets of data and/ or sets of commands. How much Isolation do you need? Isolation prevents concurrent transactions from affecting each others. How much Consistency do you need (or when do you need it)? Consistency exists when data is committed and

consistent with all data rules at a point in time. NoSQL Data Architecture& Best Practices Data ViewData HowView do you choose? .. Durability Can you live with writing advanced code to compensate? o Trusting all developers to properly check for partial transaction failures, current physical layout of the data

cluster, and write code to propagate data across the cluster. Can you live with lost data? o No logs, archives, mirroring, etc. Can you live with accidental deletion of data? o No point in time recovery feature Can you live with scripting your own backup & recovery solutions? NoSQL Data Architecture& Best Practices Data ViewData HowView do you choose? .. Atomicity Can you live with modifying single documents at a time? Can you live with partially successful transactions?

o You can achieve higher availability because transactions can partially succeed. Can you live with inconsistent and incomplete data? o Is it OK to not know when data anomalies are caused by bugs in your code or are temporarily inconsistent because they havent been synchronized yet? Can you live with writing advanced code to compensate? o Custom solutions for atomic rollback, handling of transactions that fail, find & fix inconsistent data. NoSQL Data Architecture& Best Practices Data ViewData HowView do you choose? .. Isolation Can you live with modifying single documents at a time?

Can you live with inaccurate queries? o Without isolation, query results are inaccurate because concurrent transactions can change data while processing it. Can you live with race conditions and dead locks? Can you live with writing advanced code to compensate? o Your own versioning system, code to hide concurrent updates, inserts and deletes from queries, handle race conditions and deadlocks. NoSQL Data Architecture& Best Practices Data ViewData HowView do you choose? .. Consistency - Do you need complete consistency?

Not necessarily instead, you may prefer: Absolute fastest performance at lowest hardware cost Highest global data availability at lowest hardware cost Working with one document at a time Writing advanced code to create your own consistency model Eventually consistent data Some inconsistent data that cant be reconciled

Some missing data that cant be recovered Some inconsistent query results NoSQL Data Architecture& Best Practices Data ViewData HowView do you choose? .. What do you need most? Highest performance for queries and transactions Highest data availability across multiple data centers Less data loss (eg. Durability) More query accuracy & less deadlocks (eg. Isolation) More data integrity (eg. Atomicity) Less code to compensate for lack of ACID compliance

NoSQL Data Architecture& Best Practices Key Points RDBMs will always have an important place in our architecture. NoSQL implementations have a benefit to our future. Once you have a list of NoSQL databases that meet your modeling needs, choose the one that best meets your need for velocity and volume. It is not a one-or-the-other all in choice to make.

Recently Viewed Presentations

  • LHCONE L3VPN: A global infrastructure for High Energy

    LHCONE L3VPN: A global infrastructure for High Energy

    TEIN. TEIN, Mumbai. GÉANT, Europe. GARR. LHCONE VRF domain/aggregator. UChi. PoP router. See . http://lhcone.net. for more detail. Ver. 4.2, May 29, 2018 ...
  • NACO Authority Records - British Library

    NACO Authority Records - British Library

    RDA NACO Authority Records Personal Names The British Library Board 2014 The BL Guide has a section covering the Relation Between 368 and 110 Qualifier.
  • 'Learning to Speak Softly': the autobiography of Roberta Cowell

    'Learning to Speak Softly': the autobiography of Roberta Cowell

    Then again, how do we define 'motor racing' 'women's contribution' and 'important' (the 'Dog-house Owner's Club' versus the Jackie Stewart legend) (Rallying in 1950s and 60s Sheila Van Damm Pat Moss and Anne Wisdom, driving equestrians) Gender and sporting bodies:...
  • Yeast whole-genome analysis of conserved regulatory motifs

    Yeast whole-genome analysis of conserved regulatory motifs

    Epigenomic informationretains genome 'state'in differentiationand development. ... For example, cluster Oct 4 is a predicted activator of enhancers active in embryonic stem (ES) cells. The motif is enriched in ES-specific enhancers (cluster A), and the Oct 4 TF is expressed...
  • Game Design - WPI

    Game Design - WPI

    Might and Magic 2 - long struggle, mystery. Very end, control panel … 15 minutes to decode ''Fourscore and seven years …" Solved it, asteroid missed, thank you and go home Ex: A Christmas Story - decoder ring drink Ovaltine...
  • Collier Research Process - MR. FURMAN'S EDUCATIONAL PORTAL

    Collier Research Process - MR. FURMAN'S EDUCATIONAL PORTAL

    The Collier Research Process is a guide to writing a research paper. It was developed by Collier Media Specialists to provide an easy-to-follow, teacher directed model to simplify the steps of research and writing.
  • startowa - IChPW Zabrze

    startowa - IChPW Zabrze

    M. Piechaczek; A. Mianowski; A. Sobolewski. Koksownictwo 2015 , Karpacz. Analiza obrazu w ocenie tekstury optycznej koksu. Aby zmienić liczbę całkowitą slajdów wyświetlaną w stopce slajdu należy po zakończeniu prezentacji wejść do menu WIDOK -> WZORZEC SLAJDÓW i wstawić odpowiednią...
  • Psalm 119 Psalm 119 is the longest psalm.

    Psalm 119 Psalm 119 is the longest psalm.

    In describing the Law the psalmist uses synonyms. The words testimonies, precepts, judgments, statutes, and God's word are used throughout the psalm. When read, one can see the importance of the Law to the life of Israel. ... be diligent....