CSE 232A Graduate Database Systems Arun Kumar Topic 7: Parallel RDBMSs and Dataflow Systems Chapters 22 of Cow Book Outline Parallel RDBMSs and Cloud-Native RDBMSs Beyond RDBMSs: A Brief History

Big Data Systems Parallel DBMSs: Motivation Scalability: Database is too large for a single nodes disk Performance: Exploit multiple cores/disks/nodes while maintaining almost all other benefits of (R)DBMSs! Three Paradigms of Parallelism Data/Partitioned Parallelism Contention Interconnect

Contention Interconnect Interconnect Shared-Disk Parallelism Shared-Memory Parallelism Symmetric MultiProcessing (SMP) Shared-Nothing Parallelism

Massively Parallel Processing (MPP) Shared-Nothing Parallelism Followed by almost all parallel RDBMSs (and Big Data sys.) 1 master node orchestrates multiple worker nodes Need partitioned parallel implementation algorithms for relational op implementations and query proc.; modify QO Q: If we give 10 workers (CPUs/nodes) for processing a query in parallel, will its runtime go down by a factor of 10? It depends! (Access patterns of the querys operators, communication of intermediate data, relative startup overhead, etc.) Shared-Nothing Parallelism

Runtime speedup (fixed data size) 12 8 4 1 Linear Speedup Runtime speedup 2 1 Sublinear Speedup

1 4 8 12 Number of workers Speedup plot / Strong scaling 0.5 Linear Scaleup Sublinear Scaleup 1

4 8 12 Factor (# workers, data size) Scaleup plot / Weak scaling Q: Is superlinear speedup/scaleup possible? Shared-Nothing Parallelism: Outline Data Partitioning Parallel Operator Implementations Parallel Query Optimization Parallel vs Distributed DBMSs Data Partitioning

A part of ETL (Extract-Transform-Load) for database Typically, record-wise/horizontal partitioning (aka sharding) Three common schemes (given k machines): Round-robin: assign tuple i to machine i MOD k Hashing-based: needs partitioning attribute(s) Range-based: needs ordinal partitioning attribute(s) Tradeoffs: Round-robin often inefficient for parallel query processing (why?); range-based good for range queries but faces new kind of skew; hashing-based is most common Replication often used for more availability, performance Parallel Scans and Select Intra-operator parallelism is our primary focus Inter-operator and inter-query parallelism also possible! Filescan:

Trivial! Worker simply scans its partition and streams it Apply selection predicate (if any) Indexed: Depends on data partitioning scheme and predicate! Same tradeoffs: Hash index vs B+ Tree index Each worker can have its own (sub-)index Master routes query based on matching workers Parallel Sorting Naive algorithm: (1) Each worker sorts local partition (EMS); (2) Master merges all locally sorted runs Issue: Parallelism is limited during merging phase! Faster algorithm: (1) Scan in parallel and range partition data (most likely a repartitioning) based on SortKey; (2) Each worker sorts local allotted range (EMS); result is globally

sorted and conveniently range-partitioned Potential Issue: Skew in range partitions; handled by roughly estimating distribution using sampling Parallel Sorting Original Partitions Assign SortKey Master Range splits Worker 1 Range-partitioned Globally Sorted

Master Master Worker 1 1 to V 1 2

Worker 2 V2 2 V V1 V to to

V Worker 2 Worker 2 3 Worker n to

V V 2 3 Worker n Worker n V

n to V n1 R epa rti t

V2 V 2 to to V V3

Lo EM ca Worker 1 S l V n to V n1

V n to V n1 Parallel Aggregates and Group By Without Group By List: Trivial for MAX, MIN, COUNT, SUM, AVG (why?) MEDIAN requires parallel sorting (why?) With Group By List:

1. If AggFunc allows, pre-compute partial aggregates 2. Master assigns each worker a set of groups (hash partition) 3. Each worker communicates its partial aggregate for a group to that groups assigned worker (aka shuffle) 4. Each worker finishes aggregating for all its assigned groups Parallel Group By Aggregate Original Partitions Partial Aggs Master

Master l ca Y Lo pB r G Worker 1 Assign GroupingList Hash splits Worker 1

Master G 1 Worker n G 2

Worker 2 G 2 Worker n R G en pa

2 Worker n Worker 2 1 G G 1

Worker 2 Lo Master G c al Ag rpB Worker 1 ai Y n Worker 1

G Worker 2 Final Aggs Re-partitioned Partial Aggs Worker n G n

G n Parallel Project Non-deduplicating Project: Trivial! Pipelined with Scans/Select Deduplicating Project: Each worker deduplicates its partition on ProjectionList If estimated output size is small (catalog?), workers communicate their results to master to finish dedup. If estimated output size is too large for masters disk, similar algorithm as Parallel Aggregate with Group By, except, there is no AggFunc computation

Parallel Nested Loops Join Given two tables A and B and JoinAttribute for equi-join 1. Master assigns range/hash splits on JoinAttribute to workers 2. Repartitioning of A and B separately using same splits on JoinAttribute (unless pre-partitioned on it!) 3. Worker i applies BNLJ locally on its partitions Ai and Bi 4. Overall join output is just collection of all n worker outputs If join is not equi-join, there might be a lot of communication between workers; worst-case: all-to-all for cross-product! Parallel Split and Merge for Joins Repartitioning quite common for parallel (equi-)joins Functionality abstracted as two new physical operators: Split: each worker sends a subset of its partition to another worker based on masters command (hash/range)

Merge: each worker unions subsets sent to it by others and constructs its assigned (re)partitioned subset Useful for parallel BNLJ, Sort-Merge Join, and Hash Join Parallel Sort-Merge and Hash Join For SMJ, split is on ranges of (ordinal) JoinAttribute; for HJ, split is on hash function over JoinAttribute Worker i does local join of Ai and Bi using SMJ or HJ Improved Parallel Hash Join 2-phase parallel HJ to improve performance Idea: Previous version hash partitions JoinAttribute to n (same as # workers); instead, decouple the two and do a 2stage process: partition phase and join phase Partition Phase: Say |A| < |B|; divide A and B into k (can be

> n) partitions using h1() s.t. each F x |Ai| < Cluster RAM Join Phase: Repartition an Ai into n partitions using h2(); build hash table on new Aij at worker j as tuples arrive; repartition Bi using h2(); local HJ of Aij and Bij on worker j in parallel for j = 1 to n; repeat all these steps for each i = 1 to k Uses all n workers for join of each subset pair Parallel Query Optimization Far more complex than single-node QO! I/O cost, CPU cost, and communication cost for each phy. op. Space of PQPs explodes: each node can have its own different local sub-plan (e.g., filescan v indexed) Pipeline parallelism and partitioned parallelism can be interleaved in complex ways! Join order enumeration affected: bushy trees can be good!

(we will skip more details) Parallel vs Distributed RDBMSs A parallel RDBMS layers distribution atop the file system Can handle dozens of nodes (Gamma, Teradata, etc.) Raghus distributed: collection of independent DBMSs Quirk of terminology; federated more accurate term Each base RDBMS can be at a different location! Each RDBMS might host a subset of the database files Might need to ship entire files for distributed QP (we will skip more details) These days: Polystores, federated DBMSs on steroids! Cloud Computing Compute, storage, memory, networking are virtualized and

exist on remote servers; rented by application users Manageability: Managing hardware is not user's problem! Pay-as-you-go: Fine-grained pricing economics based on actual usage (granularity: seconds to years!) Elasticity: Can dynamically add or reduce capacity based on actual workloads demand Infrastructure-as-a-Service (IaaS); Platform-as-a-Service (PaaS); Software-as-a-Service (SaaS) Cloud Computing How to redesign a parallel RDBMS to best exploit the clouds capabilities? Evolution of Cloud Infrastructure

Data Center: Physical space from which a cloud is operated 3 generations of data centers/clouds: Cloud 1.0 (Past): Networked servers; user rents/timesliced access to servers needed for data/software Cloud 2.0 (Current): Virtualization of networked servers; user rents amount of resource capacity; cloud provider has a lot more flexibility on provisioning (multi-tenancy, load balancing, more elasticity, etc.) Cloud 3.0 (Ongoing Research): Serverless and disaggregated resources all connected to fast networks Revisiting Parallelism in the Cloud Interconnect Interconnect

Shared-Disk Parallelism Networks have become much faster: 100GbE to even TbE! Such bundling could under-utilize some resources! Shared-Nothing Parallelism How to exploit clouds virtualization of compute, memory, and storage resources to improve speed and utilization?

Revisiting Parallelism in the Cloud The promise of full serverless / resource disaggregation: All resources (compute, memory, storage) are networkattached and can be elastically added/removed Need more memory for some phy. ops.! Interconnect Need more CPUs to better parallelize aggregates! How to fulfill the promise with minimal added latency?

Cloud-Native Parallel RDBMSs Not just running a regular parallel RDBMS on IaaS! Need to revisit, redesign, and reimplemented storage subsystem, memory management, query processing and optimization, transaction management, and more! Higher levels (data model, SQL, parser, etc.) preserved Cloud providers, traditional database companies, startups Key Example: Regular MPP (shared-nothing style) Heterogeneous and elastic compute capacities Wide variety of storage formats

Spectrum supports adhoc remote reads from S3 vs local storage Key Example: Shared-disk style + elastic compute Each virtual warehouse is an independent MPP compute cluster Compressed columnar format Key Example:

Serverless! Remote reads from S3 Schema-on-read ETL not needed Many data formats Simple interactive queries Federated possible Outline

Parallel RDBMSs and Cloud-Native RDBMSs Beyond RDBMSs: A Brief History Big Data Systems Beyond RDBMSs: A Brief History Relational model and RDBMSs are too restrictive: 1. Flat tables with few data/attribute types Object-Relational DBMSs: UDT, UDFs, text, multimedia, etc. 2. Restricted language interface (SQL)

PL/SQL; recursive SQL; embedded SQL; QBE; visual interfaces 3. Need to know schema first! Schema-later semi-structured XML data model; XQuery 4. Optimized for static dataset Stream data model; standing queries; time windows But the DB community has been addressing these issues! Advertisement: Take CSE 135 and CSE 232B

to learn more! So, why did people still need to look beyond RDBMSs? Beyond RDBMSs: A Brief History The DB community got blindsided by the unstoppable rise of the Web/Internet giants! DB folks underappreciated 4 key concerns of Web folks: Developability Fault Tolerance Elasticity Cost/Politics!

DB/Enterprise vs. Web Dichotomy DB folks underappreciated 4 key concerns of Web folks: Developability: RDBMS extensibility mechanisms (UDTs, UDFs, etc.) are too painful to use for programmers! DB companies: we write the software and sell to our customers, viz., enterprise companies (banks, retail, etc.) Web companies: we will hire an army of software engineers to build own in-house software systems! Need simpler APIs and DBMSs that scale custom programs DB/Enterprise vs. Web Dichotomy DB folks underappreciated 4 key concerns of Web folks: Fault Tolerance: What if we run on 100Ks of machines?! DB companies: our customers do not need more than a

few dozen machines to store and analyze their data! Web companies: we need hundreds of thousands of machines for planetary-scale Web services! If a machine fails, user should not have to rerun entire query! DBMS should take care of fault tolerance, not user/appl. (Cloud-native RDBMSs now offer fault tolerance by design) DB/Enterprise vs. Web Dichotomy DB folks underappreciated 4 key concerns of Web folks: Elasticity: Resources should adapt to query workload DB companies: our customers have fairly predictably sized datasets and workloads; can fix their clusters! Web companies: our workloads could vary widely and the datasets they need vary widely! Need to be able to upsize and downsize clusters easily

on-the-fly, based on current query workload DB/Enterprise vs. Web Dichotomy DB folks underappreciated 4 key concerns of Web folks: Cost/Politics: Commercial RDBMS licenses too costly! DB companies: our customers have $$$! Web companies: our products are mostly free (ads?); why pay so much $$$ if we can build our own DBMSs? Many started with MySQL (!) but then built their own DBMSs New tools were free & open source; led to viral adoption! Cool, so, these new systems jolted the DB folks from being smug and complacent! But what is Big Data?

Big Data Marketing term; think Big as in Big Oil, not big building Wikipedia says: Data that is so large and complex that existing toolkits [read RDBMSs!] are not adequate [hah!] Typical characterization by 3 Vs: Volume: larger-than-RAM; >= TBs, even Exabytes! Variety: relations, webpages, docs, tweets, multimedia, etc. Velocity: high generation rate, e.g., sensors, surveillance, etc. Why Big Data now? 1. Applications New data-driven mentality in almost all applications: Web: search, e-commerce, e-mails, social media Science: satellite imagery, CERNs LHC, document corpora Medicine: pharmacogenomics, precision medicine Logistics: sensors, GPS, Internet of Things

Finance: high-throughput trading, monitoring Humanities: digitized books/literature, social media Governance: e-voting, targeted campaigns, NSA Why Big Data now? 2. Storage Outline Parallel RDBMSs and Cloud-Native RDBMSs Beyond RDBMSs: A Brief History

Big Data Systems The MapReduce/Hadoop Craze Spark and Other Dataflow Systems Key-Value NoSQL Systems

Graph Processing Systems Advanced Analytics/ML Systems The MapReduce/Hadoop Craze Blame Google! Simple problem: index, store, and search the Web! Who were their major systems hires? Jeff Dean and Sanjay Ghemawat (Systems, not DB or IR) Why did they not use RDBMSs? (Haha.)

Developability, data model, fault tolerance, scale, cost, Engineers started with MySQL; abandoned it! What is MapReduce? MapReduce: Simplified Data Processing on Large Clusters. In OSDI 2004. Programming model for writing data-parallel programs + distributed system architecture for processing large data Map and Reduce are terms/ideas from functional PL Engineer only implements the logic of Map and Reduce Libraries in Java, C++, etc. handle orchestration of data distribution, parallelization, etc. under the covers Was radically easier for engineers to write programs with! What is MapReduce?

Standard example: count word occurrences in a doc corpus Input: A set of text documents (say, webpages) Output: A dictionary of unique words and their counts function map sounds (String docname, String doctext) Hmmm, suspiciously familiar : for each word w in doctext : emit (w, 1) Part of MapReduce API

function reduce (String word, Iterator partialCounts) : sum = 0 for each pc in partialCounts : sum += pc emit (word, sum) How MapReduce Works Parallel flow of control and data upon running the MapReduce program: Each Mapper and Reducer is a separate process; parallel! Fault tolerance achieved using data replication SQL Strikes Back! Q: How would you do the word counting in RDBMS / in SQL?

First step: Transform text docs into relations and load (how?) Part of a stage called Extract-Transform-Load (ETL) Suppose we pre-divide each document into words and have the schema: DocWords (DocName, Word) Second step: a single, simple SQL query! SELECT Word, COUNT (*) FROM DocWords Parallelism, scaling, etc. done GROUP BY Word by RDBMS under the covers ORDER BY Word What is Hadoop then? Open-source impl. Of Googles ideas; includes MapReduce

model and a distributed file system (HDFS) Summary: User writes logic of Map and Reduce functions in API; input splitting, data distribution, shuffling, fault tolerance, etc. all handled by the Hadoop library under the covers Exploded in popularity! 100s of papers, 10s of products! A real revolution in scalable data processing that took the DB community by surprise! A Spectacular War of the Worlds No declarativity! Filescan-based! DeWitts work on parallel DBMSs! Cheap rip-off of RDBMSs!

Young Turks vs. Old Guard? Swift and scathing rebuttal from MapReduce/Hadoop world! DBMSs too high-level/hard to use for low-level text ETL Meant for offline fault-tolerant workloads on cheap nodes Google awarded a patent for MapReduce (ahem)! MapReduce/Hadoop not meant to be an RDBMS replacement Enter Hybrid Systems! Clever DB researches: Lets get the best of both worlds! Numerous projects on hybrid systems in industry/academia: Programming model-level: Bring declarativity from RDBMS world to MapReduce/Hadoop world Dataflow language

over Hadoop SQL dialect over Hadoop Systems-level: Intermix system implementation ideas HadoopDB Microsoft from Yale U. Polybase Big Data Systems

Parallel RDBMSs and Cloud-Native RDBMSs Beyond RDBMSs: A Brief History Big Data Systems The MapReduce/Hadoop Craze

Spark and Other Dataflow Systems Key-Value NoSQL Systems Graph Processing Systems Advanced Analytics/ML Systems

Spark from UC Berkeley Extended dataflow programming model (subsumes most of RA; MapReduce); system (re)designed from ground up Agenda: Unified system to handle relations, text, etc.; support more general distributed data processing Tons of sponsors, gazillion bucks, unbelievable hype! Key idea: exploit distributed memory to cache data Key novelty: lineage-based fault tolerance, not replication Open-sourced to Apache; commercialized as Databricks What does Spark have? Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In NSDI 2012

Word Count Example in Spark Spark has libraries for Python, Scala, and Java SparkSQL offers an SQL-like front-end Spark-based Ecosystem of Tools The Berkeley Data Analytics Stack (BDAS) How does Spark work? Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In NSDI 2012 Databricks is basically building yet another parallel RDBMS!

Reinventing the Wheel? Other Dataflow Systems Stratosphere/Apache Flink from TU Berlin Myria from U Washington AsterixDB from UC Irvine Azure Data Lakes from Microsoft Building such Big Data systems is (was?) one of the hottest topics in both industry and academia My bias: Lot of system building; not sure of research novelty References and More Material MapReduce/Hadoop: MapReduce: Simplified Data Processing on Large Clusters.

Jeffrey Dean and Sanjay Ghemawat. In OSDI 2004. More Examples: http://bit.ly/2rl86bG, http://bit.ly/2rkSRj8 Online Tutorial: http://bit.ly/2rS2B5j Spark: Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. Matei Zaharia and others. In NSDI 2012. More Examples: http://bit.ly/2rhkhEp, http://bit.ly/2rkT8Tc Online Tutorial: http://bit.ly/2r8lW0S Outline Parallel RDBMSs and Cloud-Native RDBMSs

Beyond RDBMSs: A Brief History Big Data Systems The MapReduce/Hadoop Craze Spark and Other Dataflow Systems

Key-Value NoSQL Systems Graph Processing Systems Advanced Analytics/ML Systems Key-Value NoSQL Systems Simple API: get and put unique records very quickly! Records usually uniquely identified by a key; information

in record is the value (could be general JSON object) Used extensively by Web companies: get user profile or product record quickly and update info, Facebook status, etc. High availability, high scalability, eventual consistency Idea: Discard ACID and 30+ years of DB lessons; use BASE (Basically Available, Soft state, and Eventually consistent) The new RDBMS-hating movement was christened NoSQL Key-Value NoSQL Systems Also called transactional NoSQL (read-write) Hadoop / Spark aka analytical NoSQL (read mostly)

Key-Value NoSQL Systems Recent work on relaxed consistency models with guarantees in between full ACID and fuzzy best-effort BASE/Eventual 5 consistency levels of Microsoft Azure CosmosDB (a geodistributed cloudnative DBMS) My bias: Key area of research at the intersection of DB & distributed systems! Advertisement: Take CSE 223B to learn more! Outline Parallel RDBMSs and Cloud-Native RDBMSs

Beyond RDBMSs: A Brief History Big Data Systems The MapReduce/Hadoop Craze Spark and Other Dataflow Systems

Key-Value NoSQL Systems Graph Processing Systems Advanced Analytics/ML Systems Graph Processing Systems Not a workload DB folks used to care much about

Specialized graph systems have been around for years (Neo4j), but more popular now (Facebook, LinkedIn, etc.) Data Model: set of nodes, and set of (multi-)edges Ops/queries: nearest neighbors, shortest path, connectivity, density, cliques, etc. Graph Processing Systems Can be handled as an application on an RDBMS, but might be inefficient transitive closure, repeated self-joins, etc. Graph Processing Systems Advertisement: Graph Analytics course on Coursera by UCSD

Recently Viewed Presentations

  • Agriculture Scientist - Sam Houston State University

    Agriculture Scientist - Sam Houston State University

    The cotton gin was a groundbreaking invention in the southern United States, and it had an enormous positive impact on the economies of the southern states in the U.S. The cotton gin also had an equally important impact on slavery...
  • BCH 300 LECTURE NOTES - WordPress.com

    BCH 300 LECTURE NOTES - WordPress.com

    The homopolysaccharides starch and glycogen are stored fuels in plant, animal, and bacterial cells. They consist of D-glucose with linkages, and all three contain some branches. The homopolysaccharides cellulose, chitin, and dextran serve structural roles.
  • The New Jersey Department of Human Services Division of ...

    The New Jersey Department of Human Services Division of ...

    SHC issues rental subsidies in partnership with the Division of Developmental Disabilities and the Division of Mental Health and Addiction Services . DDD has identified potential recipients of these vouchers. This webinar will provide an overview of the voucher process...
  • San Jose Announcements - ACN Inc.

    San Jose Announcements - ACN Inc.

    vitesse résidentiel d'ACN + Voix Territoires de Telus: Ouest canadien (Alberta et Colombie-Britannique) Les clients qui décident de modifier leur vitesse de connexion pourraient recevoir un nouveau modem. Dans le cas où un nouveau modem serait envoyé, l'ancien modem devra...
  • Install Additional Domain Controllers

    Install Additional Domain Controllers

    What is the purpose of the directory partition? Which partition types can be included in the directory partition? Once Active Directory has been installed, what must you do to make that server a domain controller? To remove Active Directory from...
  • The electromagnetic (EM) field serves as a model

    The electromagnetic (EM) field serves as a model

    Notice that for the EM field, we started withthe E and B fields -and showed that the relativistic "field" was a superposition of an infinite number of individual "plane wave" particles, with momentum k .The second quantization fell out naturally....
  • Speed Test - Transum

    Speed Test - Transum

    Speed Test You will see 20 questions ... What number did I think of ? 18. How many cm are there in 3.2 m? 19. Find the volume of a cuboid which measures 7 cm by 10 cm by 2...
  • Ch 1 Basic Imaging Principles - UCO: Department of ...

    Ch 1 Basic Imaging Principles - UCO: Department of ...

    This is a line whose unite normal is oriented at an angle . θ. relative to the x-axis and is at distance . l. from the origin in the direction of the unit normal. The . line impulse . ???,...