Developing with Apache Spark - UC Berkeley AMP Camp

Developing with Apache Spark - UC Berkeley AMP Camp

Introduction to Apache Spark Patrick Wendell - Databricks What is Spark? Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop 10 fa s e t r on d is k 2 -5 l 10 0 i n , c ode es s me mor y Efficient Usable

Up to General execution graphs Rich APIs in Java, Scala, Python Interactive shell The Spark Community +You! Todays Talk The Spark programming model Language and deployment choices

Example algorithm (PageRank) Spark Programming Model Key Concept: RDDs Write programs in terms of operations on distributed datasets Resilient Distributed Datasets Collections of objects spread across a cluster, stored in RAM or on Disk Built through parallel transformations Automatically rebuilt on failure

Operations Transformations (e.g. map, filter, groupBy) Actions (e.g. count, collect, save) Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base RDD Transformed RDD Cache 1

lines = spark.textFile(hdfs://...) results errors = lines.filter(lambda s: s.startswith(ERROR)) messages = errors.map(lambda s: s.split(\t)[2]) messages.cache() messages.filter(lambda s: mysql in s).count() Driver Action Cache 3 Full-text search of Wikipedia 60GB on 20 EC2 machine 0.5 sec vs. 20s for on-disk

Block 1 Cache 2 Worker messages.filter(lambda s: php in s).count() ... tasks Worker Worker Block 3

Block 2 % of working set in cache 12 30 41 58 80 40 0 69

E x e c u t io n t im e (s ) Scaling Down Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data msgs = textFile.filter(lambda s: s.startsWith(ERROR)) .map(lambda s: s.split(\t)[2]) HDFS File Filtered RDD filter

(func = startsWith()) Mapped RDD map (func = split(...)) Programming with RDDs SparkContext Main entry point to Spark functionality Available in shell as variable sc In standalone programs, youd make your own (see later for details) Creating RDDs

# Turn a Python collection into an RDD > sc.parallelize([1, 2, 3]) # > > > Load text file from local FS, HDFS, or S3 sc.textFile(file.txt) sc.textFile(directory/*.txt) sc.textFile(hdfs://namenode:9000/path/file) # Use existing Hadoop InputFormat (Java/Scala only) > sc.hadoopFile(keyClass, valClass, inputFmt, conf) Basic Transformations > nums = sc.parallelize([1, 2, 3])

# Pass each element through a function > squares = nums.map(lambda x: x*x) // {1, 4, 9} # Keep elements passing a predicate > even = squares.filter(lambda x: x % 2 == 0) // {4} # Map each element to zero or more others > nums.flatMap(lambda x: => range(x)) > # => {0, 0, 1, 0, 1, 2} Range object (sequence of numbers 0, 1, , x-1) Basic Actions > nums = sc.parallelize([1, 2, 3]) # Retrieve RDD contents as a local collection > nums.collect() # => [1, 2, 3] # Return first K elements > nums.take(2) # => [1, 2]

# Count number of elements > nums.count() # => 3 # Merge elements with an associative function > nums.reduce(lambda x, y: x + y) # => 6 # Write elements to a text file > nums.saveAsTextFile(hdfs://file.txt) Working with Key-Value Pairs Sparks distributed reduce transformations operate on RDDs of key-value pairs Python: pair = (a, b) pair[0] # => a pair[1] # => b

Scala: val pair = (a, b) pair._1 // => a pair._2 // => b Java: Tuple2 pair = new Tuple2(a, b); pair._1 // => a Some Key-Value Operations > pets = sc.parallelize( [(cat, 1), (dog, 1), (cat, 2)]) > pets.reduceByKey(lambda x, y: x + y)

# => {(cat, 3), (dog, 1)} > pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])} > pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)} also automatically implements combiners on the map side reduceByKey Example: Word Count > lines = sc.textFile(hamlet.txt) > counts = lines.flatMap(lambda line: line.split( )) .map(lambda word => (word, 1)) .reduceByKey(lambda x, y: x + y) to be or

not to be to be or not to be (to, 1) (be, 1) (or, 1) (not, 1) (to, 1) (be, 1)

(be, 2) (not, 1) (or, 1) (to, 2) Other Key-Value Operations > visits = sc.parallelize([ (index.html, 1.2.3.4), (about.html, 3.4.5.6), (index.html, 1.3.3.1) ]) > pageNames = sc.parallelize([ (index.html, Home), (about.html, About) ]) > visits.join(pageNames) # (index.html, (1.2.3.4, Home)) # (index.html, (1.3.3.1, Home)) # (about.html, (3.4.5.6, About)) > visits.cogroup(pageNames) # (index.html, ([1.2.3.4, 1.3.3.1], [Home]))

# (about.html, ([3.4.5.6], [About])) Setting the Level of Parallelism All the pair RDD operations take an optional second parameter for number of tasks > words.reduceByKey(lambda x, y: x + y, 5) > words.groupByKey(5) > visits.join(pageViews, 5) Using Local Variables Any external variables you use in a closure will automatically be shipped to the cluster: > query = sys.stdin.readline() > pages.filter(lambda x: query in x).count() Some caveats:

Each task gets a new copy (updates arent sent back) Variable must be Serializable / Pickle-able Dont use fields of an outer object (ships all of it!) Under The Hood: DAG Scheduler General task graphs Automatically pipelines functions Data locality aware Partitioning aware to avoid shuffles B:

A: F: Stage 1 C: groupBy D: E: join Stage 2 map = RDD filter

Stage 3 = cached partition More RDD Operators map reduce sample filter count take

groupBy fold sort reduceByKey union groupByKey join cogroup

leftOuterJoin cross pipe rightOuterJoin zip save first partitionBy mapWith ...

How to Run Spark Language Support Python lines lines = = sc.textFile(...) sc.textFile(...) lines.filter(lambda lines.filter(lambda s: s: ERROR ERROR in in s).count() s).count() Scala

val lines lines = = sc.textFile(...) sc.textFile(...) val lines.filter(x lines.filter(x => => x.contains(ERROR)).count() x.contains(ERROR)).count() Java JavaRDD JavaRDD lines lines = = sc.textFile(...); sc.textFile(...); lines.filter(new

Function() Boolean>() { { Boolean call(String call(String s) s) { { Boolean return return s.contains(error); s.contains(error); } } }).count(); }).count();

Standalone Programs Python, Scala, & Java Interactive Shells Python & Scala Performance Java & Scala are faster due to static typing but Python is often fine Interactive Shell The Fastest Way to Learn Spark Available in Python and Scala Runs as an application on an existing Spark

Cluster OR Can run locally or a Standalone Application import sys from pyspark import SparkContext if __name__ == "__main__": sc = SparkContext( local, WordCount, sys.argv[0], None) lines = sc.textFile(sys.argv[1]) counts = lines.flatMap(lambda s: s.split( )) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda x, y: x + y) counts.saveAsTextFile(sys.argv[2]) Python Java

Scala Create a SparkContext import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ val sc = new SparkContext(url, name, sparkHome, Seq(app.jar)) Cluster URL, or App Spark install import org.apache.spark.api.java.JavaSparkContext; local / local[N] name path on cluster

List of JARs with app code (to ship) JavaSparkContext sc = new JavaSparkContext( masterUrl, name, sparkHome, new String[] {app.jar})); from pyspark import SparkContext sc = SparkContext(masterUrl, name, sparkHome, [library.py])) Add Spark to Your Project Scala / Java: add a Maven dependency on groupId: artifactId: version:

org.spark-project spark-core_2.10 0.9.0 Python: run program with our pyspark script Administrative GUIs http://:8080 (by default) Software Components Spark runs as a library in your program (1 instance per app) Runs tasks locally or on cluster Mesos, YARN or standalone mode

Accesses storage systems via Hadoop InputFormat API Can use HBase, HDFS, S3, Your application SparkContext Cluster manager Worker Spark executor Local threads

Worker Spark executor HDFS or other storage EXAMPLE APPLICATION: PAGERANK Example: PageRank Good example of a more complex algorithm Multiple stages of map & reduce Benefits from Sparks in-memory caching

Multiple iterations over the same data Basic Idea Give pages ranks (scores) based on links to them Links from many pages high rank Link from a highrank page high rank ge: en.wikipedia.org/wiki/File:PageRank-hi-res-2.png Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |neighborsp| to its neighbors

3. Set each pages rank to 0.15 + 0.85 contribs 1.0 1.0 1.0 1.0 Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3. Set each pages rank to 0.15 + 0.85 contribs 1.0 1

1.0 0.5 0.5 1.0 0.5 1 1.0 0.5 Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3. Set each pages rank to 0.15 + 0.85 contribs

1.85 1.0 0.58 0.58 Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3. Set each pages rank to 0.15 + 0.85 contribs 1.85 0.58 0.58

0.29 0.29 0.58 0.5 1.85 0.5 1.0 Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3. Set each pages rank to 0.15 + 0.85 contribs

1.31 0.39 ... 0.58 1.72 Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3. Set each pages rank to 0.15 + 0.85 contribs Final state: 1.44

1.37 0.46 0.73 Scala Implementation val links = // load RDD of (url, neighbors) pairs var ranks = // load RDD of (url, rank) pairs for (i <- 1 to ITERATIONS) { val contribs = links.join(ranks).flatMap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _) }

ranks.saveAsTextFile(...) 200 171 Hadoop Spark 100 50 0 30 14

80 150 23 Ite ra ti o n ti m e (s ) PageRank Performance 60 Number of machines Other Iterative Algorithms Spark

Hadoop 155 K-Means Clustering 4.1 0 30 60 90 120

150 180 110 Logistic Regression 0.96 0 25 50 75 100

Time per Iteration (s) 125 CONCLUSION Conclusion Spark offers a rich API to make data analytics fast: both fast to write and fast to run Achieves 100x speedups in real applications Growing community with 25+ companies contributing Get Started

Up and Running in a Few Steps Download Unzip Shell Project Resources Examples on the Project Site Examples in the Distribution Documentation http://spark.incubator.apache.org

Recently Viewed Presentations

  • 3.6 Structure in Poetry - Mrs. Spies&#x27; Classroom

    3.6 Structure in Poetry - Mrs. Spies' Classroom

    4.6 Structure in Poetry Have your Catalogue (childhood) Poem ready to turn in. Purpose Today we will: interpret a poem, focusing on sound devices analyze a poem's structure present an oral interpretation On Your Poetry Terms sheet…
  • Headline goes here Second line - Aqua-Tech Sales and ...

    Headline goes here Second line - Aqua-Tech Sales and ...

    Two simple parts. The ecocirc® wireless kit Part #6050B4000 includes two parts, both certified for use with drinking water:. A lead-free . recirculation pump. at your hot water source . A battery-operated . mixing valve, with temperature sensor and transceiver,...
  • Broadband connectivity - TTA

    Broadband connectivity - TTA

    Broadband Connectivity in Canada Douglas Sward Phone: +1 613-990-4700 ... Digital Subscriber Line Access Multiplexer (DSLAM) is moving from the Central Office (CO) to the Outside Plant Interface (OPI) to be closer to residential premises => => Micro-DSLAM.
  • Using the Tongue for Balance

    Using the Tongue for Balance

    Using the Tongue for Balance Gabriel Ausfresser University of Rhode Island BME 482 Overview Introduction Requirements for maintaining balance Replacing a lost sense with another Using Tongue-based Biofeedback How it works Why the tongue?
  • Rewards Can Make a Difference! High performing organizations

    Rewards Can Make a Difference! High performing organizations

    Rewards Can Make a Difference! "High performing organizations with HR programs that are integrated and aligned with key business objectives provide a 3-year total shareholder return of 29.8% vs. 10.3% for all other companies!"*
  • Diapositive 1 - ESL Resources - RESCOL 2 - Topics

    Diapositive 1 - ESL Resources - RESCOL 2 - Topics

    playing baseball playing basketball bowling boxing biking fencing canoeing dancing playing golf racing rafting windsurfing waterskiing sailing swimming playing tennis playing table tennis playing volleyball skipping skateboarding playing rugby rollerskating rock climbing running playing football doing gymnastics playing hockey ice...
  • Fundamentals of Corporate Finance

    Fundamentals of Corporate Finance

    2. Capital structure does not affect cash flows assuming: No taxes No bankruptcy costs No effect on management incentives Debt Increases Expected Rate of Return MM'S PROPOSITION II The rate of return shareholders can expect to receive on their shares...
  • PEP - cuthies.co.uk

    PEP - cuthies.co.uk

    PEP Personal Information Name: Age: Height: Weight: School: Sports played: Specialist subject: GCSE PE Identified Sports Activity: Level played at: Identified Components of fitness: Strengths: Weaknesses: Introduction "The development of a personal exercise programme is central to the development of...