大规模数据处理/云计算 Introduction - PKU

大规模数据处理/云计算 Introduction - PKU

/ Lecture 1 Introduction to MapReduce 7/9/2013 http://net.pku.edu.cn/~course/cs402/ Jimmy Lin University of Maryland SEWMGroup This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details What is this course about?

Data-intensive information processing Large-data (web-scale) problems Focus on MapReduce programming An entry-level course~ 2 What is MapReduce? Programming model for expressing distributed computations at a massive scale Execution framework for organizing and

performing such computations Open-source implementation called Hadoop 3 Why Large Data? How much data? Google processes 20 PB a day (2008) Wayback Machine has 3 PB + 100 TB/month (3/2009)

Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009) CERNs LHC will generate 15 PB a year 640K ought to be enough for anybody. 5

6 Happening everywhere! microarray chips microprocessors Molecular biology (cancer) fiber optics Network traffic (spam) 300M/day

Simulations (Millennium) particle colliders Particle events (LHC) 1B 7 1M/sec 8 Maximilien Brice, CERN

9 Maximilien Brice, CERN 10 Maximilien Brice, CERN 11 Maximilien Brice, CERN No data like more data! s/knowledge/data/g; How do we get here if were not Google?

(Banko and Brill, ACL 2001) (Brants et al., EMNLP 2007) 12 Example: information extraction Answering factoid questions Pattern matching on the Web Works amazingly well Who shot Abraham Lincoln? X shot Abraham Lincoln

Learning relations Start with seed instances Search for patterns on the Web Using patterns to find more instances Wolfgang Amadeus Mozart (1756 - 1791) Einstein was born in 1879 Birthday-of(Mozart, 1756) Birthday-of(Einstein, 1879)

PERSON (DATE PERSON was born in DATE (Brill et al., TREC 2001; Lin, ACM TOIS 2007) (Agichtein and Gravano, DL 2000; Ravichandran and Hovy, ACL 2002; ) 13 Example: Scene Completion Hays, Efros (CMU), Scene Completion Using Millions of Photographs SIGGRAPH, 2007 Image Database Grouped by Semantic Content

30 different Flickr.com groups 2.3 M images total (396 GB). Select Candidate Images Most Suitable for Filling Hole

Classify images with gist scene detector [Torralba] Color similarity Local context matching Computation

Index images offline 50 min. scene matching, 20 min. local matching, 4 min. compositing Reduces to 5 minutes total by using 5 machines Extension Flickr.com has over 500 million images 14 14

More Data More Gains? CNNIC 2010 6 4.2 31.8% 4334 2.77 18.6% 1.4 30% 15 2009 2009 238868 145475 93393 37. 88

312.46 73.4 1.41 0.33 567.27 4.73 8.86% 11.24% 5.36% 4.53% 4.61% 8.94% 16 Did you know? 17 Did you know? We are currently preparing our students for

jobs that dont yet exist It is estimated that a weeks worth of the New York Times contains more information than a person was likely to come across in a lifetime in the 18th century The amount of new technical information is doubling every 2 years So what does IT ALL MEAN? 18 We are living in exponential times 19 Two Different Views

a thrower-awayer Jennifer Widom trying to live an efficient life so that one has time to work and be with ones family. MyLifeBits Gordon Bell 20

Information Overloading 21

What is Cloud Computing? The best thing since sliced bread? Before clouds Grids Vector supercomputers

Cloud computing means many different things: Large-data processing Rebranding of web 2.0 Utility computing Everything as a service 23 Rebranding of web 2.0

Rich, interactive web applications Clouds refer to the servers that run them AJAX as the de facto standard (for better or worse) Examples: Facebook, YouTube, Gmail, The network is the computer: take two

User data is stored in the clouds Rise of the netbook, smartphones, etc. Browser is the OS 24 Source: Wikipedia (Electricity meter) Utility Computing What?

Why? Computing resources as a metered service (pay as you go) Ability to dynamically provision virtual machines Cost: capital vs. operating expenses Scalability: infinite capacity Elasticity: scale up or down on demand Does it make sense?

Benefits to cloud users Business case for cloud providers I think there is a world market for about five computers. 26 Everything as a Service Utility computing = Infrastructure as a Service (IaaS)

Platform as a Service (PaaS) Why buy machines when you can rent cycles? Examples: Amazons EC2, Rackspace Give me nice API and take care of the maintenance, upgrades, Example: Google App Engine

Software as a Service (SaaS) Just run it for me! Example: Gmail, Salesforce 27 Utility Computing pay-as-you-go Microsoft pay less

utility computing 500 use less pay less cloud computing 28 Platform as a Service (PaaS) Web Application Services PaaS Internet Multi-tenant architecture platform 29

Software as a Service (SaaS) a model of software deployment whereby a provider licenses an application to customers for use as a service on demand. 30 Who cares? Ready-made large-data problems

Lots of user-generated content Even more user behavior data Examples: Facebook friend suggestions, Google ad placement Business intelligence: gather everything in a data warehouse and run analytics to generate insight Utility computing

Provision Hadoop clusters on-demand in the cloud Lower barrier to entry for tackling large-data problem Commoditization and democratization of large-data capabilities 31 Story around Hadoop Google-IBM Cloud Computing Initiative 2007 10 Google IBM 6

MapReduce Hadoop 200 0 2500 33 Cloud Computing Initiative Google and IBM team on cloud computing initiative for universities(2007) provide several hundred computers access through the Internet to test parallel programming projects The idea for the program from

Google senior software engineer Christophe Bisciglia Google Code University 34 The Information Factories Googleplex(pre-2008) servers number 450,000, according to the lowest estimate 200 petabytes of hard disk storage four petabytes of RAM To handle the current load of 100 million queries a day, input-output bandwidth must be

in the neighborhood of 3 petabits per second 35 Google Infrastructure 2003 "The Google file system," in sosp. Bolton Landing, NY, USA: ACM Press, 2003. 2004 "MapReduce: Simplified Data Processing on Large Clusters," in osdi, 2004, 2006 "Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!)," in osdi, 2006 . 36

Hadoop Project Doug Cutting 37 History of Hadoop

2004 - Initial versions of what is now Hadoop Distributed File System and Map-Reduce implemented by Doug Cutting & Mike Cafarella December 2005 - Nutch ported to the new framework. Hadoop runs reliably on 20 nodes. January 2006 - Doug Cutting joins Yahoo! February 2006 - Apache Hadoop project official started to support the standalone development of Map-Reduce and HDFS. March 2006 - Formation of the Yahoo! Hadoop team May 2006 - Yahoo sets up a Hadoop research cluster - 300 nodes

April 2006 - Sort benchmark run on 188 nodes in 47.9 hours May 2006 - Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark) October 2006 - Research cluster reaches 600 Nodes December 2006 - Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3 hrs, 500 nodes in 5.2 hrs, 900 nodes in 7.8 January 2007 - Research cluster reaches 900 node April 2007 - Research clusters - 2 clusters of 1000 nodes April 2008: Won the 1-terabyte sort benchmark in 209 seconds on 900 nodes. October 2008: Loading 10 terabytes of data per day onto research clusters. March 2009: 17 clusters with a total of 24,000 nodes. April 2009: Won the minute sort by sorting 500 GB in 59 seconds (on 1,400 nodes) and the 100terabyte sort in 173 minutes (on 3,400 nodes). 38 Google Code University 2008, Seminar: Mass Data

Processing Technology on Large Scale Clusters, Tsinghua University Aaron Kimball 39 Startup: Cloudera Cloudera is pushing a commercial distribution for Hadoop Mike Olson Christophe Bisciglia Doug Cutting Aaron Kimball Tom White

40 Course Administrivia TextBooks [Tom] Tom White, Hadoop: The Definitive Guide, O'Reilly, 3rd, 2012.5. [Lin] Jimmy Lin and Chris Dyer, Data-Intensive Text Processing with MapReduce, 2013.1. 42 This schedule is tentative and subject to change without notice ID Topics

Contents Reading 7/9 Introduction to MapReduce ( ppt) Why large data? Cloud Computing Story about Hadoop [Lin]Ch1:Introduction

[Tom]Ch1:Meet Hadoop 7/11 MapReduce Basics (ppt) How do we scale up? MapReduce HDFS [Lin]Ch2:Mapreduce Basics [Tom]Ch6:How mapreduce works [GFS&MapReduce Paper] MapReduce Program Develop

Basic MapReduce algorithm design and design patterns [Tom]Ch5:Developing a MapReduce Application [Lin]Ch3:MapReduce algorithm design Introduction to Information Retrieval Inverted Index on MapReduce Retrieval Problems [Lin]Ch4:Inverted Indexing for Text Retrieval Graph Algorithm and Mapreduce Parallel Breadth-First-Search

PageRank [Lin]Ch5:Graph Algorithms What is clustering Applications of clustering in information retrieval K-means algorithm Evaluation of clustering How many clusters Canopy clustering [IIR] Ch16 7/16

7/18 7/23 7/25 MapReduce Algorithm Design(ppt) Text retrieval ( ppt) Graph Algorithm (ppt ) Flat Clustering

(ppt) Canopy Clustering (ppt) 43 Recap Why large data? Cloud computing Story about Hadoop 44

Recently Viewed Presentations

  • Canada Between the Wars 1919-1929 - Mr. Forsyth's Classes

    Canada Between the Wars 1919-1929 - Mr. Forsyth's Classes

    King accused Lord Byng and the Conservatives of "twisting the Constitution." The Progressives continued to support the Liberals and Meighen was quickly defeated. An election called for September 14, 1926 returned King and the Liberals to power. Robert W. White...
  • Adjectives


    They had trained 8) long and 9) hard to get here. The race was due to start at 11 o'clock, but started 10) late as the track wasn't 11) clean. There was a 12) loud roar when all the drivers...
  • PowerPoint Presentation

    PowerPoint Presentation

    Values guide a person's needs, wants, and goals. Values, needs, wants, and goals influence a person's daily decisions. Decisions affect an individual's financial situation because we make financial decisions based on what we feel is important (our . ... PowerPoint...
  • 9.1 Types of Electric Charge - Mrs. N. Gill

    9.1 Types of Electric Charge - Mrs. N. Gill

    9.4 Electric Force. The Van de Graff generator is a device that separates large quantities of electric charge. A charge (negative or positive) is transferred onto a moving belt at the base of the generator and is transferred off the...
  • Course Offerings - Miami Dade College

    Course Offerings - Miami Dade College

    Basic Report Writing (JOU 1100) - Journalistic writing emphasizing the elements of reporting with an emphasis on the modern news story, analysis of the elements of news, style structure of news stories, news sources and the mechanics of newspaper production....
  • Chapter 16

    Chapter 16

    Demerit goods are those considered socially undesirable, and are often taxed, regulated, or prohibited. Redistributing Income. Income Redistribution: Government activity that takes income from some people through taxation and uses it to help citizens in need.
  • iblog.dearbornschools.org


    Walk-In Directions. Find a seat—YES, you will have a seating chart . Fill out the index. card with the following information: Bell Work. Front of Card Back of Card. Hour/Period.
  • The Transition to Modern Office Add-in Development

    The Transition to Modern Office Add-in Development

    Add-in contexts. Content add-in. Add-in that runs within a document content with read/write access. Excel, PowerPoint, Access. Add-in command. Command in the Office UI to launch add-in or perform UI-less operation