Key-Value stores

Key-Value stores

Key-Value stores simple data model that maps keys to a list of values Easy to achieve Performance Fault tolerance Heterogeneity Availability due to its schema-less data model and fine granularity partitioning of the data

Google BigTable Google use the key-value paradigm to map URLs to multidimensional data, such as: Timestamps/Versions Rank Keywords Links No explicit ordering is needed on keys since a hash function is used

MapReduce Many data distributed over hundreds or thousands of machines These data need to be processed and produce new data (usually the process is a simple task) ex. aggregate functions, filtering, etc. MapReduce Partition the data according to pre-defined key

ex. words, URLs, etc. A master node assigns to workers specific partitions (=keys, =mappings of data to keys) The worker will produce a new list of keyvalue, corresponding to the new intermediate processed data (Map phase) Workers then will gather all intermediate data belonging to a specific key, and reduce them to the requested output ordered by key (Reduce phase) MapReduce map (in_key, in_value) -> list(out_key, intermediate_value) reduce (out_key, list(intermediate_value)) -> list(out_value) k, v k', v' map map k1,v1 k2,v2

k3,v3 ... k1,v1' k2,v2' k3,v3' ... sort sort k1, v1 v1' ... k2, v2 v2'

... reduce reduce r1 r2 ... r1 r2 ... The dirty little secret of Google that is too obvious is ... Hadoop Map/Reduce open source implementation of the map/reduce idea takes care of scheduling tasks, monitoring them

and re-executes the failed tasks a single master JobTracker and one slave TaskTracker per cluster-node HadoopDB HadoopDB: Database Connector extends Hadoop's InputFormat class connects to a database, executes the SQL query and returns results as key-value pairs should support any JDBC-compliant database that resides in the cluster HadoopDB: Catalog

The catalog maintains meta-information about the databases connection parameters such as database location, driver class and credentials metadata such as data sets contained in the cluster, replica locations, and data partitioning properties HadoopDB: Data Loader responsible for: globally repartitioning data on a given partition key upon loading

breaking apart single node data into multiple smaller partitions or chunks and finally bulk-loading the single-node databases with the chunks The [not so] easy way to do the labwork... HadoopDB with MonetDB instead of PostgreSQL read the HadoopDB paper download HadoopDB hook in MonetDB define a benchmark

do experiments write an excellent report! implementation details+experiments The [not so] fun way to do the labwork... Combine something of the following: Map/Reduce URLs DHTs - Chord N-gram Strings Strings

M5 module DHTs a traditional key-value store based on a distributed hash table there are no master nodes keys are distributed to nodes according to a hash function values are retrieved with O(logN) messages by employing routing tables Chord protocol

keys are assigned an identifier: hash(key) Each peer maintains a routing table (finger table) to route lookups peers are assigned an identifier: hash(IP) store and retrieve pairs of (key, data): lookup(key) peer6 peer1 V+2^0 peer2 V+2^1 peer4 V+2^2 peer5 peer5 Chord Ring modulo 2^m peer2 peer4 peer3

The [not so] fun way to do the labwork... Design the BAT representation of distributed KeyValue store module Develop a wrapper around it to (simulate) parallel behavior define a benchmark do experiments write an excellent report! implementation details + experiments What is the interface of the KV

store? SiteStat Basic event ns_utc=1117835999527& Time=86399527& type=view& ns_m2=no& name=statistics.basics& Ip= ns_site_cookie=426EA62C01D400FA& agent=Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)& ns_pageurl= &ns_js=yes& referrer=http%3A// &lang=nl &secure=no &id=3165999& Pv=2& cntry=NL& language=NL Dealing with formatted key strings"""""""""" url box" http zvon org xxl DTDTutorial General book.html" Oid

Value Oid Value Oid value 1 org 1 xxl 1 DTDtutorial 2 com 1

DOM2reference 3 net [ 0, [ 1, [ 2, Insert 1000 urls -> 2 seconds [ 3, [ 4, [ 5, [ 6, [ 7, "urlbox_0", "urlbox_1", "urlbox_2", "urlbox_3", "urlbox_4", "urlbox_5", "urlbox_6", "urlbox_7", 1, 1 270, 236,

180, 136, 56, 48 34, 32 5, 5 ] 270 219 173 122 ] ] ] ] ] ] ] N-gram indexing" Break the string in overlapping n-grams Oid

value 1 Tuto Oid value 1 Tuto 1 utor 1 tori Use a single table per n-gram + wildcards Oid value Oid value

1 utor 1 tori

Recently Viewed Presentations

  • 3rd SG13 Regional Workshop for Africa on ITU-T

    3rd SG13 Regional Workshop for Africa on ITU-T

    ICT Standards Policy 2011. Objectives. Improve . the effective functioning of ICT sector. through adoption of high-quality and interoperable standards for ICT infrastructures in order to increase the efficiency of service delivery.
  • 2008 Mic 中文簡報範本 - Fki

    2008 Mic 中文簡報範本 - Fki

    MIC at a Glance Mission Assist businesses in seizing market opportunities Assist the Taiwanese government in formulating industry development strategies Since 1987, MIC has been serving as a strategic think thank for senior decision makers in industries, public sectors, academic...
  • BRAIN RULE #9 -

    BRAIN RULE #9 -

    Modality principle . Although multimedia only combines 2 of the senses, we use this method quite often. Cognitive psychologist Richard Mayer studied 3 different groups. One group who received info via hearing, one group via sight, and one group with...
  • Why Study Spanish?

    Why Study Spanish?

    study abroad - to live, travel, and take classes in a foreign country - for a semester or even a year. Knowing a second language such as SPANISH increases the opportunity to immerse oneself in a foreign language and culture.
  • Hebrews 12:1-4 - Bible Answer

    Hebrews 12:1-4 - Bible Answer

    Hebrews 12:1-2. 1 Therefore we also, since we are surrounded by so great a cloud of witnesses, let us lay aside every weight, and the sin which so easily ensnares us, and let us run with endurance the race that...
  • Healthy Eating -

    Healthy Eating -

    Task: To create a food pyramid showing the foods that we should eat most of at the bottom and the foods that we should eat less of at the top layer. Once you have drawn the pyramid, place the foods...
  • Module 2 - Santa Fe College

    Module 2 - Santa Fe College

    Digital Video Disc or Digital Versatile Disc (DVD) DVD is an optical media standard that can be used to store large amounts of different types of data (computer data, video, audio). ... The NTFS file system automatically detects bad sectors...
  • A Look at Expected Amendments to Iso/Iec 17025 (1999)

    A Look at Expected Amendments to Iso/Iec 17025 (1999)

    Adam Gouker A2LA, EMC Program Manager ... Telecom 88 Bluetooth 10 CTIA 23 Product Safety (electrical) 68 VCCI 67 SAR 13 WiMAX 4 Energy Star - 23 (1 CB) TCB's - 8 Upcoming A2LA Training November 14-15 (SC) - Intro...