Web Algorithmics - Dipartimento di Informatica

Web Algorithmics - Dipartimento di Informatica

Algoritmi per IR Prologo References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. Mining the Web: Discovering Knowledge from... S. Chakrabarti, Morgan-Kaufmann Publishers, 2003. A bunch of scientific papers available on the course site !! About this course It is a mix of algorithms for data compression data indexing data streaming (and sketching)

data searching data mining Massive data !! Paradigm shift... Web 2.0 is about the many Big DATA Big PC ? We have three types of algorithms: T1(n) = n, T2(n) = n2, T3(n) = 2n ... and assume that 1 step = 1 time unit How many input data n each algorithm may process within t time units?

n1 = t, n2 = t, n3 = log2 t What about a k-times faster processor? ...or, what is n, when the time units are k*t ? n1 = k * t, n2 = k * t, n3 = log2 (kt) = log2 k + log2 t A new scenario Data are more available than even before n ... is more than a theoretical assumption The RAM model is too simple Step cost is (1) time Not just MIN #steps

1 CPU CPU L1 L2 RAM HD net registers Cache Few Mbs Some nanosecs Few words fetched Few Tbs Few Gbs Few millisecs Tens of nanosecs Some words fetched B = 32K page Many Tbs Even secs Packets You should be ??-aware programmers

I/O-conscious Algorithms track read/write head read/write arm magnetic surface The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on ones desk or by taking an airplane to the other side of the world and using a sharpener on someone elses desk. (D. Comer) Spatial locality vs Temporal locality The space issue M = memory size, N = problem size T(n) = time complexity of an algorithm using linear space p = fraction of memory accesses [0,30,4 (HennessyPatterson)] C = cost of an I/O [105 106 (Hennessy-Patterson)]

If N=(1+f)M, then the avg cost per step is: C * p * f/(1+f) This is at least 104 * f/(1+f) If we fetch B 4Kb in time C, and algo uses all of them: (1/B) * (p * f/(1+f) * C) 30 * f/(1+f) Space-conscious Algorithms I/Os search access Compressed data structures Streaming Algorithms track read/write head read/write arm magnetic surface Data arrive continuously or we wish FEW scans Streaming algorithms:

Use few scans Handle each element fast Use small space Cache-Oblivious Algorithms CPU L1 L2 RAM registers Cache Few Mbs Some nanosecs Few words fetched HD net Few Tbs Few Gbs Few millisecs Tens of nanosecs

Some words fetchedB = 32K page Many Tbs Even secs Packets Unknown and/or changing devices Block access important on all levels of memory hierarchy But memory hierarchies are very diverse Cache-oblivious algorithms: Explicitly, algorithms do not assume any model parameters Implicitly, algorithms use blocks efficiently on all memory levels Toy problem #1: Max Subarray Goal: Given a stock, and its -performance over the time, find the time window in which it achieved the best market performance. Math Problem: Find the subarray of maximum sum. A = 2 -5 6 1 -2 4 3 -13 9 -6 7

4K 8K 16K 32K 128K 256K 512K 1M n3 22s 3m 26m 3.5h 28h -- --

-- n2 0 0 0 1s 26s 106s 7m 28m An optimal solution We assume every subsum0 A= <0 >0 Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7 Algorithm sum=0; max = -1; For i=1,...,n do If (sum + A[i] 0) sum=0; else sum +=A[i]; MAX{max, sum}; Note: Sum < 0 when OPT starts; Sum > 0 within OPT Toy problem #2 : sorting How to sort tuples (objects) on disk Memory containing the tuples A Key observation:

Array A is an array of pointers to objects For each object-to-object comparison A[i] vs A[j]: 2 random accesses to memory locations A[i] and A[j] MergeSort (n log n) random memory accesses (I/Os ??) B-trees for sorting ? Using a well-tuned B-tree library: Berkeley DB n insertions Data get distributed arbitrarily !!! B-tree internal nodes B-tree leaves (tuple pointers") What about listing tuples in order ? Tuples Possibly 109 random I/Os = 109 * 5ms 2 months

Binary Merge-Sort Merge-Sort(A,i,j) Merge-Sort(A,i,j) 01 01 if if (i (i << j) j) then then Divide 02 mm == (i+j)/2; 02 (i+j)/2; Conquer 03 Merge-Sort(A,i,m); 03 Merge-Sort(A,i,m); 04 Merge-Sort(A,m+1,j); 04 Merge-Sort(A,m+1,j); Combine 05 Merge(A,i,m,j) 05 Merge(A,i,m,j) Cost of Mergesort on large data

Take Wikipedia in Italian, compute word freq: n=109 tuples few Gbs Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms Analysis of mergesort on disk: It is an indirect sort: (n log2 n) random I/Os [5ms] * n log2 n 1.5 years In practice, it is faster because of caching... 2 passes (R/W) Merge-Sort Recursion Tree log2 N If the run-size is larger than B (i.e. after first step!!),

fetching all of it in memory for merging does not help 1 2 3 4 1 2 5 7 1 2 5 10 2 10 10 2 5

9 10 1 7 8 9 13 19 10 3 11 12 13 15 17 19 4 6 8 11 12 15 17 How do we deploy the

7 9disk/mem 13 19 3 features 4 8 15 ? 6 1 5 5 6 13 19 13 19 7 9 9 7 4 15 15 4 3 8 8 3 11 12 17 12 17

12 17 6 11 6 11 M N/M runs, each sorted in internal memory (no I/Os) I/O-cost for merging is 2 (N/B) log2 (N/M) Multi-way Merge-Sort The key is to balance run-size and #runs to merge Sort N items with main-memory M and disk-pages B: Pass 1: Produce (N/M) sorted runs. Pass i: merge X M/B runs logM/B N/M passes INPUT 1 ... INPUT 2

... OUTPUT INPUT X .. . Disk Disk Main memory buffers of B items Multiway Merging Bf1 p1 Bf2 Fetch, if pi = B p2 min(Bf1[p1], Bf2[p2], , Bfx[pX]) Bfo po

Bfx pX Current page Run 1 Current page Flush, if Bfo full Current page Run 2 Run X=M/B Out File: Merged run EOF Cost of Multi-way Merge-Sort Number of passes = logM/B #runs logM/B N/M

Optimal cost = ((N/B) logM/B N/M) I/Os In practice 1 M/B 1000 #passes = logM/B N/M One multiway merge 2 passes = few mins Tuning depends on disk features Large fan-out (M/B) decreases #passes Compression would decrease the cost of a pass! Does compression may help? Goal: enlarge M and reduce N #passes = O(logM/B N/M) Cost of a pass = O(N/B) Part of Vitters paper In order to address issues related to:

Disk Striping: sorting easily on D disks Distribution sort: top-down sorting Lower Bounds: how much we can go Toy problem #3: Top-freq elements Goal: Top queries over a stream of N items (large). Math Problem: Find the item y whose frequency is > N/2, using the smallest space. (i.e. If mode occurs > N/2) A=b a c c c d c b a a a c c b c c c . Algorithm Use a pair of variables

For each item s of the stream, Proof Problems if N/2 If Xy, then every one of ys occurrences has a negative mate. if (X==s) then C++ Hence these mates should be #y. else { C--; if (C==0) X=s; C=1;} Return X; As a result, 2 * #occ(y) > N... Toy problem #4: Indexing Consider the following TREC collection:

N = 6 * 109 size = 6Gb n = 106 documents TotT= 109 (avg term length is 6 chars) t = 5 * 105 distinct terms What kind of data structure we build to support word-based searches ? Solution 1: Term-Doc matrix n = 1 million Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth t=500K Antony 1

1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1

0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0

0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1

0 Space is 500Gb ! 1 if play contains word, 0 otherwise Solution 2: Inverted index Brutus 2 Calpurnia 1 Caesar 4 2 8 3 16 32 64 128 We can do still better: i.e.53050% original 8 13 21 text

34 13 16 1. Typically use about 12 bytes 2. We have 109 total terms at least 12Gb space 3. Compressing 6Gb documents gets 1.5Gb data Please !! Do not underestimate the features of disks in algorithmic design

Recently Viewed Presentations

  • A Benefit-Cost Analysis Primer Kevin Keller, PG, CGWP

    A Benefit-Cost Analysis Primer Kevin Keller, PG, CGWP

    "Value" denotes welfare or quality of life. TIGER III project selection & BCA. BCA quality matters more than size of the B/C ratio. Focus your analysis on how it demonstrates need for your project. Benefits and costs may manifest themselves...
  • Amendments After The Bill Of Rights

    Amendments After The Bill Of Rights

    Amendments After The Bill Of Rights. EQ: Why are amendments added to the Constitution? What is the purpose of the Amendments?
  • Sheep Brain Dissection Guide - avon-schools.org

    Sheep Brain Dissection Guide - avon-schools.org

    Don't break any structures off. Identify the pineal body Pineal Gland Produces the hormone melatonin at night Melatonin is the sleep hormone Midsagittal Section Cut the brain in half Locate the brain stem parts: the medulla oblongata, the pons, cerebrum,...
  •  Focus:  As the Mughal Empire in India declined,

    Focus: As the Mughal Empire in India declined,

    One of the first examples of European imperialism in Asia, the British rule over India changed Indian politics, economics, and society and led to the rise of Indian nationalism. Important Terms: Raj. Do Now: What was the name of the...
  • Agenda - Columbia University

    Agenda - Columbia University

    (CreditMetrics for baskets - the "Black-Scholes" of the Credit Derivatives market) Galin Georgiev January, 2000 Disclaimer Summary Definition of a protection contract on an individual name and a first-to-default (FTD) protection contract on two names The CreditMetrics model for baskets:...
  • Soft tissue injuries Chapter 10 - Jefferson Township Public ...

    Soft tissue injuries Chapter 10 - Jefferson Township Public ...

    Soft tissue injuries Chapter 10 3 layers of the skin 1. Epidermis-outer layer that is a barrier to infection "Superficial" 2. Dermis- middle layer that contains nerves hair roots, sweat and oil glands and blood vessels.
  • JSP คืออะไร - thaiall.com

    JSP คืออะไร - thaiall.com

    หลักการภาษาชุดคำสั่ง ภาษาคอมพิวเตอร์ อ.บุรินทร์ รุจจน ...
  • Worcester Public Schools Flu Vaccination Clinics Shauna Rice

    Worcester Public Schools Flu Vaccination Clinics Shauna Rice

    To promote the health of the Worcester community by providing free influenza vaccines to Worcester Public School students. Flu clinics will be held at all Worcester public schools this Fall over the course of two weeks.