Algoritmi per IR Prologo References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. Mining the Web: Discovering Knowledge from... S. Chakrabarti, Morgan-Kaufmann Publishers, 2003. A bunch of scientific papers available on the course site !! About this course It is a mix of algorithms for data compression data indexing data streaming (and sketching)

data searching data mining Massive data !! Paradigm shift... Web 2.0 is about the many Big DATA Big PC ? We have three types of algorithms: T1(n) = n, T2(n) = n2, T3(n) = 2n ... and assume that 1 step = 1 time unit How many input data n each algorithm may process within t time units?

n1 = t, n2 = t, n3 = log2 t What about a k-times faster processor? ...or, what is n, when the time units are k*t ? n1 = k * t, n2 = k * t, n3 = log2 (kt) = log2 k + log2 t A new scenario Data are more available than even before n ... is more than a theoretical assumption The RAM model is too simple Step cost is (1) time Not just MIN #steps

1 CPU CPU L1 L2 RAM HD net registers Cache Few Mbs Some nanosecs Few words fetched Few Tbs Few Gbs Few millisecs Tens of nanosecs Some words fetched B = 32K page Many Tbs Even secs Packets You should be ??-aware programmers

I/O-conscious Algorithms track read/write head read/write arm magnetic surface The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on ones desk or by taking an airplane to the other side of the world and using a sharpener on someone elses desk. (D. Comer) Spatial locality vs Temporal locality The space issue M = memory size, N = problem size T(n) = time complexity of an algorithm using linear space p = fraction of memory accesses [0,30,4 (HennessyPatterson)] C = cost of an I/O [105 106 (Hennessy-Patterson)]

If N=(1+f)M, then the avg cost per step is: C * p * f/(1+f) This is at least 104 * f/(1+f) If we fetch B 4Kb in time C, and algo uses all of them: (1/B) * (p * f/(1+f) * C) 30 * f/(1+f) Space-conscious Algorithms I/Os search access Compressed data structures Streaming Algorithms track read/write head read/write arm magnetic surface Data arrive continuously or we wish FEW scans Streaming algorithms:

Use few scans Handle each element fast Use small space Cache-Oblivious Algorithms CPU L1 L2 RAM registers Cache Few Mbs Some nanosecs Few words fetched HD net Few Tbs Few Gbs Few millisecs Tens of nanosecs

Some words fetchedB = 32K page Many Tbs Even secs Packets Unknown and/or changing devices Block access important on all levels of memory hierarchy But memory hierarchies are very diverse Cache-oblivious algorithms: Explicitly, algorithms do not assume any model parameters Implicitly, algorithms use blocks efficiently on all memory levels Toy problem #1: Max Subarray Goal: Given a stock, and its -performance over the time, find the time window in which it achieved the best market performance. Math Problem: Find the subarray of maximum sum. A = 2 -5 6 1 -2 4 3 -13 9 -6 7

4K 8K 16K 32K 128K 256K 512K 1M n3 22s 3m 26m 3.5h 28h -- --

-- n2 0 0 0 1s 26s 106s 7m 28m An optimal solution We assume every subsum0 A= <0 >0 Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7 Algorithm sum=0; max = -1; For i=1,...,n do If (sum + A[i] 0) sum=0; else sum +=A[i]; MAX{max, sum}; Note: Sum < 0 when OPT starts; Sum > 0 within OPT Toy problem #2 : sorting How to sort tuples (objects) on disk Memory containing the tuples A Key observation:

Array A is an array of pointers to objects For each object-to-object comparison A[i] vs A[j]: 2 random accesses to memory locations A[i] and A[j] MergeSort (n log n) random memory accesses (I/Os ??) B-trees for sorting ? Using a well-tuned B-tree library: Berkeley DB n insertions Data get distributed arbitrarily !!! B-tree internal nodes B-tree leaves (tuple pointers") What about listing tuples in order ? Tuples Possibly 109 random I/Os = 109 * 5ms 2 months

Binary Merge-Sort Merge-Sort(A,i,j) Merge-Sort(A,i,j) 01 01 if if (i (i << j) j) then then Divide 02 mm == (i+j)/2; 02 (i+j)/2; Conquer 03 Merge-Sort(A,i,m); 03 Merge-Sort(A,i,m); 04 Merge-Sort(A,m+1,j); 04 Merge-Sort(A,m+1,j); Combine 05 Merge(A,i,m,j) 05 Merge(A,i,m,j) Cost of Mergesort on large data

Take Wikipedia in Italian, compute word freq: n=109 tuples few Gbs Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms Analysis of mergesort on disk: It is an indirect sort: (n log2 n) random I/Os [5ms] * n log2 n 1.5 years In practice, it is faster because of caching... 2 passes (R/W) Merge-Sort Recursion Tree log2 N If the run-size is larger than B (i.e. after first step!!),

fetching all of it in memory for merging does not help 1 2 3 4 1 2 5 7 1 2 5 10 2 10 10 2 5

9 10 1 7 8 9 13 19 10 3 11 12 13 15 17 19 4 6 8 11 12 15 17 How do we deploy the

7 9disk/mem 13 19 3 features 4 8 15 ? 6 1 5 5 6 13 19 13 19 7 9 9 7 4 15 15 4 3 8 8 3 11 12 17 12 17

12 17 6 11 6 11 M N/M runs, each sorted in internal memory (no I/Os) I/O-cost for merging is 2 (N/B) log2 (N/M) Multi-way Merge-Sort The key is to balance run-size and #runs to merge Sort N items with main-memory M and disk-pages B: Pass 1: Produce (N/M) sorted runs. Pass i: merge X M/B runs logM/B N/M passes INPUT 1 ... INPUT 2

... OUTPUT INPUT X .. . Disk Disk Main memory buffers of B items Multiway Merging Bf1 p1 Bf2 Fetch, if pi = B p2 min(Bf1[p1], Bf2[p2], , Bfx[pX]) Bfo po

Bfx pX Current page Run 1 Current page Flush, if Bfo full Current page Run 2 Run X=M/B Out File: Merged run EOF Cost of Multi-way Merge-Sort Number of passes = logM/B #runs logM/B N/M

Optimal cost = ((N/B) logM/B N/M) I/Os In practice 1 M/B 1000 #passes = logM/B N/M One multiway merge 2 passes = few mins Tuning depends on disk features Large fan-out (M/B) decreases #passes Compression would decrease the cost of a pass! Does compression may help? Goal: enlarge M and reduce N #passes = O(logM/B N/M) Cost of a pass = O(N/B) Part of Vitters paper In order to address issues related to:

Disk Striping: sorting easily on D disks Distribution sort: top-down sorting Lower Bounds: how much we can go Toy problem #3: Top-freq elements Goal: Top queries over a stream of N items (large). Math Problem: Find the item y whose frequency is > N/2, using the smallest space. (i.e. If mode occurs > N/2) A=b a c c c d c b a a a c c b c c c ** ****. Algorithm Use a pair of variables **

For each item s of the stream, Proof Problems if N/2 If Xy, then every one of ys occurrences has a negative mate. if (X==s) then C++ Hence these mates should be #y. else { C--; if (C==0) X=s; C=1;} Return X; As a result, 2 * #occ(y) > N... Toy problem #4: Indexing Consider the following TREC collection:

N = 6 * 109 size = 6Gb n = 106 documents TotT= 109 (avg term length is 6 chars) t = 5 * 105 distinct terms What kind of data structure we build to support word-based searches ? Solution 1: Term-Doc matrix n = 1 million Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth t=500K Antony 1

1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1

0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0

0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1

0 Space is 500Gb ! 1 if play contains word, 0 otherwise Solution 2: Inverted index Brutus 2 Calpurnia 1 Caesar 4 2 8 3 16 32 64 128 We can do still better: i.e.53050% original 8 13 21 text

34 13 16 1. Typically use about 12 bytes 2. We have 109 total terms at least 12Gb space 3. Compressing 6Gb documents gets 1.5Gb data Please !! Do not underestimate the features of disks in algorithmic design