Peer-to-Peer Networks 1 Overlay Network A logical network laid on top of the Internet Logical link AB Logical link BC B A C Internet The Formal Model Let V be a set of nodes. The functions id : V Z+ assigns a unique id to each node in V rs : V {0, 1}* assigns a random bit string to each node in V (optional)

A family of overlay networks ON : F G, where F is the set of all triples = (V; id; V; id; rs) and G is the set of all directed graphs. A unique directed graph ON(V; id; ) G with each labeled set = (V; id; V; id; rs) of nodes. Each node contains one or more objects. One important objective is SEARCH: any node must be able to access any object as quickly as possible Structured vs. Unstructured Overlay networks Unstructured No restriction on network topology. Examples: Gnutella, Kazaa, Bittorrent, Skype etc. Structured Network topology satisfies specific

invariants. Examples: Chord, CAN, Pastry Skip Graph etc Gnutella The Gnutella network is a fully distributed alternative to the centralized Napster. Initial popularity of the network was spurred on by Napster's legal demise in early 2001. 5 What is Gnutella? A protocol for distributed search object1 object2 peer peer

No central authority. 6 Remarks Gnutella uses the simple idea of searching by flooding, but scalability is an issue, since query flooding wastes bandwidth. Uses TTL to control flooding. Sometimes, existing objects may not be located due to limited TTL. Subsequently, various improved search strategies have been proposed. 7 Searching in Gnutella The topology is dynamic, i.e. constantly changing. How do we model a constantly changing topology? Usually, we begin with a static topology, and later account for the effect of churn. Modeling topology -- Random graph -- Power law graph (measurements provide useful inputs)

8 Random graph: Erds-Rnyi model A random graph G(n, p) is constructed by starting with a set of n vertices, and adding edges between pairs of nodes at random. Every possible edge occurs independently with probability p. Q. Is Gnutella topology a random graph? 9 Gnutella topology Gnutella topology is almost a power-law graph. (Also called scale-free graph) What is a power-law graph? The number of nodes with degree k = c.k - r (Contrast this with Gaussian distribution where the number of nodes with degree k = c. 2 - k. ) Many graphs in the nature exhibit power-law characteristics. Examples, world-wide web (the number of pages that have k in-links is proportional to k- 2), The fraction of scientific

papers that receive k citations is k-3 etc. 10 # of telephone numbers from which calls were made AT&T Call Graph How many telephone numbers receive calls from k different telephone numbers? # of telephone numbers called 11 4 Gnutella network proportion of nodes power-law link distribution

102 data power-law fit = 2.07 101 100 100 101 number of neighbors summer 2000, data provided by Clip2 12 5 A possible explanation Nodes join at different times. The more connections a node has, the more likely it is to acquire new connections (Rich gets richer).

Popular webpages attract new pointers. It has been mathematically shown that such a growth process produces power-law network 13 7 Search strategies Flooding Random walk / - Biased random walk/ - Multiple walker random walk (Combined with) One-hop replication / Two-hop replication k-hop replication 14 On Random walk Let p(d) be the probability that a random walk on a d-D lattice returns to the origin. In 1921, Plya proved that, (1) p(1)=p(2)=1, but (2) p(d)<1 for d>2

There are similar results on two walkers meeting each other via random walk 15 Search via random walk Existence of a path does not necessarily mean that such a path can be discovered 16 Search via Random Walk Search metrics Delay = discovery time in hops Overhead = total distance covered by the walker Both should be as small as possible. For a single random walker, these are equal. K random walkers is a compromise. For search by flooding, if delay = h then overhead = d + d2 + + dh where d = degree of a node.

17 A simple analysis of random walk Let p = Population of the object. i.e. the fraction of nodes hosting the object T = TTL (time to live) Hop count h Probability of success 1 p 2 (1-p).p 3

(1-p)2.p T (1-p)T-1.p 18 A simple analysis of random walk Expected hop count E(h) = 1.p + 2.(1-p).p + 3(1-p)2.p + + T.(1-p)T-1.p = 1/p. (1-(1-p)T) - T(1-p)T With a large TTL, E(h) = 1/p With a small TTL, there is a risk that search will time out before an existing object is located. 19 K random walkers

Assume they all k walkers start in unison. Probability that none could find the object after one hop = (1-p)k. The probability. that none succeeded after T hops = (1-p)kT. So the probability that at least one walker succeeded is 1-(1-p)kT. A typical assumption is that the search is abandoned as soon as at least one walker succeeds. Using these, one can derive a new value of E(h) As k increases, the overhead increases, but the delay decreases. There is a tradeoff. 20 Increasing search efficiency Major strategies 1. Biased walk utilizing node degree heterogeneity. 2. Utilizing structural properties like random graph, power-law graphs, or small-world properties 3. Topology adaptation for faster search 4. Introducing two layers in the graph structure using supernodes 21

One hop replication Each node keeps track of the indices of the files belonging to its immediate neighbors. As a result, high capacity / high degree nodes can provide much better clues to a large number of search queries. Where is 22 Biased random walk P=2/10 P=5/10 P=3/10 Each node records the degree of the neighboring nodes. Search easily gravitates towards high degree nodes that hold more clues. 23 power-law graph

number of nodes found 94 67 63 54 2 6 1 Deterministic biased walk 24 9 The next step This growing surge in popularity revealed the limits of the initial protocol's scalability. In early 2001, variations on the protocol improved the scalability. Instead of treating every node as client and server, some resource-rich nodes were used as ultrapeers

or supernodes, containing indices of the objects in the local neighborhood. Search requests and responses were routed through them leading to faster response. 25 The KaZaA approach Where is ABC? download ABC Supernode Supernode Powerful nodes (supernodes) act as local index servers, and client queries are propagated to other supernodes. Two-layered architecture. 26

The Chord P2P Network Some slides have been borrowed from the original presentation by the authors Main features of Chord -- Load balancing via Consistent Hashing Small routing tables per node: log n Small routing delay: log n hops Fast join/leave protocol (polylog time) Consistent Hashing -- Assigns both nodes and objects an m-bit key. -- Order these nodes around an identifier circle (what does a circle mean here?) according to the order of their keys (0 .. 2m-1). This ring is known as the Chord Ring. An object with key k is assigned to the first node whose key is k (called the successor node of key k) Nodes and Objects on the Chord Ring Key 5

K5 Node 105 N105 K20 Circular 7-bit ID space N32 N90 K80 A key k is stored at its successor (node with key k) Consistent Hashing [Karger 97] Property 1 If there are N nodes and K object keys, then with high probability,

each node is responsible for (1+ )K/N objects. Property 2 When a node joins or leaves the network, the responsibility of at most O(K/N) keys changes hand (only to or from the node that is joining or leaving. When K is large, the impact is quite small. The log N Fingers (0) 1/4 1/2 Distance of N80s neighbors from N80 1/8 Circular (log N)-bit ID space

1/16 1/32 1/64 1/128 N80 Each node knows of only log N other nodes. Finger i points to successor of n+2i N120 112 1/8 1/16 1/32 1/64 1/128

N80 Chord Finger Table N32s Finger Table (0) N113 N102 N=128 N32 N85 33..33 34..35 36..39

40..47 48..63 64..95 96..31 N40 N40 N40 N40 N52 N70 N102 N40 N80 N79 N52 N70 N60

Finger table actually contains ID and IP address Node ns i-th entry: first node n + 2i-1 Lookup Greedy routing (0) N113 N32s Finger Table N102 N85 N80 N79 N70

N60 33..33 34..35 N32 36..39 40..47 N40 48..63 64..95 N52 96..31 N40 N40 N40 N40 N52 N70 N102 N70s

Finger Table 71..71 N79 72..73 N79 74..77 N79 78..85 N80 86..101 N102 102..5 N102 6..69 N32 N80s Finger Table 81..81 N85 82..83 N85 84..87 N85 88..95 N102 96..111 N102 112..15 N113 16..79 N32 Node 32, lookup(82): 32 70 80 85. New Node Join N20s

Finger Table (0) N113 N20 N102 N32 1 2 3 4 5 6 7 21..21 22..23 24..27 28..35 36..51

52..83 84..19 N40 N80 N52 N70 N60 Assume that the new node N20 knows one of the existing nodes. New Node Join (2) N20s Finger Table (0) N113 N20 N102

N32 21..21 22..23 24..27 28..35 36..51 52..83 84..19 N32 N32 N32 N32 N40 N52 N102 N40 N80

N52 N70 N60 Node 20 asks that node to locate the successors of 21, 22, , 52, 84. The Join procedure The new node id asks a gateway node n to find the successor of id n. find_successor(id) if id (n, successor] then return successor else forward the query around the circle

fi Needs O(n) messages for a simple Chord ring. This is slow. Steps in join Linked list insert n n id Successor(n) id Finally But the transition does not happen immediately A More Efficient Join // ask n to find the successor of id if

id (n, successor] then return successor else n= closest_ preceding_node (id) return n.find_successor(id) fi // search for the highest predecessor of id n. closest_preceding_node(id) for i = log N downto 1 if (finger[i] (n,id) return finger[i] Example N20 wants to find out the successor of key 65 (0) N113 N20 N102

N32 N40 N80 N52 N70 N60 K65 After join move objects (0) N20s Finger Table N113 N20 N102

D114..20 N32 21..21 22..23 24..27 28..35 36..51 52..83 84..19 N32 N32 N32 N32 N40 N52 N102 N40

N80 N52 N70 N60 Notify nodes that must include N20 in their table. N113[1]=N20, not N32. Node 20 moves documents from node 32. Three steps in join Step 1. Initialize predecessor and fingers of the new node. (Knowledge of predecessor is useful in stabilization ) Step 2. Update the predecessor and the fingers of the existing nodes. (Thus notify nodes that must include N20 in their table. N113[1] = N20, not N32. Step 3. Transfer objects to the new node as appropriate.

Concurrent Join n1 n1 New node n New node n New node n n2 [Before] New node n n2 [After] Stabilization

Periodic stabilization is needed to integrate the new node into the network and restore the invariant. n1 n1 New node n n2 New node n n2 Predecessor.successor(n1) n1, so n1 adopts predecessor.successor(n1) = n as its new successor The complexity of join With high probability, any node joining or leaving an N-node Chord network will use O(V; id; log2N) messages to re-establish the Chord routing invariants and finger tables.

Chord Summary Log(n) lookup messages and table space. Well-defined location for each ID. Natural load balance due to consistent hashing. No name structure imposed. Minimal join/leave disruption. Chord Advanced issues Analysis Theorem. Search takes O (log N) time 2m = key space, N= number of nodes Proof. After log N forwarding steps, distance to key is at most

2 m / N (N= 2log N). Number of nodes in the remaining range is O(log N) with high probability (property of consistent hashing). So by using successors in that range, it will take at most an Additional O(log N) forwarding steps. Analysis (contd.) O(log N) search time is true if finger and successor entries correct, But what if these entries are wrong (which is possible during join or leave operations, or process crash?) Search under peer failures N32 crashed. Lookup for K42 fails (V; id; N16 does not know N45) Say m=7 0 N16 N112

X N96 X N32 Who has abcnews.com? (V; id; hashes to K42) X N80 N45 File abcnews.com with key K42 stored here Search under peer failures One solution: maintain r multiple successor entries in case of a failure, use other successor entries.

Reactive vs. Proactive approach 0 Say m=7 N112 N16 N96 X N32 Who has abcnews.com? (V; id; hashes to K42) N80 N45 File abcnews.com with

key K42 stored here Search under peer failures Choosing r=2log(N) suffices to maintain the correctness with high probability. Say 50% of nodes fail (i.e prob of failure = ). For a given node, Probability (at least one successor alive) = 1 1 1 ( ) 2 log N =1 2 2 N Search under peer failures (2) Lookup fails (V; id; N45 is dead) Say m=7 0 N16 N112

N96 N32 Who has abcnews.com? (V; id; hashes to K42) X N80 X N45 File abcnews.com with key K42 stored here Search under peer failures (2) One solution: replicate file/key at r successors and predecessors Say m=7

0 N16 N112 N96 N32 Who has abcnews.com? (V; id; hashes to K42) K42 replicated X N45 N80 K42 replicated File abcnews.com with key K42 stored here

Dealing with dynamic issues Peers fail New peers join Peers leave Need to update successors and fingers, and ensure keys reside in the right places New peers joining Some gateway node directs N40 to its successor N45 N32 updates successor to N40 N40 initializes successor to N45, and obtains fingers from it N40 periodically talks to neighbors to update finger table Say m=7 0 N112 Stabilization protocol N16

N96 Gateway node N32 N40 N80 N45 New node New peers joining (2) N40 may need to copy some files/keys from N45 (V; id; files with fileid between 32 and 40) Say m=7 0 N112

N16 N96 N32 N40 N80 N45 K34,K38 Concurrent join 0 N112 N16 N20 Say m=7 N24

N96 N28 N32 K24 N80 N45 K38 Argue that each node will eventually be reachable Effect of join on lookup If in a stable network with N nodes, another set of N nodes joins the network, and the join protocol correctly sets their successors, then lookups will take O(log N) steps w.h.p Effect of join on lookup 0

Consistent hashing guarantees that there be O(V; id; log N) new nodes w.h.p between two consecutive nodes N16 N112 N20 N96 Transfer pending N24 Linear Scan Will locate K24

N28 N32 N80 N45 K38 K24 Weak and Strong Stabilization N1 Loopy network N3 N96 N5 N78 N24 N63

u (successor (predecessor (u))) = u. Still it is weakly stable but not strongly stable. Why? Loopy network N1 (succ (pred (u))) = u N3 N96 N5 stable N78 Must be false for strong stability N24 N63

What is funny / awkward about this? v: u < v < successor (u) (Weakly stable) Strong stabilization The key idea of recovery from loopiness is: Let each node u ask its successor to walk around the ring until it reaches a node v : u

Bidirectional Chord Each node u has fingers to u+1, u+2, u+4, u+8 as well as u-1, u-2, u-4, u-8 How does it help? Skip Lists and Skip Graphs Some slides adapted from the original slides by James Aspnes Gauri Shah Definition of Skip List A skip list for a set L of distinct (key, element) items is a series of linked lists L0, L1 , , Lh such that Each list Li contains the special keys and List L0 contains the keys of L in non-decreasing order Each list is a subsequence of the previous one, i.e., L0 L1 Lh List Lh contains only the two special keys and 68

Skip List Dictionary based on a probabilistic data structure. Allows efficient search, insert, and delete operations. Each element in the dictionary typically stores additional useful information beside its search key. Example: [for University of Iowa] [for Daily Iowan] Probabilistic alternative to a balanced tree. 69 Skip List TAIL Level 0 Level 1

Level 2 HEAD J A A G J M J

M R W Each node linked at higher level with probability 1/2. 70 Another example L2 L1 L0

31 23 12 23 26 31 34 31 34 64 44

56 64 78 Each element of Li appears in Li+1 with probability p. Higher levels denote express lanes. 71 Searching in Skip List Search for a key x in a skip list as follows: Start at the first position of the top list At the current position P, compare x with y key(after(p)) x = y -> return element(after (P)) x y -> scan forward x y -> drop down If we move past the bottom list, then no such key exists

72 Example of search for 78 L3 L2 L1 L0

31 23 12 23 26 31 34 31 34 64

44 56 64 78 At L1 P is at, is bigger than 78, we drop down At L0, 78 = 78, so the search is over. 73 Insertion The insert algorithm uses randomization to decide in how many levels the new item should be added to the skip list. After inserting the new item at the bottom level flip a coin.

If it returns tail, insertion is complete. Otherwise, move to next higher level and insert in this level at the appropriate position, and repeat the coin flip. 74 Insertion Example 1) Suppose we want to insert 15 2) Do a search, and find the spot between 10 and 23 3) Suppose the coin come up head three times p2 L2 p1 L1 L0 L3 p0 10

23 23 36 L2 15 L1 15 23

L0 15 23 10 36 75 Deletion Search for the given key . If a position with key is not found, then no such key exists. Otherwise, if a position with key is found (it will be definitely found on the bottom level), then we remove all

occurrences of from every level. If the uppermost level is empty, remove it. 76 Deletion Example 1) Suppose we want to delete 34 2) Do a search, find the spot between 23 and 45 3) Remove all the position above p Remove this level p2 L2 34 L1 L0 12

23 34 23 34 p1 p0 45 L2 L1

L0 23 12 23 45 77 Constant number of pointers Average number of pointers per node = O(V; id; 1) Total number of pointers = 2.n + 2. n/2 + 2. n/4 + 2. n/8 + = 4.n So, the average number of pointers per node = 4

78 Number of levels The number of levels = O(V; id; log n) w.h.p Pr[a given element x is above level c log n] = 1/2c log n = 1/nc Pr[any element is above level c log n] = n. 1/nc = 1/nc-1 79 Search time Consider a skiplist with two levels L0 and L1. To search a key, first search L1 and then search L0. Cost (V; id; i.e. search time) = length (V; id; L1) + n / length (V; id; L1) Minimum when length (V; id; L1) = n / length (V; id; L1). Thus length(V; id; L1) = (V; id; n)

1/2 , and cost = 2. (V; id; n) 1/2 (V; id; Three lists) minimum cost = 3. (V; id; n)1/3 (V; id; Log n lists) minimum cost = log n. (V; id; n) 80 1/log n = 2.log n Skip lists for P2P? Advantages O(V; id; log n) expected search time. Retains locality. Dynamic node additions/deletions. Disadvantages Heavily loaded top-level nodes.

Easily susceptible to failures. Lacks redundancy. 81 Level 2 A Skip Graph A 100 Level 1 000 Level 0 W G 100

J M R 001 011 110 G A J M 001

001 011 101 R W 110 101 Membership vectors A G J

M R W 001 100 001 011 110 101 Link at level i to nodes with matching prefix of length i. Think of a tree of skip lists that share lower layers. 82

Properties of skip graphs 1. Efficient Searching. 2. Efficient node insertions & deletions. 3. Independence from system size. 4. Locality and range queries. 83 Searching: avg. O (log n) Level 0 Level 1 Level 2 Restricting to the lists containing the starting element of the search, we get a skip list. G A W

J M G A A G J M J M R R W

R W Same performance as DHTs. 84 Node Insertion 1 Level 2 buddy G A 100 Level 0 Level 1

000 M 011 G A 100 001 M R W new node J 101

001 110 R W 110 101 011 A G 001 100

M R W 011 110 101 Starting at buddy node, find nearest key at level 0. Takes O(V; id; log n) time on average. 85 Node Insertion - 2 Level 2 At each level i, find nearest node with matching prefix of membership vector of length i+1.

A 100 Level 1 000 J 001 M 011 G A 100 001

Level 0 W G A G 001 100 J M 001 011

J M 001 011 R 101 110 R W 110 101

R W 110 101 Total time for insertion: O(V; id; log n) DHTs take: O(V; id; log2n) 86 Independent of system size No need to know size of keyspace or number of nodes. Level 1 Level 0 E Z

E Z 1 0 insert J E J Z Level 2 E

J Z Level 1 00 01 E J Z 1 0 0

Level 0 Old nodes extend membership vector as required with arrivals. DHTs require knowledge of keyspace size initially. 87 Locality and range queries Find key < F, > F. Find largest key < x. Find least key > x. D A F I Find all keys in interval [D..O]. A

D F I L O S Initial node insertion at level 0. 88 Applications of locality Version Control e.g. find latest news from yesterday. find largest key < news: 02/13. Level 0 news:02/09 news:02/10

news:02/11 news:02/12 news:02/13 Data Replication e.g. find any copy of some Britney Spears song. Level 0 britney01 britney02 britney03 britney04 britney05 DHTs cannot do this easily as hashing destroys locality.

89 Load balancing Interested in average load on a node u. i.e. the number of searches from source s to destination t that use node u. Theorem: Let dist (V; id; u, t) = d. Then the probability that a search from s to t passes through u is < 2/(V; id; d+1). where V = {nodes v: u <= v <= t} and |V| = d+1. 90 Skip list restriction Level 2 Level 1 s Nodes u

Level 0 Node u is on the search path from s to t only if it is in the skip list formed from the lists of s at each level. 91 Tallest nodes s u is not on path. s u is on path. u u

u t u u t Node u is on the search path from s to t only if it is in T = the set of k tallest nodes in [u..t]. d+1 = Pr[|T|=k] k/(V; id; d+1) = E[|T|]/(V; id; d+1). Pr [u T] k=1 92 Heights independent of position, so distances are symmetric.