Link Prediction in Social Networks Laks V.S. Lakshmanan (based on Kleinberg and Liben-Nowell. The Link Prediction Problem for Social Networks. CIKM 2003.) The Problem Examples: Co-authorship graphs, scientists serving on common PC/ editorial board, business leaders working on common board, etc. Can current state of network be used to predict future links? Note: factors extraneous to n/w

(need to be) ignored (e.g., they met on a beach). Motivation & Significance Model for Network evolution. Predict likely interactions, not explicitly observed, based on observed links useful for terrorist network monitoring. Link prediction is one instance of Social Network Analysis. Application for Friend suggestion in online SN. Notice, this takes on a flavor of link Main Contributions

Features intrinsic to n/w can offer a good measure of how likely a future link is. Can significantly outperform chance. Need more sophisticated measures than direct ones (e.g., geodesic distances). Note: Extra-network features can significantly improve prediction accuracy: e.g., keywords describing interests of each scientist, or keywords extracted from their papers/abstracts/titles. Not considered here. See Al Hasan et al. SDM workshop 2006. http://www.cs.rpi.edu/~zaki/PaperDir/LINK06.pdf Ben Taskar et al. Link prediction in relational data. In Proceedings of Neural Information Processing Systems, 2004. A. Popescul and L. Ungar. Statistical relational learning for link prediction. In Workshop on Learning Statistical Models from Relational Data at IJCAI, 2003.

Link Prediction Problem Formalized G[t0,t0] = the multi-graph of social interactions between actors (users) during the training interval [t0,t0]. Separate edge for each interaction (paper, phone call, email, ...) with timestamp. Predict edges in test interval [t1,t1], where t1>t0. E_old = edges present in G[t0,t0]. E_new = edges in G[t1,t1] but not in G[t0,t0].

Refinement: Restrict to those nodes adjacent to >= edges in G[t0,t0] and to >= edges in G[t1,t1]. Call them the core nodes. Example John Mary Hui Asif Ram John Nita Mary Asif

Ram Hui Say = = 2. Only evaluate how well we do on {John, Ram, Asif, Mary}. E*_new = {(J,M), (R,M)}, when restricted to the core nodes above. Evaluation: among the top-most likely edges predicted, how well we do on precision and recall. Precision = analog of soundness.

Recall = analog of completeness. Some simple notions Triadic closure [Easley & Kleinberg 2010]: if x and y have a common friend, then the likelihood of x, y being friends increases. Clustering coefficient of node x = fraction of its neighbors that are connected by a link. Bearman and Moodys study: teenage girls who have a low clustering coefficient in their network of friends are significantly more likely to contemplate suicide than those whose clustering coefficient is high! (Ouch!)

Various predictors Notation: (u) = neighbors of u in G[t0,t0]. score(u,v) = score that indicates likelihood of (u,v) being in E*_new. Neighborhood-based: Shortest distance: (negated) length of shortest path (geodesic). Common neighbors: (u) (v). Jaccard coefficient: |(u)(v)|/|(u)(v)|(u)(v)|/|/|(u)(v)|/| (u)(v)|(u)(v)|/|. More predictors Adamic/Adar: z(u)(v)(u)(v) 1/log|(u)(v)|/|(z)|(u)(v)|/|.

Preferential attachment: |(u)(v)|/|(u)|(u)(v)|/|x|(u)(v)|/|(v)|(u)(v)|/|. Based on ensemble of paths: Katz, Hitting/Commute Time, Rooted PageRank, SimRank. The Katz Score Katz: i i|(u)(v)|/|pathsi(u,v)|(u)(v)|/|, where 0<<1.<<1.1. Dealing with multi-graphs: Unweighted boolean. Weighted count #interactions b/w neighbors. Shorter paths more affinity.

More paths more affinity. K = A + 2A2 + ... = I (I A)-1, where A = adjacency matrix of graph. Notice computational challenges. Fact: For convergence, need <<1. [1/(largest eigenvalue of A)]. Hitting & Commute Time Hitting & Commute Time: Perform a random walk, starting at u, follow neighbor of current node w.p. 1/deg(curr). Adapted in the obvious way if edges are weighted. Hu,v=expected time for walk to reach v.

Not symmetric. Cu,v=Hu,v+Hv,u. x=stationary distribution weight of x. Stationary normed versions: Hu,vv. Hu,vv+Hv,uu. Score, in all cases, negative of measure. (Why?) Rooted PageRank Rooted PR:v in the following RW: At any step,

Jump to u with probability p. Move to one of the neighbors of current node at random with probability 1-p. Avoids some pitfalls of HT/CT by minimizing dependence on nodes far away from both u and v. Computational challenge (of RPR, HT, CT) similar to PageRank. SimRank SimRank: SR(u,u) = 1. When u != v, SR(u,v) =

.w(u)(v)(u)x(u)(v)(v)SR(w,x)/|(u)(v)|/|(u)|(u)(v)|/||(u)(v)|/|(v)|(u)(v)|/|, where 0<<1.<<1.1. SR(u,v) = 0 when (u) u (v) = . RW interpretation: E[i], where i=earliest time at which RWs started at u and v meet for the first time. Qn: Can the RW interpretation be leveraged to exploit PageRank-like techniques to compute SimRank? Predictors Summary Neighborhood-based (easy to compute) and Path-based (more accurate). Output sorted in descending score order of pages. Consider top-n pages among core nodes for evaluation.

A performance trick, which pays off with quality as well: use low rank approximation Ak of adjacency matrix A. Works very well since usually A is extremely sparse. Meta-measures score*(u,v) = |(u)(v)|/|(v) Top-K(u)|(u)(v)|/|, for some K. //how many of your neighbors are my nearest score-neighbors?// score*(u,v) = z(u)(v)(v)TopK(u)score(u,z). //what is the total strength of my relationship with my nearest score-neighbors who are your

neighbors?// unseen bigrams. Conclusions While many predictors significantly outperform random, their precision can use considerable improvement (e.g., <<1. 20%). Need for better prediction algorithms: Perhaps more recent training data is more important? Combining measures (e.g., using Bayesian classifier). Leveraging node properties. Need fast algorithms for approximating

measures. Recommended Papers Link Prediction: & SN evolution Al Hasan et al. Link prediction using supervised learning. SDM 2006 Workshop. Ben Taskar et al. Link prediction in relational data. In Proceedings of Neural Information Processing Systems, 2004. A. Popescul and L. Ungar. Statistical relational learning for link prediction. In Workshop on Learning Statistical Models from Relational Data at IJCAI, 2003. Jure Leskovec, Lars Backstrom, Ravi Kumar, and Andrew Tomkins. Microscopic evolution of social networks. In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining (KDD'08). Jure Leskovec, Jon M. Kleinberg, and Christos Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In Proceedings of the 11th ACM

International Conference on Knowledge Discovery and Data Mining (KDD'05). Recommended Papers Computational Aspects: Kurt C. Foster, Stephen Q. Muth, John J. Potterat, and Richard B. Rothenberg. A faster katz status score algorithm. Computational & Mathematical Organization Theory, 7(4):275285, 2001. Purnamrita Sarkar and Andrew W. Moore. A tractable approach to finding closest truncated-commute-time neighbors in large graphs. In Proceedings of the 23 rd Conference on Uncertainty in Artificial Intelligence (UAI'07). Purnamrita Sarkar, Andrew W. Moore, and Amit Prakash. Fast incremental proximity search in large graphs. In Proceedings of the 25th International Conference on Machine Learning (ICML'08). Glen Jeh and Jennifer Widom. Simrank: a measure of structural-context similarity. In Proceedings of the Eighth ACM International Conference on Knowledge

Discovery and Data Mining (KDD'02). Dmitry Lizorkin et al. Accuracy estimate and optimization techniques for SimRank computation. VLDB Journal, 2008. Daniel Fogaras, et al. Towards scaling fully personalized pagerank: Algorithms, lower bounds, and experiments. Internet Mathematics, 2(3), 2005. Peixiang Zhao, Jiawei Han, and Yizhou Sun. P-Rank: a comprehensive structural similarity measure over information networks. CIKM 2009. P. Zhao, J. Han, and Y. Sun. P-rank: a comprehensive structural similarity measure over information networks. InCIKM, pages 553{562, 2009. Pei Lee, Laks V.S. Lakshmanan and Jeffrey Xu Yu. On Top-k Structural Similarity Search. ICDE 2012. Recommended Papers Surveys & Some Theory: Lise Getoor and Christopher P. Diehl. Link Mining: A Survey. SIGKDD Explorations 2005.

Linyuan L and Tao Zhou. Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and its Applications. Volume 390, Issue 6, 15 March 2011, Pages 11501170. Mohammad Al Hasan and Mohammed J. Zaki. A Survey of Link Prediction in Social Networks. Social Network Data Analytics 2011, pp. 243-275. Purnamrita Sarkar, Deepayan Chakrabarti, and Andrew W. Moore. 2011. Theoretical justification of popular link prediction heuristics. In Proceedings of the Twenty-Second international joint conference on Artificial Intelligence Volume Volume Three (IJCAI'11), Toby Walsh (Ed.), Vol. Volume Three. AAAI Press 2722-2727.