Approximate Graph Matching R. Srikant ECE/CSL UIUC Coauthor Joseph Lubars Problem Statement Given two correlated graphs One with known node identities, One with unknown (or incorrect)

node identities Goal: Infer the identities of the nodes in the second graph Problem Given two correlated graphs One with known node identities, One with unknown (or incorrect) node identities Goal: Infer the identities of the nodes in the second graph Computational Complexity Requirement

We are interested in very large graphs: Problem Goes by Many Names Approximate Graph Matching Random Graph Isomorphism: Special case Network Deanonymization: Privacy Network Alignment: Biology Application 1: Social Networks Friendship Graph Bob Alice Carol

Bob Bob Alice Carol Sample edges from an underlying friendship graph to obtain social networks Alice Carol Application 1: Social Networks

? Bob ? Alice ? Carol Use the graph topology of one social network to deanonymize members of another network Application 2: Protein Interaction

Human Network Mouse Network Q8WUU5 Q920S3 P06436 P58391 Q9Y365 P62805

Q9JMD3 P62806 Find proteins with similar functions across different species based on the topologies of their interaction networks Application 3: Wikipedia Articles English Wikipedia French Wikipdia Hydrosphre Hydrosphere Sun

Soleil Terre Earth Solar System Systme solaire Supercontinent Supercontinent Automatically find or correct corresponding articles in different versions of Wikipedia based on the graph of article links.

Mathematical Model Note: permuting the node identities or giving them different identities, or erasing the node identities, are all equivalent 1 2

Start with a random graph (generated according to some random graph model) Independently sample each edge of twice with probability Permute node labels of one graph Prior Results Special Case:

In this case, the problem reduces to the graph isomorphism problem find a permutation of the nodes of that produces exactly w.h.p. results below are for ErdsRnyi graphs (each edge occurs with probability ) Result 1 (Erds and Rnyi, 1963): The identity permutation is the unique solution w.h.p. (i.e., ErdsRnyi graphs have no nontrivial automorphisms w.h.p.) Result 2 (Babai, Erds, and Seklow, 1980): There is a linear time algorithm to find the correct permutation (random graph isomorphism) with high probability. General Case: Permutation Matrices

Permuting the node labels of a graph changes its adjacency matrix If is the original adjacency matrix, the new adjacency matrix can be represented as , where is a permutation matrix (consists of exactly one 1 in each column and one 1 in each row, and 0s elsewhere) Mismatch Metric General case (: Permute the nodes of so that the adjacency matrices of and are as similar as possible 1 In other words, minimize the mismatch: min 1

2 2 Adjacency matrix for : Set of permutation matrices The solution produces the correct permutation w.h.p. for many graph models (e.g., CullinaKiyavash, 2016). However, computationally infeasible 2

2 Convex Relaxation Change the problem to: min 1 2

2 1 is the set of doubly stochastic matrices, which is the convex hull of the set of permutation matrices Algorithm: Solve this problem and project the result onto the set of permutation matrices No guarantees for recovery 2

2 Seed-Based Approaches (Narayanan- Shmatikov, 2009) Many recent algorithms assume that some node identities are known in . These identities are used to discover all of the remaining node identities in the graph: Result 1 (Yartseva and Grossglauser, 2013): For if , and at least

initial seed identities are known, then the algorithm recovers the identities of nodes with high probability Seed-Based Approaches Many recent algorithms assume that some node identities are known in . These identities are used to discover all of the remaining node identities in the graph: Result 2 (Korula and Lattanzi, 2013): A similar algorithm to result 1, but slightly worse theoretical guarantees for ErdsRnyi graphs The authors also show that the algorithm recovers 97% of the identities if is generated according to a certain preferential attachment model Seed-Based Approaches Many recent algorithms assume that some node identities are known in . These

identities are used to discover all of the remaining node identities in the graph: Result 3 (Kazemi, Hassani, and Grossglauser, 2015): A similar algorithm to result 1, but allows for erroneous initial seed identities Makes errors Our Model/Results Our Model Stochastic Block Model

Two communities and Edge probability within the same community Edge probability across communities Assume 1 2

Our Model We start with two independently edge sampled copies of a graph generated by a stochastic block model A fraction of the node labels of are correct, but the rest are incorrect (call this permutation ). But we dont know which labels are correct Problem: Find the correct labels of the nodes of Sample Edges 1

Permute node labels 2 Motivation Problem: we are given a partially correct permutation (), we have to make it fully correct Use convex relaxation (or some other method) to produce a preliminary estimate of the correspondence between the nodes of the two graphs () In practice, some fraction of the nodes will be correctly matched, some will not. But we dont know which nodes are correctly matched Key Assumption: The correct and incorrect matches are distributed uniformly at random

Main Result Suppose nodes are placed into community with probability and community with probability , with . Then, if , and , then there exists an efficient algorithm which recovers the correct permutation exactly with high probability. Main Result Suppose nodes are placed into community with probability and community with probability , with . Then, if But not too large to make the graph densely connected

Fraction of correctly matched nodes is not too small, then there exists an efficient algorithm which recovers the correct permutation exactly with high probability. The Algorithm: Witnesses (KorulaLattanzi) 1 Question: Is in the same as in ? If is matched to , and is a neighbor of and is a neighbor of , then - is a witness for The more witnesses we have, the more confident we are about a match We wish to match nodes in with nodes in

in a way that maximizes the total number of witnesses ^ 2

( , )=2 MWM on Bipartite Graphs Bipartite graph, with the nodes of on the left and those of on the right: 1. For every , calculate the number of witnesses . This is the weight of the edge 2. Find the maximum weighted matching Complexity of a nave implementation: .

Can we reduce it? 2 1 (, ) ( , )

Step 1: Efficiently calculating 2 1 In parallel, for each , we can efficiently calculate the number of witnesses for every as follows: The number of paths from to in the graph on the right gives New complexity: , where is the max degree of a node in graph

^

( , )=1 ( , )=3 ( , )=2

( , )=1 Step 2: Greedy Matching, instead of MWM Add the match with the largest number of witnesses, remove conflicts, add the match with the next largest number of witnesses, Why does it work? On the right: the weights of the bipartite graph as a matrix, with the true as the identity. (Dark blue: low values, dark red: high values) The diagonal entries tend to dominate their respective rows and columns.

Why Does Greedy Matching Work? E-R graph with being the probability of an edge (Yartseva-Grossglauser, KorulaLattanzi). Each edge appears in , with probability Probability that is a witness for if it is a correct match: Probability that is a witness for if it is an incorrect match:

If is small, , so the number of witnesses for is much larger on average for a correct match Simulations In practice, the algorithm can be run repeatedly Suppose 10% of the matches are correct initially, then by running the algorithm once, one may increase this to something larger than 10%

Run it again, increase the number of correct matches Repeat several times. Threshold phenomenon: if the initial number of correct matches is small, doesnt help; otherwise, can match all nodes correctly E-R Graphs Running the Algorithm Once Fraction of initially correct matches Running the Algorithm Iteratively Fraction of initially correct matches Performance on Various Graph

Models Stochastic Block Model Fraction of initially correct matches Barabsi-Albert Model Fraction of initially correct matches Possible Algorithm for Seedless Matching 1 , 2 Seedless Algorithm (e.g., Convex

Relaxation Approach) ^ Witness-Based Correction Technique ~ Seedless Matching In Practice Our algorithm on Barabsi-Albert graphs for varying values of

Conclusions Can we recover all node identities w.h.p. with no seeds? No low-complexity algorithm is known for any random graph model for For a stochastic block model, if we are given an initial estimate of the correct permutation, then exact matches can be found w.h.p. The initial estimate must contain a fraction of nodes that are correctly matched but we dont need to know which matches are correct Motivation: convex relaxation or other approaches can provide an initial estimate Assumption: the correct and incorrect matches are uniformly distributed Can remove this assumption, but much weaker theoretical guarantees