18734: Foundations of Privacy Privacy-preserving Release of Statistics: Differential Privacy Anupam Datta CMU Fall 2015 Privacy-Preserving Statistics: Non-Interactive Setting Add noise, sample,

generalize, suppress x1 xn Analyst Sanitized Database D Goals: Accurate statistics (low noise)

Preserve individual privacy (what does that mean?) Database D maintained by trusted curator

Census data Health data Network data 2 Privacy-Preserving Statistics: Interactive Setting Query f # individuals with salary > $30K

x1 xn f(D)+noise Database D maintained by trusted curator Analyst

Goals: Accurate statistics (low noise) Preserve individual privacy (what does that mean?) Census data

Health data Network data 3 Classical Intuition for Privacy If the release of statistics S makes it possible to determine the value [of private information] more accurately than is possible without access to S, a disclosure has taken place. [Dalenius 1977] Privacy means that anything that can be learned

about a respondent from the statistical database can be learned without access to the database Similar to semantic security of encryption 5 Impossibility Result [Dwork, Naor 2006] Result: For reasonable breach, if sanitized database contains information about database, then some adversary breaks this definition Example

Terry Gross is two inches shorter than the average Lithuanian woman DB allows computing average height of a Lithuanian woman This DB breaks Terry Grosss privacy according to this definition even if her record is not in the database! 6 Very Informal Proof Sketch Suppose DB is uniformly random Breach is predicting a predicate g(DB) Adversarys background knowledge:

r, H(r ; San(DB)) g(DB) where H is a suitable hash function, r=H(DB) By itself, does not leak anything about DB Together with San(DB), reveals g(DB) 7 Differential Privacy: Idea [Dwork, McSherry, Nissim, Smith 2006]

Released statistic is about the same if any individuals record is removed from the database 8 An Information Flow Idea Changing input databases in a specific way changes output statistic by a small amount 9

Not Absolute Confidentiality Does not guarantee that Terry Grosss height wont be learned by the adversary 10 Differential Privacy: Definition Randomized sanitization function has -differential privacy if for all data sets D1 and D2 differing by at

most one element and all subsets S of the range of , Pr[(D1) S ] e Pr[(D2) S ] Answer to query # individuals with salary > $30K is in range [100, 110] with approximately the same probability in D1 and D2 11 Achieving Differential Privacy: Interactive Setting User

Database D Tell me f(D) f(D)+noise x1 xn How much and what type of noise should be added?

12 Slide: Adam Smith Example: Noise Addition 13 Slide: Adam Smith Global Sensitivity

14 Exercise Function f: # individuals with salary > $30K Global Sensitivity of f = ? Answer: 1 15 Background on Probability Theory

(see Oct 11, 2013 recitation) 16 Continuous Probability Distributions Probability density function (PDF), fX Example distributions Normal, exponential, Gaussian, Laplace 17

Laplace Distribution PDF = Mean = Variance = 2b2 Source: Wikipedia 18 Laplace Distribution

Change of notation from previous slide: xy 0 b 19 Achieving Differential Privacy 20

Slide: Adam Smith Laplace Mechanism 21 Laplace Mechanism: Proof Idea Pr[A(x) = t] Pr[A(x) = t]

22 Laplace Mechanism: More details For , we have -differential privacy 23 Slide: Adam Smith

Example: Noise Addition 24 Using Global Sensitivity Many natural functions have low global sensitivity Histogram, covariance matrix, strongly convex optimization problems

25 Composition Theorem If A1 is 1-differentially private and A2 is 2differentially private and they use independent random coins then < A1 , A2 > is (1+2)-differentially private Repeated querying degrades privacy; degradation is quantifiable 26

Applications Netflix data set [McSherry, Mironov 2009; MSR] Accuracy of differentially private recommendations (wrt one movie rating) comparable to baseline set by Netflix Network trace data sets [McSherry, Mahajan 2010; MSR] 27 Challenge: High Sensitivity

Approach: Add noise proportional to sensitivity to preserve -differential privacy Improvements: Smooth sensitivity [Nissim, Raskhodnikova, Smith 2007; BGUPSU] Restricted sensitivity [Blocki, Blum, Datta, Sheffet 2013; CMU] 28 Challenge: Identifying an Individuals Information

Information about an individual may not be just in their own record Example: In a social network, information about node A also in node B influenced by A, for example, because A may have caused a link between B and C 29 Differential Privacy: Summary An approach to releasing privacy-preserving

statistics A rigorous privacy guarantee Significant activity in theoretical CS community Several applications to real data sets Recommendation systems, network trace data,.. Some challenges High sensitivity, identifying individuals information, repeated querying 30