Differential privacy (DP) is a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset. The idea behind differential privacy is that if the effect of making an arbitrary single substitution in the database is small enough, the query result cannot be used to infer much about any single individual, and therefore provides privacy. Another way to describe differential privacy is as a constraint on the algorithms used to publish aggregate information about a statistical database which limits the disclosure of private information of records whose information is in the database. For example, differentially private algorithms are used by some government agencies to publish demographic information or other statistical aggregates while ensuring confidentiality of survey responses, and by companies to collect information about user behavior while controlling what is visible even to internal analysts. Roughly, an algorithm is differentially private if an observer seeing its output cannot tell if a particular individual's information was used in the computation. Differential privacy is often discussed in the context of identifying individuals whose information may be in a database. Although it does not directly refer to identification and reidentification attacks, differentially private algorithms probably resist such attacks. Differential privacy was developed by cryptographers and thus is often associated with cryptography, and draws much of its language from cryptography.

History

Official statistics organizations are charged with collecting information from individuals or establishments, and publishing aggregate data to serve the public interest. For example, the 1790 United States Census collected information about individuals living in the United States and published tabulations based on sex, age, race, and condition of servitude. Statistical organizations have long collected information under a promise of

confidentiality Confidentiality involves a set of rules or a promise usually executed through confidentiality agreements that limits the access or places restrictions on certain types of information. Legal confidentiality By law, lawyers are often required ...

that the information provided will be used for statistical purposes, but that the publications will not produce information that can be traced back to a specific individual or establishment. To accomplish this goal, statistical organizations have long suppressed information in their publications. For example, in a table presenting the sales of each business in a town grouped by business category, a cell that has information from only one company might be suppressed, in order to maintain the confidentiality of that company's specific sales. The adoption of electronic information processing systems by statistical agencies in the 1950s and 1960s dramatically increased the number of tables that a statistical organization could produce and, in so doing, significantly increased the potential for an improper disclosure of confidential information. For example, if a business that had its sales numbers suppressed also had those numbers appear in the total sales of a region, then it might be possible to determine the suppressed value by subtracting the other sales from that total. But there might also be combinations of additions and subtractions that might cause the private information to be revealed. The number of combinations that needed to be checked increases exponentially with the number of publications, and it is potentially unbounded if data users are able to make queries of the statistical database using an interactive query system. In 1977, Tore Dalenius formalized the mathematics of cell suppression. In 1979,

Dorothy Denning Dorothy Elizabeth Denning (née Robling, born August 12, 1945) is a US-American information security researcher known for lattice-based access control (LBAC), intrusion detection systems (IDS), and other cyber security innovations. She published ...

, Peter J. Denning and Mayer D. Schwartz formalized the concept of a Tracker, an adversary that could learn the confidential contents of a statistical database by creating a series of targeted queries and remembering the results. This and future research showed that privacy properties in a database could only be preserved by considering each new query in light of (possibly all) previous queries. This line of work is sometimes called ''query privacy,'' with the final result being that tracking the impact of a query on the privacy of individuals in the database was NP-hard. In 2003,

Kobbi Nissim Kobbi Nissim (קובי נסים) is a computer scientist at Georgetown University, where he is the McDevitt Chair of Computer Science. His areas of research include cryptography and data privacy. He is known for the introduction of differential p ...

and Irit Dinur demonstrated that it is impossible to publish arbitrary queries on a private statistical database without revealing some amount of private information, and that the entire information content of the database can be revealed by publishing the results of a surprisingly small number of random queries—far fewer than was implied by previous work. The general phenomenon is known as the Fundamental Law of Information Recovery, and its key insight, namely that in the most general case, privacy cannot be protected without injecting some amount of noise, led to development of differential privacy. In 2006,

Cynthia Dwork Cynthia Dwork (born June 27, 1958) is an American computer scientist at Harvard University, where she is Gordon McKay Professor of Computer Science, Radcliffe Alumnae Professor at the Radcliffe Institute for Advanced Study, and Affiliated Professo ...

Frank McSherry Frank McSherry is a computer scientist. McSherry's areas of research include distributed computing and information privacy. McSherry is known, along with Cynthia Dwork, Adam D. Smith, and Kobbi Nissim, as one of the co-inventors of differential p ...

and Adam D. Smith published an article formalizing the amount of noise that needed to be added and proposing a generalized mechanism for doing so. Their work was a co-recipient of the 2016 TCC Test-of-Time Award and the 2017 Gödel Prize. Since then, subsequent research has shown that there are many ways to produce very accurate statistics from the database while still ensuring high levels of privacy.

ε-differential privacy

The 2006 Dwork, McSherry, Nissim and Smith article introduced the concept of ε-differential privacy, a mathematical definition for the privacy loss associated with any data release drawn from a statistical database. (Here, the term ''statistical database'' means a set of data that are collected under the pledge of confidentiality for the purpose of producing statistics that, by their production, do not compromise the privacy of those individuals who provided the data.) The intuition for the 2006 definition of ε-differential privacy is that a person's privacy cannot be compromised by a statistical release if their data are in the database. Therefore, with differential privacy, the goal is to give each individual roughly the same privacy that would result from having their data removed. That is, the statistical functions run on the database should not overly depend on the data of any one individual. Of course, how much any individual contributes to the result of a database query depends in part on how many people's data are involved in the query. If the database contains data from a single person, that person's data contributes 100%. If the database contains data from a hundred people, each person's data contributes just 1%. The key insight of differential privacy is that as the query is made on the data of fewer and fewer people, more noise needs to be added to the query result to produce the same amount of privacy. Hence the name of the 2006 paper, "Calibrating noise to sensitivity in private data analysis." The 2006 paper presents both a mathematical definition of differential privacy and a mechanism based on the addition of Laplace noise (i.e. noise coming from the Laplace distribution) that satisfies the definition.

Definition of ε-differential privacy

Let ε be a positive real number and

\mathcal

be a randomized algorithm that takes a dataset as input (representing the actions of the trusted party holding the data). Let

\textrm\ \mathcal

denote the

image An image is a visual representation of something. It can be two-dimensional, three-dimensional, or somehow otherwise feed into the visual system to convey information. An image can be an artifact, such as a photograph or other two-dimensiona ...

\mathcal

. The algorithm

\mathcal

is said to provide

\varepsilon

-differential privacy if, for all datasets

D_1

and

D_2

that differ on a single element (i.e., the data of one person), and all subsets

S

\textrm\ \mathcal

\Pr mathcal(D_1) \in S \leq \exp\left(\varepsilon\right) \cdot \Pr mathcal(D_2) \in S

where the probability is taken over the randomness used by the algorithm. Differential privacy offers strong and robust guarantees that facilitate modular design and analysis of differentially private mechanisms due to its

composability Composability is a system design principle that deals with the inter-relationships of components. A highly composable system provides components that can be selected and assembled in various combinations to satisfy specific user requirements. In i ...

, robustness to post-processing, and graceful degradation in the presence of correlated data.

Composability

(Self-)composability refers to the fact that the joint distribution of the outputs of (possibly adaptively chosen) differentially private mechanisms satisfies differential privacy. Sequential composition. If we query an ε-differential privacy mechanism

t

times, and the randomization of the mechanism is independent for each query, then the result would be

\varepsilon t

-differentially private. In the more general case, if there are

n

independent mechanisms:

\mathcal_1,\dots,\mathcal_n

, whose privacy guarantees are

\varepsilon_1,\dots,\varepsilon_n

differential privacy, respectively, then any function

g

of them:

g(\mathcal_1,\dots,\mathcal_n)

\left(\sum\limits_^ \varepsilon_i\right)

-differentially private. Parallel composition. If the previous mechanisms are computed on ''disjoint'' subsets of the private database then the function

g

would be

(\max_i \varepsilon_i)

-differentially private instead.

Robustness to post-processing

For any deterministic or randomized function

F

defined over the image of the mechanism

\mathcal

, if

\mathcal

satisfies ε-differential privacy, so does

F(\mathcal)

. Together,

and robustness to post-processing permit modular construction and analysis of differentially private mechanisms and motivate the concept of the ''privacy loss budget''. If all elements that access sensitive data of a complex mechanisms are separately differentially private, so will be their combination, followed by arbitrary post-processing.

Group privacy

In general, ε-differential privacy is designed to protect the privacy between neighboring databases which differ only in one row. This means that no adversary with arbitrary auxiliary information can know if one particular participant submitted his information. However this is also extendable. We may want to protect databases differing in

c

rows, which amounts to an adversary with arbitrary auxiliary information knowing if

c

particular participants submitted their information. This can be achieved because if

c

items change, the probability dilation is bounded by

\exp ( \varepsilon c )

instead of

\exp ( \varepsilon )

, i.e., for D₁ and D₂ differing on

c

items:

\Pr mathcal(D_)\in S leq
\exp(\varepsilon c)\cdot\Pr mathcal(D_)\in S,\!

Thus setting ε instead to

\varepsilon/c

achieves the desired result (protection of

c

items). In other words, instead of having each item ε-differentially private protected, now every group of

c

items is ε-differentially private protected (and each item is

(\varepsilon/c)

-differentially private protected).

ε-differentially private mechanisms

Since differential privacy is a probabilistic concept, any differentially private mechanism is necessarily randomized. Some of these, like the Laplace mechanism, described below, rely on adding controlled noise to the function that we want to compute. Others, like the exponential mechanism and posterior sampling sample from a problem-dependent family of distributions instead.

Sensitivity

Let

d

be a positive integer,

\mathcal

be a collection of datasets, and

f \colon \mathcal \rightarrow \mathbb^d

be a function. The ''sensitivity'' of a function, denoted

\Delta f

, is defined by

\Delta f=\max \lVert f(D_1)-f(D_2) \rVert_1,

where the maximum is over all pairs of datasets

D_1

and

D_2

\mathcal

differing in at most one element and

\lVert \cdot \rVert_1

denotes the

\ell_1

norm. In the example of the medical database below, if we consider

f

to be the function

Q_i

, then the sensitivity of the function is one, since changing any one of the entries in the database causes the output of the function to change by either zero or one. There are techniques (which are described below) using which we can create a differentially private algorithm for functions with low sensitivity.

The Laplace mechanism

The Laplace mechanism adds Laplace noise (i.e. noise from the Laplace distribution, which can be expressed by probability density function

\text(y)\propto \exp(-, y, /\lambda)\,\!

, which has mean zero and standard deviation

\sqrt \lambda\,\!

). Now in our case we define the output function of

\mathcal\,\!

as a real valued function (called as the transcript output by

\mathcal\,\!

) as

\mathcal_(x)=f(x)+Y\,\!

where

Y \sim \text(\lambda)\,\!\,\!

and

f\,\!

is the original real valued query/function we planned to execute on the database. Now clearly

\mathcal_(x)\,\!

can be considered to be a continuous random variable, where :

\frac=\frac\,\!

which is at most

e^\leq e^\,\!

. We can consider

\frac\,\!

to be the privacy factor

\varepsilon\,\!

. Thus

\mathcal\,\!

follows a differentially private mechanism (as can be seen from the definition above). If we try to use this concept in our diabetes example then it follows from the above derived fact that in order to have

\mathcal\,\!

as the

\varepsilon\,\!

-differential private algorithm we need to have

\lambda=1/\varepsilon\,\!

. Though we have used Laplace noise here, other forms of noise, such as the Gaussian Noise, can be employed, but they may require a slight relaxation of the definition of differential privacy. According to this definition, differential privacy is a condition on the release mechanism (i.e., the trusted party releasing information ''about'' the dataset) and not on the dataset itself. Intuitively, this means that for any two datasets that are similar, a given differentially private algorithm will behave approximately the same on both datasets. The definition gives a strong guarantee that presence or absence of an individual will not affect the final output of the algorithm significantly. For example, assume we have a database of medical records

D_1

where each record is a pair (Name, X), where

X

is a

Boolean Any kind of logic, function, expression, or theory based on the work of George Boole is considered Boolean. Related to this, "Boolean" may refer to: * Boolean data type, a form of data with only two possible values (usually "true" and "false" ...

denoting whether a person has diabetes or not. For example: Now suppose a malicious user (often termed an ''adversary'') wants to find whether Chandler has diabetes or not. Suppose he also knows in which row of the database Chandler resides. Now suppose the adversary is only allowed to use a particular form of query

Q_i

that returns the partial sum of the first

i

rows of column

X

in the database. In order to find Chandler's diabetes status the adversary executes

Q_5(D_1)

and

Q_4(D_1)

, then computes their difference. In this example,

Q_5(D_1) = 3

and

Q_4(D_1) = 2

, so their difference is 1. This indicates that the "Has Diabetes" field in Chandler's row must be 1. This example highlights how individual information can be compromised even without explicitly querying for the information of a specific individual. Continuing this example, if we construct

D_2

by replacing (Chandler, 1) with (Chandler, 0) then this malicious adversary will be able to distinguish

D_2

from

D_1

by computing

Q_5 - Q_4

for each dataset. If the adversary were required to receive the values

Q_i

via an

\varepsilon

-differentially private algorithm, for a sufficiently small

\varepsilon

, then he or she would be unable to distinguish between the two datasets.

Randomized response

A simple example, especially developed in the social sciences, is to ask a person to answer the question "Do you own the ''attribute A''?", according to the following procedure: # Toss a coin. # If heads, then toss the coin again (ignoring the outcome), and answer the question honestly. # If tails, then toss the coin again and answer "Yes" if heads, "No" if tails. (The seemingly redundant extra toss in the first case is needed in situations where just the ''act'' of tossing a coin may be observed by others, even if the actual result stays hidden.) The confidentiality then arises from the refutability of the individual responses. But, overall, these data with many responses are significant, since positive responses are given to a quarter by people who do not have the ''attribute A'' and three-quarters by people who actually possess it. Thus, if ''p'' is the true proportion of people with ''A'', then we expect to obtain (1/4)(1-''p'') + (3/4)''p'' = (1/4) + ''p''/2 positive responses. Hence it is possible to estimate ''p''. In particular, if the ''attribute A'' is synonymous with illegal behavior, then answering "Yes" is not incriminating, insofar as the person has a probability of a "Yes" response, whatever it may be. Although this example, inspired by randomized response, might be applicable to

microdata Microdata can mean: * Microdata (statistics), a statistical term for individual response data in surveys and censuses * Microdata (HTML), a specification for semantic markup in HTML * Microdata Corporation Microdata Corporation was an American m ...

(i.e., releasing datasets with each individual response), by definition differential privacy excludes microdata releases and is only applicable to queries (i.e., aggregating individual responses into one result) as this would violate the requirements, more specifically the plausible deniability that a subject participated or not.

Stable transformations

A transformation

T

c

-stable if the Hamming distance between

T(A)

and

T(B)

is at most

c

-times the Hamming distance between

A

and

B

for any two databases

A,B

. Theorem 2 in asserts that if there is a mechanism

M

that is

\varepsilon

-differentially private, then the composite mechanism

M\circ T

(\varepsilon \times c)

-differentially private. This could be generalized to group privacy, as the group size could be thought of as the Hamming distance

h

between

A

and

B

(where

A

contains the group and

B

doesn't). In this case

M\circ T

(\varepsilon \times c \times h)

-differentially private.

Other notions of differential privacy

Since differential privacy is considered to be too strong or weak for some applications, many versions of it have been proposed. The most widespread relaxation is (ε, δ)-differential privacy, which weakens the definition by allowing an additional small δ density of probability on which the upper bound ε does not hold.

Adoption of differential privacy in real-world applications

To date there are over 12 real-world deployments of differential privacy, the most important to date being: * 2008:

U.S. Census Bureau The United States Census Bureau (USCB), officially the Bureau of the Census, is a principal agency of the U.S. Federal Statistical System, responsible for producing data about the American people and economy. The Census Bureau is part of the ...

, for showing commuting patterns. * 2014: Google's RAPPOR, for telemetry such as learning statistics about unwanted software hijacking users' settings. * 2015: Google, for sharing historical traffic statistics. * 2016: Apple announced its intention to use differential privacy in

iOS 10 iOS 10 is the tenth major release of the iOS mobile operating system developed by Apple Inc., being the successor to iOS 9. It was announced at the company's Worldwide Developers Conference on June 13, 2016, and was released on September 1 ...

to improve its Intelligent personal assistant technology. * 2017: Microsoft, for telemetry in Windows. * 2019: Privitar Lens is an API using differential privacy. * 2020: LinkedIn, for advertiser queries. * 2021: The US Census Bureau uses differential privacy to release redistricting data from the 2020 Census.

Public purpose considerations

There are several public purpose considerations regarding differential privacy that are important to consider, especially for policymakers and policy-focused audiences interested in the social opportunities and risks of the technology: * Data Utility & Accuracy. The main concern with differential privacy is the tradeoff between data utility and individual privacy. If the privacy loss parameter is set to favor utility, the privacy benefits are lowered (less “noise” is injected into the system); if the privacy loss parameter is set to favor heavy privacy, the accuracy and utility of the dataset are lowered (more “noise” is injected into the system). It is important for policymakers to consider the tradeoffs posed by differential privacy in order to help set appropriate best practices and standards around the use of this privacy preserving practice, especially considering the diversity in organizational use cases. It is worth noting, though, that decreased accuracy and utility is a common issue among all statistical disclosure limitation methods and is not unique to differential privacy. What is unique, however, is how policymakers, researchers, and implementers can consider mitigating against the risks presented through this tradeoff. * Data Privacy & Security. Differential privacy provides a quantified measure of privacy loss and an upper bound and allows curators to choose the explicit tradeoff between privacy and accuracy. It is robust to still unknown privacy attacks. However, it encourages greater data sharing, which if done poorly, increases privacy risk. Differential privacy implies that privacy is protected, but this depends very much on the privacy loss parameter chosen and may instead lead to a false sense of security. Finally, though it is robust against unforeseen future privacy attacks, a countermeasure may be devised that we cannot predict.

External links

Publications

Calibrating noise to sensitivity in private data analysis
Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. In Proceedings of the Third conference on Theory of Cryptography (TCC'06). Springer-Verlag, Berlin, Heidelberg, 265–284. https://doi.org/10.1007/11681878_14 (This is the original publication of Differential Privacy, and not the eponymous article by Dwork that was published the same year.)
Differential Privacy: A Survey of Results
by Cynthia Dwork, Microsoft Research, April 2008 (Presents what was discovered during the first two years of research on differential privacy.)
The Algorithmic Foundations of Differential Privacy
by Cynthia Dwork and Aaron Roth, 2014. (This is the open-source textbook published by Dwork and Roth.)

by Úlfar Erlingsson, Google Research Blog, October 2014 (Google's use of local differential privacy in the Chrome Browser, later abandoned.)
Differential Privacy: A Primer for a Non-Technical Audience
Alexandra Wood, Micah Altman, Aaron Bembenek, Mark Bun, Marco Gaboardi, et al, Vanderbilt Journal of Entertainment & Technology LawVanderbilt Journal of Entertainment, Volume 21, Issue 1, Fall 2018. (A good introductory document, but definitely *not* for non-technical audiences!)
Technology Factsheet: Differential Privacy
by Raina Gandhi and Amritha Jayanti, Belfer Center for Science and International Affairs, Fall 2020
Differential Privacy and the 2020 US Census
MIT Case Studies in Social and Ethical Responsibilities of Computing, no. Winter 2022 (January). https://doi.org/10.21428/2c646de5.7ec6ab93.

Explanatory Videos

Privacy of Dynamic Data: Continual Observation and Pan Privacy
by Moni Naor, Institute for Advanced Study, November 2009
Tutorial on Differential Privacy
by Katrina Ligett, California Institute of Technology, December 2013
Privacy and the 2020 Census
Simson Garfinkel, Rice University Symposium on Data Privacy, January 28, 2019

Tutorials

A Practical Beginner's Guide To Differential Privacy
by Christine Task, Purdue University, April 2012