''k''-anonymity is a property possessed by certain anonymized data. The concept of ''k''-anonymity was first introduced by

Latanya Sweeney Latanya Arvette Sweeney is an American computer scientist. She is the Daniel Paul Professor of the Practice of Government and Technology at the Harvard Kennedy School and in the Harvard Faculty of Arts and Sciences at Harvard University. She is t ...

and

Pierangela Samarati Pierangela Samarati from the Università degli Studi di Milano, Italy, was named Fellow of the Institute of Electrical and Electronics Engineers (IEEE) in 2012 ''for contributions to information security, data protection, and privacy''. She was name ...

in a paper published in 1998 as an attempt to solve the problem: "Given person-specific field-structured data, produce a release of the data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful." A release of data is said to have the ''k''-anonymity property if the information for each person contained in the release cannot be distinguished from at least

k - 1

individuals whose information also appear in the release. Unfortunately, the guarantees provided by k-anonymity are aspirational, not mathematical.

Methods for ''k''-anonymization

To use k-anonymity to process a dataset so that it can be released with privacy protection, a data scientist must first examine the dataset and decide if each attribute (column) is an ''identifier'' (identifying), a ''non-identifier'' (not-identifying), or a ''quasi-identifier'' (somewhat identifying). Identifiers such as names are suppressed, non-identifying values are allowed to remain, and the quasi-identifiers need to be processed so that every distinct combination of quasi-identifiers designates at least ''k'' records. In the example table below presents a fictional nonanonymized database consisting of the patient records for a fictitious hospital. The ''Name'' column is an identifier, ''Age, Gender, State of domicile,'' and ''Religion'' are quasi-identifiers, and ''Disease'' is a non-identifying sensitive value. But what about ''Height'' and ''Weight''? Are they also non-identifying sensitive values, or are they quasi-identifiers? ; Patients treated in the study on April 30: There are 6 attributes and 10 records in this data. There are two common methods for achieving ''k''-anonymity for some value of ''k''. #Suppression: In this method, certain values of the attributes are replaced by an asterisk '*'. All or some values of a column may be replaced by '*'. In the anonymized table below, we have replaced all the values in the 'Name' attribute and all the values in the 'Religion' attribute with a '*'. #Generalization: In this method, individual values of attributes are replaced with a broader category. For example, the value '19' of the attribute 'Age' may be replaced by ' ≤ 20', the value '23' by '20 < Age ≤ 30' , etc. The next table shows the anonymized database. ; Patients treated in the study on April 30: This data has 2-anonymity with respect to the attributes 'Age', 'Gender' and 'State of domicile' since for any combination of these attributes found in any row of the table there are always at least 2 rows with those exact attributes. The attributes available to an adversary are called

quasi-identifier Quasi-identifiers are pieces of information that are not of themselves unique identifiers, but are sufficiently well correlated with an entity that they can be combined with other quasi-identifiers to create a unique identifier. Quasi-identifiers c ...

s. Each quasi-identifier tuple occurs in at least ''k'' records for a dataset with ''k''-anonymity.

Critiques of k-anonyminity

This examples demonstrates a failing with ''k''-anonymity: there may exist other data records that can be linked on the variables that are allegedly non-identifying. For example, if an attacker is able to obtain the a log from the person who was taking vital signs as part of the study and learns that Kishor was at the hospital on April 30 and is 180cm tall, this information can be used to link with the "anonymized" database (which may have been published on the Internet) and learn that Kishor has a heart-related disease. An attacker who knows that Kishor visited the hospital on April 30 may be able to infer this simply knowing that Kishor is 180cm hight, roughly 80-82 kg, and comes from Karnataka. The root of this problem is the core problem with k-anonyminity: there is no way to mathematically, unambiguously determine if an attribute is an identifier, a quasi-identifier, or a non-identifying sensitive value. In fact, all values are potentially identifying, depending on their prevalence in the population and on auxiliary data that the attacker may have. Other privacy mechanisms such as

differential privacy Differential privacy (DP) is a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset. The idea behind differential privacy is t ...

do not share this problem. Meyerson and Williams (2004) demonstrated that optimal ''k''-anonymity is an

NP-hard In computational complexity theory, NP-hardness ( non-deterministic polynomial-time hardness) is the defining property of a class of problems that are informally "at least as hard as the hardest problems in NP". A simple example of an NP-hard pr ...

problem, however heuristic methods such as ''k''-Optimize as given by Bayardo and Agrawal (2005) often yield effective results. A practical approximation algorithm that enables solving the ''k''-anonymization problem with an approximation guarantee of

O(\log k)

was presented by Kenig and Tassa.

Possible attacks

While ''k''-anonymity is a promising approach to take for group based anonymization given its simplicity and wide array of algorithms that perform it, it is however susceptible to many attacks. When background knowledge is available to an attacker, such attacks become even more effective. Such attacks include: * ''Homogeneity Attack'': This attack leverages the case where all the values for a sensitive value within a set of ''k'' records are identical. In such cases, even though the data has been ''k''-anonymized, the sensitive value for the set of ''k'' records may be exactly predicted. * ''Background Knowledge Attack'': This attack leverages an association between one or more quasi-identifier attributes with the sensitive attribute to reduce the set of possible values for the sensitive attribute. For example, Machanavajjhala, Kifer, Gehrke, and Venkitasubramaniam (2007) showed that knowing that heart attacks occur at a reduced rate in Japanese patients could be used to narrow the range of values for a sensitive attribute of a patient's disease.

Caveats

Because ''k''-anonymization does not include any randomization, attackers can still make inferences about data sets that may harm individuals. For example, if the 19-year-old John from Kerala is known to be in the database above, then it can be reliably said that he has either cancer, a heart-related disease, or a viral infection. ''K''-anonymization is not a good method to anonymize high-dimensional datasets. For example, researchers showed that, given 4 locations, the unicity of mobile phone timestamp-location datasets (

\mathcal_4

, ''k''-anonymity when

k=1

) can be as high as 95%. It has also been shown that ''k''-anonymity can skew the results of a data set if it disproportionately suppresses and generalizes data points with unrepresentative characteristics. The suppression and generalization algorithms used to ''k''-anonymize datasets can be altered, however, so that they do not have such a skewing effect.

References

{{reflist Data anonymization techniques Privacy

Methods for ''k''-anonymization

Critiques of k-anonyminity

Possible attacks

Caveats

See also

References