Unicity (computer Science)
   HOME

TheInfoList



OR:

Unicity (\varepsilon_p) is a risk metric for measuring the re-identifiability of high-dimensional anonymous data. First introduced in 2013, unicity is measured by the number of points ''p'' needed to uniquely identify an individual in a data set. The fewer points needed, the more unique the traces are and the easier they would be to re-identify using outside information. In a high-dimensional, human behavioural data set, such as mobile phone meta-data, for each person, there exists potentially thousands of different records. In the case of mobile phone meta-data, credit card transaction histories and many other types of personal data, this information includes the time and location of an individual. In research, unicity is widely used to illustrate the re-identifiability of anonymous data sets. In 2013 researchers from the
MIT Media Lab The MIT Media Lab is a research laboratory at the Massachusetts Institute of Technology, growing out of MIT's Architecture Machine Group in the School of Architecture. Its research does not restrict to fixed academic disciplines, but draws from ...
showed that only 4 points needed to uniquely identify 95% of individual trajectories in a de-identified data set of 1.5 million mobility trajectories. These ''points'' were location-time pairs that appeared with the resolution of 1 hour and 0.15 km² to 15 km². These results were shown to hold true for credit card transaction data as well with 4 points being enough to re-identify 90% of trajectories. Further research studied the unicity of the apps installed by people on their smartphones, the trajectories of vehicles, mobile phone data from Boston and Singapore, and, public transport data in Singapore obtained from smartcards.


Measuring unicity

Unicity (\varepsilon_p) is formally defined as the expected value of the fraction of uniquely identifiable trajectories, given ''p'' points selected from those trajectories uniformly at random. A full computation of \varepsilon_p of a data set D requires picking ''p'' points uniformly at random from each trajectory T_i \in D, and then checking whether or not any other trajectory also contains those ''p'' points. Averaging over all possible sets of ''p'' points for each trajectory results in a value for \varepsilon_p. This is usually prohibitively expensive as it requires considering every possible set of ''p'' points for each trajectory in the data set — trajectories that sometimes contain thousands of points. Instead, unicity is usually estimated using sampling techniques. Specifically, given a data set D, the estimated unicity is computed by sampling from D a fraction of the trajectories S and then checking whether each of the trajectories T_j \in S are unique in D given ''p'' randomly selected points from each T_j. The fraction of S that is uniquely identifiable is then the unicity estimate.


See also

*
Quasi-identifier Quasi-identifiers are pieces of information that are not of themselves unique identifiers, but are sufficiently well correlated with an entity that they can be combined with other quasi-identifiers to create a unique identifier. Quasi-identifiers c ...
*
Personally Identifiable Information Personal data, also known as personal information or personally identifiable information (PII), is any information related to an identifiable person. The abbreviation PII is widely accepted in the United States, but the phrase it abbreviates ha ...


References

{{reflist Anonymity Data analysis Privacy