Datafly algorithm is an
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...
for providing anonymity in medical data. The algorithm was developed by
Latanya Arvette Sweeney in 1997−98. Anonymization is achieved by automatically generalizing, substituting, inserting, and removing information as appropriate without losing many of the details found within the data. The method can be used on-the-fly in
role-based security within an institution, and in
batch mode for
exporting data from an institution.
Organizations release and receive medical data with all explicit
identifier
An identifier is a name that identifies (that is, labels the identity of) either a unique object or a unique ''class'' of objects, where the "object" or class may be an idea, person, physical countable object (or class thereof), or physical mass ...
s—such as name—removed, in the erroneous belief that
patient confidentiality is maintained because the resulting data look anonymous. However the remaining data can be used to re-identify individuals by linking or matching the data to other databases or by looking at unique characteristics found in the
fields and
records of the
database
In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...
itself.
The Datafly algorithm has been criticized for trying to achieve anonymization by overgeneralization. The algorithm selects the
attribute with the greatest number of distinct
values as the one to generalize first.
Core algorithm
An outline of the Datafly algorithm is presented below.
Input:
Private
Table
Table may refer to:
* Table (database), how the table data arrangement is used within the databases
* Table (furniture), a piece of furniture with a flat surface and one or more legs
* Table (information), a data arrangement with rows and column ...
PT;
quasi-identifier QI = ( ''A''
1, ..., ''A''
n ), ''k''-anonymity constraint ''k''; domain generalization hierarchies DGH
Ai, where ''i'' = 1,...,''n'' with accompanying
functions ''f''
Ai, and loss, which is a limit on the percentage of
tuple
In mathematics, a tuple is a finite sequence or ''ordered list'' of numbers or, more generally, mathematical objects, which are called the ''elements'' of the tuple. An -tuple is a tuple of elements, where is a non-negative integer. There is o ...
s that can be suppressed. PT
dis the set
of unique identifiers or keys for each tuple.
Output
Output may refer to:
* The information produced by a computer, see Input/output
* An output state of a system, see state (computer science)
* Output (economics), the amount of goods and services produced
** Gross output in economics, the valu ...
:
MGT a generalization of PT
Ithat enforces ''k''-anonymity
Assumes: , PT , ≤ ''k'', and loss * , PT , = ''k''
algorithm Datafly:
// Construct a frequency
list
A list is a Set (mathematics), set of discrete items of information collected and set forth in some format for utility, entertainment, or other purposes. A list may be memorialized in any number of ways, including existing only in the mind of t ...
containing unique
sequence
In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed and order matters. Like a set, it contains members (also called ''elements'', or ''terms''). The number of elements (possibly infinite) is cal ...
s of values across the quasi-identifier in PT,
// along with the number of occurrences of each sequence.
:1. let freq be an expandable and collapsible
vector
Vector most often refers to:
* Euclidean vector, a quantity with a magnitude and a direction
* Disease vector, an agent that carries and transmits an infectious pathogen into another living organism
Vector may also refer to:
Mathematics a ...
with no elements initially. Each element is of the form ( QI, frequency, SID ), where SID = ; and, frequency = , SID, . Therefore, freq is also accessible as a table over (QI, frequency, SID).
:2. let pos
0, total
0
:3. while total ≠ , PT, do
::3.1 freq
os ( ''t''
I occurs, SID ) where ''t''
I∈
I ( ''t''
QI __, ___ )
freq; occurs = , PT, - , PT
I– , ; and, SID =
::3.2 pos
pos + 1, total
total + occurs
:// Make a solution by generalizing the attribute with the most number of distinct values
:// and suppressing no more than the allowed number of tuples.
:4. let belowk
0
:5. for pos
1 to , freq, do
::5.1 ( __, count )
freq
os
::5.2 if count < ''k'' then do
:::5.2.1 belowk
belowk + count
:6. if belowk > ''k'' then do: // Note. loss * , PT, = ''k''.
::6.1 freq
generalize(freq)
::6.2 go to step 4
:7. else do
:// assert: the number of tuples to suppress in freq is ≤ loss * , PT,
::7.1 freq
suppress(freq, belowk )
::7.2 MGT
reconstruct(freq)
:8. return MGT.
References
{{reflist
External links
Details of the Datafly algorithm
Privacy
Anonymity
Medical records
Data protection
Datasets
Obfuscation
Articles with example pseudocode