Datafly Algorithm
   HOME

TheInfoList



OR:

Datafly algorithm is an
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...
for providing anonymity in medical data. The algorithm was developed by
Latanya Arvette Sweeney Latanya Arvette Sweeney is an American computer scientist. She is the Daniel Paul Professor of the Practice of Government and Technology at the Harvard Kennedy School and in the Harvard Faculty of Arts and Sciences at Harvard University. She is t ...
in 1997−98. Anonymization is achieved by automatically generalizing, substituting, inserting, and removing information as appropriate without losing many of the details found within the data. The method can be used on-the-fly in role-based security within an institution, and in batch mode for exporting data from an institution. Organizations release and receive medical data with all explicit
identifier An identifier is a name that identifies (that is, labels the identity of) either a unique object or a unique ''class'' of objects, where the "object" or class may be an idea, physical countable object (or class thereof), or physical noncountable ...
s—such as name—removed, in the erroneous belief that
patient confidentiality A patient is any recipient of health care services that are performed by healthcare professionals. The patient is most often ill or injured and in need of treatment by a physician, nurse, optometrist, dentist, veterinarian, or other health ...
is maintained because the resulting data look anonymous. However the remaining data can be used to re-identify individuals by linking or matching the data to other databases or by looking at unique characteristics found in the
field Field may refer to: Expanses of open ground * Field (agriculture), an area of land used for agricultural purposes * Airfield, an aerodrome that lacks the infrastructure of an airport * Battlefield * Lawn, an area of mowed grass * Meadow, a grass ...
s and records of the
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...
itself. The Datafly algorithm has been criticized for trying to achieve anonymization by overgeneralization. The algorithm selects the attribute with the greatest number of distinct
value Value or values may refer to: Ethics and social * Value (ethics) wherein said concept may be construed as treating actions themselves as abstract objects, associating value to them ** Values (Western philosophy) expands the notion of value beyo ...
s as the one to generalize first.


Core algorithm

An outline of the Datafly algorithm is presented below. Input: Private
Table Table may refer to: * Table (furniture), a piece of furniture with a flat surface and one or more legs * Table (landform), a flat area of land * Table (information), a data arrangement with rows and columns * Table (database), how the table data ...
PT;
quasi-identifier Quasi-identifiers are pieces of information that are not of themselves unique identifiers, but are sufficiently well correlated with an entity that they can be combined with other quasi-identifiers to create a unique identifier. Quasi-identifiers c ...
QI = ( ''A''1, ..., ''A''n ), ''k''-anonymity constraint ''k''; domain generalization hierarchies DGHAi, where ''i'' = 1,...,''n'' with accompanying functions ''f''Ai, and loss, which is a limit on the percentage of
tuple In mathematics, a tuple is a finite ordered list (sequence) of elements. An -tuple is a sequence (or ordered list) of elements, where is a non-negative integer. There is only one 0-tuple, referred to as ''the empty tuple''. An -tuple is defi ...
s that can be suppressed. PT dis the set of unique identifiers or keys for each tuple.
Output Output may refer to: * The information produced by a computer, see Input/output * An output state of a system, see state (computer science) * Output (economics), the amount of goods and services produced ** Gross output in economics, the value of ...
: MGT a generalization of PT Ithat enforces ''k''-anonymity Assumes: , PT , ≤ ''k'', and loss * , PT , = ''k'' algorithm Datafly: // Construct a frequency list containing unique
sequence In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed and order matters. Like a set, it contains members (also called ''elements'', or ''terms''). The number of elements (possibly infinite) is calle ...
s of values across the quasi-identifier in PT, // along with the number of occurrences of each sequence. :1. let freq be an expandable and collapsible vector with no elements initially. Each element is of the form ( QI, frequency, SID ), where SID = ; and, frequency = , SID, . Therefore, freq is also accessible as a table over (QI, frequency, SID). :2. let pos \gets 0, total \gets 0 :3. while total ≠ , PT, do ::3.1 freq os \gets ( ''t'' I occurs, SID ) where ''t'' II ( ''t'' QI __, ___ ) \not\in freq; occurs = , PT, - , PT I– , ; and, SID = ::3.2 pos \gets pos + 1, total \gets total + occurs :// Make a solution by generalizing the attribute with the most number of distinct values :// and suppressing no more than the allowed number of tuples. :4. let belowk \gets 0 :5. for pos \gets 1 to , freq, do ::5.1 ( __, count ) \gets freq os ::5.2 if count < ''k'' then do :::5.2.1 belowk \gets belowk + count :6. if belowk > ''k'' then do: // Note. loss * , PT, = ''k''. ::6.1 freq \gets generalize(freq) ::6.2 go to step 4 :7. else do :// assert: the number of tuples to suppress in freq is ≤ loss * , PT, ::7.1 freq \gets suppress(freq, belowk ) ::7.2 MGT \gets reconstruct(freq) :8. return MGT.


References

{{reflist


External links


Details of the Datafly algorithm
Privacy Anonymity Medical records Data protection Datasets Obfuscation Articles with example pseudocode