HOME

TheInfoList



OR:

Datafly algorithm is an
algorithm In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...
for providing anonymity in medical data. The algorithm was developed by Latanya Arvette Sweeney in 1997−98. Anonymization is achieved by automatically generalizing, substituting, inserting, and removing information as appropriate without losing many of the details found within the data. The method can be used on-the-fly in role-based security within an institution, and in batch mode for exporting data from an institution. Organizations release and receive medical data with all explicit
identifier An identifier is a name that identifies (that is, labels the identity of) either a unique object or a unique ''class'' of objects, where the "object" or class may be an idea, person, physical countable object (or class thereof), or physical mass ...
s—such as name—removed, in the erroneous belief that patient confidentiality is maintained because the resulting data look anonymous. However the remaining data can be used to re-identify individuals by linking or matching the data to other databases or by looking at unique characteristics found in the fields and records of the
database In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...
itself. The Datafly algorithm has been criticized for trying to achieve anonymization by overgeneralization. The algorithm selects the attribute with the greatest number of distinct values as the one to generalize first.


Core algorithm

An outline of the Datafly algorithm is presented below. Input: Private
Table Table may refer to: * Table (database), how the table data arrangement is used within the databases * Table (furniture), a piece of furniture with a flat surface and one or more legs * Table (information), a data arrangement with rows and column ...
PT; quasi-identifier QI = ( ''A''1, ..., ''A''n ), ''k''-anonymity constraint ''k''; domain generalization hierarchies DGHAi, where ''i'' = 1,...,''n'' with accompanying functions ''f''Ai, and loss, which is a limit on the percentage of
tuple In mathematics, a tuple is a finite sequence or ''ordered list'' of numbers or, more generally, mathematical objects, which are called the ''elements'' of the tuple. An -tuple is a tuple of elements, where is a non-negative integer. There is o ...
s that can be suppressed. PT dis the set of unique identifiers or keys for each tuple.
Output Output may refer to: * The information produced by a computer, see Input/output * An output state of a system, see state (computer science) * Output (economics), the amount of goods and services produced ** Gross output in economics, the valu ...
: MGT a generalization of PT Ithat enforces ''k''-anonymity Assumes: , PT , ≤ ''k'', and loss * , PT , = ''k'' algorithm Datafly: // Construct a frequency
list A list is a Set (mathematics), set of discrete items of information collected and set forth in some format for utility, entertainment, or other purposes. A list may be memorialized in any number of ways, including existing only in the mind of t ...
containing unique
sequence In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed and order matters. Like a set, it contains members (also called ''elements'', or ''terms''). The number of elements (possibly infinite) is cal ...
s of values across the quasi-identifier in PT, // along with the number of occurrences of each sequence. :1. let freq be an expandable and collapsible
vector Vector most often refers to: * Euclidean vector, a quantity with a magnitude and a direction * Disease vector, an agent that carries and transmits an infectious pathogen into another living organism Vector may also refer to: Mathematics a ...
with no elements initially. Each element is of the form ( QI, frequency, SID ), where SID = ; and, frequency = , SID, . Therefore, freq is also accessible as a table over (QI, frequency, SID). :2. let pos \gets 0, total \gets 0 :3. while total ≠ , PT, do ::3.1 freq os \gets ( ''t'' I occurs, SID ) where ''t'' II ( ''t'' QI __, ___ ) \not\in freq; occurs = , PT, - , PT I– , ; and, SID = ::3.2 pos \gets pos + 1, total \gets total + occurs :// Make a solution by generalizing the attribute with the most number of distinct values :// and suppressing no more than the allowed number of tuples. :4. let belowk \gets 0 :5. for pos \gets 1 to , freq, do ::5.1 ( __, count ) \gets freq os ::5.2 if count < ''k'' then do :::5.2.1 belowk \gets belowk + count :6. if belowk > ''k'' then do: // Note. loss * , PT, = ''k''. ::6.1 freq \gets generalize(freq) ::6.2 go to step 4 :7. else do :// assert: the number of tuples to suppress in freq is ≤ loss * , PT, ::7.1 freq \gets suppress(freq, belowk ) ::7.2 MGT \gets reconstruct(freq) :8. return MGT.


References

{{reflist


External links


Details of the Datafly algorithm
Privacy Anonymity Medical records Data protection Datasets Obfuscation Articles with example pseudocode