HOME

TheInfoList



OR:

Data integrity is the maintenance of, and the assurance of, data accuracy and consistency over its entire life-cycle and is a critical aspect to the design, implementation, and usage of any system that stores, processes, or retrieves data. The term is broad in scope and may have widely different meanings depending on the specific context even under the same general umbrella of computing. It is at times used as a proxy term for
data quality Data quality refers to the state of qualitative or quantitative pieces of information. There are many definitions of data quality, but data is generally considered high quality if it is "fit for tsintended uses in operations, decision making an ...
, while
data validation In computer science, data validation is the process of ensuring data has undergone data cleansing to ensure they have data quality, that is, that they are both correct and useful. It uses routines, often called "validation rules", "validation cons ...
is a prerequisite for data integrity. Data integrity is the opposite of
data corruption In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted. ...
. The overall intent of any data integrity technique is the same: ensure data is recorded exactly as intended (such as a database correctly rejecting mutually exclusive possibilities). Moreover, upon later
retrieval Retrieval could refer to: Computer science * RETRIEVE, Tymshare database that inspired dBASE and others * Data retrieval * Document retrieval * Image retrieval * Information retrieval * Knowledge retrieval * Medical retrieval * Music information ...
, ensure the data is the same as when it was originally recorded. In short, data integrity aims to prevent unintentional changes to information. Data integrity is not to be confused with
data security Data security means protecting digital data, such as those in a database, from destructive forces and from the unwanted actions of unauthorized users, such as a cyberattack or a data breach. Technologies Disk encryption Disk encryption ref ...
, the discipline of protecting data from unauthorized parties. Any unintended changes to data as the result of a storage, retrieval or processing operation, including malicious intent, unexpected hardware failure, and
human error Human error refers to something having been done that was " not intended by the actor; not desired by a set of rules or an external observer; or that led the task or system outside its acceptable limits".Senders, J.W. and Moray, N.P. (1991) Human ...
, is failure of data integrity. If the changes are the result of unauthorized access, it may also be a failure of data security. Depending on the data involved this could manifest itself as benign as a single pixel in an image appearing a different color than was originally recorded, to the loss of vacation pictures or a business-critical database, to even catastrophic loss of human life in a
life-critical system A safety-critical system (SCS) or life-critical system is a system whose failure or malfunction may result in one (or more) of the following outcomes: * death or serious injury to people * loss or severe damage to equipment/property * environme ...
.


Integrity types


Physical integrity

Physical integrity deals with challenges which are associated with correctly storing and fetching the data itself. Challenges with physical integrity may include
electromechanical In engineering, electromechanics combines processes and procedures drawn from electrical engineering and mechanical engineering. Electromechanics focuses on the interaction of electrical and mechanical systems as a whole and how the two systems ...
faults, design flaws, material
fatigue Fatigue describes a state of tiredness that does not resolve with rest or sleep. In general usage, fatigue is synonymous with extreme tiredness or exhaustion that normally follows prolonged physical or mental activity. When it does not resolve ...
,
corrosion Corrosion is a natural process that converts a refined metal into a more chemically stable oxide. It is the gradual deterioration of materials (usually a metal) by chemical or electrochemical reaction with their environment. Corrosion engine ...
,
power outages A power outage (also called a powercut, a power out, a power failure, a power blackout, a power loss, or a blackout) is the loss of the electrical power network supply to an end user. There are many causes of power failures in an electricity ...
, natural disasters, and other special environmental hazards such as ionizing radiation, extreme temperatures, pressures and
g-force The gravitational force equivalent, or, more commonly, g-force, is a measurement of the type of force per unit mass – typically acceleration – that causes a perception of weight, with a g-force of 1 g (not gram in mass measure ...
s. Ensuring physical integrity includes methods such as redundant hardware, an
uninterruptible power supply An uninterruptible power supply or uninterruptible power source (UPS) is an electrical apparatus that provides emergency power to a load when the input power source or mains power fails. A UPS differs from an auxiliary or emergency power system ...
, certain types of
RAID Raid, RAID or Raids may refer to: Attack * Raid (military), a sudden attack behind the enemy's lines without the intention of holding ground * Corporate raid, a type of hostile takeover in business * Panty raid, a prankish raid by male college ...
arrays,
radiation hardened Radiation hardening is the process of making electronic components and circuits resistant to damage or malfunction caused by high levels of ionizing radiation (particle radiation and high-energy electromagnetic radiation), especially for environ ...
chips, error-correcting memory, use of a
clustered file system A clustered file system is a file system which is shared by being simultaneously mounted on multiple servers. There are several approaches to clustering, most of which do not employ a clustered file system (only direct attached storage for ...
, using file systems that employ block level
checksum A checksum is a small-sized block of data derived from another block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify data ...
s such as
ZFS ZFS (previously: Zettabyte File System) is a file system with volume management capabilities. It began as part of the Sun Microsystems Solaris operating system in 2001. Large parts of Solaris – including ZFS – were published under an open ...
, storage arrays that compute parity calculations such as
exclusive or Exclusive or or exclusive disjunction is a logical operation that is true if and only if its arguments differ (one is true, the other is false). It is symbolized by the prefix operator J and by the infix operators XOR ( or ), EOR, EXOR, , , ...
or use a
cryptographic hash function A cryptographic hash function (CHF) is a hash algorithm (a map of an arbitrary binary string to a binary string with fixed size of n bits) that has special properties desirable for cryptography: * the probability of a particular n-bit output re ...
and even having a
watchdog timer A watchdog timer (sometimes called a ''computer operating properly'' or ''COP'' timer, or simply a ''watchdog'') is an electronic or software timer that is used to detect and recover from computer malfunctions. Watchdog timers are widely used in ...
on critical subsystems. Physical integrity often makes extensive use of error detecting algorithms known as
error-correcting codes In computing, telecommunication, information theory, and coding theory, an error correction code, sometimes error correcting code, (ECC) is used for controlling errors in data over unreliable or noisy communication channels. The central idea is ...
. Human-induced data integrity errors are often detected through the use of simpler checks and algorithms, such as the
Damm algorithm In error detection, the Damm algorithm is a check digit algorithm that detects all single-digit errors and all adjacent transposition errors. It was presented by H. Michael Damm in 2004. Strengths and weaknesses Strengths The Damm algorithm is ...
or
Luhn algorithm The Luhn algorithm or Luhn formula, also known as the " modulus 10" or "mod 10" algorithm, named after its creator, IBM scientist Hans Peter Luhn, is a simple checksum formula used to validate a variety of identification numbers, such as credit c ...
. These are used to maintain data integrity after manual transcription from one computer system to another by a human intermediary (e.g. credit card or bank routing numbers). Computer-induced transcription errors can be detected through
hash functions A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually ...
. In production systems, these techniques are used together to ensure various degrees of data integrity. For example, a computer
file system In computing, file system or filesystem (often abbreviated to fs) is a method and data structure that the operating system uses to control how data is stored and retrieved. Without a file system, data placed in a storage medium would be one lar ...
may be configured on a fault-tolerant RAID array, but might not provide block-level checksums to detect and prevent
silent data corruption Silent may mean any of the following: People with the name * Silent George, George Stone (outfielder) (1876–1945), American Major League Baseball outfielder and batting champion * Brandon Silent (born 1973), South African former footballer * ...
. As another example, a database management system might be compliant with the
ACID In computer science, ACID ( atomicity, consistency, isolation, durability) is a set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps. In the context of databases, a sequ ...
properties, but the RAID controller or hard disk drive's internal write cache might not be.


Logical integrity

This type of integrity is concerned with the correctness or
rationality Rationality is the quality of being guided by or based on reasons. In this regard, a person acts rationally if they have a good reason for what they do or a belief is rational if it is based on strong evidence. This quality can apply to an abil ...
of a piece of data, given a particular context. This includes topics such as
referential integrity Referential integrity is a property of data stating that all its references are valid. In the context of relational databases, it requires that if a value of one attribute (column) of a relation (table) references a value of another attribute ( ...
and
entity integrity Entity integrity is concerned with ensuring that each row of a table has a unique and non-null primary key value; this is the same as saying that each row in a table represents a single instance of the entity type modelled by the table. A requiremen ...
in a relational database or correctly ignoring impossible sensor data in robotic systems. These concerns involve ensuring that the data "makes sense" given its environment. Challenges include
software bugs A software bug is an error, flaw or fault in the design, development, or operation of computer software that causes it to produce an incorrect or unexpected result, or to behave in unintended ways. The process of finding and correcting bugs ...
, design flaws, and human errors. Common methods of ensuring logical integrity include things such as check constraints, foreign key constraints, program assertions, and other run-time sanity checks. Both physical and logical integrity often share many common challenges such as human errors and design flaws, and both must appropriately deal with concurrent requests to record and retrieve data, the latter of which is entirely a subject on its own. If a data sector only has a logical error, it can be reused by overwriting it with new data. In case of a physical error, the affected data sector is permanently unusable.


Databases

Data integrity contains guidelines for
data retention Data retention defines the policies of persistent data and records management for meeting legal and business data archival requirements. Although sometimes interchangeable, it is not to be confused with the Data Protection Act 1998. The different ...
, specifying or guaranteeing the length of time data can be retained in a particular database. To achieve data integrity, these rules are consistently and routinely applied to all data entering the system, and any relaxation of enforcement could cause errors in the data. Implementing checks on the data as close as possible to the source of input (such as human data entry), causes less erroneous data to enter the system. Strict enforcement of data integrity rules results in lower error rates, and time saved troubleshooting and tracing erroneous data and the errors it causes to algorithms. Data integrity also includes rules defining the relations a piece of data can have to other pieces of data, such as a ''Customer'' record being allowed to link to purchased ''Products'', but not to unrelated data such as ''Corporate Assets''. Data integrity often includes checks and correction for invalid data, based on a fixed
schema The word schema comes from the Greek word ('), which means ''shape'', or more generally, ''plan''. The plural is ('). In English, both ''schemas'' and ''schemata'' are used as plural forms. Schema may refer to: Science and technology * SCHEMA ...
or a predefined set of rules. An example being textual data entered where a date-time value is required. Rules for data derivation are also applicable, specifying how a data value is derived based on algorithm, contributors and conditions. It also specifies the conditions on how the data value could be re-derived.


Types of integrity constraints

Data integrity is normally enforced in a
database system In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases s ...
by a series of integrity constraints or rules. Three types of integrity constraints are an inherent part of the relational data model: entity integrity, referential integrity and domain integrity. * ''
Entity integrity Entity integrity is concerned with ensuring that each row of a table has a unique and non-null primary key value; this is the same as saying that each row in a table represents a single instance of the entity type modelled by the table. A requiremen ...
'' concerns the concept of a
primary key In the relational model of databases, a primary key is a ''specific choice'' of a ''minimal'' set of attributes (columns) that uniquely specify a tuple ( row) in a relation (table). Informally, a primary key is "which attributes identify a record, ...
. Entity integrity is an integrity rule which states that every table must have a primary key and that the column or columns chosen to be the primary key should be unique and not null. * ''
Referential integrity Referential integrity is a property of data stating that all its references are valid. In the context of relational databases, it requires that if a value of one attribute (column) of a relation (table) references a value of another attribute ( ...
'' concerns the concept of a
foreign key A foreign key is a set of attributes in a table that refers to the primary key of another table. The foreign key links these two tables. Another way to put it: In the context of relational databases, a foreign key is a set of attributes subject to ...
. The referential integrity rule states that any foreign-key value can only be in one of two states. The usual state of affairs is that the foreign-key value refers to a primary key value of some table in the database. Occasionally, and this will depend on the rules of the data owner, a foreign-key value can be
null Null may refer to: Science, technology, and mathematics Computing * Null (SQL) (or NULL), a special marker and keyword in SQL indicating that something has no value *Null character, the zero-valued ASCII character, also designated by , often used ...
. In this case, we are explicitly saying that either there is no relationship between the objects represented in the database or that this relationship is unknown. * ''Domain integrity'' specifies that all columns in a relational database must be declared upon a defined domain. The primary unit of data in the relational data model is the data item. Such data items are said to be non-decomposable or atomic. A domain is a set of values of the same type. Domains are therefore pools of values from which actual values appearing in the columns of a table are drawn. * ''User-defined integrity'' refers to a set of rules specified by a user, which do not belong to the entity, domain and referential integrity categories. If a database supports these features, it is the responsibility of the database to ensure data integrity as well as the
consistency model In computer science, a consistency model specifies a contract between the programmer and a system, wherein the system guarantees that if the programmer follows the rules for operations on memory, memory will be consistent and the results of readi ...
for the data storage and retrieval. If a database does not support these features, it is the responsibility of the applications to ensure data integrity while the database supports the
consistency model In computer science, a consistency model specifies a contract between the programmer and a system, wherein the system guarantees that if the programmer follows the rules for operations on memory, memory will be consistent and the results of readi ...
for the data storage and retrieval. Having a single, well-controlled, and well-defined data-integrity system increases * stability (one centralized system performs all data integrity operations) * performance (all data integrity operations are performed in the same tier as the consistency model) * re-usability (all applications benefit from a single centralized data integrity system) * maintainability (one centralized system for all data integrity administration). Modern
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases span ...
s support these features (see
Comparison of relational database management systems The following tables compare general and technical information for a number of relational database management systems. Please see the individual products' articles for further information. Unless otherwise specified in footnotes, comparisons are ba ...
), and it has become the de facto responsibility of the database to ensure data integrity. Companies, and indeed many database systems, offer products and services to migrate legacy systems to modern databases.


Examples

An example of a data-integrity mechanism is the parent-and-child relationship of related records. If a parent record owns one or more related child records all of the referential integrity processes are handled by the database itself, which automatically ensures the accuracy and integrity of the data so that no child record can exist without a parent (also called being orphaned) and that no parent loses their child records. It also ensures that no parent record can be deleted while the parent record owns any child records. All of this is handled at the database level and does not require coding integrity checks into each application.


File systems

Various research results show that neither widespread
filesystem In computing, file system or filesystem (often abbreviated to fs) is a method and data structure that the operating system uses to control how data is stored and retrieved. Without a file system, data placed in a storage medium would be one larg ...
s (including UFS, Ext,
XFS XFS is a high-performance 64-bit computing, 64-bit journaling file system created by Silicon Graphics, Inc., Silicon Graphics, Inc (SGI) in 1993. It was the default file system in SGI's IRIX operating system starting with its version 5.3. XFS ...
, JFS and
NTFS New Technology File System (NTFS) is a proprietary journaling file system developed by Microsoft. Starting with Windows NT 3.1, it is the default file system of the Windows NT family. It superseded File Allocation Table (FAT) as the preferred fi ...
) nor hardware RAID solutions provide sufficient protection against data integrity problems. Some filesystems (including
Btrfs Btrfs (pronounced as "better F S", "butter F S", "b-tree F S", or simply by spelling it out) is a computer storage format that combines a file system based on the copy-on-write (COW) principle with a logical volume manager (not to be confused w ...
and
ZFS ZFS (previously: Zettabyte File System) is a file system with volume management capabilities. It began as part of the Sun Microsystems Solaris operating system in 2001. Large parts of Solaris – including ZFS – were published under an open ...
) provide internal data and
metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
checksumming that is used for detecting
silent data corruption Silent may mean any of the following: People with the name * Silent George, George Stone (outfielder) (1876–1945), American Major League Baseball outfielder and batting champion * Brandon Silent (born 1973), South African former footballer * ...
and improving data integrity. If a corruption is detected that way and internal RAID mechanisms provided by those filesystems are also used, such filesystems can additionally reconstruct corrupted data in a transparent way. This approach allows improved data integrity protection covering the entire data paths, which is usually known as
end-to-end data protection End-to-end or End to End may refer to: * End-to-end auditable voting systems, a voting system * End-to-end delay, the time for a packet to be transmitted across a network from source to destination * End-to-end encryption, a cryptographic paradigm ...
.


Data integrity as applied to various industries

* The U.S.
Food and Drug Administration The United States Food and Drug Administration (FDA or US FDA) is a federal agency of the Department of Health and Human Services. The FDA is responsible for protecting and promoting public health through the control and supervision of food s ...
has created draft guidance on data integrity for the pharmaceutical manufacturers required to adhere to U.S. Code of Federal Regulations 21 CFR Parts 210–212. Outside the U.S., similar data integrity guidance has been issued by the United Kingdom (2015), Switzerland (2016), and Australia (2017). * Various standards for the manufacture of medical devices address data integrity either directly or indirectly, including
ISO 13485 ISO 13485 ''Medical devices -- Quality management systems -- Requirements for regulatory purposes'' is a voluntary standard, published by International Organization for Standardization (ISO) for the first time in 1996, and contains a comprehensiv ...
, ISO 14155, and ISO 5840. * In early 2017, the
Financial Industry Regulatory Authority The Financial Industry Regulatory Authority (FINRA) is a private American corporation that acts as a self-regulatory organization (SRO) that regulates member brokerage firms and exchange markets. FINRA is the successor to the National Associati ...
(FINRA), noting data integrity problems with automated trading and money movement surveillance systems, stated it would make "the development of a data integrity program to monitor the accuracy of the submitted data" a priority. In early 2018, FINRA said it would expand its approach on data integrity to firms' "technology change management policies and procedures" and Treasury securities reviews. * Other sectors such as mining and product manufacturing are increasingly focusing on the importance of data integrity in associated automation and production monitoring assets. * Cloud storage providers have long faced significant challenges ensuring the integrity or provenance of customer data and tracking violations.


See also

*
End-to-end data integrity End-to-end or End to End may refer to: * End-to-end auditable voting systems, a voting system * End-to-end delay, the time for a packet to be transmitted across a network from source to destination * End-to-end encryption, a cryptographic paradigm ...
*
Message authentication In information security, message authentication or data origin authentication is a property that a message has not been modified while in transit (data integrity) and that the receiving party can verify the source of the message. Message authentica ...
*
National Information Assurance Glossary Committee on National Security Systems Instruction No. 4009, National Information Assurance Glossary, published by the United States federal government, is an unclassified glossary of Information security terms intended to provide a common vocabular ...
*
Single version of the truth In computerized business management, single version of the truth (SVOT), is a technical concept describing the data warehousing ideal of having either a single centralised database, or at least a distributed synchronised database, which stores all ...
*


References


Further reading

* * {{DEFAULTSORT:Data Integrity Data quality Transaction processing