HOME

TheInfoList



OR:

Data corruption refers to errors in
computer data ''In computer science, data (treated as singular, plural, or as a mass noun) is any sequence of one or more symbols; datum is a single symbol of data. Data requires interpretation to become information. Digital data is data that is represen ...
that occur during writing, reading, storage, transmission, or processing, which introduce unintended changes to the original data. Computer, transmission, and storage systems use a number of measures to provide end-to-end
data integrity Data integrity is the maintenance of, and the assurance of, data accuracy and consistency over its entire Information Lifecycle Management, life-cycle. It is a critical aspect to the design, implementation, and usage of any system that stores, proc ...
, or lack of errors. In general, when data corruption occurs, a file containing that data will produce unexpected results when accessed by the system or the related application. Results could range from a minor loss of data to a system crash. For example, if a document file is corrupted, when a person tries to open that file with a document editor they may get an error message, thus the file might not be opened or might open with some of the data corrupted (or in some cases, completely corrupted, leaving the document unintelligible). The adjacent image is a corrupted image file in which most of the information has been lost. Some types of
malware Malware (a portmanteau of ''malicious software'')Tahir, R. (2018)A study on malware and malware detection techniques . ''International Journal of Education and Management Engineering'', ''8''(2), 20. is any software intentionally designed to caus ...
may intentionally corrupt files as part of their payloads, usually by overwriting them with inoperative or garbage code, while a non-malicious virus may also unintentionally corrupt files when it accesses them. If a virus or
trojan Trojan or Trojans may refer to: * Of or from the ancient city of Troy * Trojan language, the language of the historical Trojans Arts and entertainment Music * '' Les Troyens'' ('The Trojans'), an opera by Berlioz, premiered part 1863, part 18 ...
with this payload method manages to alter files critical to the running of the computer's operating system software or physical hardware, the entire system may be rendered unusable. Some programs can give a suggestion to repair the file automatically (after the error), and some programs cannot repair it. It depends on the level of corruption, and the built-in functionality of the application to handle the error. There are various causes of the corruption.


Overview

There are two types of data corruption associated with computer systems: undetected and detected. Undetected data corruption, also known as ''silent data corruption'', results in the most dangerous errors as there is no indication that the data is incorrect. Detected data corruption may be permanent with the loss of data, or may be temporary when some part of the system is able to detect and correct the error; there is no data corruption in the latter case. Data corruption can occur at any level in a system, from the host to the storage medium. Modern systems attempt to detect corruption at many layers and then recover or correct the corruption; this is almost always successful but very rarely the information arriving in the systems memory is corrupted and can cause unpredictable results. Data corruption during transmission has a variety of causes. Interruption of data transmission causes information loss. Environmental conditions can interfere with data transmission, especially when dealing with wireless transmission methods. Heavy clouds can block satellite transmissions. Wireless networks are susceptible to interference from devices such as microwave ovens. Hardware and software failure are the two main causes for
data loss Data loss is an error condition in information systems in which information is destroyed by failures (like failed spindle motors or head crashes on hard drives) or neglect (like mishandling, careless handling or storage under unsuitable conditions) ...
. Background radiation,
head crash A head crash is a hard-disk failure that occurs when a disk read-and-write head, read–write head of a hard disk drive makes contact with its rotating hard disk platter, platter, slashing its surface and permanently damaging its magnetic media ...
es, and
aging Ageing (or aging in American English) is the process of becoming Old age, older until death. The term refers mainly to humans, many other animals, and fungi; whereas for example, bacteria, perennial plants and some simple animals are potentiall ...
or wear of the storage device fall into the former category, while software failure typically occurs due to bugs in the code.
Cosmic ray Cosmic rays or astroparticles are high-energy particles or clusters of particles (primarily represented by protons or atomic nuclei) that move through space at nearly the speed of light. They originate from the Sun, from outside of the ...
s cause most
soft error In electronics and computing, a soft error is a type of error where a signal or datum is wrong. Errors may be caused by a defect, usually understood either to be a mistake in design or construction, or a broken component. A soft error is also a ...
s in DRAM.


Silent

Some errors go unnoticed, without being detected by the disk firmware or the host operating system; these errors are known as ''silent data corruption''. There are many error sources beyond the disk storage subsystem itself. For instance, cables might be slightly loose, the power supply might be unreliable, external vibrations such as a loud sound, the network might introduce undetected corruption,
cosmic radiation Cosmic rays or astroparticles are high-energy particles or clusters of particles (primarily represented by protons or atomic nuclei) that move through space at nearly the speed of light. They originate from the Sun, from outside of the Sol ...
and many other causes of soft memory errors, etc. In 39,000 storage systems that were analyzed, firmware bugs accounted for 5–10% of storage failures. All in all, the error rates as observed by a
CERN The European Organization for Nuclear Research, known as CERN (; ; ), is an intergovernmental organization that operates the largest particle physics laboratory in the world. Established in 1954, it is based in Meyrin, western suburb of Gene ...
study on silent corruption are far higher than one in every 1016 bits. Webshop
Amazon.com Amazon.com, Inc., doing business as Amazon, is an American multinational technology company engaged in e-commerce, cloud computing, online advertising, digital streaming, and artificial intelligence. Founded in 1994 by Jeff Bezos in Bellevu ...
has acknowledged similar high data corruption rates in their systems. In 2021, faulty processor cores were identified as an additional cause in publications by Google and Facebook; cores were found to be faulty at a rate of several in thousands of cores. One problem is that hard disk drive capacities have increased substantially, but their error rates remain unchanged. The data corruption rate has always been roughly constant in time, meaning that modern disks are not much safer than old disks. In old disks the probability of data corruption was very small because they stored tiny amounts of data. In modern disks the probability is much larger because they store much more data, whilst not being safer. That way, silent data corruption has not been a serious concern while storage devices remained relatively small and slow. In modern times and with the advent of larger drives and very fast RAID setups, users are capable of transferring 1016 bits in a reasonably short time, thus easily reaching the data corruption thresholds. As an example,
ZFS ZFS (previously Zettabyte File System) is a file system with Volume manager, volume management capabilities. It began as part of the Sun Microsystems Solaris (operating system), Solaris operating system in 2001. Large parts of Solaris, includin ...
creator Jeff Bonwick stated that the fast database at
Greenplum Greenplum is a big data technology based on MPP architecture and the Postgres open source database technology. The technology was created by a company of the same name headquartered in San Mateo, California around 2005. Greenplum was acquire ...
, which is a database software company specializing in large-scale data warehousing and analytics, faces silent corruption every 15 minutes. As another example, a real-life study performed by
NetApp NetApp, Inc. is an American data infrastructure company that provides unified data storage, integrated data services, and cloud operations (CloudOps) solutions to enterprise customers. The company is based in San Jose, California. It has ranked ...
on more than 1.5 million HDDs over 41 months found more than 400,000 silent data corruptions, out of which more than 30,000 were not detected by the hardware RAID controller (only detected during scrubbing). Another study, performed by
CERN The European Organization for Nuclear Research, known as CERN (; ; ), is an intergovernmental organization that operates the largest particle physics laboratory in the world. Established in 1954, it is based in Meyrin, western suburb of Gene ...
over six months and involving about 97 
petabytes The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable un ...
of data, found that about 128 
megabytes The megabyte is a multiple of the unit byte for digital information. Its recommended unit symbol is MB. The unit prefix ''mega'' is a multiplier of (106) in the International System of Units (SI). Therefore, one megabyte is one million bytes ...
of data became permanently corrupted silently somewhere in the pathway from network to disk. Silent data corruption may result in
cascading failure A cascading failure is a failure in a system of interconnection, interconnected parts in which the failure of one or few parts leads to the failure of other parts, growing progressively as a result of positive feedback. This can occur when a singl ...
s, in which the system may run for a period of time with undetected initial error causing increasingly more problems until it is ultimately detected. For example, a failure affecting file system
metadata Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive ...
can result in multiple files being partially damaged or made completely inaccessible as the file system is used in its corrupted state.


Countermeasures

When data corruption behaves as a
Poisson process In probability theory, statistics and related fields, a Poisson point process (also known as: Poisson random measure, Poisson random point field and Poisson point field) is a type of mathematical object that consists of Point (geometry), points ...
, where each
bit The bit is the most basic unit of information in computing and digital communication. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represented as ...
of data has an independently low probability of being changed, data corruption can generally be detected by the use of
checksum A checksum is a small-sized block of data derived from another block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify dat ...
s, and can often be corrected by the use of
error correcting code In computing, telecommunication, information theory, and coding theory, forward error correction (FEC) or channel coding is a technique used for controlling errors in data transmission over unreliable or noisy communication channels. The centra ...
s (ECC). If an uncorrectable data corruption is detected, procedures such as automatic retransmission or restoration from
backup In information technology, a backup, or data backup is a copy of computer data taken and stored elsewhere so that it may be used to restore the original after a data loss event. The verb form, referring to the process of doing so, is "wikt:back ...
s can be applied. Certain levels of
RAID RAID (; redundant array of inexpensive disks or redundant array of independent disks) is a data storage virtualization technology that combines multiple physical Computer data storage, data storage components into one or more logical units for th ...
disk arrays have the ability to store and evaluate
parity bit A parity bit, or check bit, is a bit added to a string of binary code. Parity bits are a simple form of error detecting code. Parity bits are generally applied to the smallest units of a communication protocol, typically 8-bit octets (bytes) ...
s for data across a set of hard disks and can reconstruct corrupted data upon the failure of a single or multiple disks, depending on the level of RAID implemented. Some CPU architectures employ various transparent checks to detect and mitigate data corruption in
CPU cache A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost (time or energy) to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, whi ...
s, CPU buffers and
instruction pipeline In computer engineering, instruction pipelining is a technique for implementing instruction-level parallelism within a single processor. Pipelining attempts to keep every part of the processor busy with some instruction by dividing incoming Mac ...
s; an example is ''Intel Instruction Replay'' technology, which is available on
Intel Itanium Itanium (; ) is a discontinued family of 64-bit Intel microprocessors that implement the Intel Itanium architecture (formerly called IA-64). The Itanium architecture originated at Hewlett-Packard (HP), and was later jointly developed by HP and I ...
processors. Many errors are detected and corrected by the hard disk drives using the ECC codes which are stored on disk for each sector. If the disk drive detects multiple read errors on a sector it may make a copy of the failing sector on another part of the disk, by remapping the failed sector of the disk to a spare sector without the involvement of the operating system (though this may be delayed until the next write to the sector). This "silent correction" can be monitored using S.M.A.R.T. and tools available for most operating systems to automatically check the disk drive for impending failures by watching for deteriorating SMART parameters. Some file systems, such as
Btrfs Btrfs (pronounced as "better F S", "butter F S", "b-tree F S", or "B.T.R.F.S.") is a computer storage format that combines a file system based on the copy-on-write (COW) principle with a logical volume manager (distinct from Linux's LVM), d ...
,
HAMMER A hammer is a tool, most often a hand tool, consisting of a weighted "head" fixed to a long handle that is swung to deliver an impact to a small area of an object. This can be, for example, to drive nail (fastener), nails into wood, to sh ...
,
ReFS Resilient File System (ReFS), codenamed "Protogon", is a Microsoft proprietary file system introduced with Windows Server 2012 with the intent of becoming the "next generation" file system after NTFS. ReFS was designed to overcome problem ...
, and
ZFS ZFS (previously Zettabyte File System) is a file system with Volume manager, volume management capabilities. It began as part of the Sun Microsystems Solaris (operating system), Solaris operating system in 2001. Large parts of Solaris, includin ...
, use internal data and
metadata Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive ...
checksumming to detect silent data corruption. In addition, if a corruption is detected and the file system uses integrated RAID mechanisms that provide
data redundancy In computer main memory, auxiliary storage and computer buses, data redundancy is the existence of data that is additional to the actual data and permits correction of errors in stored or transmitted data. The additional data can simply be a com ...
, such file systems can also reconstruct corrupted data in a transparent way. This approach allows improved data integrity protection covering the entire data paths, which is usually known as ''end-to-end data protection'', compared with other data integrity approaches that do not span different layers in the storage stack and allow data corruption to occur while the data passes boundaries between the different layers. ''
Data scrubbing Data scrubbing is an error correction technique that uses a background task to periodically inspect main memory or storage for errors, then corrects detected errors using redundant data in the form of different checksums or copies of data. Data ...
'' is another method to reduce the likelihood of data corruption, as disk errors are caught and recovered from before multiple errors accumulate and overwhelm the number of parity bits. Instead of parity being checked on each read, the parity is checked during a regular scan of the disk, often done as a low priority background process. The "data scrubbing" operation activates a parity check. If a user simply runs a normal program that reads data from the disk, then the parity would not be checked unless parity-check-on-read was both supported and enabled on the disk subsystem. If appropriate mechanisms are employed to detect and remedy data corruption, data integrity can be maintained. This is particularly important in commercial applications (e.g.
banking A bank is a financial institution that accepts Deposit account, deposits from the public and creates a demand deposit while simultaneously making loans. Lending activities can be directly performed by the bank or indirectly through capital m ...
), where an undetected error could either corrupt a database index or change data to drastically affect an account balance, and in the use of encrypted or compressed data, where a small error can make an extensive dataset unusable.


See also

* Various resources: **
Data degradation Data degradation is the gradual Data corruption, corruption of Data (computing), computer data due to an accumulation of non-critical failures in a data storage device. It is also referred to as data decay, data rot or bit rot. This results in ...
, also called data rot **
Computer science Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...
**
Data integrity Data integrity is the maintenance of, and the assurance of, data accuracy and consistency over its entire Information Lifecycle Management, life-cycle. It is a critical aspect to the design, implementation, and usage of any system that stores, proc ...
** Database integrity **
Radiation hardening Radiation hardening is the process of making electronic components and circuits resistant to damage or malfunction caused by high levels of ionizing radiation (particle radiation and high-energy electromagnetic radiation), especially for environm ...
** Software rot * Countermeasures: ** Data Integrity Field **
ECC memory Error correction code memory (ECC memory) is a type of computer data storage that uses an error correction code (ECC) to detect and correct ''n''-bit data corruption which occurs in memory. Typically, ECC memory maintains a memory system immun ...
**
Forward error correction In computing, telecommunication, information theory, and coding theory, forward error correction (FEC) or channel coding is a technique used for controlling errors in data transmission over unreliable or noisy communication channels. The centra ...
**
List of data recovery software In computing, data recovery is a process of retrieving deleted, inaccessible, lost, corrupted, damaged, overwritten or formatted data from computer data storage#Secondary storage, secondary storage, removable media or Computer file, files, when ...
**
Parchive Parchive (a portmanteau of parity archive, and formally known as Parity Volume Set Specification) is an erasure code system that produces par files for checksum verification of data integrity, with the capability to perform data recovery operatio ...
**
RAID RAID (; redundant array of inexpensive disks or redundant array of independent disks) is a data storage virtualization technology that combines multiple physical Computer data storage, data storage components into one or more logical units for th ...
**
Reed–Solomon error correction In information theory and coding theory, Reed–Solomon codes are a group of error-correcting codes that were introduced by Irving S. Reed and Gustave Solomon in 1960. They have many applications, including consumer technologies such as MiniDiscs, ...


References


External links


SoftECC: A System for Software Memory Integrity Checking

A Tunable, Software-based DRAM Error Detection and Correction Library for HPC

Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing

End-to-end Data Integrity for File Systems: A ZFS Case Study

DRAM Errors in the Wild: A Large-Scale Field Study

A study on silent corruptions
and an associated paper o
data integrity
(CERN, 2007)
End-to-end Data Protection in SAS and Fibre Channel Hard Disk Drives
(HGST) {{data Data quality Product expiration