HOME

TheInfoList



OR:

A checksum is a small-sized
block Block or blocked may refer to: Arts, entertainment and media Broadcasting * Block programming, the result of a programming strategy in broadcasting * W242BX, a radio station licensed to Greenville, South Carolina, United States known as ''96.3 ...
of data derived from another block of
digital data Digital data, in information theory and information systems, is information represented as a string of Discrete mathematics, discrete symbols, each of which can take on one of only a finite number of values from some alphabet (formal languages ...
for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify
data integrity Data integrity is the maintenance of, and the assurance of, data accuracy and consistency over its entire Information Lifecycle Management, life-cycle. It is a critical aspect to the design, implementation, and usage of any system that stores, proc ...
but are not relied upon to verify
data authenticity In information security, message authentication or data origin authentication is a property that a message has not been modified while in transit (data integrity) and that the receiving party can verify the source of the message. Description M ...
. The procedure which generates this checksum is called a checksum function or checksum algorithm. Depending on its design goals, a good checksum algorithm usually outputs a significantly different value, even for small changes made to the input. This is especially true of
cryptographic hash function A cryptographic hash function (CHF) is a hash algorithm (a map (mathematics), map of an arbitrary binary string to a binary string with a fixed size of n bits) that has special properties desirable for a cryptography, cryptographic application: ...
s, which may be used to detect many data corruption errors and verify overall
data integrity Data integrity is the maintenance of, and the assurance of, data accuracy and consistency over its entire Information Lifecycle Management, life-cycle. It is a critical aspect to the design, implementation, and usage of any system that stores, proc ...
; if the computed checksum for the current data input matches the stored value of a previously computed checksum, there is a very high probability the data has not been accidentally altered or corrupted. Checksum functions are related to
hash function A hash function is any Function (mathematics), function that can be used to map data (computing), data of arbitrary size to fixed-size values, though there are some hash functions that support variable-length output. The values returned by a ...
s,
fingerprint A fingerprint is an impression left by the friction ridges of a human finger. The recovery of partial fingerprints from a crime scene is an important method of forensic science. Moisture and grease on a finger result in fingerprints on surfa ...
s, randomization functions, and
cryptographic hash function A cryptographic hash function (CHF) is a hash algorithm (a map (mathematics), map of an arbitrary binary string to a binary string with a fixed size of n bits) that has special properties desirable for a cryptography, cryptographic application: ...
s. However, each of those concepts has different applications and therefore different design goals. For instance, a function returning the start of a string can provide a hash appropriate for some applications but will never be a suitable checksum. Checksums are used as
cryptographic primitive Cryptographic primitives are well-established, low-level cryptography, cryptographic algorithms that are frequently used to build cryptographic protocols for computer security systems. These routines include, but are not limited to, one-way hash fun ...
s in larger authentication algorithms. For cryptographic systems with these two specific design goals, see
HMAC In cryptography, an HMAC (sometimes expanded as either keyed-hash message authentication code or hash-based message authentication code) is a specific type of message authentication code (MAC) involving a cryptographic hash function and a se ...
.
Check digit A check digit is a form of redundancy check used for Error detection and correction, error detection on identification numbers, such as bank account numbers, which are used in an application where they will at least sometimes be input manually. It ...
s and
parity bit A parity bit, or check bit, is a bit added to a string of binary code. Parity bits are a simple form of error detecting code. Parity bits are generally applied to the smallest units of a communication protocol, typically 8-bit octets (bytes) ...
s are special cases of checksums, appropriate for small blocks of data (such as
Social Security number In the United States, a Social Security number (SSN) is a nine-digit number issued to United States nationality law, U.S. citizens, Permanent residence (United States), permanent residents, and temporary (working) residents under section 205(c)(2 ...
s,
bank account A bank account is a financial account maintained by a bank or other financial institution in which the financial transaction A financial transaction is an Contract, agreement, or communication, between a buyer and seller to exchange goods, ...
numbers, computer words, single
byte The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable un ...
s, etc.). Some
error-correcting code In computing, telecommunication, information theory, and coding theory, forward error correction (FEC) or channel coding is a technique used for controlling errors in data transmission over unreliable or noisy communication channels. The centra ...
s are based on special checksums which not only detect common errors but also allow the original data to be recovered in certain cases.


Algorithms


Parity byte or parity word

The simplest checksum algorithm is the so-called longitudinal parity check, which breaks the data into "words" with a fixed number of bits, and then computes the bitwise
exclusive or Exclusive or, exclusive disjunction, exclusive alternation, logical non-equivalence, or logical inequality is a logical operator whose negation is the logical biconditional. With two inputs, XOR is true if and only if the inputs differ (on ...
(XOR) of all those words. The result is appended to the message as an extra word. In simpler terms, for =1 this means adding a bit to the end of the data bits to guarantee that there is an even number of '1's. To check the integrity of a message, the receiver computes the bitwise exclusive or of all its words, including the checksum; if the result is not a word consisting of zeros, the receiver knows a transmission error occurred. With this checksum, any transmission error which flips a single bit of the message, or an odd number of bits, will be detected as an incorrect checksum. However, an error that affects two bits will not be detected if those bits lie at the same position in two distinct words. Also swapping of two or more words will not be detected. If the affected bits are independently chosen at random, the probability of a two-bit error being undetected is .


Sum complement

A variant of the previous algorithm is to add all the "words" as unsigned binary numbers, discarding any overflow bits, and append the
two's complement Two's complement is the most common method of representing signed (positive, negative, and zero) integers on computers, and more generally, fixed point binary values. Two's complement uses the binary digit with the ''greatest'' value as the ''s ...
of the total as the checksum. To validate a message, the receiver adds all the words in the same manner, including the checksum; if the result is not a word full of zeros, an error must have occurred. This variant, too, detects any single-bit error, but the pro modular sum is used in
SAE J1708 Society of Automotive Engineers standard SAE J1708 is a standard used for serial communications between ECUs on a heavy duty vehicle and also between a computer and the vehicle. With respect to Open System Interconnection model (OSI), J1708 defin ...
.


Position-dependent

The simple checksums described above fail to detect some common errors which affect many bits at once, such as changing the order of data words, or inserting or deleting words with all bits set to zero. The checksum algorithms most used in practice, such as
Fletcher's checksum The Fletcher checksum is an algorithm for computing a position-dependent checksum devised by John G. Fletcher (1934–2012) at Lawrence Livermore Labs in the late 1970s. The objective of the Fletcher checksum was to provide error-detection prope ...
,
Adler-32 Adler-32 is a checksum algorithm written by Mark Adler in 1995, modifying Fletcher's checksum. Compared to a cyclic redundancy check of the same length, it trades reliability for speed. Adler-32 is more reliable than Fletcher-16, and slightly le ...
, and
cyclic redundancy check A cyclic redundancy check (CRC) is an error-detecting code commonly used in digital networks and storage devices to detect accidental changes to digital data. Blocks of data entering these systems get a short ''check value'' attached, based on ...
s (CRCs), address these weaknesses by considering not only the value of each word but also its position in the sequence. This feature generally increases the
cost Cost is the value of money that has been used up to produce something or deliver a service, and hence is not available for use anymore. In business, the cost may be one of acquisition, in which case the amount of money expended to acquire it i ...
of computing the checksum.


Fuzzy checksum

The idea of fuzzy checksum was developed for detection of
email spam Email spam, also referred to as junk email, spam mail, or simply spam, refers to unsolicited messages sent in bulk via email. The term originates from a Spam (Monty Python), Monty Python sketch, where the name of a canned meat product, "Spam (food ...
by building up cooperative databases from multiple ISPs of email suspected to be spam. The content of such spam may often vary in its details, which would render normal checksumming ineffective. By contrast, a "fuzzy checksum" reduces the body text to its characteristic minimum, then generates a checksum in the usual manner. This greatly increases the chances of slightly different spam emails producing the same checksum. The ISP spam detection software, such as
SpamAssassin Apache SpamAssassin is a computer program used for e-mail spam filtering. It uses a variety of spam-detection techniques, including DNS and fuzzy checksum techniques, Bayesian filtering, external programs, blacklists and online databases. It ...
, of co-operating ISPs, submits checksums of all emails to the centralised service such as
DCC DCC may refer to: Biology * Netrin receptor DCC, human receptor protein, and the gene encoding it * Dosage compensation complex Business * Day Chocolate Company * DCC plc, an Irish holding company * Doppelmayr Cable Car, cable car company * D ...
. If the count of a submitted fuzzy checksum exceeds a certain threshold, the database notes that this probably indicates spam. ISP service users similarly generate a fuzzy checksum on each of their emails and request the service for a spam likelihood.


General considerations

A message that is bits long can be viewed as a corner of the -dimensional
hypercube In geometry, a hypercube is an ''n''-dimensional analogue of a square ( ) and a cube ( ); the special case for is known as a ''tesseract''. It is a closed, compact, convex figure whose 1- skeleton consists of groups of opposite parallel l ...
. The effect of a checksum algorithm that yields an -bit checksum is to map each -bit message to a corner of a larger hypercube, with dimension . The corners of this hypercube represent all possible received messages. The valid received messages (those that have the correct checksum) comprise a smaller set, with only corners. A single-bit transmission error then corresponds to a displacement from a valid corner (the correct message and checksum) to one of the adjacent corners. An error which affects bits moves the message to a corner which is steps removed from its correct corner. The goal of a good checksum algorithm is to spread the valid corners as far from each other as possible, to increase the likelihood "typical" transmission errors will end up in an invalid corner.


See also

General topic *
Algorithm In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...
*
Check digit A check digit is a form of redundancy check used for Error detection and correction, error detection on identification numbers, such as bank account numbers, which are used in an application where they will at least sometimes be input manually. It ...
*
Damm algorithm In error detection, the Damm algorithm is a check digit algorithm that detects all single-digit errors and all adjacent transposition errors. It was presented by H. Michael Damm in 2004, as a part of his PhD dissertation entitled ''Totally Antisy ...
*
Data rot Data degradation is the gradual corruption of computer data due to an accumulation of non-critical failures in a data storage device. It is also referred to as data decay, data rot or bit rot. This results in a decline in data quality over time ...
* File verification *
Fletcher's checksum The Fletcher checksum is an algorithm for computing a position-dependent checksum devised by John G. Fletcher (1934–2012) at Lawrence Livermore Labs in the late 1970s. The objective of the Fletcher checksum was to provide error-detection prope ...
* Frame check sequence *
cksum cksum is a command in Unix and Unix-like operating systems that generates a checksum value for a file or stream of data. The cksum command reads each file given in its arguments, or standard input if no arguments are provided, and outputs the fi ...
*
md5sum is a computer program that calculates and verifies 128-bit MD5 hashes, as described in RFC 1321. The MD5 hash functions as a compact digital fingerprint of a file. As with all such hashing algorithms, there is theoretically an unlimited number ...
*
sha1sum is a computer program that calculates and verifies SHA-1 Cryptographic hash function, hashes. It is commonly used to verify the integrity of files. It (or a variant) is installed by default on most Linux distributions. Typically distributed alo ...
*
Parchive Parchive (a portmanteau of parity archive, and formally known as Parity Volume Set Specification) is an erasure code system that produces par files for checksum verification of data integrity, with the capability to perform data recovery operatio ...
* Sum (Unix) * SYSV checksum * BSD checksum * xxHash Error correction *
Hamming code In computer science and telecommunications, Hamming codes are a family of linear error-correcting codes. Hamming codes can detect one-bit and two-bit errors, or correct one-bit errors without detection of uncorrected errors. By contrast, the ...
*
Reed–Solomon error correction In information theory and coding theory, Reed–Solomon codes are a group of error-correcting codes that were introduced by Irving S. Reed and Gustave Solomon in 1960. They have many applications, including consumer technologies such as MiniDiscs, ...
* IPv4 header checksum Hash functions * List of hash functions *
Luhn algorithm The Luhn algorithm or Luhn formula (creator: IBM scientist Hans Peter Luhn), also known as the " modulus 10" or "mod 10" algorithm, is a simple check digit formula used to validate a variety of identification numbers. The algorithm is in the pub ...
*
Parity bit A parity bit, or check bit, is a bit added to a string of binary code. Parity bits are a simple form of error detecting code. Parity bits are generally applied to the smallest units of a communication protocol, typically 8-bit octets (bytes) ...
*
Rolling checksum Rolling is a type of motion that combines rotation (commonly, of an axially symmetric object) and translation of that object with respect to a surface (either one or the other moves), such that, if ideal conditions exist, the two are in contact ...
* Verhoeff algorithm File systems *
Bcachefs Bcachefs is a copy-on-write (COW) file system for Linux-based operating systems. Its primary developer, Kent Overstreet, first announced it in 2015, and it was added to the Linux kernel beginning with 6.7. It is intended to compete with the moder ...
,
Btrfs Btrfs (pronounced as "better F S", "butter F S", "b-tree F S", or "B.T.R.F.S.") is a computer storage format that combines a file system based on the copy-on-write (COW) principle with a logical volume manager (distinct from Linux's LVM), d ...
,
ReFS Resilient File System (ReFS), codenamed "Protogon", is a Microsoft proprietary file system introduced with Windows Server 2012 with the intent of becoming the "next generation" file system after NTFS. ReFS was designed to overcome problem ...
and
ZFS ZFS (previously Zettabyte File System) is a file system with Volume manager, volume management capabilities. It began as part of the Sun Microsystems Solaris (operating system), Solaris operating system in 2001. Large parts of Solaris, includin ...
 – file systems that perform automatic file integrity checking using checksums Related concepts *
Isopsephy In numerology, isopsephy (stressed on the ''I'' and the ''E''; , ) or isopsephism is the practice of adding up the Greek numerals, number values of the letters in a word to form a single number. The total number is then used as a metaphorical brid ...
*
Gematria In numerology, gematria (; or , plural or ) is the practice of assigning a numerical value to a name, word, or phrase by reading it as a number, or sometimes by using an alphanumeric cipher. The letters of the alphabets involved have standar ...
*
File fixity File fixity is a digital preservation In library science, library and archival science, digital preservation is a formal process to ensure that digital information of continuing value remains accessible and usable in the long term. It involves pl ...


References


Further reading

* *


External links

{{wikibooks , 1= Algorithm Implementation , 2= Checksums
Additive Checksums (C)
theory from Barr Group
A4US-LetterUS-Letter two-columnChecksum CalculatorOpen source python based application with GUI used to verify downloads.