File verification is the process of using an
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...
for verifying the integrity of a
computer file
A computer file is a System resource, resource for recording Data (computing), data on a Computer data storage, computer storage device, primarily identified by its filename. Just as words can be written on paper, so too can data be written to a ...
, usually by
checksum
A checksum is a small-sized block of data derived from another block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify dat ...
. This can be done by
comparing two files bit-by-bit, but requires two copies of the same file, and may miss systematic corruptions which might occur to both files. A more popular approach is to generate a
hash of the copied file and comparing that to the hash of the original file.
Integrity verification
File integrity can be compromised, usually referred to as the file becoming
corrupted. A file can become corrupted by a variety of ways: faulty
storage media, errors in transmission, write errors during copying or moving,
software bug
A software bug is a design defect ( bug) in computer software. A computer program with many or serious bugs may be described as ''buggy''.
The effects of a software bug range from minor (such as a misspelled word in the user interface) to sev ...
s, and so on.
Hash-based verification ensures that a file has not been corrupted by comparing the file's hash value to a previously calculated value. If these values match, the file is presumed to be unmodified. Due to the nature of hash functions,
hash collision
In computer science, a hash collision or hash clash is when two distinct pieces of data in a hash table share the same hash value. The hash value in this case is derived from a hash function which takes a data input and returns a fixed length of ...
s may result in
false positives, but the likelihood of collisions is often negligible with random corruption.
Authenticity verification
It is often desirable to verify that a file hasn't been modified in transmission or storage by untrusted parties, for example, to include malicious code such as
virus
A virus is a submicroscopic infectious agent that replicates only inside the living Cell (biology), cells of an organism. Viruses infect all life forms, from animals and plants to microorganisms, including bacteria and archaea. Viruses are ...
es or
backdoors. To verify the authenticity, a classical hash function is not enough as they are not designed to be
collision resistant; it is computationally trivial for an attacker to cause deliberate hash collisions, meaning that a malicious change in the file is not detected by a hash comparison. In cryptography, this attack is called a
preimage attack
In cryptography, a preimage attack on cryptographic hash functions tries to find a message that has a specific hash value. A cryptographic hash function should resist attacks on its preimage (set of possible inputs).
In the context of attack, the ...
.
For this purpose,
cryptographic hash function
A cryptographic hash function (CHF) is a hash algorithm (a map (mathematics), map of an arbitrary binary string to a binary string with a fixed size of n bits) that has special properties desirable for a cryptography, cryptographic application: ...
s are employed often. As long as the hash sums cannot be tampered with — for example, if they are communicated over a secure channel — the files can be presumed to be intact. Alternatively,
digital signatures can be employed to assure
tamper resistance.
File formats
A checksum file is a small file that contains the checksums of other files.
There are a few well-known checksum file formats.
Several utilities, such as
md5deep, can use such checksum files to automatically verify an entire directory of files in one operation.
The particular hash algorithm used is often indicated by the file extension of the checksum file.
The ".sha1" file extension indicates a checksum file containing 160-bit
SHA-1
In cryptography, SHA-1 (Secure Hash Algorithm 1) is a hash function which takes an input and produces a 160-bit (20-byte) hash value known as a message digest – typically rendered as 40 hexadecimal digits. It was designed by the United States ...
hashes in
sha1sum format.
The ".md5" file extension, or a file named "MD5SUMS", indicates a checksum file containing 128-bit
MD5
The MD5 message-digest algorithm is a widely used hash function producing a 128-bit hash value. MD5 was designed by Ronald Rivest in 1991 to replace an earlier hash function MD4, and was specified in 1992 as Request for Comments, RFC 1321.
MD5 ...
hashes in
md5sum format.
The ".sfv" file extension indicates a checksum file containing 32-bit CRC32 checksums in
simple file verification format.
The "crc.list" file indicates a checksum file containing 32-bit CRC checksums in brik format.
As of 2012, best practice recommendations is to use
SHA-2 or
SHA-3 to generate new file integrity digests; and to accept MD5 and SHA-1 digests for backward compatibility if stronger digests are not available.
The theoretically weaker SHA-1, the weaker MD5, or much weaker CRC were previously commonly used for file integrity checks.
[
Simson Garfinkel, Gene Spafford, Alan Schwartz]
"Practical UNIX and Internet Security"
p. 630.
CRC checksums cannot be used to verify the authenticity of files, as CRC32 is not a
collision resistant hash function -- even if the hash sum file is not tampered with, it is computationally trivial for an attacker to replace a file with the same CRC digest as the original file, meaning that a malicious change in the file is not detected by a CRC comparison.
See also
*
Checksum
A checksum is a small-sized block of data derived from another block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify dat ...
*
Data deduplication
References
{{Computer files
Computer files
Error detection and correction