HOME

TheInfoList



OR:

A rolling hash (also known as recursive hashing or rolling checksum) is a
hash function A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually ...
where the input is hashed in a window that moves through the input. A few hash functions allow a rolling hash to be computed very quickly—the new hash value is rapidly calculated given only the old hash value, the old value removed from the window, and the new value added to the window—similar to the way a
moving average In statistics, a moving average (rolling average or running average) is a calculation to analyze data points by creating a series of averages of different subsets of the full data set. It is also called a moving mean (MM) or rolling mean and is ...
function can be computed much more quickly than other low-pass filters. One of the main applications is the Rabin–Karp string search algorithm, which uses the rolling hash described below. Another popular application is the
rsync rsync is a utility for efficiently transferring and synchronizing files between a computer and a storage drive and across networked computers by comparing the modification times and sizes of files. It is commonly found on Unix-like opera ...
program, which uses a checksum based on Mark Adler's
adler-32 Adler-32 is a checksum algorithm written by Mark Adler in 1995, modifying Fletcher's checksum. Compared to a cyclic redundancy check of the same length, it trades reliability for speed (preferring the latter). Adler-32 is more reliable than Fletcher ...
as its rolling hash. Low Bandwidth Network Filesystem (LBFS) uses a
Rabin fingerprint The Rabin fingerprinting scheme is a method for implementing fingerprints using polynomials over a finite field. It was proposed by Michael O. Rabin. Scheme Given an ''n''-bit message ''m''0,...,''m''n-1, we view it as a polynomial of degree ''n''- ...
as its rolling hash. FastCDC (Fast Content-Defined Chunking) uses a compute-efficient Gear fingerprint as its rolling hash. At best, rolling hash values are pairwise independentDaniel Lemire, Owen Kaser: Recursive ''n''-gram hashing is pairwise independent, at best, Computer Speech & Language 24 (4), pages 698–710, 2010. arXiv:0705.4676. or strongly
universal Universal is the adjective for universe. Universal may also refer to: Companies * NBCUniversal, a media and entertainment company ** Universal Animation Studios, an American Animation studio, and a subsidiary of NBCUniversal ** Universal TV, a ...
. They cannot be 3-wise independent, for example.


Polynomial rolling hash

The Rabin–Karp string search algorithm is often explained using a rolling hash function that only uses multiplications and additions: :H = c_1 a^ + c_2 a^ + c_3 a^ + ... + c_k a^, where a is a constant, and c_1, ..., c_k are the input characters (but this function is not a
Rabin fingerprint The Rabin fingerprinting scheme is a method for implementing fingerprints using polynomials over a finite field. It was proposed by Michael O. Rabin. Scheme Given an ''n''-bit message ''m''0,...,''m''n-1, we view it as a polynomial of degree ''n''- ...
, see below). In order to avoid manipulating huge H values, all math is done modulo n. The choice of a and n is critical to get good hashing; in particular, the modulus n is typically a prime number. See
linear congruential generator A linear congruential generator (LCG) is an algorithm that yields a sequence of pseudo-randomized numbers calculated with a discontinuous piecewise linear equation. The method represents one of the oldest and best-known pseudorandom number generat ...
for more discussion. Removing and adding characters simply involves adding or subtracting the first or last term. Shifting all characters by one position to the left requires multiplying the entire sum H by a. Shifting all characters by one position to the right requires dividing the entire sum H by a. Note that in modulo arithmetic, a can be chosen to have a
multiplicative inverse In mathematics, a multiplicative inverse or reciprocal for a number ''x'', denoted by 1/''x'' or ''x''−1, is a number which when multiplied by ''x'' yields the multiplicative identity, 1. The multiplicative inverse of a fraction ''a''/ ...
a^ by which H can be multiplied to get the result of the division without actually performing a division.


Rabin fingerprint

The
Rabin fingerprint The Rabin fingerprinting scheme is a method for implementing fingerprints using polynomials over a finite field. It was proposed by Michael O. Rabin. Scheme Given an ''n''-bit message ''m''0,...,''m''n-1, we view it as a polynomial of degree ''n''- ...
is another hash, which also interprets the input as a polynomial, but over the
Galois field In mathematics, a finite field or Galois field (so-named in honor of Évariste Galois) is a field that contains a finite number of elements. As with any field, a finite field is a set on which the operations of multiplication, addition, subtr ...
GF(2) (also denoted \mathbb F_2, or \mathbb Z/2\mathbb Z) is the finite field of two elements (GF is the initialism of ''Galois field'', another name for finite fields). Notations and \mathbb Z_2 may be encountered although they can be confused with ...
. Instead of seeing the input as a polynomial of bytes, it is seen as a polynomial of bits, and all arithmetic is done in GF(2) (similarly to CRC32). The hash is the result of the division of that polynomial by an irreducible polynomial over GF(2). It is possible to update a Rabin fingerprint using only the entering and the leaving byte, making it effectively a rolling hash. Because it shares the same author as the Rabin–Karp string search algorithm, which is often explained with another, simpler rolling hash, and because this simpler rolling hash is also a polynomial, both rolling hashes are often mistaken for each other. The backup softwar
restic
uses a Rabin fingerprint for splitting files, with blob size varying between and .


Cyclic polynomial

Hashing by cyclic polynomial—sometimes called Buzhash—is also simple, but it has the benefit of avoiding multiplications, using barrel shifts instead. It is a form of
tabulation hashing In computer science, tabulation hashing is a method for constructing universal families of hash functions by combining table lookup with exclusive or operations. It was first studied in the form of Zobrist hashing for computer games; later work b ...
: it presumes that there is some hash function h from characters to integers in the interval ,2^L). This hash function might be simply an array or a hash table mapping characters to random integers. Let the function s be a cyclic binary rotation (or
circular shift In combinatorial mathematics, a circular shift is the operation of rearranging the entries in a tuple, either by moving the final entry to the first position, while shifting all other entries to the next position, or by performing the inverse oper ...
): it rotates the bits by 1 to the left, pushing the latest bit in the first position. E.g., s(101)=011. Let \oplus be the bitwise
exclusive or Exclusive or or exclusive disjunction is a logical operation that is true if and only if its arguments differ (one is true, the other is false). It is symbolized by the prefix operator J and by the infix operators XOR ( or ), EOR, EXOR, , ...
. The hash values are defined as : H = s^(h( c_1 )) \oplus s^( h(c_2) ) \oplus \ldots \oplus s( h(c_) ) \oplus h(c_k), where the multiplications by powers of two can be implemented by binary shifts. The result is a number in _In_practice,_this_can_be_achieved_by_an_integer_division_H_\rightarrow_H_\div_2^.


__Content-based_slicing_using_a_rolling_hash_

_ One_of_the_interesting_use_cases_of_the_rolling_hash_function_is_that_it_can_create_dynamic,_content-based_chunks_of_a_stream_or_file._This_is_especially_useful_when_it_is_required_to_send_only_the_changed_chunks_of_a_large_file_over_a_network:_a_simple_byte_addition_at_the_front_of_the_file_would_normally_cause_all_fixed_size_windows_to_become_updated,_while_in_reality,_only_the_first_"chunk"_has_been_modified. A_simple_approach_to_making_dynamic_chunks_is_to_calculate_a_rolling_hash,_and_if_the_hash_value_matches_an_arbitrary_pattern_(e.g._all_zeroes)_in_the_lower_''N''_bits_(with_a_probability_of_,_given_the_hash_has_a_uniform_probability_distribution)_then_it’s_chosen_to_be_a_chunk_boundary.__Each_chunk_will_thus_have_an_average_size_of_2^n_bytes._This_approach_ensures_that_unmodified_data_(more_than_a_window_size_away_from_the_changes)_will_have_the_same_boundaries. Once_the_boundaries_are_known,_the_chunks_need_to_be_compared_by_cryptographic_hash_value_to_detect_changes._The_backup_software_Attic_(backup_software).html" ;"title=",2^L). Computing the hash values in a rolling fashion is done as follows. Let H be the previous hash value. Rotate H once: H \leftarrow s(H). If c_1 is the character to be removed, rotate it k times: s^(h( c_1 )). Then simply set :H \leftarrow s(H) \oplus s^(h( c_1 )) \oplus h(c_), where c_ is the new character. Hashing by cyclic polynomials is strongly universal or pairwise independent: simply keep the first L-k+1 bits. That is, take the result H and dismiss any k-1 consecutive bits. In practice, this can be achieved by an integer division H \rightarrow H \div 2^.


Content-based slicing using a rolling hash

One of the interesting use cases of the rolling hash function is that it can create dynamic, content-based chunks of a stream or file. This is especially useful when it is required to send only the changed chunks of a large file over a network: a simple byte addition at the front of the file would normally cause all fixed size windows to become updated, while in reality, only the first "chunk" has been modified. A simple approach to making dynamic chunks is to calculate a rolling hash, and if the hash value matches an arbitrary pattern (e.g. all zeroes) in the lower ''N'' bits (with a probability of , given the hash has a uniform probability distribution) then it’s chosen to be a chunk boundary. Each chunk will thus have an average size of 2^n bytes. This approach ensures that unmodified data (more than a window size away from the changes) will have the same boundaries. Once the boundaries are known, the chunks need to be compared by cryptographic hash value to detect changes. The backup software Attic (backup software)">Attic uses the Buzhash algorithm with a customizable chunk size range for splitting file streams.


Content-based slicing using moving sum

Several programs, including gzip (with the --rsyncable option) and rsyncrypto, do content-based slicing based on this specific (unweighted) moving sum:"Rsyncrypto Algorithm"
:S(n) = \sum_^ c_i, :H(n) = S(n) \mod 4096, where * S(n) is the sum of 8196 consecutive bytes ending with byte n (requires 21 bits of storage), * c_i is byte i of the file, * H(n) is a "hash value" consisting of the bottom 12 bits of S(n). Shifting the window by one byte simply involves adding the new character to the sum and subtracting the oldest character (no longer in the window) from the sum. For every n where H(n)

0
, these programs cut the file between n and n+1. This approach will ensure that any change in the file will only affect its current and possibly the next chunk, but no other chunk.


Gear fingerprint and content-based chunking algorithm FastCDC

Chunking is a technique to divide a data stream into a set of blocks, also called chunks. Content-defined chunking (CDC) is a chunking technique in which the division of the data stream is not based on fixed chunk size, as in fixed-size chunking, but on its content. The Content-Defined Chunking algorithm needs to compute the hash value of a data stream byte by byte and split the data stream into chunks when the hash value meets a predefined value. However, comparing a string byte-by-byte will introduce the heavy computation overhead. FastCDC proposes a new and efficient Content-Defined Chunking approach. It uses a fast rolling Gear hash algorithm, skipping the minimum length, normalizing the chunk-size distribution, and last but not the least, rolling two bytes each time to speed up the CDC algorithm, which can achieve about 10X higher throughput than Rabin-based CDC approach.{{cite journal , title=The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems , journal=IEEE Transactions on Parallel and Distributed Systems , date=2020-06-16 , doi=10.1109/TPDS.2020.2984632 , s2cid=215817722 , url=https://ieeexplore.ieee.org/document/9055082 , access-date=2020-07-22, last1=Xia , first1=Wen , last2=Zou , first2=Xiangyu , last3=Jiang , first3=Hong , last4=Zhou , first4=Yukun , last5=Liu , first5=Chuanyi , last6=Feng , first6=Dan , last7=Hua , first7=Yu , last8=Hu , first8=Yuchong , last9=Zhang , first9=Yucheng , volume=31 , issue=9 , pages=2017–2031 The basic version pseudocode is provided as follows: algorithm FastCDC input: data buffer ''src'', data length ''n'', output: cut point ''i'' ''MinSize'' ← 2KB // split minimum chunk size is 2 KB ''MaxSize'' ← 64KB // split maximum chunk size is 64 KB ''fp'' ← ''0'' ''i'' ← ''MinSize'' ''Mask'' ← ''0x0000d93003530000LL'' // buffer size is less than minimum chunk size if ''n'' ≤ ''MinSize'' then return ''n'' if ''n'' ≥ ''MaxSize'' then ''n'' ← ''MaxSize''     while ''i'' < ''n'' do ''fp'' ← (''fp'' << 1 ) + ''Gear''[''src''[''i'' if !(''fp'' & ''Mask'') then return ''i''     return ''i'' Where Gear array is a pre-calculated hashing array. Here FastCDC uses Gear hashing algorithm which can calculate the rolling hashing results quickly and keep the uniform distribution of the hashing results as Rabin. Compared with the traditional Rabin hashing algorithm, it achieves a much faster speed. Experiments suggest that it can generate nearly the same chunk size distribution in the much shorter time (about 1/10 of rabin-based chunking ) when segmenting the data stream.


Computational complexity

All rolling hash functions can be computed in time linear in the number of characters and updated in constant time when characters are shifted by one position. In particular, computing the Rabin-Karp rolling hash of a string of length k requires O(k) modular arithmetic operations, and hashing by cyclic polynomials requires O(k) bitwise
exclusive or Exclusive or or exclusive disjunction is a logical operation that is true if and only if its arguments differ (one is true, the other is false). It is symbolized by the prefix operator J and by the infix operators XOR ( or ), EOR, EXOR, , ...
s and
circular shift In combinatorial mathematics, a circular shift is the operation of rearranging the entries in a tuple, either by moving the final entry to the first position, while shifting all other entries to the next position, or by performing the inverse oper ...
s.


Software


rollinghashcpp
is a
free-software Free software or libre software is computer software distributed under terms that allow users to run the software for any purpose as well as to study, change, and distribute it and any adapted versions. Free software is a matter of liberty, no ...
C++ implementation of several rolling hash functions
rollinghashjava
is an Apache-licensed Java implementation of rolling hash functions


See also

*
MinHash In computer science and data mining, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how Similarity measure, similar two sets are. The scheme was invented by , and initially ...
*
w-shingling In natural language processing a ''w-shingling'' is a set of ''unique'' ''shingles'' (therefore ''n-grams'') each of which is composed of contiguous subsequences of tokens within a document, which can then be used to ascertain the similarity be ...


External links


MIT 6.006: Introduction to Algorithms 2011- Recitation 9 - Rolling Hash


Footnotes

Hash functions