Rabin–Karp Algorithm
   HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, the Rabin–Karp algorithm or Karp–Rabin algorithm is a
string-searching algorithm In computer science, string-searching algorithms, sometimes called string-matching algorithms, are an important class of string algorithms that try to find a place where one or several string (computer science), strings (also called patterns) are f ...
created by that uses
hashing Hash, hashes, hash mark, or hashing may refer to: Substances * Hash (food), a coarse mixture of ingredients * Hash, a nickname for hashish, a cannabis product Hash mark * Hash mark (sports), a marking on hockey rinks and gridiron football fiel ...
to find an exact match of a pattern string in a text. It uses a
rolling hash A rolling hash (also known as recursive hashing or rolling checksum) is a hash function where the input is hashed in a window that moves through the input. A few hash functions allow a rolling hash to be computed very quickly—the new hash value ...
to quickly filter out positions of the text that cannot match the pattern, and then checks for a match at the remaining positions. Generalizations of the same idea can be used to find more than one match of a single pattern, or to find matches for more than one pattern. To find a single match of a single pattern, the
expected time In computational complexity theory, the average-case complexity of an algorithm is the amount of some computational resource (typically time) used by the algorithm, averaged over all possible inputs. It is frequently contrasted with worst-case com ...
of the algorithm is
linear Linearity is the property of a mathematical relationship (''function'') that can be graphically represented as a straight line. Linearity is closely related to '' proportionality''. Examples in physics include rectilinear motion, the linear r ...
in the combined length of the pattern and text, although its worst-case time complexity is the product of the two lengths. To find multiple matches, the expected time is linear in the input lengths, plus the combined length of all the matches, which could be greater than linear. In contrast, the Aho–Corasick algorithm can find all matches of multiple patterns in worst-case time and space linear in the input length and the number of matches (instead of the total length of the matches). A practical application of the algorithm is detecting plagiarism. Given source material, the algorithm can rapidly search through a paper for instances of sentences from the source material, ignoring details such as case and punctuation. Because of the abundance of the sought strings, single-string searching algorithms are impractical.


Overview

A naive string matching algorithm compares the given pattern against all positions in the given text. Each comparison takes time proportional to the length of the pattern, and the number of positions is proportional to the length of the text. Therefore, the worst-case time for such a method is proportional to the product of the two lengths. In many practical cases, this time can be significantly reduced by cutting short the comparison at each position as soon as a mismatch is found, but this idea cannot guarantee any speedup. Several string-matching algorithms, including the Knuth–Morris–Pratt algorithm and the
Boyer–Moore string-search algorithm In computer science, the Boyer–Moore string-search algorithm is an efficient string-searching algorithm that is the standard benchmark for practical string-search literature. It was developed by Robert S. Boyer and J Strother Moore in 1977. T ...
, reduce the worst-case time for string matching by extracting more information from each mismatch, allowing them to skip over positions of the text that are guaranteed not to match the pattern. The Rabin–Karp algorithm instead achieves its speedup by using a
hash function A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually u ...
to quickly perform an approximate check for each position, and then only performing an exact comparison at the positions that pass this approximate check. A hash function is a function which converts every string into a numeric value, called its ''hash value''; for example, we might have hash("hello")=5. If two strings are equal, their hash values are also equal. For a well-designed hash function, the inverse is true, in an approximate sense: strings that are unequal are very unlikely to have equal hash values. The Rabin–Karp algorithm proceeds by computing, at each position of the text, the hash value of a string starting at that position with the same length as the pattern. If this hash value equals the hash value of the pattern, it performs a full comparison at that position. In order for this to work well, the hash function should be selected randomly from a family of hash functions that are unlikely to produce many
false positive A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test result ...
s, that is, positions of the text which have the same hash value as the pattern but do not actually match the pattern. These positions contribute to the running time of the algorithm unnecessarily, without producing a match. Additionally, the hash function used should be a
rolling hash A rolling hash (also known as recursive hashing or rolling checksum) is a hash function where the input is hashed in a window that moves through the input. A few hash functions allow a rolling hash to be computed very quickly—the new hash value ...
, a hash function whose value can be quickly updated from each position of the text to the next. Recomputing the hash function from scratch at each position would be too slow.


The algorithm

The algorithm is as shown: function RabinKarp(string s ..n string pattern ..m hpattern := hash(pattern ..m; for i from 1 to n-m+1 hs := hash(s ..i+m-1 if hs = hpattern if s ..i+m-1= pattern ..m return i return not found Lines 2, 4, and 6 each require O(''m'') time. However, line 2 is only executed once, and line 6 is only executed if the hash values match, which is unlikely to happen more than a few times. Line 5 is executed O(''n'') times, but each comparison only requires constant time, so its impact is O(''n''). The issue is line 4. Naively computing the hash value for the substring s +1..i+m/code> requires O(''m'') time because each character is examined. Since the hash computation is done on each loop, the algorithm with a naive hash computation requires O(mn) time, the same complexity as a straightforward string matching algorithm. For speed, the hash must be computed in constant time. The trick is the variable hs already contains the previous hash value of s ..i+m-1/code>. If that value can be used to compute the next hash value in constant time, then computing successive hash values will be fast. The trick can be exploited using a
rolling hash A rolling hash (also known as recursive hashing or rolling checksum) is a hash function where the input is hashed in a window that moves through the input. A few hash functions allow a rolling hash to be computed very quickly—the new hash value ...
. A rolling hash is a hash function specially designed to enable this operation. A trivial (but not very good) rolling hash function just adds the values of each character in the substring. This rolling hash formula can compute the next hash value from the previous value in constant time:
s +1..i+m= s ..i+m-1- s + s +m
This simple function works, but will result in statement 5 being executed more often than other more sophisticated rolling hash functions such as those discussed in the next section. Good performance requires a good hashing function for the encountered data. If the hashing is poor (such as producing the same hash value for every input), then line 6 would be executed O(''n'') times (i.e. on every iteration of the loop). Because character-by-character comparison of strings with length ''m'' takes O(m) time, the whole algorithm then takes a worst-case O(''mn'') time.


Hash function used

The key to the Rabin–Karp algorithm's performance is the efficient computation of
hash value A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually u ...
s of the successive substrings of the text. The
Rabin fingerprint The Rabin fingerprinting scheme is a method for implementing fingerprints using polynomials over a finite field. It was proposed by Michael O. Rabin. Scheme Given an ''n''-bit message ''m''0,...,''m''n-1, we view it as a polynomial of degree ''n''- ...
is a popular and effective rolling hash function. The hash function described here is not a Rabin fingerprint, but it works equally well. It treats every substring as a number in some base, the base being usually the size of the character set. For example, if the substring is "hi", the base is 256, and prime modulus is 101, then the hash value would be 104 × 256 ) % 101 + 105% 101 = 65 (
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
of 'h' is 104 and of 'i' is 105) '%' is 'mod' or modulo, or remainder after integer division, operator Technically, this algorithm is only similar to the true number in a non-decimal system representation, since for example we could have the "base" less than one of the "digits". See
hash function A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually u ...
for a much more detailed discussion. The essential benefit achieved by using a
rolling hash A rolling hash (also known as recursive hashing or rolling checksum) is a hash function where the input is hashed in a window that moves through the input. A few hash functions allow a rolling hash to be computed very quickly—the new hash value ...
such as the Rabin fingerprint is that it is possible to compute the hash value of the next substring from the previous one by doing only a constant number of operations, independent of the substrings' lengths. For example, if we have text "abracadabra" and we are searching for a pattern of length 3, the hash of the first substring, "abr", using 256 as the base, and 101 as the prime modulus is: // ASCII a = 97, b = 98, r = 114. hash("abr") = (_[_(_[__(97_×_256)_%_101_+_98_%_101_)_×_256_.html" ;"title="(_[__(97_×_256)_%_101_+_98_.html" ;"title="( [ ( [ (97 × 256) % 101 + 98 ">( [ ( [ (97 × 256) % 101 + 98 % 101 ) × 256 ">(_[__(97_×_256)_%_101_+_98_.html" ;"title="( [ ( [ (97 × 256) % 101 + 98 ">( [ ( [ (97 × 256) % 101 + 98 % 101 ) × 256 % 101 ) + 114 ] % 101 = 4 We can then compute the hash of the next substring, "bra", from the hash of "abr" by subtracting the number added for the first 'a' of "abr", i.e. 97 × 2562, multiplying by the base and adding for the last a of "bra", i.e. 97 × 2560. Like so: // ''old hash (-ve avoider)* old 'a' left base offset base shift new 'a prime modulus hash("bra") = (_4___+_101_________-__97_*_[(256%101)*256%_101_)_*_256_________+____97_.html" ;"title="256%101)*256.html" ;"title="( 4 + 101 - 97 * [(256%101)*256">( 4 + 101 - 97 * [(256%101)*256% 101 ) * 256 + 97 ">256%101)*256.html" ;"title="( 4 + 101 - 97 * [(256%101)*256">( 4 + 101 - 97 * [(256%101)*256% 101 ) * 256 + 97 % 101 = 30 * (-ve avoider) = "underflow avoider". Necessary if using unsigned integers for calculations. Because we know all hashes h \leq p for prime modulus $p$, we can ensure no underflow by adding p to the old hash before subtracting the value corresponding to the old 'a' (mod p). the last '* 256' is the shift of the subtracted hash to the left although ((256%101)*256)%101 is the same as 2562 % 101, to avoid overflowing integer maximums when the pattern string is longer (e.g. 'Rabin-Karp' is 10 characters, 2569 is the offset without modulation ), the pattern length base offset is pre-calculated in a loop, modulating the result each iteration If we are matching the search string "bra", using similar calculation of hash("abr"), hash'("bra") = [ ( [ ( [ ( 98 × 256) %101 + 114] % 101 ) × 256 ] % 101) + 97 ] % 101 = 30 If the substrings in question are long, this algorithm achieves great savings compared with many other hashing schemes. Theoretically, there exist other algorithms that could provide convenient recomputation, e.g. multiplying together ASCII values of all characters so that shifting substring would only entail dividing the previous hash by the first character value, then multiplying by the new last character's value. The limitation, however, is the limited size of the integer
data type In computer science and computer programming, a data type (or simply type) is a set of possible values and a set of allowed operations on it. A data type tells the compiler or interpreter how the programmer intends to use the data. Most progra ...
and the necessity of using
modular arithmetic In mathematics, modular arithmetic is a system of arithmetic for integers, where numbers "wrap around" when reaching a certain value, called the modulus. The modern approach to modular arithmetic was developed by Carl Friedrich Gauss in his book ...
to scale down the hash results, (see
hash function A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually u ...
article). Meanwhile, naive hash functions do not produce large numbers quickly, but, just like adding ASCII values, are likely to cause many
hash collision In computer science, a hash collision or hash clash is when two pieces of data in a hash table share the same hash value. The hash value in this case is derived from a hash function which takes a data input and returns a fixed length of bits. Al ...
s and hence slow down the algorithm. Hence the described hash function is typically the preferred one in the Rabin–Karp algorithm.


Multiple pattern search

The Rabin–Karp algorithm is inferior for single pattern searching to Knuth–Morris–Pratt algorithm,
Boyer–Moore string-search algorithm In computer science, the Boyer–Moore string-search algorithm is an efficient string-searching algorithm that is the standard benchmark for practical string-search literature. It was developed by Robert S. Boyer and J Strother Moore in 1977. T ...
and other faster single pattern
string searching algorithm In computer science, string-searching algorithms, sometimes called string-matching algorithms, are an important class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a larger stri ...
s because of its slow worst case behavior. However, it is a useful algorithm for multiple pattern search. To find any of a large number, say ''k'', fixed length patterns in a text, a simple variant of the Rabin–Karp algorithm uses a
Bloom filter A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in ...
or a
set data structure In computer science, a set is an abstract data type that can store unique values, without any particular order. It is a computer implementation of the mathematical concept of a finite set. Unlike most other collection types, rather than retrievin ...
to check whether the hash of a given string belongs to a set of hash values of patterns we are looking for: function RabinKarpSet(string s ..n set of string subs, m): set hsubs := emptySet foreach sub in subs insert hash(sub ..m into hsubs hs := hash(s ..m for i from 1 to n-m+1 if hs ∈ hsubs and s ..i+m-1∈ subs return i hs := hash(s +1..i+m return not found We assume all the substrings have a fixed length ''m''. A naïve way to search for ''k'' patterns is to repeat a single-pattern search taking O(''n+m'') time, totaling in O(''(n+m)k'') time. In contrast, the above algorithm can find all ''k'' patterns in O(''n''+''km'') expected time, assuming that a hash table check works in O(1) expected time.


References

* * * (for the Bloom filter extension)
Yet another explanation


External links

* {{DEFAULTSORT:Rabin-Karp String Search Algorithm String matching algorithms Hashing