Boyer–Moore–Horspool Algorithm
   HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, the Boyer–Moore–Horspool algorithm or Horspool's algorithm is an
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...
for finding
substring In formal language theory and computer science, a substring is a contiguous sequence of characters within a string. For instance, "''the best of''" is a substring of "''It was the best of times''". In contrast, "''Itwastimes''" is a subsequence ...
s in
strings String or strings may refer to: *String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects Arts, entertainment, and media Films * ''Strings'' (1991 film), a Canadian anim ...
. It was published by
Nigel Horspool R. Nigel Horspool is a retired professor of computer science, formerly of the University of Victoria. He invented the Boyer–Moore–Horspool algorithm, a fast string search algorithm adapted from the Boyer–Moore string-search algorithm. Horsp ...
in 1980 as SBM. It is a simplification of the
Boyer–Moore string-search algorithm In computer science, the Boyer–Moore string-search algorithm is an efficient string-searching algorithm that is the standard benchmark for practical string-search literature. It was developed by Robert S. Boyer and J Strother Moore in 1977. T ...
which is related to the
Knuth–Morris–Pratt algorithm In computer science, the Knuth–Morris–Pratt string-searching algorithm (or KMP algorithm) searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies s ...
. The algorithm trades space for time in order to obtain an
average-case complexity In computational complexity theory, the average-case complexity of an algorithm is the amount of some computational resource (typically time) used by the algorithm, averaged over all possible inputs. It is frequently contrasted with worst-case comp ...
of ''O(n)'' on random text, although it has ''O(nm)'' in the
worst case In computer science, best, worst, and average cases of a given algorithm express what the resource usage is ''at least'', ''at most'' and ''on average'', respectively. Usually the resource being considered is running time, i.e. time complexity, b ...
, where the length of the pattern is ''m'' and the length of the search string is ''n''.


Description

Like Boyer–Moore, Boyer–Moore–Horspool preprocesses the pattern to produce a table containing, for each symbol in the
alphabet An alphabet is a standardized set of basic written graphemes (called letters) that represent the phonemes of certain spoken languages. Not all writing systems represent language in this way; in a syllabary, each character represents a syll ...
, the number of characters that can safely be skipped. The preprocessing phase, in pseudocode, is as follows (for an alphabet of 256 symbols, i.e., bytes): ''Unlike the original, we use zero-based indices here.'' function preprocess(pattern) T ← new table of 256 integers for i from 0 to 256 exclusive T ← length(pattern) for i from 0 to length(pattern) - 1 exclusive T attern[i ← length(pattern) - 1 - i return T Pattern search proceeds as follows. The procedure reports the index of the first occurrence of in . function same(str1, str2, len) ''Compares two strings, up to the first len characters.'' i ← len - 1 while str1 = str2 ''Note: this is equivalent to !memcmp(str1, str2, len).'' if i = 0 ''The original algorithm tries to play smart here: it checks for the'' return true ''last character, and then starts from the first to the second-last.'' i ← i - 1 return false function search(needle, haystack) T ← preprocess(needle) skip ← 0 while length(haystack) - skip ≥ length(needle) ''haystack[skip:] -- substring starting with "skip". &haystack[skip] in C.'' if same(haystack[skip:], needle, length(needle)) return skip skip ← skip + T[haystack[skip + length(needle) - 1 return ''not-found''


Performance

The algorithm performs best with long needle strings, when it consistently hits a non-matching character at or near the final byte of the current position in the haystack and the final byte of the needle does not occur elsewhere within the needle. For instance a 32 byte needle ending in "z" searching through a 255 byte haystack which does not have a 'z' byte in it would take up to 224 byte comparisons. The best case is the same as for the Boyer–Moore string-search algorithm in
big O notation Big ''O'' notation is a mathematical notation that describes the limiting behavior of a function when the argument tends towards a particular value or infinity. Big O is a member of a family of notations invented by Paul Bachmann, Edmund Lan ...
, although the constant overhead of initialization and for each loop is less. The worst case behavior happens when the bad character skip is consistently low (with the lower limit of 1 byte movement) and a large portion of the needle matches the haystack. The bad character skip is only low, on a partial match, when the final character of the needle also occurs elsewhere within the needle, with 1 byte movement happening when the same byte is in both of the last two positions. The canonical degenerate case similar to the above "best" case is a needle of an 'a' byte followed by 31 'z' bytes in a haystack consisting of 255 'z' bytes. This will do 31 successful byte comparisons, a 1 byte comparison that fails and then move forward 1 byte. This process will repeat 223 more times (255 − 32), bringing the total byte comparisons to 7,168 (32 × 224). (A different byte-comparison loop will have a different behavior.) The worst case is significantly higher than for the Boyer–Moore string-search algorithm, although obviously this is hard to achieve in normal use cases. It is also worth noting that this worst case is also the worst case for the naive (but usual) memcmp() algorithm, although the implementation of that tends to be significantly optimized (and is more cache friendly).


Tuning the comparison loop

The original algorithm had a more sophisticated same() loop. It uses an extra pre-check before proceeding in the positive direction: function same_orig(str1, str2, len) i ← 0 if str1 en - 1= str2 en - 1 while str1 = str2 if i = len - 2 return true i ← i + 1 return false A tuned version of the BMH algorithm is the
Raita algorithm In computer science, the Raita algorithm is a string searching algorithm which improves the performance of Boyer–Moore–Horspool algorithm. This algorithm preprocesses the string being searched for the pattern, which is similar to Boyer–Moore ...
. It adds an additional precheck for the middle character, in the order of last-first-middle. The algorithm enters the full loop only when the check passes: function same_raita(str1, str2, len) i ← 0 mid ← len / 2 ''Three prechecks.'' if len ≥ 3 if str id!= str2 id return false if len ≥ 1 if str != str2 return false if len ≥ 2 if str en - 1!= str2 en - 1 return false ''Any old comparison loop.'' return len < 3 ''or'' SAME(&str1 &str2 len - 2) It is unclear whether this 1992 tuning still holds its performance advantage on modern machines. The rationale by the authors is that actual text usually contains some patterns which can be effectively prefiltered by these three characters. It appears that Raita is not aware of the old last-character precheck (he believed that the backward-only routine is the Horspool implementation), so readers are advised to take the results with a grain of salt. On modern machines, library functions like tends to provide better throughput than any of the hand-written comparison loops. The behavior of an "SFC" loop (Horspool's terminology) both in libstdc++ and libc++ seems to suggest that a modern Raita implementation should not include any of the one-character shifts, since they have detrimental effects on data alignment. Also see
String-searching algorithm In computer science, string-searching algorithms, sometimes called string-matching algorithms, are an important class of string algorithms that try to find a place where one or several string (computer science), strings (also called patterns) are f ...
which has detailed analysis of other string searching algorithms.


References


External links


Description of the algorithm

An implementation from V8 JavaScript engine written in C++
{{DEFAULTSORT:Boyer-Moore-Horspool algorithm String matching algorithms Articles with example C code