HOME

TheInfoList



OR:

The bitap algorithm (also known as the shift-or, shift-and or Baeza-Yates–Gonnet algorithm) is an
approximate string matching In computer science, approximate string matching (often colloquially referred to as fuzzy string searching) is the technique of finding strings that match a pattern approximately (rather than exactly). The problem of approximate string matching i ...
algorithm. The algorithm tells whether a given text contains a substring which is "approximately equal" to a given pattern, where approximate equality is defined in terms of
Levenshtein distance In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-charact ...
if the substring and pattern are within a given distance ''k'' of each other, then the algorithm considers them equal. The algorithm begins by precomputing a set of
bitmask In computer science, a mask or bitmask is data that is used for bitwise operations, particularly in a bit field. Using a mask, multiple bits in a byte, nibble, word, etc. can be set either on or off, or inverted from on to off (or vice versa) in ...
s containing one bit for each element of the pattern. Then it is able to do most of the work with
bitwise operation In computer programming, a bitwise operation operates on a bit string, a bit array or a binary numeral (considered as a bit string) at the level of its individual bits. It is a fast and simple action, basic to the higher-level arithmetic operati ...
s, which are extremely fast. The bitap algorithm is perhaps best known as one of the underlying algorithms of the
Unix Unix (; trademarked as UNIX) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, and ot ...
utility As a topic of economics, utility is used to model worth or value. Its usage has evolved significantly over time. The term was introduced initially as a measure of pleasure or happiness as part of the theory of utilitarianism by moral philosopher ...
agrep, written by
Udi Manber Udi Manber ( he, אודי מנבר) is an Israeli computer scientist. He is one of the authors of agrep and GLIMPSE. After a career in engineering and management, he worked on medical research. Education He earned both his bachelor's degree in 1 ...
, Sun Wu, and Burra Gopal. Manber and Wu's original paper gives extensions of the algorithm to deal with fuzzy matching of general
regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" or ...
s. Due to the data structures required by the algorithm, it performs best on patterns less than a constant length (typically the
word length In computing, a word is the natural unit of data used by a particular processor design. A word is a fixed-sized datum handled as a unit by the instruction set or the hardware of the processor. The number of bits or digits in a word (the ''word si ...
of the machine in question), and also prefers inputs over a small alphabet. Once it has been implemented for a given alphabet and word length ''m'', however, its
running time In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by t ...
is completely predictableit runs in O(''mn'') operations, no matter the structure of the text or the pattern. The bitap algorithm for exact string searching was invented by Bálint Dömölki in 1964 and extended by R. K. Shyamasundar in 1977, before being reinvented by
Ricardo Baeza-Yates Ricardo A. Baeza-Yates (born March 21, 1961) is a Chilean- Catalan computer scientist that currently is a Research Professor at the Institute for Experiential AI of Northeastern University in the Silicon Valley campus. He is also part-time profe ...
and
Gaston Gonnet Gaston H. Gonnet is a Uruguayan Canadian computer scientist and entrepreneur. He is best known for his contributions to the Maple computer algebra system and the creation of a digital version of the Oxford English Dictionary. Education and ear ...
in 1989 (one chapter of first author's PhD thesis) which also extended it to handle classes of characters, wildcards, and mismatches. In 1991, it was extended by Manber and Wu to handle also insertions and deletions (full fuzzy string searching). This algorithm was later improved by Baeza-Yates and Navarro in 1996. __TOC__


Exact searching

The bitap algorithm for exact string searching, in full generality, looks like this in pseudocode: algorithm bitap_search is input: ''text'' as a string. ''pattern'' as a string. output: string ''m'' := length(''pattern'') if ''m'' = 0 then return ''text'' /* Initialize the bit array R. */ ''R'' := new array 'm''+1of bit, initially all 0 ''R'' := 1 for ''i'' := 0; ''i'' < length(''text''); ''i'' += 1 do /* Update the bit array. */ for ''k'' := ''m''; ''k'' ≥ 1; ''k'' -= 1 do ''R'' := ''R'' 'k'' - 1& (''text'' 'i''= ''pattern'' 'k'' - 1 if ''R'' 'm''then return (''text'' + ''i'' - ''m'') + 1 return null Bitap distinguishes itself from other well-known string searching algorithms in its natural mapping onto simple bitwise operations, as in the following modification of the above program. Notice that in this implementation, counterintuitively, each bit with value zero indicates a match, and each bit with value 1 indicates a non-match. The same algorithm can be written with the intuitive semantics for 0 and 1, but in that case we must introduce another instruction into the
inner loop Inner loop may refer to: * Inner loop in computer programs * Inner Loop (Phoenix), a section of Interstate 10 in downtown Phoenix, Arizona, United States * Inner Loop (Rochester), an expressway around downtown Rochester, New York, United States * I ...
to set R , = 1. In this implementation, we take advantage of the fact that left-shifting a value shifts in zeros on the right, which is precisely the behavior we need. Notice also that we require CHAR_MAX additional bitmasks in order to convert the (text

pattern -1
condition in the general implementation into bitwise operations. Therefore, the bitap algorithm performs better when applied to inputs over smaller alphabets. #include #include const char *bitap_bitwise_search(const char *text, const char *pattern)


Fuzzy searching

To perform fuzzy string searching using the bitap algorithm, it is necessary to extend the bit array ''R'' into a second dimension. Instead of having a single array ''R'' that changes over the length of the text, we now have ''k'' distinct arrays ''R''1..''k''. Array ''Ri'' holds a representation of the prefixes of ''pattern'' that match any suffix of the current string with ''i'' or fewer errors. In this context, an "error" may be an insertion, deletion, or substitution; see
Levenshtein distance In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-charact ...
for more information on these operations. The implementation below performs
fuzzy matching Record linkage (also known as data matching, data linkage, entity resolution, and many other terms) is the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, and da ...
(returning the first match with up to ''k'' errors) using the fuzzy bitap algorithm. However, it only pays attention to substitutions, not to insertions or deletionsin other words, a
Hamming distance In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of ''substitutions'' required to chan ...
of ''k''. As before, the semantics of 0 and 1 are reversed from their conventional meanings. #include #include #include const char *bitap_fuzzy_bitwise_search(const char *text, const char *pattern, int k)


See also

* agrep *
TRE (computing) TRE is an open-source library for pattern matching in text, which works like a regular expression engine with the ability to do approximate string matching. It was developed by Ville Laurikari and is distributed under a 2-clause BSD-like license. ...


External links and references

# Bálint Dömölki, An algorithm for syntactical analysis, Computational Linguistics 3, Hungarian Academy of Science pp. 29–46, 1964. # Bálint Dömölki, A universal compiler system based on production rules,
BIT Numerical Mathematics ''BIT Numerical Mathematics'' is a quarterly peer-reviewed mathematics journal that covers research in numerical analysis. It was established in 1961 by Carl Erik Fröberg and is published by Springer Science+Business Media. The name "BIT" is a re ...
, 8(4), pp 262–275, 1968. # R. K. Shyamasundar, Precedence parsing using Dömölki's algorithm,
International Journal of Computer Mathematics The ''International Journal of Computer Mathematics'' is a monthly peer-reviewed scientific journal covering numerical analysis and scientific computing. It was established in 1964 and is published by Taylor & Francis. The editors-in-chief are Ch ...
, 6(2)pp 105–114, 1977. # Ricardo Baeza-Yates. "Efficient Text Searching." PhD Thesis, University of Waterloo, Canada, May 1989. # Udi Manber, Sun Wu. "Fast text searching with errors." Technical Report TR-91-11. Department of Computer Science,
University of Arizona The University of Arizona (Arizona, U of A, UArizona, or UA) is a public land-grant research university in Tucson, Arizona. Founded in 1885 by the 13th Arizona Territorial Legislature, it was the first university in the Arizona Territory. T ...
, Tucson, June 1991.
gzipped PostScript
# Ricardo Baeza-Yates, Gastón H. Gonnet. "A New Approach to Text Searching." ''
Communications of the ACM ''Communications of the ACM'' is the monthly journal of the Association for Computing Machinery (ACM). It was established in 1958, with Saul Rosen as its first managing editor. It is sent to all ACM members. Articles are intended for readers with ...
'', 35(10): pp. 74–82, October 1992. # Udi Manber, Sun Wu. "Fast text search allowing errors." ''
Communications of the ACM ''Communications of the ACM'' is the monthly journal of the Association for Computing Machinery (ACM). It was established in 1958, with Saul Rosen as its first managing editor. It is sent to all ACM members. Articles are intended for readers with ...
'', 35(10): pp. 83–91, October 1992, . # R. Baeza-Yates and G. Navarro. A faster algorithm for approximate string matching. In Dan Hirchsberg and Gene Myers, editors, ''Combinatorial Pattern Matching'' (CPM'96), LNCS 1075, pages 1–23, Irvine, CA, June 1996. # G. Myers. "A fast bit-vector algorithm for approximate string matching based on dynamic programming." ''
Journal of the ACM The ''Journal of the ACM'' is a peer-reviewed scientific journal covering computer science in general, especially theoretical aspects. It is an official journal of the Association for Computing Machinery. Its current editor-in-chief is Venkatesan ...
'' 46 (3), May 1999, 395–415.
libbitap
a free implementation that shows how the algorithm can easily be extended for most regular expressions. Unlike the code above, it places no limit on the pattern length. #Ricardo Baeza-Yates, Berthier Ribeiro-Neto. ''Modern Information Retrieval''. 1999. .
bitap.py
- Python implementation of Bitap algorithm with Wu-Manber modifications. {{Strings String matching algorithms Articles with example C code