HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, approximate string matching (often colloquially referred to as fuzzy string searching) is the technique of finding
strings String or strings may refer to: *String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects Arts, entertainment, and media Films * ''Strings'' (1991 film), a Canadian anim ...
that match a
pattern A pattern is a regularity in the world, in human-made design, or in abstract ideas. As such, the elements of a pattern repeat in a predictable manner. A geometric pattern is a kind of pattern formed of geometric shapes and typically repeated l ...
approximately (rather than exactly). The problem of approximate string matching is typically divided into two sub-problems: finding approximate
substring In formal language theory and computer science, a substring is a contiguous sequence of characters within a string. For instance, "''the best of''" is a substring of "''It was the best of times''". In contrast, "''Itwastimes''" is a subsequence ...
matches inside a given string and finding dictionary strings that match the pattern approximately.


Overview

The closeness of a match is measured in terms of the number of primitive operations necessary to convert the string into an exact match. This number is called the
edit distance In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to tr ...
between the string and the pattern. The usual primitive operations are: * insertion: ''cot'' → ''coat'' * deletion: ''coat'' → ''cot'' * substitution: ''coat'' → ''cost'' These three operations may be generalized as forms of substitution by adding a NULL character (here symbolized by *) wherever a character has been deleted or inserted: * insertion: ''co*t'' → ''coat'' * deletion: ''coat'' → ''co*t'' * substitution: ''coat'' → ''cost'' Some approximate matchers also treat ''transposition'', in which the positions of two letters in the string are swapped, to be a primitive operation. * transposition: ''cost'' → ''cots'' Different approximate matchers impose different constraints. Some matchers use a single global unweighted cost, that is, the total number of primitive operations necessary to convert the match to the pattern. For example, if the pattern is ''coil'', ''foil'' differs by one substitution, ''coils'' by one insertion, ''oil'' by one deletion, and ''foal'' by two substitutions. If all operations count as a single unit of cost and the limit is set to one, ''foil'', ''coils'', and ''oil'' will count as matches while ''foal'' will not. Other matchers specify the number of operations of each type separately, while still others set a total cost but allow different weights to be assigned to different operations. Some matchers permit separate assignments of limits and weights to individual groups in the pattern.


Problem formulation and algorithms

One possible definition of the approximate string matching problem is the following: Given a pattern string P = p_1p_2...p_m and a text string T = t_1t_2\dots t_n, find a substring T_ = t_\dots t_j in ''T'', which, of all substrings of ''T'', has the smallest edit distance to the pattern ''P''. A brute-force approach would be to compute the edit distance to P for all substrings of T, and then choose the substring with the minimum distance. However, this algorithm would have the running time ''O''(''n''3 ''m''). A better solution, which was proposed by Sellers, relies on
dynamic programming Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics. I ...
. It uses an alternative formulation of the problem: for each position ''j'' in the text ''T'' and each position ''i'' in the pattern ''P'', compute the minimum edit distance between the ''i'' first characters of the pattern, P_i, and any substring T_ of ''T'' that ends at position ''j''. For each position ''j'' in the text ''T'', and each position ''i'' in the pattern ''P'', go through all substrings of ''T'' ending at position ''j'', and determine which one of them has the minimal edit distance to the ''i'' first characters of the pattern ''P''. Write this minimal distance as ''E''(''i'', ''j''). After computing ''E''(''i'', ''j'') for all ''i'' and ''j'', we can easily find a solution to the original problem: it is the substring for which ''E''(''m'', ''j'') is minimal (''m'' being the length of the pattern ''P''.) Computing ''E''(''m'', ''j'') is very similar to computing the edit distance between two strings. In fact, we can use the Levenshtein distance computing algorithm for ''E''(''m'', ''j''), the only difference being that we must initialize the first row with zeros, and save the path of computation, that is, whether we used ''E''(''i'' − 1,''j''), E(''i'',''j'' − 1) or ''E''(''i'' − 1,''j'' − 1) in computing ''E''(''i'', ''j''). In the array containing the ''E''(''x'', ''y'') values, we then choose the minimal value in the last row, let it be ''E''(''x''2, ''y''2), and follow the path of computation backwards, back to the row number 0. If the field we arrived at was ''E''(0, ''y''1), then ''T'' 'y''1 + 1nbsp;... ''T'' 'y''2is a substring of T with the minimal edit distance to the pattern ''P''. Computing the ''E''(''x'', ''y'') array takes ''O''(''mn'') time with the dynamic programming algorithm, while the backwards-working phase takes ''O''(''n'' + ''m'') time. Another recent idea is the similarity join. When matching database relates to a large scale of data, the ''O''(''mn'') time with the dynamic programming algorithm cannot work within a limited time. So, the idea is to reduce the number of candidate pairs, instead of computing the similarity of ''all'' pairs of strings. Widely used algorithms are based on filter-verification, hashing,
Locality-sensitive hashing In computer science, locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability. (The number of buckets is much smaller than the universe of possible input items.) Since ...
(LSH),
Trie In computer science, a trie, also called digital tree or prefix tree, is a type of ''k''-ary search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes def ...
s and other greedy and approximation algorithms. Most of them are designed to fit some framework (such as Map-Reduce) to compute concurrently.


On-line versus off-line

Traditionally, approximate string matching algorithms are classified into two categories:
on-line In computer technology and telecommunications, online indicates a state of connectivity and offline indicates a disconnected state. In modern terminology, this usually refers to an Internet connection, but (especially when expressed "on line" or ...
and off-line. With on-line algorithms the pattern can be processed before searching but the text cannot. In other words, on-line techniques do searching without an index. Early algorithms for on-line approximate matching were suggested by Wagner and Fisher and by Sellers. Both algorithms are based on
dynamic programming Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics. I ...
but solve different problems. Sellers' algorithm searches approximately for a substring in a text while the algorithm of Wagner and Fisher calculates
Levenshtein distance In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-charact ...
, being appropriate for dictionary fuzzy search only. On-line searching techniques have been repeatedly improved. Perhaps the most famous improvement is the bitap algorithm (also known as the shift-or and shift-and algorithm), which is very efficient for relatively short pattern strings. The Bitap algorithm is the heart of the
Unix Unix (; trademarked as UNIX) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, and ot ...
searching
utility As a topic of economics, utility is used to model worth or value. Its usage has evolved significantly over time. The term was introduced initially as a measure of pleasure or happiness as part of the theory of utilitarianism by moral philosopher ...
agrep. A review of on-line searching algorithms was done by G. Navarro. Although very fast on-line techniques exist, their performance on large data is unacceptable. Text preprocessing or indexing makes searching dramatically faster. Today, a variety of indexing algorithms have been presented. Among them are
suffix tree In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particu ...
s,
metric tree A metric tree is any tree data structure specialized to index data in metric spaces. Metric trees exploit properties of metric spaces such as the triangle inequality to make accesses to the data more efficient. Examples include the M-tree, vp-t ...
s and
n-gram In the fields of computational linguistics and probability, an ''n''-gram (sometimes also called Q-gram) is a contiguous sequence of ''n'' items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or b ...
methods. A detailed survey of indexing techniques that allows one to find an arbitrary substring in a text is given by Navarro ''et al.'' A computational survey of dictionary methods (i.e., methods that permit finding all dictionary words that approximately match a search pattern) is given by Boytsov.


Applications

Common applications of approximate matching include
spell checking In software, a spell checker (or spelling checker or spell check) is a software feature that checks for misspellings in a text file, text. Spell-checking features are often embedded in software or services, such as a word processor, email client, el ...
. With the availability of large amounts of DNA data, matching of
nucleotide Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecules wi ...
sequences has become an important application. Approximate matching is also used in
spam filtering Various anti-spam techniques are used to prevent email spam (unsolicited bulk email). No technique is a complete solution to the spam problem, and each has trade-offs between incorrectly rejecting legitimate email (false positives) as opposed to ...
.
Record linkage Record linkage (also known as data matching, data linkage, entity resolution, and many other terms) is the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, and da ...
is a common application where records from two disparate databases are matched. String matching cannot be used for most binary data, such as images and music. They require different algorithms, such as
acoustic fingerprint An acoustic fingerprint is a condensed digital summary, a fingerprint, deterministically generated from an audio signal, that can be used to identify an audio sample or quickly locate similar items in an audio database. Practical uses of aco ...
ing. A common command-line tool ''fzf'' is often used to integrate approximate string searching into various command-line applications.


See also

*
Concept search A concept search (or conceptual search) is an automated information retrieval method that is used to search electronically stored unstructured data, unstructured text (for example, digital archives, email, scientific literature, etc.) for informatio ...
*
Jaro–Winkler distance In computer science and statistics, the Jaro–Winkler distance is a string metric measuring an edit distance between two sequences. It is a variant proposed in 1990 by William E. Winkler of the Jaro distance metric (1989, Matthew A. Jaro). ...
*
Levenshtein distance In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-charact ...
*
Locality-sensitive hashing In computer science, locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability. (The number of buckets is much smaller than the universe of possible input items.) Since ...
*
Metaphone Metaphone is a phonetic algorithm, published by Lawrence Philips in 1990, for indexing words by their English pronunciation. It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English sp ...
*
Needleman–Wunsch algorithm The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or nucleotide sequences. It was one of the first applications of dynamic programming to compare biological sequences. The algorithm was developed by Saul ...
*
Plagiarism detection Plagiarism detection or content similarity detection is the process of locating instances of plagiarism or copyright infringement within a work or document. The widespread use of computers and the advent of the Internet have made it easier to pla ...
*
Regular expressions A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" or ...
for fuzzy and non-fuzzy matching *
Smith–Waterman algorithm The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorit ...
*
Soundex Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly enc ...
*
String metric In mathematics and computer science, a string metric (also known as a string similarity metric or string distance function) is a metric that measures distance ("inverse similarity") between two text strings for approximate string matching or com ...


References

* * * * * * * * * * * * * *


External links


Flamingo Project


with recent advances in approximate string matching based on an edit distance threshold.
StringMetric project
a Scala library of string metrics and phonetic algorithms
Natural project
a
JavaScript JavaScript (), often abbreviated as JS, is a programming language that is one of the core technologies of the World Wide Web, alongside HTML and CSS. As of 2022, 98% of Website, websites use JavaScript on the Client (computing), client side ...
natural language processing library which includes implementations of popular string metrics {{DEFAULTSORT:Approximate String Matching * Pattern matching Dynamic programming