HOME

TheInfoList



OR:

In
information theory Information theory is the mathematical study of the quantification (science), quantification, Data storage, storage, and telecommunications, communication of information. The field was established and formalized by Claude Shannon in the 1940s, ...
and
computer science Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...
, the Damerau–Levenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein.) is a
string metric In mathematics and computer science, a string metric (also known as a string similarity metric or string distance function) is a metric (mathematics), metric that measures distance ("inverse similarity") between two string (computer science), tex ...
for measuring the
edit distance In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two String (computing), strings (e.g., words) are to one another, that is measured by counting the minimum number of opera ...
between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other. The Damerau–Levenshtein distance differs from the classical Levenshtein distance by including transpositions among its allowable operations in addition to the three classical single-character edit operations (insertions, deletions and substitutions). In his seminal paper, Damerau stated that in an investigation of spelling errors for an information-retrieval system, more than 80% were a result of a single error of one of the four types. Damerau's paper considered only misspellings that could be corrected with at most one edit operation. While the original motivation was to measure distance between human misspellings to improve applications such as
spell checker In software, a spell checker (or spelling checker or spell check) is a software feature that checks for misspellings in a text. Spell-checking features are often embedded in software or services, such as a word processor, email client, electronic ...
s, Damerau–Levenshtein distance has also seen uses in biology to measure the variation between protein sequences.


Definition

To express the Damerau–Levenshtein distance between two strings a and b, a function d_(i,j) is defined, whose value is a distance between an prefix (initial substring) of string a and a prefix of b. The ''restricted distance'' function is defined recursively as: d_(i,j) = \min \begin 0 & \text i = j = 0, \\ d_(i-1,j) + 1 & \text i > 0, \\ d_(i,j-1) + 1 & \text j > 0, \\ d_(i-1,j-1) + 1_ & \text i, j > 0, \\ d_(i-2,j-2) + 1_ & \text i, j > 1 \text a_i = b_ \text a_ = b_j, \\ \end where 1_ is the
indicator function In mathematics, an indicator function or a characteristic function of a subset of a set is a function that maps elements of the subset to one, and all other elements to zero. That is, if is a subset of some set , then the indicator functio ...
equal to 0 when a_i = b_j and equal to 1 otherwise. Each recursive call matches one of the cases covered by the Damerau–Levenshtein distance: * d_(i-1,j) + 1 corresponds to a deletion (from ''a'' to ''b''), * d_(i,j-1) + 1 corresponds to an insertion (from ''a'' to ''b''), * d_(i-1,j-1) + 1_ corresponds to a match or mismatch, depending on whether the respective symbols are the same, * d_(i-2,j-2) + 1_ corresponds to a transposition between two successive symbols. The Damerau–Levenshtein distance between and is then given by the function value for full strings: d_\big(, a, , , b, \big), where i = , a, denotes the length of string , and j = , b, is the length of .


Algorithm

Presented here are two algorithms: the first, simpler one, computes what is known as the ''optimal string alignment distance'' or ''restricted edit distance'', while the second one computes the Damerau–Levenshtein distance with adjacent transpositions. Adding transpositions adds significant complexity. The difference between the two algorithms consists in that the ''optimal string alignment algorithm'' computes the number of edit operations needed to make the strings equal under the condition that ''no substring is edited more than once'', whereas the second one presents no such restriction. Take for example the edit distance between CA and ABC. The Damerau–Levenshtein distance LD(CA, ABC) = 2 because CA → AC → ABC, but the optimal string alignment distance OSA(CA, ABC) = 3 because if the operation CA → AC is used, it is not possible to use AC → ABC because that would require the substring to be edited more than once, which is not allowed in OSA, and therefore the shortest sequence of operations is CA → A → AB → ABC. Note that for the optimal string alignment distance, the
triangle inequality In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side. This statement permits the inclusion of Degeneracy (mathematics)#T ...
does not hold: OSA(CA, AC) + OSA(AC, ABC) < OSA(CA, ABC), and so it is not a true metric.


Optimal string alignment distance

Optimal string alignment distance can be computed using a straightforward extension of the Wagner–Fischer dynamic programming algorithm that computes Levenshtein distance. In
pseudocode In computer science, pseudocode is a description of the steps in an algorithm using a mix of conventions of programming languages (like assignment operator, conditional operator, loop) with informal, usually self-explanatory, notation of actio ...
: algorithm OSA-distance is input: strings a ..length(a) b ..length(b) output: distance, integer let d ..length(a), 0..length(b)be a 2-d array of integers, dimensions length(a)+1, length(b)+1 ''// note that d is zero-indexed, while a and b are one-indexed.'' for i := 0 to length(a) inclusive do d
, 0 The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...
:= i for j := 0 to length(b) inclusive do d
, j The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...
:= j for i := 1 to length(a) inclusive do for j := 1 to length(b) inclusive do if a = b then cost := 0 else cost := 1 d
, j The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...
:= minimum(d -1, j+ 1, ''// deletion'' d
, j-1 The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...
+ 1, ''// insertion'' d -1, j-1+ cost) ''// substitution'' if i > 1 and j > 1 and a = b -1and a -1= b then d
, j The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...
:= minimum(d
, j The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...
d -2, j-2+ 1) ''// transposition'' return d ength(a), length(b) The difference from the algorithm for Levenshtein distance is the addition of one recurrence: if i > 1 and j > 1 and a = b -1and a -1= b then d
, j The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...
:= minimum(d
, j The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...
d -2, j-2+ 1) ''// transposition''


Distance with adjacent transpositions

The following algorithm computes the true Damerau–Levenshtein distance with adjacent transpositions; this algorithm requires as an additional parameter the size of the alphabet , so that all entries of the arrays are in : algorithm DL-distance is input: strings a ..length(a) b ..length(b) output: distance, integer da := new array of , Σ, integers for i := 1 to , Σ, inclusive do da := 0 let d ��1..length(a), −1..length(b)be a 2-d array of integers, dimensions length(a)+2, length(b)+2 ''// note that d has indices starting at −1, while a, b and da are one-indexed.'' maxdist := length(a) + length(b) d ��1, −1:= maxdist for i := 0 to length(a) inclusive do d , −1:= maxdist d
, 0 The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...
:= i for j := 0 to length(b) inclusive do d ��1, j:= maxdist d
, j The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...
:= j for i := 1 to length(a) inclusive do db := 0 for j := 1 to length(b) inclusive do k := da [j â„“ := db if a = b then cost := 0 db := j else cost := 1 d
, j The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...
:= minimum(d[i−1, j−1] + cost, ''//substitution'' d[i, j−1] + 1, ''//insertion'' d[i−1, j ] + 1, ''//deletion'' d −1, ℓ−1+ (i−k−1) + 1 + (j-ℓ−1)) ''//transposition'' da [i := i return d ength(a), length(b) To devise a proper algorithm to calculate unrestricted Damerau–Levenshtein distance, note that there always exists an optimal sequence of edit operations, where once-transposed letters are never modified afterwards. (This holds as long as the cost of a transposition, W_T, is at least the average of the cost of an insertion and deletion, i.e., 2W_T \ge W_I + W_D.) Thus, we need to consider only two symmetric ways of modifying a substring more than once: (1) transpose letters and insert an arbitrary number of characters between them, or (2) delete a sequence of characters and transpose letters that become adjacent after deletion. The straightforward implementation of this idea gives an algorithm of cubic complexity: O\big(M \cdot N \cdot \max(M, N)\big), where ''M'' and ''N'' are string lengths. Using the ideas of Lowrance and Wagner, this naive algorithm can be improved to be O(M \cdot N) in the worst case, which is what the above pseudocode does. It is interesting that the bitap algorithm can be modified to process transposition. See the information retrieval section of for an example of such an adaptation.


Applications

Damerau–Levenshtein distance plays an important role in
natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
. In natural languages, strings are short and the number of errors (misspellings) rarely exceeds 2. In such circumstances, restricted and real edit distance differ very rarely. Oommen and Loke even mitigated the limitation of the restricted edit distance by introducing ''generalized transpositions''. Nevertheless, one must remember that the restricted edit distance usually does not satisfy the
triangle inequality In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side. This statement permits the inclusion of Degeneracy (mathematics)#T ...
, and thus cannot be used with
metric tree A metric tree is any tree data structure specialized to index data in metric spaces. Metric trees exploit properties of metric spaces such as the triangle inequality to make accesses to the data more efficient. Examples include the M-tree, vp-tr ...
s.


DNA

Since
DNA Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...
frequently undergoes insertions, deletions, substitutions, and transpositions, and each of these operations occurs on approximately the same timescale, the Damerau–Levenshtein distance is an appropriate metric of the variation between two strands of DNA. More common in DNA, protein, and other bioinformatics related alignment tasks is the use of closely related algorithms such as Needleman–Wunsch algorithm or Smith–Waterman algorithm.


Fraud detection

The algorithm can be used with any set of words, like vendor names. Since entry is manual by nature, there is a risk of entering a false vendor. A fraudster employee may enter one real vendor such as "Rich Heir Estate Services" versus a false vendor "Rich Hier State Services". The fraudster would then create a false bank account and have the company route checks to the real vendor and false vendor. The Damerau–Levenshtein algorithm will detect the transposed and dropped letter and bring attention of the items to a fraud examiner.


Export control

The U.S. Government uses the Damerau–Levenshtein distance with its Consolidated Screening List API.


See also

*
Ispell Ispell is a spelling checker for Unix that supports most Western languages. It offers several interfaces, including a programmatic interface for use by editors such as Emacs. Unlike GNU Aspell, ispell will only suggest corrections that are based ...
– Suggests corrections that are based on a Damerau–Levenshtein distance of 1 *


References


Further reading

* {{DEFAULTSORT:Damerau-Levenshtein Distance String metrics Information theory Dynamic programming