Jaro–Winkler distance
   HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
and
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, the Jaro–Winkler distance is a
string metric In mathematics and computer science, a string metric (also known as a string similarity metric or string distance function) is a metric that measures distance ("inverse similarity") between two text strings for approximate string matching or com ...
measuring an
edit distance In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to tr ...
between two sequences. It is a variant proposed in 1990 by William E. Winkler of the Jaro distance metric (1989, Matthew A. Jaro). The Jaro–Winkler distance uses a
prefix A prefix is an affix which is placed before the Word stem, stem of a word. Adding it to the beginning of one word changes it into another word. For example, when the prefix ''un-'' is added to the word ''happy'', it creates the word ''unhappy'' ...
scale p which gives more favourable ratings to strings that match from the beginning for a set prefix length \ell. The higher the Jaro–Winkler distance for two strings is, the less similar the strings are. The score is normalized such that 0 means an exact match and 1 means there is no similarity. The original paper actually defined the metric in terms of similarity, so the distance is defined as the inversion of that value (distance = 1 − similarity). Although often referred to as a ''distance metric'', the Jaro–Winkler distance is not a
metric Metric or metrical may refer to: * Metric system, an internationally adopted decimal system of measurement * An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement Mathematics In mathem ...
in the mathematical sense of that term because it does not obey the
triangle inequality In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side. This statement permits the inclusion of degenerate triangles, but ...
.


Definition


Jaro similarity

The Jaro similarity sim_j of two given strings s_1 and s_2 is : sim_j = \left\{ \begin{array}{l l} 0 & \text{if }m = 0\\ \frac{1}{3}\left(\frac{m}{, s_1 + \frac{m}{, s_2 + \frac{m-t}{m}\right) & \text{otherwise} \end{array} \right. Where: * , s_i, is the length of the string s_i; * m is the number of ''matching characters'' (see below); * t is the number of ''transpositions'' (see below). Jaro similarity score is 0 if the strings do not match at all, and 1 if they are an exact match. In the first step, each character of s_1 is compared with all its matching characters in s_2. Two characters from s_1 and s_2 respectively, are considered ''matching'' only if they are the same and not farther than \left\lfloor\frac{\max(, s_1, ,, s_2, )}{2}\right\rfloor-1 characters apart. For example, the following two nine character long strings, FAREMVIEL and FARMVILLE, have 8 matching characters. 'F', 'A' and 'R' are in the same position in both string. Also 'M', 'V', 'I', 'E' and 'L' are within three (result of \lfloor\tfrac{\max(9, 9)}{2}\rfloor - 1) characters away. If no matching characters are found then the strings are not similar and the algorithm terminates by returning Jaro similarity score 0. If non-zero matching characters are found, the next step is to find the number of transpositions. Transposition is the number of matching characters that are not in the right order divided by two. In the above example between FAREMVIEL and FARMVILLE, 'E' and 'L' are the matching characters that are not in the right order. So the number of transposition is one. Finally, plugging in the number of matching characters m and number of transpositions t the Jaro similarity of FAREMVIEL and FARMVILLE can be calculated, \frac{1}{3}\left(\frac{8}{9} + \frac{8}{9} + \frac{8-1}{8} \right) = 0.88


Jaro–Winkler similarity

Jaro–Winkler similarity uses a
prefix A prefix is an affix which is placed before the Word stem, stem of a word. Adding it to the beginning of one word changes it into another word. For example, when the prefix ''un-'' is added to the word ''happy'', it creates the word ''unhappy'' ...
scale p which gives more favorable ratings to strings that match from the beginning for a set prefix length \ell. Given two strings s_1 and s_2, their Jaro–Winkler similarity sim_w is: : sim_w = sim_j + \ell p (1 - sim_j), where: * sim_j is the Jaro similarity for strings s_1 and s_2 * \ell is the length of common prefix at the start of the string up to a maximum of 4 characters * p is a constant
scaling factor In affine geometry, uniform scaling (or isotropic scaling) is a linear transformation that enlarges (increases) or shrinks (diminishes) objects by a ''scale factor'' that is the same in all directions. The result of uniform scaling is similarit ...
for how much the score is adjusted upwards for having common prefixes. p should not exceed 0.25 (i.e. 1/4, with 4 being the maximum length of the prefix being considered), otherwise the similarity could become larger than 1. The standard value for this constant in Winkler's work is p = 0.1 The Jaro–Winkler distance d_w is defined as d_w = 1 - sim_w. Although often referred to as a ''distance metric'', the Jaro–Winkler distance is not a
metric Metric or metrical may refer to: * Metric system, an internationally adopted decimal system of measurement * An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement Mathematics In mathem ...
in the mathematical sense of that term because it does not obey the
triangle inequality In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side. This statement permits the inclusion of degenerate triangles, but ...
. The Jaro–Winkler distance also does not satisfy the identity axiom d(x,y)=0 \leftrightarrow x = y.


Relationship with other edit distance metrics

There are other popular measures of
edit distance In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to tr ...
, which are calculated using a different set of allowable edit operations. For instance, * the
Levenshtein distance In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-charact ...
allows deletion, insertion and substitution; * the
Damerau–Levenshtein distance In information theory and computer science, the Damerau–Levenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein.) is a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Lev ...
allows insertion, deletion, substitution, and the transposition of two adjacent characters; * the
longest common subsequence A longest common subsequence (LCS) is the longest subsequence common to all sequences in a set of sequences (often just two sequences). It differs from the longest common substring: unlike substrings, subsequences are not required to occupy conse ...
(LCS) distance allows only insertion and deletion, not substitution; * the
Hamming distance In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of ''substitutions'' required to chan ...
allows only substitution, hence, it only applies to strings of the same length.
Edit distance In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to tr ...
is usually defined as a parameterizable metric calculated with a specific set of allowed edit operations, and each operation is assigned a cost (possibly infinite). This is further generalized by DNA
sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Alig ...
algorithms such as the
Smith–Waterman algorithm The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorit ...
, which make an operation's cost depend on where it is applied.


See also

*
Record linkage Record linkage (also known as data matching, data linkage, entity resolution, and many other terms) is the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, and da ...
*
Census A census is the procedure of systematically acquiring, recording and calculating information about the members of a given population. This term is used mostly in connection with national population and housing censuses; other common censuses incl ...


Footnotes


References

* * * * *


External links


strcmp.c - Original C implementation by the author of the algorithm


Python implementation in the
Natural Language Toolkit The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. It was developed by Steven Bird and E ...
{{DEFAULTSORT:Jaro-Winkler distance String metrics