computer science Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...

and

statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

, the Jaro–Winkler similarity is a

string metric In mathematics and computer science, a string metric (also known as a string similarity metric or string distance function) is a metric (mathematics), metric that measures distance ("inverse similarity") between two string (computer science), tex ...

measuring an

edit distance In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two String (computing), strings (e.g., words) are to one another, that is measured by counting the minimum number of opera ...

between two sequences. It is a variant of the Jaro distance metric (1989, Matthew A. Jaro) proposed in 1990 by William E. Winkler. The Jaro–Winkler distance uses a

prefix A prefix is an affix which is placed before the stem of a word. Particularly in the study of languages, a prefix is also called a preformative, because it alters the form of the word to which it is affixed. Prefixes, like other affixes, can b ...

scale

p

which gives more favourable ratings to strings that match from the beginning for a set prefix length

\ell

. The higher the Jaro–Winkler distance for two strings is, the less similar the strings are. The score is normalized such that 0 means an exact match and 1 means there is no similarity. The original paper actually defined the metric in terms of similarity, so the distance is defined as the inversion of that value (distance = 1 − similarity). Although often referred to as a ''distance metric'', the Jaro–Winkler distance is not a

metric Metric or metrical may refer to: Measuring * Metric system, an internationally adopted decimal system of measurement * An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement Mathematics ...

in the mathematical sense of that term because it does not obey the

triangle inequality In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side. This statement permits the inclusion of Degeneracy (mathematics)#T ...

Definition

Jaro similarity

The Jaro similarity

sim_j

of two given strings

s_1

and

s_2

is :

sim_j = \left\{
\begin{array}{l l}
  0 & \text{if }m = 0\\
  \frac{1}{3}\left(\frac{m}{, s_1 + \frac{m}{, s_2 + \frac{m-t}{m}\right) & \text{otherwise} \end{array} \right.

Where: *

, s_i,

is the length of the string

s_i

; *

m

is the number of ''matching characters'' (see below); *

t

is the number of ''transpositions'' (see below). Jaro similarity score is 0 if the strings do not match at all, and 1 if they are an exact match. In the first step, each character of

s_1

is compared with all its matching characters in

s_2

. Two characters from

s_1

and

s_2

respectively, are considered ''matching'' only if they are the same and not farther than

\left\lfloor\frac{\max(, s_1, ,, s_2, )}{2}\right\rfloor-1

characters apart. For example, the following two nine character long strings, FAREMVIEL and FARMVILLE, have 8 matching characters. 'F', 'A' and 'R' are in the same position in both strings. Also 'M', 'V', 'I', 'E' and 'L' are within three (result of

\lfloor\tfrac{\max(9, 9)}{2}\rfloor - 1

) characters away. If no matching characters are found then the strings are not similar and the algorithm terminates by returning Jaro similarity score 0. If non-zero matching characters are found, the next step is to find the number of transpositions. Transposition is the number of matching characters that are not in the right order divided by two. In the above example between FAREMVIEL and FARMVILLE, 'E' and 'L' are the matching characters that are not in the right order. So the number of transposition is one. Finally, plugging in the number of matching characters

m

and number of transpositions

t

the Jaro similarity of FAREMVIEL and FARMVILLE can be calculated,

\frac{1}{3}\left(\frac{8}{9} + \frac{8}{9} + \frac{8-1}{8} \right) = 0.88

Jaro–Winkler similarity

Jaro–Winkler similarity uses a

scale

p

which gives more favorable ratings to strings that match from the beginning for a set prefix length

\ell

. Given two strings

s_1

and

s_2

, their Jaro–Winkler similarity

sim_w

is: :

sim_w = sim_j + \ell p (1 - sim_j),

where: *

sim_j

is the Jaro similarity for strings

s_1

and

s_2

\ell

is the length of common prefix at the start of the string up to a maximum of 4 characters *

p

is a constant

scaling factor In affine geometry, uniform scaling (or isotropic scaling) is a linear transformation that enlarges (increases) or shrinks (diminishes) objects by a '' scale factor'' that is the same in all directions ( isotropically). The result of uniform sc ...

for how much the score is adjusted upwards for having common prefixes.

p

should not exceed 0.25 (i.e. 1/4, with 4 being the maximum length of the prefix being considered), otherwise the similarity could become larger than 1. The standard value for this constant in Winkler's work is

p = 0.1

The Jaro–Winkler distance

d_w

is defined as

d_w = 1 - sim_w

. Although often referred to as a ''distance metric'', the Jaro–Winkler distance is not a

in the mathematical sense of that term because it does not obey the

. The Jaro–Winkler distance also does not satisfy the identity axiom

d(x,y)=0 \leftrightarrow x = y

Relationship with other edit distance metrics

There are other popular measures of

, which are calculated using a different set of allowable edit operations. For instance, * the Levenshtein distance allows deletion, insertion and substitution; * the Damerau–Levenshtein distance allows insertion, deletion, substitution, and the transposition of two adjacent characters; * the longest common subsequence (LCS) distance allows only insertion and deletion, not substitution; * the

Hamming distance In information theory, the Hamming distance between two String (computer science), strings or vectors of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number ...

allows only substitution, hence, it only applies to strings of the same length.

Edit distance In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two String (computing), strings (e.g., words) are to one another, that is measured by counting the minimum number of opera ...

is usually defined as a parameterizable metric calculated with a specific set of allowed edit operations, and each operation is assigned a cost (possibly infinite). This is further generalized by DNA

sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, structural, or evolutionary relationships between ...

algorithms such as the Smith–Waterman algorithm, which make an operation's cost depend on where it is applied.

Footnotes

References

* * * * *

External links

strcmp.c - Original C implementation by the author of the algorithm

Python implementation in the Natural Language Toolkit {{DEFAULTSORT:Jaro-Winkler distance String metrics