computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...

and

statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...

, the Jaro–Winkler distance is a

string metric In mathematics and computer science, a string metric (also known as a string similarity metric or string distance function) is a metric that measures distance ("inverse similarity") between two text strings for approximate string matching or com ...

measuring an

edit distance In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to tr ...

between two sequences. It is a variant proposed in 1990 by William E. Winkler of the Jaro distance metric (1989, Matthew A. Jaro). The Jaro–Winkler distance uses a

prefix A prefix is an affix which is placed before the Word stem, stem of a word. Adding it to the beginning of one word changes it into another word. For example, when the prefix ''un-'' is added to the word ''happy'', it creates the word ''unhappy'' ...

scale

p

which gives more favourable ratings to strings that match from the beginning for a set prefix length

\ell

. The higher the Jaro–Winkler distance for two strings is, the less similar the strings are. The score is normalized such that 0 means an exact match and 1 means there is no similarity. The original paper actually defined the metric in terms of similarity, so the distance is defined as the inversion of that value (distance = 1 − similarity). Although often referred to as a ''distance metric'', the Jaro–Winkler distance is not a

metric Metric or metrical may refer to: * Metric system, an internationally adopted decimal system of measurement * An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement Mathematics In mathem ...

in the mathematical sense of that term because it does not obey the

triangle inequality In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side. This statement permits the inclusion of degenerate triangles, but ...

Definition

Jaro similarity

The Jaro similarity

sim_j

of two given strings

s_1

and

s_2

is :

sim_j = \left\{
\begin{array}{l l}
  0 & \text{if }m = 0\\
  \frac{1}{3}\left(\frac{m}{, s_1 + \frac{m}{, s_2 + \frac{m-t}{m}\right) & \text{otherwise} \end{array} \right.

Where: *

, s_i,

is the length of the string

s_i

; *

m

is the number of ''matching characters'' (see below); *

t

is the number of ''transpositions'' (see below). Jaro similarity score is 0 if the strings do not match at all, and 1 if they are an exact match. In the first step, each character of

s_1

is compared with all its matching characters in

s_2

. Two characters from

s_1

and

s_2

respectively, are considered ''matching'' only if they are the same and not farther than

\left\lfloor\frac{\max(, s_1, ,, s_2, )}{2}\right\rfloor-1

characters apart. For example, the following two nine character long strings, FAREMVIEL and FARMVILLE, have 8 matching characters. 'F', 'A' and 'R' are in the same position in both string. Also 'M', 'V', 'I', 'E' and 'L' are within three (result of

\lfloor\tfrac{\max(9, 9)}{2}\rfloor - 1

) characters away. If no matching characters are found then the strings are not similar and the algorithm terminates by returning Jaro similarity score 0. If non-zero matching characters are found, the next step is to find the number of transpositions. Transposition is the number of matching characters that are not in the right order divided by two. In the above example between FAREMVIEL and FARMVILLE, 'E' and 'L' are the matching characters that are not in the right order. So the number of transposition is one. Finally, plugging in the number of matching characters

m

and number of transpositions

t

the Jaro similarity of FAREMVIEL and FARMVILLE can be calculated,

\frac{1}{3}\left(\frac{8}{9} + \frac{8}{9} + \frac{8-1}{8} \right) = 0.88

Jaro–Winkler similarity

Jaro–Winkler similarity uses a

scale

p

which gives more favorable ratings to strings that match from the beginning for a set prefix length

\ell

. Given two strings

s_1

and

s_2

, their Jaro–Winkler similarity

sim_w

is: :

sim_w = sim_j + \ell p (1 - sim_j),

where: *

sim_j

is the Jaro similarity for strings

s_1

and

s_2

\ell

is the length of common prefix at the start of the string up to a maximum of 4 characters *

p

is a constant

scaling factor In affine geometry, uniform scaling (or isotropic scaling) is a linear transformation that enlarges (increases) or shrinks (diminishes) objects by a ''scale factor'' that is the same in all directions. The result of uniform scaling is similarit ...

for how much the score is adjusted upwards for having common prefixes.

p

should not exceed 0.25 (i.e. 1/4, with 4 being the maximum length of the prefix being considered), otherwise the similarity could become larger than 1. The standard value for this constant in Winkler's work is

p = 0.1

The Jaro–Winkler distance

d_w

is defined as

d_w = 1 - sim_w

. Although often referred to as a ''distance metric'', the Jaro–Winkler distance is not a

in the mathematical sense of that term because it does not obey the

. The Jaro–Winkler distance also does not satisfy the identity axiom

d(x,y)=0 \leftrightarrow x = y

Relationship with other edit distance metrics

There are other popular measures of

, which are calculated using a different set of allowable edit operations. For instance, * the

Levenshtein distance In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-charact ...

allows deletion, insertion and substitution; * the

Damerau–Levenshtein distance In information theory and computer science, the Damerau–Levenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein.) is a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Lev ...

allows insertion, deletion, substitution, and the transposition of two adjacent characters; * the

longest common subsequence A longest common subsequence (LCS) is the longest subsequence common to all sequences in a set of sequences (often just two sequences). It differs from the longest common substring: unlike substrings, subsequences are not required to occupy conse ...

(LCS) distance allows only insertion and deletion, not substitution; * the

Hamming distance In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of ''substitutions'' required to chan ...

allows only substitution, hence, it only applies to strings of the same length.

Edit distance In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to tr ...

is usually defined as a parameterizable metric calculated with a specific set of allowed edit operations, and each operation is assigned a cost (possibly infinite). This is further generalized by DNA

sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Alig ...

algorithms such as the

Smith–Waterman algorithm The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorit ...

, which make an operation's cost depend on where it is applied.

Footnotes

References

* * * * *

External links

strcmp.c - Original C implementation by the author of the algorithm

Python implementation in the

Natural Language Toolkit The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. It was developed by Steven Bird and E ...

{{DEFAULTSORT:Jaro-Winkler distance String metrics