In
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
and
statistics
Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, the Jaro–Winkler distance is a
string metric
In mathematics and computer science, a string metric (also known as a string similarity metric or string distance function) is a metric that measures distance ("inverse similarity") between two text strings for approximate string matching or com ...
measuring an
edit distance
In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to tr ...
between two sequences. It is a variant proposed in 1990 by
William E. Winkler of the Jaro distance metric (1989,
Matthew A. Jaro).
The Jaro–Winkler distance uses a
prefix
A prefix is an affix which is placed before the Word stem, stem of a word. Adding it to the beginning of one word changes it into another word. For example, when the prefix ''un-'' is added to the word ''happy'', it creates the word ''unhappy'' ...
scale
which gives more favourable ratings to strings that match from the beginning for a set prefix length
.
The higher the Jaro–Winkler distance for two strings is, the less similar the strings are. The score is normalized such that 0 means an exact match and 1 means there is no similarity. The original paper actually defined the metric in terms of similarity, so the distance is defined as the inversion of that value (distance = 1 − similarity).
Although often referred to as a ''distance metric'', the Jaro–Winkler distance is not a
metric
Metric or metrical may refer to:
* Metric system, an internationally adopted decimal system of measurement
* An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement
Mathematics
In mathem ...
in the mathematical sense of that term because it does not obey the
triangle inequality
In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side.
This statement permits the inclusion of degenerate triangles, but ...
.
Definition
Jaro similarity
The Jaro similarity
of two given strings
and
is
:
Where:
*
is the length of the string
;
*
is the number of ''matching characters'' (see below);
*
is the number of ''transpositions'' (see below).
Jaro similarity score is 0 if the strings do not match at all, and 1 if they are an exact match. In the first step, each character of
is compared with all its matching characters in
. Two characters from
and
respectively, are considered ''matching'' only if they are the same and not farther than
characters apart. For example, the following two nine character long strings, FAREMVIEL and FARMVILLE, have 8 matching characters. 'F', 'A' and 'R' are in the same position in both string. Also 'M', 'V', 'I', 'E' and 'L' are within three (result of
) characters away. If no matching characters are found then the strings are not similar and the algorithm terminates by returning Jaro similarity score 0.
If non-zero matching characters are found, the next step is to find the number of transpositions. Transposition is the number of matching characters that are not in the right order divided by two. In the above example between FAREMVIEL and FARMVILLE, 'E' and 'L' are the matching characters that are not in the right order. So the number of transposition is one.
Finally, plugging in the number of matching characters
and number of transpositions
the Jaro similarity of FAREMVIEL and FARMVILLE can be calculated,
Jaro–Winkler similarity
Jaro–Winkler similarity uses a
prefix
A prefix is an affix which is placed before the Word stem, stem of a word. Adding it to the beginning of one word changes it into another word. For example, when the prefix ''un-'' is added to the word ''happy'', it creates the word ''unhappy'' ...
scale
which gives more favorable ratings to strings that match from the beginning for a set prefix length
. Given two strings
and
, their Jaro–Winkler similarity
is:
:
where:
*
is the Jaro similarity for strings
and
*
is the length of common prefix at the start of the string up to a maximum of 4 characters
*
is a constant
scaling factor
In affine geometry, uniform scaling (or isotropic scaling) is a linear transformation that enlarges (increases) or shrinks (diminishes) objects by a ''scale factor'' that is the same in all directions. The result of uniform scaling is similarit ...
for how much the score is adjusted upwards for having common prefixes.
should not exceed 0.25 (i.e. 1/4, with 4 being the maximum length of the prefix being considered), otherwise the similarity could become larger than 1. The standard value for this constant in Winkler's work is
The Jaro–Winkler distance
is defined as
.
Although often referred to as a ''distance metric'', the Jaro–Winkler distance is not a
metric
Metric or metrical may refer to:
* Metric system, an internationally adopted decimal system of measurement
* An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement
Mathematics
In mathem ...
in the mathematical sense of that term because it does not obey the
triangle inequality
In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side.
This statement permits the inclusion of degenerate triangles, but ...
.
The Jaro–Winkler distance also does not satisfy the identity axiom
.
Relationship with other edit distance metrics
There are other popular measures of
edit distance
In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to tr ...
, which are calculated using a different set of allowable edit operations. For instance,
* the
Levenshtein distance
In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-charact ...
allows deletion, insertion and substitution;
* the
Damerau–Levenshtein distance In information theory and computer science, the Damerau–Levenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein.) is a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Lev ...
allows insertion, deletion, substitution, and the
transposition of two adjacent characters;
* the
longest common subsequence
A longest common subsequence (LCS) is the longest subsequence common to all sequences in a set of sequences (often just two sequences). It differs from the longest common substring: unlike substrings, subsequences are not required to occupy conse ...
(LCS) distance allows only insertion and deletion, not substitution;
* the
Hamming distance
In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of ''substitutions'' required to chan ...
allows only substitution, hence, it only applies to strings of the same length.
Edit distance
In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to tr ...
is usually defined as a parameterizable metric calculated with a specific set of allowed edit operations, and each operation is assigned a cost (possibly infinite). This is further generalized by DNA
sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Alig ...
algorithms such as the
Smith–Waterman algorithm
The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorit ...
, which make an operation's cost depend on where it is applied.
See also
*
Record linkage
Record linkage (also known as data matching, data linkage, entity resolution, and many other terms) is the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, and da ...
*
Census
A census is the procedure of systematically acquiring, recording and calculating information about the members of a given population. This term is used mostly in connection with national population and housing censuses; other common censuses incl ...
Footnotes
References
*
*
*
*
*
External links
strcmp.c - Original C implementation by the author of the algorithm Python implementation in the
Natural Language Toolkit
The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. It was developed by Steven Bird and E ...
{{DEFAULTSORT:Jaro-Winkler distance
String metrics