In
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
and related fields, a similarity measure or similarity function or similarity metric is a
real-valued function
In mathematics, a real-valued function is a function whose values are real numbers. In other words, it is a function that assigns a real number to each member of its domain.
Real-valued functions of a real variable (commonly called ''real ...
that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such measures are in some sense the inverse of
distance metrics: they take on large values for similar objects and either zero or a negative value for very dissimilar objects. Though, in more broad terms, a similarity function may also satisfy metric axioms.
Cosine similarity is a commonly used similarity measure for real-valued vectors, used in (among other fields)
information retrieval
Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
to score the similarity of documents in the
vector space model
Vector space model or term vector model is an algebraic model for representing text documents (or more generally, items) as vector space, vectors such that the distance between vectors represents the relevance between the documents. It is used in i ...
. In
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
, common
kernel functions such as the
RBF kernel can be viewed as similarity functions.
Use of different similarity measure formulas
Different types of similarity measures exist for various types of objects, depending on the objects being compared. For each type of object there are various similarity measurement formulas.
Similarity between two data points

There are many various options available when it comes to finding similarity between two data points, some of which are a combination of other similarity methods. Some of the methods for similarity measures between two data points include Euclidean distance, Manhattan distance, Minkowski distance, and Chebyshev distance. The Euclidean distance formula is used to find the distance between two points on a plane, which is visualized in the image below. Manhattan distance is commonly used in
GPS applications, as it can be used to find the shortest route between two addresses.
When you generalize the Euclidean distance formula and Manhattan distance formula you are left with the
Minkowski distance formulas, which can be used in a wide variety of applications.
*
Euclidean distance
In mathematics, the Euclidean distance between two points in Euclidean space is the length of the line segment between them. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, and therefore is o ...
*
Manhattan distance
Taxicab geometry or Manhattan geometry is geometry where the familiar Euclidean distance is ignored, and the distance between two point (geometry), points is instead defined to be the sum of the absolute differences of their respective Cartesian ...
*
Minkowski distance
*
Chebyshev distance
Similarity between strings
For comparing strings, there are various measures of
string similarity that can be used. Some of these methods include edit distance, Levenshtein distance, Hamming distance, and Jaro distance. The best-fit formula is dependent on the requirements of the application. For example, edit distance is frequently used for
natural language processing
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
applications and features, such as spell-checking. Jaro distance is commonly used in record linkage to compare first and last names to other sources.
*
Edit distance
In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two String (computing), strings (e.g., words) are to one another, that is measured by counting the minimum number of opera ...
*
Levenshtein distance
*
Lee distance
*
Hamming distance
In information theory, the Hamming distance between two String (computer science), strings or vectors of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number ...
*
Jaro distance
Similarity between two probability distributions
Typical measures of similarity for
probability distributions
In probability theory and statistics, a probability distribution is a function that gives the probabilities of occurrence of possible events for an experiment. It is a mathematical description of a random phenomenon in terms of its sample spac ...
are the
Bhattacharyya distance and the
Hellinger distance
In probability and statistics, the Hellinger distance (closely related to, although different from, the Bhattacharyya distance) is used to quantify the similarity between two probability distributions. It is a type of ''f''-divergence. The Hell ...
. Both provide a quantification of similarity for two probability distributions on the same domain, and they are mathematically closely linked. The Bhattacharyya distance does not fulfill the
triangle inequality
In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side.
This statement permits the inclusion of Degeneracy (mathematics)#T ...
, meaning it does not form a
metric
Metric or metrical may refer to:
Measuring
* Metric system, an internationally adopted decimal system of measurement
* An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement
Mathematics
...
. The Hellinger distance does form a metric on the space of probability distributions.
*
Bhattacharyya distance
*
Hellinger distance
In probability and statistics, the Hellinger distance (closely related to, although different from, the Bhattacharyya distance) is used to quantify the similarity between two probability distributions. It is a type of ''f''-divergence. The Hell ...
Similarity between two sets
The
Jaccard index formula measures the similarity between two
sets based on the number of items that are present in both sets relative to the total number of items. It is commonly used in
recommendation systems and
social media analysis. The
Sørensen–Dice coefficient also compares the number of items in both sets to the total number of items present but the weight for the number of shared items is larger. The Sørensen–Dice coefficient is commonly used in
biology
Biology is the scientific study of life and living organisms. It is a broad natural science that encompasses a wide range of fields and unifying principles that explain the structure, function, growth, History of life, origin, evolution, and ...
applications, measuring the similarity between two sets of genes or species.
*
Jaccard index
*
Sørensen–Dice coefficient
Similarity between two sequences
When comparing temporal sequences (time series), some similarity measures must additionally account for similarity of two sequences that are not fully aligned.
*
Dynamic time warping
Use in clustering
Clustering or
Cluster analysis
Cluster analysis or clustering is the data analyzing technique in which task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more Similarity measure, similar (in some specific sense defined by the ...
is a data mining technique that is used to discover patterns in data by grouping similar objects together. It involves partitioning a set of data points into groups or clusters based on their similarities. One of the fundamental aspects of clustering is how to measure similarity between data points.
Similarity measures play a crucial role in many clustering techniques, as they are used to determine how closely related two data points are and whether they should be grouped together in the same cluster. A similarity measure can take many different forms depending on the type of data being clustered and the specific problem being solved.
One of the most commonly used similarity measures is the
Euclidean distance
In mathematics, the Euclidean distance between two points in Euclidean space is the length of the line segment between them. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, and therefore is o ...
, which is used in many clustering techniques including
K-means clustering
''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to partition of a set, partition ''n'' observations into ''k'' clusters in which each observation belongs to the cluster (statistics), cluste ...
and
Hierarchical clustering. The Euclidean distance is a measure of the straight-line distance between two points in a high-dimensional space. It is calculated as the square root of the sum of the squared differences between the corresponding coordinates of the two points. For example, if we have two data points
and
, the Euclidean distance between them is