statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

and related fields, a similarity measure or similarity function or similarity metric is a

real-valued function In mathematics, a real-valued function is a function whose values are real numbers. In other words, it is a function that assigns a real number to each member of its domain. Real-valued functions of a real variable (commonly called ''real ...

that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such measures are in some sense the inverse of distance metrics: they take on large values for similar objects and either zero or a negative value for very dissimilar objects. Though, in more broad terms, a similarity function may also satisfy metric axioms. Cosine similarity is a commonly used similarity measure for real-valued vectors, used in (among other fields)

information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...

to score the similarity of documents in the

vector space model Vector space model or term vector model is an algebraic model for representing text documents (or more generally, items) as vector space, vectors such that the distance between vectors represents the relevance between the documents. It is used in i ...

. In

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

, common kernel functions such as the RBF kernel can be viewed as similarity functions.

Use of different similarity measure formulas

Different types of similarity measures exist for various types of objects, depending on the objects being compared. For each type of object there are various similarity measurement formulas. Similarity between two data points Euclidean distance vis

There are many various options available when it comes to finding similarity between two data points, some of which are a combination of other similarity methods. Some of the methods for similarity measures between two data points include Euclidean distance, Manhattan distance, Minkowski distance, and Chebyshev distance. The Euclidean distance formula is used to find the distance between two points on a plane, which is visualized in the image below. Manhattan distance is commonly used in GPS applications, as it can be used to find the shortest route between two addresses. When you generalize the Euclidean distance formula and Manhattan distance formula you are left with the Minkowski distance formulas, which can be used in a wide variety of applications. *

Euclidean distance In mathematics, the Euclidean distance between two points in Euclidean space is the length of the line segment between them. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, and therefore is o ...

Manhattan distance Taxicab geometry or Manhattan geometry is geometry where the familiar Euclidean distance is ignored, and the distance between two point (geometry), points is instead defined to be the sum of the absolute differences of their respective Cartesian ...

* Minkowski distance * Chebyshev distance Similarity between strings For comparing strings, there are various measures of string similarity that can be used. Some of these methods include edit distance, Levenshtein distance, Hamming distance, and Jaro distance. The best-fit formula is dependent on the requirements of the application. For example, edit distance is frequently used for

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

applications and features, such as spell-checking. Jaro distance is commonly used in record linkage to compare first and last names to other sources. *

Edit distance In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two String (computing), strings (e.g., words) are to one another, that is measured by counting the minimum number of opera ...

* Levenshtein distance * Lee distance *

Hamming distance In information theory, the Hamming distance between two String (computer science), strings or vectors of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number ...

* Jaro distance Similarity between two probability distributions Typical measures of similarity for

probability distributions In probability theory and statistics, a probability distribution is a function that gives the probabilities of occurrence of possible events for an experiment. It is a mathematical description of a random phenomenon in terms of its sample spac ...

are the Bhattacharyya distance and the

Hellinger distance In probability and statistics, the Hellinger distance (closely related to, although different from, the Bhattacharyya distance) is used to quantify the similarity between two probability distributions. It is a type of ''f''-divergence. The Hell ...

. Both provide a quantification of similarity for two probability distributions on the same domain, and they are mathematically closely linked. The Bhattacharyya distance does not fulfill the

triangle inequality In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side. This statement permits the inclusion of Degeneracy (mathematics)#T ...

, meaning it does not form a

metric Metric or metrical may refer to: Measuring * Metric system, an internationally adopted decimal system of measurement * An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement Mathematics ...

. The Hellinger distance does form a metric on the space of probability distributions. * Bhattacharyya distance *

Similarity between two sets The Jaccard index formula measures the similarity between two sets based on the number of items that are present in both sets relative to the total number of items. It is commonly used in recommendation systems and social media analysis. The Sørensen–Dice coefficient also compares the number of items in both sets to the total number of items present but the weight for the number of shared items is larger. The Sørensen–Dice coefficient is commonly used in

biology Biology is the scientific study of life and living organisms. It is a broad natural science that encompasses a wide range of fields and unifying principles that explain the structure, function, growth, History of life, origin, evolution, and ...

applications, measuring the similarity between two sets of genes or species. * Jaccard index * Sørensen–Dice coefficient Similarity between two sequences When comparing temporal sequences (time series), some similarity measures must additionally account for similarity of two sequences that are not fully aligned. * Dynamic time warping

Use in clustering

Clustering or

Cluster analysis Cluster analysis or clustering is the data analyzing technique in which task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more Similarity measure, similar (in some specific sense defined by the ...

is a data mining technique that is used to discover patterns in data by grouping similar objects together. It involves partitioning a set of data points into groups or clusters based on their similarities. One of the fundamental aspects of clustering is how to measure similarity between data points. Similarity measures play a crucial role in many clustering techniques, as they are used to determine how closely related two data points are and whether they should be grouped together in the same cluster. A similarity measure can take many different forms depending on the type of data being clustered and the specific problem being solved. One of the most commonly used similarity measures is the

, which is used in many clustering techniques including

K-means clustering ''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to partition of a set, partition ''n'' observations into ''k'' clusters in which each observation belongs to the cluster (statistics), cluste ...

and Hierarchical clustering. The Euclidean distance is a measure of the straight-line distance between two points in a high-dimensional space. It is calculated as the square root of the sum of the squared differences between the corresponding coordinates of the two points. For example, if we have two data points

(x_1,y_1)

and

(x_2,y_2)

, the Euclidean distance between them is

d = \surd x_2-x_1)^2 + (y_2-y_1)^2 /math>.

Another commonly used similarity measure is the Jaccard index or Jaccard similarity, which is used in clustering techniques that work with binary data such as presence/absence data or Boolean data; The Jaccard similarity is particularly useful for clustering techniques that work with text data, where it can be used to identify clusters of similar documents based on their shared features or keywords. It is calculated as the size of the intersection of two sets divided by the size of the union of the two sets:

J(A,B)=

. Similarities among 162 Relevant Nuclear Profile are tested using the Jaccard Similarity measure (see figure with heatmap). The Jaccard similarity of the nuclear profile ranges from 0 to 1, with 0 indicating no similarity between the two sets and 1 indicating perfect similarity with the aim of clustering the most similar nuclear profile. Manhattan distance, also known as Taxicab geometry, is a commonly used similarity measure in clustering techniques that work with continuous data. It is a measure of the distance between two data points in a high-dimensional space, calculated as the sum of the absolute differences between the corresponding coordinates of the two points

\left\vert x_1 - x_2 \right\vert +\left\vert y_1 -y_2 \right\vert

. When dealing with mixed-type data, including nominal, ordinal, and numerical attributes per object, Gower's distance (or similarity) is a common choice as it can handle different types of variables implicitly. It first computes similarities between the pair of variables in each object, and then combines those similarities to a single weighted average per object-pair. As such, for two objects

i

and

j

having

p

descriptors, the similarity

S

is defined as:

S_ = \frac,

where the

w_

are non-negative weights and

s_

is the similarity between the two objects regarding their

k

-th variable. In

spectral clustering In multivariate statistics, spectral clustering techniques make use of the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. The similarity matrix is provided ...

, a similarity, or affinity, measure is used to transform data to overcome difficulties related to lack of convexity in the shape of the data distribution. The measure gives rise to an

(n, n)

-sized for a set of points, where the entry

(i,j)

in the matrix can be simply the (reciprocal of the)

between

i

and

j

, or it can be a more complex measure of distance such as the Gaussian

e^

. Further modifying this result with network analysis techniques is also common. The choice of similarity measure depends on the type of data being clustered and the specific problem being solved. For example, working with continuous data such as gene expression data, the Euclidean distance or cosine similarity may be appropriate. If working with binary data such as the presence of a genomic loci in a nuclear profile, the Jaccard index may be more appropriate. Lastly, working with data that is arranged in a grid or lattice structure, such as image or signal processing data, the Manhattan distance is particularly useful for the clustering.

Use in recommender systems

Similarity measures are used to develop

recommender systems A recommender system (RecSys), or a recommendation system (sometimes replacing ''system'' with terms such as ''platform'', ''engine'', or ''algorithm'') and sometimes only called "the algorithm" or "algorithm", is a subclass of information fil ...

. It observes a user's perception and liking of multiple items. On recommender systems, the method is using a distance calculation such as or to generate a with values representing the similarity of any pair of targets. Then, by analyzing and comparing the values in the matrix, it is possible to match two targets to a user's preference or link users based on their marks. In this system, it is relevant to observe the value itself and the absolute distance between two values. Gathering this data can indicate a mark's likeliness to a user as well as how mutually closely two marks are either rejected or accepted. It is possible then to recommend to a user targets with high similarity to the user's likes. Recommender systems are observed in multiple online entertainment platforms, in social media and streaming websites. The logic for the construction of this systems is based on similarity measures.

Use in sequence alignment

Similarity matrices are used in

sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, structural, or evolutionary relationships between ...

. Higher scores are given to more-similar characters, and lower or negative scores for dissimilar characters.

Nucleotide Nucleotides are Organic compound, organic molecules composed of a nitrogenous base, a pentose sugar and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both o ...

similarity matrices are used to align

nucleic acid Nucleic acids are large biomolecules that are crucial in all cells and viruses. They are composed of nucleotides, which are the monomer components: a pentose, 5-carbon sugar, a phosphate group and a nitrogenous base. The two main classes of nuclei ...

sequences. Because there are only four nucleotides commonly found in

DNA Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...

(

Adenine Adenine (, ) (nucleoside#List of nucleosides and corresponding nucleobases, symbol A or Ade) is a purine nucleotide base that is found in DNA, RNA, and Adenosine triphosphate, ATP. Usually a white crystalline subtance. The shape of adenine is ...

(A),

Cytosine Cytosine () (symbol C or Cyt) is one of the four nucleotide bases found in DNA and RNA, along with adenine, guanine, and thymine ( uracil in RNA). It is a pyrimidine derivative, with a heterocyclic aromatic ring and two substituents attac ...

(C),

Guanine Guanine () (symbol G or Gua) is one of the four main nucleotide bases found in the nucleic acids DNA and RNA, the others being adenine, cytosine, and thymine ( uracil in RNA). In DNA, guanine is paired with cytosine. The guanine nucleoside ...

(G) and

Thymine Thymine () (symbol T or Thy) is one of the four nucleotide bases in the nucleic acid of DNA that are represented by the letters G–C–A–T. The others are adenine, guanine, and cytosine. Thymine is also known as 5-methyluracil, a pyrimidine ...

(T)), nucleotide similarity matrices are much simpler than

protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...

similarity matrices. For example, a simple matrix will assign identical bases a score of +1 and non-identical bases a score of −1. A more complicated matrix would give a higher score to transitions (changes from a

pyrimidine Pyrimidine (; ) is an aromatic, heterocyclic, organic compound similar to pyridine (). One of the three diazines (six-membered heterocyclics with two nitrogen atoms in the ring), it has nitrogen atoms at positions 1 and 3 in the ring. The oth ...

such as C or T to another pyrimidine, or from a

purine Purine is a heterocyclic aromatic organic compound that consists of two rings (pyrimidine and imidazole) fused together. It is water-soluble. Purine also gives its name to the wider class of molecules, purines, which include substituted puri ...

such as A or G to another purine) than to transversions (from a pyrimidine to a purine or vice versa). The match/mismatch ratio of the matrix sets the target evolutionary distance. The +1/−3 DNA matrix used by BLASTN is best suited for finding matches between sequences that are 99% identical; a +1/−1 (or +4/−4) matrix is much more suited to sequences with about 70% similarity. Matrices for lower similarity sequences require longer sequence alignments.

Amino acid Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although over 500 amino acids exist in nature, by far the most important are the 22 α-amino acids incorporated into proteins. Only these 22 a ...

similarity matrices are more complicated, because there are 20 amino acids coded for by the

genetic code Genetic code is a set of rules used by living cell (biology), cells to Translation (biology), translate information encoded within genetic material (DNA or RNA sequences of nucleotide triplets or codons) into proteins. Translation is accomplished ...

, and so a larger number of possible substitutions. Therefore, the similarity matrix for amino acids contains 400 entries (although it is usually

symmetric Symmetry () in everyday life refers to a sense of harmonious and beautiful proportion and balance. In mathematics, the term has a more precise definition and is usually used to refer to an object that is invariant under some transformations ...

). The first approach scored all amino acid changes equally. A later refinement was to determine amino acid similarities based on how many base changes were required to change a codon to code for that amino acid. This model is better, but it doesn't take into account the selective pressure of amino acid changes. Better models took into account the chemical properties of amino acids. One approach has been to empirically generate the similarity matrices. The Dayhoff method used phylogenetic trees and sequences taken from species on the tree. This approach has given rise to the PAM series of matrices. PAM matrices are labelled based on how many nucleotide changes have occurred, per 100 amino acids. While the PAM matrices benefit from having a well understood evolutionary model, they are most useful at short evolutionary distances (PAM10–PAM120). At long evolutionary distances, for example PAM250 or 20% identity, it has been shown that the BLOSUM matrices are much more effective. The BLOSUM series were generated by comparing a number of divergent sequences. The BLOSUM series are labeled based on how much entropy remains unmutated between all sequences, so a lower BLOSUM number corresponds to a higher PAM number.

Use in computer vision

References