HOME

TheInfoList



OR:

In
data analysis Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, enco ...
, cosine similarity is a
measure of similarity In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such meas ...
between two sequences of numbers. For defining it, the sequences are viewed as vectors in an
inner product space In mathematics, an inner product space (or, rarely, a Hausdorff pre-Hilbert space) is a real vector space or a complex vector space with an operation called an inner product. The inner product of two vectors in the space is a scalar, often den ...
, and the cosine similarity is defined as the cosine of the angle between them, that is, the
dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an algebra ...
of the vectors divided by the product of their lengths. It follows that the cosine similarity does not depend on the magnitudes of the vectors, but only on their angle. The cosine similarity always belongs to the interval
1, 1 Onekama ( ) is a village in Manistee County in the U.S. state of Michigan. The population was 411 at the 2010 census. The village is located on the shores of Portage Lake and is surrounded by Onekama Township. The town's name is derived from "On ...
For example, two
proportional vectors In mathematics, two sequences of numbers, often experimental data, are proportional or directly proportional if their corresponding elements have a constant ratio, which is called the coefficient of proportionality or proportionality constan ...
have a cosine similarity of 1, two
orthogonal vectors In mathematics, an inner product space (or, rarely, a Hausdorff pre-Hilbert space) is a real vector space or a complex vector space with an operation called an inner product. The inner product of two vectors in the space is a scalar, often de ...
have a similarity of 0, and two opposite vectors have a similarity of -1. The cosine similarity is particularly used in positive space, where the outcome is neatly bounded in ,1/math>. For example, in
information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...
and
text mining Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...
, each word is assigned a different coordinate and a document is represented by the vector of the numbers of occurrences of each word in the document. Cosine similarity then gives a useful measure of how similar two documents are likely to be, in terms of their subject matter, and independently of the length of the documents. The technique is also used to measure cohesion within clusters in the field of data mining. One advantage of cosine similarity is its low complexity, especially for sparse vectors: only the non-zero coordinates need to be considered. Other names for cosine similarity include Orchini similarity and Tucker coefficient of congruence; the Otsuka–Ochiai similarity (see below) is cosine similarity applied to
binary data Binary data is data whose unit can take on only two possible states. These are often labelled as 0 and 1 in accordance with the binary numeral system and Boolean algebra. Binary data occurs in many different technical and scientific fields, wher ...
.


Definition

The cosine of two non-zero vectors can be derived by using the Euclidean dot product formula: :\mathbf\cdot\mathbf =\left\, \mathbf\right\, \left\, \mathbf\right\, \cos\theta Given two vectors of attributes, ''A'' and ''B'', the cosine similarity, , is represented using a
dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an algebra ...
and
magnitude Magnitude may refer to: Mathematics *Euclidean vector, a quantity defined by both its magnitude and its direction *Magnitude (mathematics), the relative size of an object *Norm (mathematics), a term for the size or length of a vector *Order of ...
as :\text =S_C (A,B):= \cos(\theta) = = \frac, where A_i and B_i are
components Circuit Component may refer to: •Are devices that perform functions when they are connected in a circuit.   In engineering, science, and technology Generic systems * System components, an entity with discrete structure, such as an assem ...
of vector A and B respectively. The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 indicating
orthogonality In mathematics, orthogonality is the generalization of the geometric notion of ''perpendicularity''. By extension, orthogonality is also used to refer to the separation of specific features of a system. The term also has specialized meanings in ...
or
decorrelation Decorrelation is a general term for any process that is used to reduce autocorrelation within a signal, or cross-correlation within a set of signals, while preserving other aspects of the signal. A frequently used method of decorrelation is the u ...
, while in-between values indicate intermediate similarity or dissimilarity. For text matching, the attribute vectors ''A'' and ''B'' are usually the
term frequency Term may refer to: *Terminology, or term, a noun or compound word used in a specific context, in particular: ** Technical term, part of the specialized vocabulary of a particular field, specifically: *** Scientific terminology, terms used by scien ...
vectors of the documents. Cosine similarity can be seen as a method of normalizing document length during comparison. In the case of
information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...
, the cosine similarity of two documents will range from 0 to 1, since the term frequencies cannot be negative. This remains true when using
tf–idf In information retrieval, tf–idf (also TF*IDF, TFIDF, TF–IDF, or Tf–idf), short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or co ...
weights. The angle between two term frequency vectors cannot be greater than 90°. If the attribute vectors are normalized by subtracting the vector means (e.g., A - \bar), the measure is called the centered cosine similarity and is equivalent to the
Pearson correlation coefficient In statistics, the Pearson correlation coefficient (PCC, pronounced ) ― also known as Pearson's ''r'', the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation, or colloquially simply as the correlation coefficient ...
. For an example of centering, \text\, A =
_1, A_2 Onekama ( ) is a village in Manistee County in the U.S. state of Michigan. The population was 411 at the 2010 census. The village is located on the shores of Portage Lake and is surrounded by Onekama Township. The town's name is derived from "On ...
T, \text \bar = \left frac,\frac\rightT, \text A-\bar= \left frac,\frac\rightT. The term cosine distance is commonly used for the complement of cosine similarity in positive space, that is : \text = D_C(A,B) := 1 - S_C(A,B). It is important to note, however, that the cosine distance is not a proper
distance metric In mathematics, a metric space is a set together with a notion of ''distance'' between its elements, usually called points. The distance is measured by a function called a metric or distance function. Metric spaces are the most general sett ...
as it does not have the
triangle inequality In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side. This statement permits the inclusion of degenerate triangles, but ...
property—or, more formally, the
Schwarz inequality Schwarz may refer to: * Schwarz, Germany, a municipality in Mecklenburg-Vorpommern, Germany * Schwarz (surname), a surname (and list of people with the surname) * Schwarz (musician), American DJ and producer * ''Schwarz'' (Böhse Onkelz album), ...
—and it violates the coincidence axiom. One way to see this is to
note Note, notes, or NOTE may refer to: Music and entertainment * Musical note, a pitched sound (or a symbol for a sound) in music * ''Notes'' (album), a 1987 album by Paul Bley and Paul Motian * ''Notes'', a common (yet unofficial) shortened version ...
that the cosine distance is half of the
squared Euclidean distance In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two Point (geometry), points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theo ...
of the L_2 normalization of the vectors, and squared Euclidean distance does not satisfy the triangle inequality either. To repair the triangle inequality property while maintaining the same ordering, it is necessary to convert to angular distance or
Euclidean distance In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, therefor ...
. Alternatively, the triangular inequality that does work for angular distances can be expressed directly in terms of the cosines; see below.


Angular distance and similarity

The normalized angle, referred to as angular distance, between any two vectors A and B is a formal
distance metric In mathematics, a metric space is a set together with a notion of ''distance'' between its elements, usually called points. The distance is measured by a function called a metric or distance function. Metric spaces are the most general sett ...
and can be calculated from the cosine similarity. The complement of the angular distance metric can then be used to define angular similarity function bounded between 0 and 1, inclusive. When the vector elements may be positive or negative: :\text = D_ := \frac = \frac :\text = S_ := 1 - \text = 1 - \frac Or, if the vector elements are always positive: :\text = D_ := \frac = \frac :\text = S_ := 1 - \text = 1 - \frac Unfortunately, computing the inverse cosine () function is slow, making the use of the angular distance more computationally expensive than using the more common (but not metric) cosine distance above.


L2-normalized Euclidean distance

Another effective proxy for cosine distance can be obtained by L_2 normalisation of the vectors, followed by the application of normal
Euclidean distance In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, therefor ...
. Using this technique each term in each vector is first divided by the magnitude of the vector, yielding a vector of unit length. Then, it is clear, the Euclidean distance over the end-points of any two vectors is a proper metric which gives the same ordering as the cosine distance (a
monotonic transformation In mathematics, a monotonic function (or monotone function) is a function (mathematics), function between List of order structures in mathematics, ordered sets that preserves or reverses the given order relation, order. This concept first aro ...
of Euclidean distance; see below) for any comparison of vectors, and furthermore avoids the potentially expensive trigonometric operations required to yield a proper metric. Once the normalisation has occurred, the vector space can be used with the full range of techniques available to any Euclidean space, notably standard
dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
techniques. This normalised form distance is often used within many
deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. De ...
algorithms.


Otsuka–Ochiai coefficient

In biology, there is a similar concept known as the Otsuka–Ochiai coefficient named after Yanosuke Otsuka (also spelled as Ōtsuka, Ootsuka or Otuka, ja, 大塚 弥之助) and Akira Ochiai ( ja, 落合 明), also known as the Ochiai–Barkman or Ochiai coefficient, which can be represented as: :K =\frac Here, A and B are sets, and , A, is the number of elements in A. If sets are represented as bit vectors, the Otsuka–Ochiai coefficient can be seen to be the same as the cosine similarity. In a recent book, the coefficient is misattributed to another Japanese researcher with the family name Otsuka. The confusion arises because in 1957 Akira Ochiai attributes the coefficient only to Otsuka (no first name mentioned) by citing an article by Ikuso Hamai ( ja, 浜井 生三), who in turn cites the original 1936 article by Yanosuke Otsuka.


Properties

The most noteworthy property of cosine similarity is that it reflects a relative, rather than absolute, comparison of the individual vector dimensions. For any constant a and vector V, the vectors V and aV are maximally similar. The measure is thus most appropriate for data where frequency is more important than absolute values; notably, term frequency in documents. However more recent metrics with a grounding in information theory, such as Jensen–Shannon, SED, and triangular divergence have been shown to have improved semantics in at least some contexts. Cosine similarity is related to
Euclidean distance In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, therefor ...
as follows. Denote Euclidean distance by the usual \, A - B\, , and observe that :\, A - B\, ^2 = (A - B) \cdot (A - B) = \, A\, ^2 + \, B\, ^2 - 2 (A \cdot B)\ (
polarization identity In linear algebra, a branch of mathematics, the polarization identity is any one of a family of formulas that express the inner product of two vectors in terms of the norm of a normed vector space. If a norm arises from an inner product then t ...
) by
expansion Expansion may refer to: Arts, entertainment and media * ''L'Expansion'', a French monthly business magazine * ''Expansion'' (album), by American jazz pianist Dave Burrell, released in 2004 * ''Expansions'' (McCoy Tyner album), 1970 * ''Expansio ...
. When and are normalized to unit length, \, A\, ^2 = \, B\, ^2 = 1 so this expression is equal to :2 (1 - \cos(A, B)). In short, the cosine distance can be expressed in terms of Euclidean distance as :D_C(A, B) = \frac\quad\mathrm\quad\, A\, ^2 = \, B\, ^2 = 1. The Euclidean distance is called the ''chord distance'' (because it is the length of the chord on the unit circle) and it is the Euclidean distance between the vectors which were normalized to unit sum of squared values within them. Null distribution: For data which can be negative as well as positive, the
null distribution In statistical hypothesis testing, the null distribution is the probability distribution of the test statistic when the null hypothesis is true. For example, in an F-test, the null distribution is an F-distribution. Null distribution is a tool scie ...
for cosine similarity is the distribution of the
dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an algebra ...
of two independent random
unit vector In mathematics, a unit vector in a normed vector space is a vector (often a spatial vector) of length 1. A unit vector is often denoted by a lowercase letter with a circumflex, or "hat", as in \hat (pronounced "v-hat"). The term ''direction vecto ...
s. This distribution has a
mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
of zero and a
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
of 1/n (where n is the number of dimensions), and although the distribution is bounded between -1 and +1, as n grows large the distribution is increasingly well-approximated by the
normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...
. Other types of data such as
bitstream A bitstream (or bit stream), also known as binary sequence, is a sequence of bits. A bytestream is a sequence of bytes. Typically, each byte is an 8-bit quantity, and so the term octet stream is sometimes used interchangeably. An octet may ...
s, which only take the values 0 or 1, the null distribution takes a different form and may have a nonzero mean.


Triangle inequality for cosine similarity

The ordinary
triangle inequality In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side. This statement permits the inclusion of degenerate triangles, but ...
for angles (i.e., arc lengths on a unit hypersphere) gives us that :, ~\angle - \angle~, \le ~\angle~ \le ~\angle~ + ~\angle~. Because the cosine function decreases as an angle in radians increases, the sense of these inequalities is reversed when we take the cosine of each value: :\cos(\angle - \angle) \ge \cos(\angle) \ge \cos(\angle + \angle). Using the cosine addition and subtraction formulas, these two inequalities can be written in terms of the original cosines, :\cos(A,C) \cdot \cos(C,B) + \sqrt \geq \cos(A,B), :\cos(A,B) \geq \cos(A,C) \cdot \cos(C,B) - \sqrt. This form of the triangle inequality can be used to bound the minimum and maximum similarity of two objects A and B if the similarities to a reference object C is already known. This is used for example in metric data indexing, but has also been used to accelerate spherical k-means clustering the same way the Euclidean triangle inequality has been used to accelerate regular k-means.


Soft cosine measure

A soft cosine or ("soft" similarity) between two vectors considers similarities between pairs of features. The traditional cosine similarity considers the
vector space model Vector space model or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers (such as index terms). It is used in information filtering, information retrieval, indexing and r ...
(VSM) features as independent or completely different, while the soft cosine measure proposes considering the similarity of features in VSM, which help generalize the concept of cosine (and soft cosine) as well as the idea of (soft) similarity. For example, in the field of
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
(NLP) the similarity among features is quite intuitive. Features such as words, ''n''-grams, or syntactic ''n''-grams can be quite similar, though formally they are considered as different features in the VSM. For example, words “play” and “game” are different words and thus mapped to different points in VSM; yet they are semantically related. In case of ''n''-grams or syntactic ''n''-grams,
Levenshtein distance In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-charact ...
can be applied (in fact, Levenshtein distance can be applied to words as well). For calculating soft cosine, the matrix is used to indicate similarity between features. It can be calculated through Levenshtein distance,
WordNet WordNet is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into '' synsets'' with short definition ...
similarity, or other
similarity measure In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such meas ...
s. Then we just multiply by this matrix. Given two -dimension vectors a and b, the soft cosine similarity is calculated as follows: :\begin \operatorname_1(a,b)= \frac, \end where . If there is no similarity between features (, for ), the given equation is equivalent to the conventional cosine similarity formula. The
time complexity In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by ...
of this measure is quadratic, which makes it applicable to real-world tasks. Note that the complexity can be reduced to subquadratic. An efficient implementation of such soft cosine similarity is included in the
Gensim Gensim is an open-source library for unsupervised topic modeling, document indexing, retrieval by similarity, and other natural language processing functionalities, using modern statistical machine learning. Gensim is implemented in Python and ...
open source library.


See also

*
Sørensen–Dice coefficient The Sørensen–Dice coefficient (see below for other names) is a statistic used to gauge the similarity of two samples. It was independently developed by the botanists Thorvald Sørensen and Lee Raymond Dice, who published in 1948 and 1945 respec ...
*
Hamming distance In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of ''substitutions'' required to chan ...
*
Correlation In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics ...
*
Jaccard index The Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used for gauging the similarity and diversity of sample sets. It was developed by Grove Karl Gilbert in 1884 as his ratio of verification (v) and now is freque ...
*
SimRank SimRank is a general similarity measure, based on a simple and intuitive graph-theoretic model. SimRank is applicable in any domain with object-to-object relationships, that measures similarity of the structural context in which objects occur, ba ...
*
Information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...


References


External links


Weighted cosine measure

A tutorial on cosine similarity using Python
{{DEFAULTSORT:Cosine Similarity Information retrieval techniques Similarity measures Data analysis