The Dice-Sørensen coefficient (see below for other names) is a statistic used to gauge the similarity of two samples. It was independently developed by the botanists Lee Raymond Dice and Thorvald Sørensen, who published in 1945 and 1948 respectively.

Name

The index is known by several other names, especially Sørensen–Dice index, Sørensen index and Dice's coefficient. Other variations include the "similarity coefficient" or "index", such as Dice similarity coefficient (DSC). Common alternate spellings for Sørensen are ''Sorenson'', ''Soerenson'' and ''Sörenson'', and all three can also be seen with the ''–sen'' ending (the Danish letter ø is phonetically equivalent to the German/Swedish ö, which can be written as oe in ASCII). Other names include: * F1 score * Czekanowski's binary (non-quantitative) index * Measure of genetic similarity * Zijdenbos similarity index, referring to a 1994 paper of Zijdenbos et al.

Formula

Sørensen's original formula was intended to be applied to discrete data. Given two sets, X and Y, it is defined as :

DSC =  \frac

where , ''X'', and , ''Y'', are the cardinalities of the two sets (i.e. the number of elements in each set). The Sørensen index equals twice the number of elements common to both sets divided by the sum of the number of elements in each set. Equivalently, the index is the size of the intersection as a fraction of the average size of the two sets. When applied to Boolean data, using the definition of true positive (TP), false positive (FP), and false negative (FN), it can be written as :

DSC = \frac

. It is different from the Jaccard index which only counts true positives once in both the numerator and denominator. DSC is the quotient of similarity and ranges between 0 and 1. It can be viewed as a similarity measure over sets. Similarly to the Jaccard index, the set operations can be expressed in terms of vector operations over binary vectors a and b: :

s_v = \frac

which gives the same outcome over binary vectors and also gives a more general similarity metric over vectors in general terms. For sets ''X'' and ''Y'' of keywords used in

information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...

, the coefficient may be defined as twice the shared information (intersection) over the sum of cardinalities : When taken as a string similarity measure, the coefficient may be calculated for two strings, ''x'' and ''y'' using bigrams as follows: :

s = \frac

where ''n''_''t'' is the number of character bigrams found in both strings, ''n''_''x'' is the number of bigrams in string ''x'' and ''n''_''y'' is the number of bigrams in string ''y''. For example, to calculate the similarity between: :night :nacht We would find the set of bigrams in each word: : : Each set has four elements, and the intersection of these two sets has only one element: ht. Inserting these numbers into the formula, we calculate, ''s'' = (2 · 1) / (4 + 4) = 0.25.

Continuous Dice Coefficient

Source: For a discrete (binary) ground truth

A

and continuous measures

B

in the interval ,1 the following formula can be used:

cDC =  \frac

Where

, A \cap B,  = \Sigma_i a_ib_i

and

, B,  = \Sigma_i b_i

c can be computed as follows:

c = \frac

\Sigma_i a_i \operatorname = 0

which means no overlap between A and B, c is set to 1 arbitrarily.

Difference from Jaccard

This coefficient is not very different in form from the Jaccard index. In fact, both are equivalent in the sense that given a value for the Sørensen–Dice coefficient

S

, one can calculate the respective Jaccard index value

J

and vice versa, using the equations

J=S/(2-S)

and

S=2J/(1+J)

. Since the Sørensen–Dice coefficient does not satisfy the

triangle inequality In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side. This statement permits the inclusion of Degeneracy (mathematics)#T ...

, it can be considered a semimetric version of the Jaccard index. The function ranges between zero and one, like Jaccard. Unlike Jaccard, the corresponding difference function :

d(X, Y) = 1 -  \frac

is not a proper distance metric as it does not satisfy the triangle inequality.Gallagher, E.D., 1999
COMPAH Documentation
University of Massachusetts, Boston The simplest counterexample of this is given by the three sets

X=\

Y=\

and

Z = X \cup Y = \

. We have

d(X,Y)=1

and

d(X,Z)=d(Y,Z)=1/3

. To satisfy the triangle inequality, the sum of any two sides must be greater than or equal to that of the remaining side. However,

d(X, Z) + d(Y, Z) = 2/3 < 1 = d(X, Y)

Applications

The Sørensen–Dice coefficient is useful for ecological community data (e.g. Looman & Campbell, 1960). Justification for its use is primarily empirical rather than theoretical (although it can be justified theoretically as the intersection of two

fuzzy set Fuzzy or Fuzzies may refer to: Music * Fuzzy (band), a 1990s Boston indie pop band * Fuzzy (composer), Danish composer Jens Vilhelm Pedersen (born 1939) * Fuzzy (album), ''Fuzzy'' (album), 1993 debut album of American rock band Grant Lee Buffalo ...

s). As compared to

Euclidean distance In mathematics, the Euclidean distance between two points in Euclidean space is the length of the line segment between them. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, and therefore is o ...

, the Sørensen distance retains sensitivity in more heterogeneous data sets and gives less weight to outliers. Recently the Dice score (and its variations, e.g. logDice taking a logarithm of it) has become popular in computer

lexicography Lexicography is the study of lexicons and the art of compiling dictionaries. It is divided into two separate academic disciplines: * Practical lexicography is the art or craft of compiling, writing and editing dictionaries. * Theoretical le ...

for measuring the lexical association score of two given words. logDice is also used as part of the Mash Distance for genome and metagenome distance estimation Finally, Dice is used in image segmentation, in particular for comparing algorithm output against reference masks in medical applications.

Abundance version

The expression is easily extended to abundance instead of presence/absence of species. This quantitative version is known by several names: * Quantitative Sørensen–Dice index * Quantitative Sørensen index * Quantitative Dice index * Bray–Curtis similarity (1 minus the ''Bray-Curtis dissimilarity'') * Czekanowski's quantitative index * Steinhaus index * Pielou's percentage similarity * 1 minus the

Hellinger distance In probability and statistics, the Hellinger distance (closely related to, although different from, the Bhattacharyya distance) is used to quantify the similarity between two probability distributions. It is a type of ''f''-divergence. The Hell ...

* Proportion of specific agreement or positive agreement

References

External links

{{DEFAULTSORT:Sorensen-Dice coefficient Information retrieval evaluation String metrics Measure theory Similarity measures