probability theory Probability theory is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set o ...

and

information theory Information theory is the scientific study of the quantification (science), quantification, computer data storage, storage, and telecommunication, communication of information. The field was originally established by the works of Harry Nyquist a ...

, the variation of information or shared information distance is a measure of the distance between two clusterings ( partitions of elements). It is closely related to

mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...

; indeed, it is a simple linear expression involving the mutual information. Unlike the mutual information, however, the variation of information is a true

metric Metric or metrical may refer to: * Metric system, an internationally adopted decimal system of measurement * An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement Mathematics In mathem ...

, in that it obeys the

triangle inequality In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side. This statement permits the inclusion of degenerate triangles, but ...

.Marina Meila, "Comparing Clusterings by the Variation of Information", Learning Theory and Kernel Machines (2003), vol. 2777, pp. 173–187, , Lecture Notes in Computer Science, VennDiagramIncludingVI

Definition

Suppose we have two

partitions Partition may refer to: Computing Hardware * Disk partitioning, the division of a hard disk drive * Memory partition, a subdivision of a computer's memory, usually for use by a single job Software * Partition (database), the division of a ...

X

and

Y

of a

set Set, The Set, SET or SETS may refer to: Science, technology, and mathematics Mathematics *Set (mathematics), a collection of elements *Category of sets, the category whose objects and morphisms are sets and total functions, respectively Electro ...

A

into disjoint

subset In mathematics, Set (mathematics), set ''A'' is a subset of a set ''B'' if all Element (mathematics), elements of ''A'' are also elements of ''B''; ''B'' is then a superset of ''A''. It is possible for ''A'' and ''B'' to be equal; if they are ...

s, namely

X = \

and

Y = \

. Let: :

n = \sum_ , X_,  = \sum_ , Y_, =, A,

p_ = , X_,  / n

and

q_ = , Y_,  / n

r_ = , X_i\cap Y_,  / n

Then the variation of information between the two partitions is: :

\mathrm(X; Y ) = - \sum_ r_ \left log(r_/p_i)+\log(r_/q_j) \right /math>.

This is equivalent to the shared information distance between the random variables ''i'' and ''j'' with respect to the uniform probability measure on A defined by \mu(B):=, B, /n for B\subseteq A .

Explicit information content

We can rewrite this definition in terms that explicitly highlight the information content of this metric. The set of all partitions of a set form a compact

Lattice Lattice may refer to: Arts and design * Latticework, an ornamental criss-crossed framework, an arrangement of crossing laths or other thin strips of material * Lattice (music), an organized grid model of pitch ratios * Lattice (pastry), an orna ...

where the partial order induces two operations, the meet

\wedge

and the join

\vee

, where the maximum

\overline

is the partition with only one block, i.e., all elements grouped together, and the minimum is

\overline

, the partition consisting of all elements as singletons. The meet of two partitions

X

and

Y

is easy to understand as that partition formed by all pair intersections of one block of,

X_

, of

X

and one,

Y_

, of

Y

. It then follows that

X\wedge Y\subseteq X

and

X\wedge Y\subseteq Y

. Let's define the entropy of a partition

X

H\left( X\right)\,=\,-\sum_i\,p_\log p_

where

p_=, X_i, /n

. Clearly,

H(\overline)=0

and

H(\overline)=\log\,n

. The entropy of a partition is a monotonous function on the lattice of partitions in the sense that

X\subseteq Y\Rightarrow H(X) \geq H(Y)

. Then the VI distance between

X

and

Y

is given by

\mathrm(X,Y)\,=\,2 H( X\wedge Y )\,-\,H(X)\,-\,H(Y)

The difference

d(X,Y)\equiv , H\left(X\right)-H\left(Y\right),

is a pseudo-metric as

d(X,Y)=0

doesn't necessarily imply that

X=Y

. From the definition of

\overline

, it is

\mathrm(X,\mathrm)\,=\,H\left(X\right)

. If in the

Hasse diagram In order theory, a Hasse diagram (; ) is a type of mathematical diagram used to represent a finite partially ordered set, in the form of a drawing of its transitive reduction. Concretely, for a partially ordered set ''(S, ≤)'' one represents ea ...

we draw an edge from every partition to the maximum

\overline

and assign it a weight equal to the VI distance between the given partition and

\overline

, we can interpret the VI distance as basically an average of differences of edge weights to the maximum

\mathrm(X,Y)\,=\,, \mathrm(X,\overline)\,-\,\mathrm(X\wedge Y,\overline), \,+\,, \mathrm(Y,\overline)\,-\,\mathrm(X\wedge Y,\overline),  \,=\,d(X,X\wedge Y)\,+\,d(Y,X\wedge Y)

For

H(X)

as defined above, it holds that the joint information of two partitions coincides with the entropy of the meet

H(X,Y)\,=\,H(X\wedge Y)

and we also have that

d(X,X\wedge Y)\,=\,H(X\wedge Y, X)

coincides with the conditional entropy of the meet (intersection)

X\wedge Y

relative to

X

Identities

The variation of information satisfies :

\mathrm(X; Y ) = H(X) + H(Y) - 2I(X, Y)

, where

H(X)

is the

entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...

X

, and

I(X, Y)

between

X

and

Y

with respect to the uniform probability measure on

A

. This can be rewritten as :

\mathrm(X; Y ) = H(X,Y) - I(X, Y)

, where

H(X,Y)

is the

joint entropy In information theory, joint entropy is a measure of the uncertainty associated with a set of variables. Definition The joint Shannon entropy (in bits) of two discrete random variables X and Y with images \mathcal X and \mathcal Y is defined ...

X

and

Y

, or :

\mathrm(X; Y ) = H(X, Y) + H(Y, X)

, where

H(X, Y)

and

H(Y, X)

are the respective conditional entropies. The variation of information can also be bounded, either in terms of the number of elements: :

\mathrm(X; Y)\leq \log(n)

, Or with respect to a maximum number of clusters,

K^*

: :

\mathrm(X ; Y) \leq 2 \log(K^*)

References

External links

Partanalyzer
includes a C++ implementation of VI and other metrics and indices for analyzing partitions and clusterings
C++ implementation with MATLAB mex files
{{DEFAULTSORT:Variation Of Information Entropy and information Summary statistics for contingency tables Clustering criteria

Definition

Explicit information content

Identities

References

Further reading

External links