Entropy-mutual-information-relative-entropy-relation-diagram

In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in

units Unit may refer to: Arts and entertainment * UNIT, a fictional military organization in the science fiction television series ''Doctor Who'' * Unit of action, a discrete piece of action (or beat) in a theatrical presentation Music * Unit (album), ...

such as shannons (

bit The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represente ...

s), nats or hartleys) obtained about one random variable by observing the other random variable. The concept of mutual information is intimately linked to that of entropy of a random variable, a fundamental notion in information theory that quantifies the expected "amount of information" held in a random variable. Not limited to real-valued random variables and linear dependence like the

correlation coefficient A correlation coefficient is a numerical measure of some type of correlation, meaning a statistical relationship between two variables. The variables may be two columns of a given data set of observations, often called a sample, or two components ...

, MI is more general and determines how different the

joint distribution Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered ...

of the pair

(X,Y)

is from the product of the marginal distributions of

X

and

Y

. MI is the expected value of the pointwise mutual information (PMI). The quantity was defined and analyzed by Claude Shannon in his landmark paper "

A Mathematical Theory of Communication "A Mathematical Theory of Communication" is an article by mathematician Claude E. Shannon published in ''Bell System Technical Journal'' in 1948. It was renamed ''The Mathematical Theory of Communication'' in the 1949 book of the same name, a sma ...

", although he did not call it "mutual information". This term was coined later by

Robert Fano Roberto Mario "Robert" Fano (11 November 1917 – 13 July 2016) was an Italian-American computer scientist and professor of electrical engineering and computer science at the Massachusetts Institute of Technology. He became a student and working ...

. Mutual Information is also known as

information gain Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random, ...

Definition

Let

(X,Y)

be a pair of random variables with values over the space

\mathcal\times\mathcal

. If their joint distribution is

P_

and the marginal distributions are

P_X

and

P_Y

, the mutual information is defined as where

D_

is the

Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fro ...

. Notice, as per property of the

, that

I(X;Y)

is equal to zero precisely when the joint distribution coincides with the product of the marginals, i.e. when

X

and

Y

are independent (and hence observing

Y

tells you nothing about

X

I(X;Y)

is non-negative, it is a measure of the price for encoding

(X,Y)

as a pair of independent random variables, when in reality they are not. If the

natural logarithm The natural logarithm of a number is its logarithm to the base of the mathematical constant , which is an irrational and transcendental number approximately equal to . The natural logarithm of is generally written as , , or sometimes, if ...

is used, the unit of mutual information is the

nat Nat or NAT may refer to: Computing * Network address translation (NAT), in computer networking Organizations * National Actors Theatre, New York City, U.S. * National AIDS trust, a British charity * National Archives of Thailand * National As ...

. If the log base 2 is used, the unit of mutual information is the shannon, also known as the bit. If the log base 10 is used, the unit of mutual information is the hartley, also known as the ban or the dit.

In terms of PMFs for discrete distributions

The mutual information of two jointly discrete random variables

X

and

Y

is calculated as a double sum: where

P_

is the joint probability ''mass'' function of

X

and

Y

, and

P_X

and

P_Y

are the

marginal probability In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the variables ...

mass functions of

X

and

Y

respectively.

In terms of PDFs for continuous distributions

In the case of jointly continuous random variables, the double sum is replaced by a

double integral In mathematics (specifically multivariable calculus), a multiple integral is a definite integral of a function of several real variables, for instance, or . Integrals of a function of two variables over a region in \mathbb^2 (the real-number ...

: where

P_

is now the joint probability ''density'' function of

X

and

Y

, and

P_X

and

P_Y

are the marginal probability density functions of

X

and

Y

respectively.

Motivation

Intuitively, mutual information measures the information that

X

and

Y

share: It measures how much knowing one of these variables reduces uncertainty about the other. For example, if

X

and

Y

are independent, then knowing

X

does not give any information about

Y

and vice versa, so their mutual information is zero. At the other extreme, if

X

is a deterministic function of

Y

and

Y

is a deterministic function of

X

then all information conveyed by

X

is shared with

Y

: knowing

X

determines the value of

Y

and vice versa. As a result, in this case the mutual information is the same as the uncertainty contained in

Y

(or

X

) alone, namely the

entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...

Y

(or

X

). Moreover, this mutual information is the same as the entropy of

X

and as the entropy of

Y

. (A very special case of this is when

X

and

Y

are the same random variable.) Mutual information is a measure of the inherent dependence expressed in the

X

and

Y

relative to the marginal distribution of

X

and

Y

under the assumption of independence. Mutual information therefore measures dependence in the following sense:

\operatorname(X;Y) = 0

if and only if In logic and related fields such as mathematics and philosophy, "if and only if" (shortened as "iff") is a biconditional logical connective between statements, where either both statements are true or both are false. The connective is bicondi ...

X

and

Y

are independent random variables. This is easy to see in one direction: if

X

and

Y

are independent, then

p_(x,y)=p_X(x) \cdot p_Y(y)

, and therefore: :

\log = \log 1 = 0 .

Moreover, mutual information is nonnegative (i.e.

\operatorname(X;Y) \ge 0

see below) and

symmetric Symmetry (from grc, συμμετρία "agreement in dimensions, due proportion, arrangement") in everyday language refers to a sense of harmonious and beautiful proportion and balance. In mathematics, "symmetry" has a more precise definiti ...

(i.e.

\operatorname(X;Y) = \operatorname(Y;X)

see below).

Properties

Nonnegativity

Using

Jensen's inequality In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier pr ...

on the definition of mutual information we can show that

\operatorname(X;Y)

is non-negative, i.e. :

\operatorname(X;Y) \ge 0

Symmetry

\operatorname(X;Y) = \operatorname(Y;X)

The proof is given considering the relationship with entropy, as shown below.

Relation to conditional and joint entropy

Mutual information can be equivalently expressed as: :

\begin
  \operatorname(X;Y) & \equiv \Eta(X) - \Eta(X\mid Y) \\
         & \equiv \Eta(Y) - \Eta(Y\mid X) \\
         & \equiv \Eta(X) + \Eta(Y) - \Eta(X, Y) \\
         & \equiv \Eta(X, Y) - \Eta(X\mid Y) - \Eta(Y\mid X)
\end

where

\Eta(X)

and

\Eta(Y)

are the marginal entropies,

\Eta(X\mid Y)

and

\Eta(Y\mid X)

are the conditional entropies, and

\Eta(X,Y)

is the

joint entropy In information theory, joint entropy is a measure of the uncertainty associated with a set of variables. Definition The joint Shannon entropy (in bits) of two discrete random variables X and Y with images \mathcal X and \mathcal Y is defined ...

X

and

Y

. Notice the analogy to the union, difference, and intersection of two sets: in this respect, all the formulas given above are apparent from the Venn diagram reported at the beginning of the article. In terms of a communication channel in which the output

Y

is a noisy version of the input

X

, these relations are summarised in the figure: Figchannel2017ab

Because

\operatorname(X;Y)

is non-negative, consequently,

\Eta(X) \ge \Eta(X\mid Y)

. Here we give the detailed deduction of

\operatorname(X;Y)=\Eta(Y)-\Eta(Y\mid X)

for the case of jointly discrete random variables: :

\begin
\operatorname(X;Y) &  = \sum_ p_(x,y) \log \frac\\
&  = \sum_ p_(x,y) \log \frac - \sum_ p_(x,y) \log p_Y(y)  \\

&  = \sum_ p_X(x)p_(y) \log p_(y) - \sum_ p_(x,y) \log p_Y(y) \\
&  = \sum_ p_X(x) \left(\sum_ p_(y) \log p_(y)\right) - \sum_ \left(\sum_ p_(x,y)\right) \log p_Y(y) \\
&  = -\sum_ p_X(x) \Eta(Y\mid X=x) - \sum_ p_Y(y) \log p_Y(y) \\
&  = -\Eta(Y\mid X) + \Eta(Y)  \\
&  = \Eta(Y) - \Eta(Y\mid X).  \\
\end

The proofs of the other identities above are similar. The proof of the general case (not just discrete) is similar, with integrals replacing sums. Intuitively, if entropy

\Eta(Y)

is regarded as a measure of uncertainty about a random variable, then

\Eta(Y\mid X)

is a measure of what

X

does ''not'' say about

Y

. This is "the amount of uncertainty remaining about

Y

after

X

is known", and thus the right side of the second of these equalities can be read as "the amount of uncertainty in

Y

, minus the amount of uncertainty in

Y

which remains after

X

is known", which is equivalent to "the amount of uncertainty in

Y

which is removed by knowing

X

". This corroborates the intuitive meaning of mutual information as the amount of information (that is, reduction in uncertainty) that knowing either variable provides about the other. Note that in the discrete case

\Eta(Y\mid Y) = 0

and therefore

\Eta(Y) = \operatorname(Y;Y)

. Thus

\operatorname(Y; Y) \ge \operatorname(X; Y)

, and one can formulate the basic principle that a variable contains at least as much information about itself as any other variable can provide.

Relation to Kullback–Leibler divergence

For jointly discrete or jointly continuous pairs

(X,Y)

, mutual information is the

from the product of the

marginal distribution In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the varia ...

p_X \cdot p_Y

, of the

p_

, that is, Furthermore, let

p_(x,y) =p_(x)* p_Y(y)

be the conditional mass or density function. Then, we have the identity The proof for jointly discrete random variables is as follows: :

\end

Similarly this identity can be established for jointly continuous random variables. Note that here the Kullback–Leibler divergence involves integration over the values of the random variable

X

only, and the expression

D_\text(p_ \parallel p_X)

still denotes a random variable because

Y

is random. Thus mutual information can also be understood as the expectation of the Kullback–Leibler divergence of the

univariate distribution In statistics, a univariate distribution is a probability distribution of only one random variable. This is in contrast to a multivariate distribution, the probability distribution of a random vector (consisting of multiple random variables). Examp ...

p_X

X

from the

conditional distribution In probability theory and statistics, given two jointly distributed random variables X and Y, the conditional probability distribution of Y given X is the probability distribution of Y when X is known to be a particular value; in some cases the ...

p_

X

given

Y

: the more different the distributions

p_

and

p_X

are on average, the greater the

Bayesian estimation of mutual information

If samples from a joint distribution are available, a Bayesian approach can be used to estimate the mutual information of that distribution. The first work to do this, which also showed how to do Bayesian estimation of many other information-theoretic properties besides mutual information, was. Subsequent researchers have rederived and extended this analysis. See for a recent paper based on a prior specifically tailored to estimation of mutual information per se. Besides, recently an estimation method accounting for continuous and multivariate outputs,

Y

, was proposed in .

Independence assumptions

The Kullback-Leibler divergence formulation of the mutual information is predicated on that one is interested in comparing

p(x,y)

to the fully factorized

outer product In linear algebra, the outer product of two coordinate vector In linear algebra, a coordinate vector is a representation of a vector as an ordered list of numbers (a tuple) that describes the vector in terms of a particular ordered basis. An ea ...

p(x) \cdot p(y)

. In many problems, such as

non-negative matrix factorization Non-negative matrix factorization (NMF or NNMF), also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix is factorized into (usually) two matrices and , with the property that ...

, one is interested in less extreme factorizations; specifically, one wishes to compare

p(x,y)

to a low-rank matrix approximation in some unknown variable

w

; that is, to what degree one might have :

p(x,y)\approx \sum_w p^\prime (x,w) p^(w,y)

Alternately, one might be interested in knowing how much more information

p(x,y)

carries over its factorization. In such a case, the excess information that the full distribution

p(x,y)

carries over the matrix factorization is given by the Kullback-Leibler divergence :

\operatorname_ = \sum_ \sum_
    ,

The conventional definition of the mutual information is recovered in the extreme case that the process

W

has only one value for

w

Variations

Several variations on mutual information have been proposed to suit various needs. Among these are normalized variants and generalizations to more than two variables.

Metric

Many applications require a

metric Metric or metrical may refer to: * Metric system, an internationally adopted decimal system of measurement * An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement Mathematics In mathem ...

, that is, a distance measure between pairs of points. The quantity :

\begin
  d(X,Y) &= \Eta(X,Y) - \operatorname(X;Y) \\
         &= \Eta(X) + \Eta(Y) - 2\operatorname(X;Y) \\
         &= \Eta(X\mid Y) + \Eta(Y\mid X) \\
         &= 2\Eta(X,Y) - \Eta(X) - \Eta(Y)
\end

satisfies the properties of a metric (

triangle inequality In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side. This statement permits the inclusion of degenerate triangles, but ...

, non-negativity, indiscernability and symmetry). This distance metric is also known as the variation of information. If

X, Y

are discrete random variables then all the entropy terms are non-negative, so

0 \le d(X,Y) \le \Eta(X,Y)

and one can define a normalized distance :

D(X,Y) = \frac \le 1.

The metric

D

is a universal metric, in that if any other distance measure places

X

and

Y

close by, then the

D

will also judge them close. Plugging in the definitions shows that :

D(X,Y) = 1 - \frac.

This is known as the Rajski Distance. In a set-theoretic interpretation of information (see the figure for

Conditional entropy In information theory, the conditional entropy quantifies the amount of information needed to describe the outcome of a random variable Y given that the value of another random variable X is known. Here, information is measured in shannons, na ...

), this is effectively the Jaccard distance between

X

and

Y

. Finally, :

D^\prime(X, Y) = 1 - \frac

is also a metric.

Conditional mutual information

Sometimes it is useful to express the mutual information of two random variables conditioned on a third. For jointly

discrete random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...

s this takes the form :

\operatorname(X;Y, Z)  =  \sum_ \sum_ \sum_
    ,

which can be simplified as :

\operatorname(X;Y, Z) = \sum_ \sum_ \sum_
      p_(x,y,z) \log \frac.

For jointly

continuous random variable In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon ...

s this takes the form :

\operatorname(X;Y, Z)  =  \int_ \int_ \int_
     dx dy dz,

which can be simplified as :

\operatorname(X;Y, Z) = \int_ \int_ \int_
      p_(x,y,z) \log \frac dx dy dz.

Conditioning on a third random variable may either increase or decrease the mutual information, but it is always true that :

\operatorname(X;Y, Z) \ge 0

for discrete, jointly distributed random variables

X,Y,Z

. This result has been used as a basic building block for proving other

inequalities in information theory Inequalities are very important in the study of information theory. There are a number of different contexts in which these inequalities appear. Entropic inequalities Consider a tuple X_1,X_2,\dots,X_n of n finitely (or at most countably) supp ...

Interaction information

Several generalizations of mutual information to more than two random variables have been proposed, such as total correlation (or multi-information) and

dual total correlation In information theory, dual total correlation (Han 1978), information rate (Dubnov 2006), excess entropy (Olbrich 2008), or binding information (Abdallah and Plumbley 2010) is one of several known non-negative generalizations of mutual information ...

. The expression and study of multivariate higher-degree mutual information was achieved in two seemingly independent works: McGill (1954) who called these functions "interaction information", and Hu Kuo Ting (1962). Interaction information is defined for one variable as follows: :

\operatorname(X_1) = \Eta(X_1)

and for

n > 1,

\operatorname(X_1;\,...\,;X_n)
  = \operatorname(X_1;\,...\,;X_) 
      - \operatorname(X_1;\,...\,;X_\mid X_n).

Some authors reverse the order of the terms on the right-hand side of the preceding equation, which changes the sign when the number of random variables is odd. (And in this case, the single-variable expression becomes the negative of the entropy.) Note that :

I(X_1;\ldots;X_\mid X_) =
  \mathbb_P_ \otimes\cdots\otimes P_ )

Multivariate statistical independence

The multivariate mutual information functions generalize the pairwise independence case that states that

X_1, X_2

if and only if

I(X_1; X_2) = 0

, to arbitrary numerous variable. n variables are mutually independent if and only if the

2^n - n - 1

mutual information functions vanish

I(X_1; \ldots; X_k) = 0

with

n \ge k \ge 2

(theorem 2). In this sense, the

I(X_1; \ldots; X_k) = 0

can be used as a refined statistical independence criterion.

Applications

For 3 variables, Brenner et al. applied multivariate mutual information to neural coding and called its negativity "synergy" and Watkinson et al. applied it to genetic expression. For arbitrary k variables, Tapia et al. applied multivariate mutual information to gene expression. It can be zero, positive, or negative. The positivity corresponds to relations generalizing the pairwise correlations, nullity corresponds to a refined notion of independence, and negativity detects high dimensional "emergent" relations and clusterized datapoints ). One high-dimensional generalization scheme which maximizes the mutual information between the joint distribution and other target variables is found to be useful in

feature selection In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construc ...

. Mutual information is also used in the area of signal processing as a

measure of similarity In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such meas ...

between two signals. For example, FMI metric is an image fusion performance measure that makes use of mutual information in order to measure the amount of information that the fused image contains about the source images. The

Matlab MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementation ...

code for this metric can be found at. A python package for computing all multivariate mutual informations, conditional mutual information, joint entropies, total correlations, information distance in a dataset of n variables is available.

Directed information

Directed information Directed information, I(X^n\to Y^n) , is an information theory measure that quantifies the information flow from the random process X^n = \ to the random process Y^n = \. The term ''directed information'' was coined by James Massey and is defined ...

\operatorname\left(X^n \to Y^n\right)

, measures the amount of information that flows from the process

X^n

Y^n

, where

X^n

denotes the vector

X_1, X_2, ..., X_n

and

Y^n

denotes

Y_1, Y_2, ..., Y_n

. The term ''directed information'' was coined by

James Massey James Lee Massey (February 11, 1934 – June 16, 2013) was an American information theorist and cryptographer, Professor Emeritus of Digital Technology at ETH Zurich. His notable work includes the application of the Berlekamp–Massey algorithm to ...

and is defined as :

\operatorname\left(X^n \to Y^n\right)
  = \sum_^n \operatorname\left(X^i; Y_i\mid Y^\right)

. Note that if

n=1

, the directed information becomes the mutual information. Directed information has many applications in problems where

causality Causality (also referred to as causation, or cause and effect) is influence by which one event, process, state, or object (''a'' ''cause'') contributes to the production of another event, process, state, or object (an ''effect'') where the cau ...

plays an important role, such as capacity of channel with feedback.

Normalized variants

Normalized variants of the mutual information are provided by the ''coefficients of constraint'',

uncertainty coefficient In statistics, the uncertainty coefficient, also called proficiency, entropy coefficient or Theil's U, is a measure of nominal association. It was first introduced by Henri Theil and is based on the concept of information entropy. Definition Sup ...

or proficiency: :

C_ = \frac
  ~~~~\mbox~~~~ 
  C_ = \frac.

The two coefficients have a value ranging in

, 1 The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline (t ...

but are not necessarily equal. In some cases a symmetric measure may be desired, such as the following '' redundancy'' measure: :

R = \frac

which attains a minimum of zero when the variables are independent and a maximum value of :

R_\max = \frac

when one variable becomes completely redundant with the knowledge of the other. See also ''

Redundancy (information theory) In information theory, redundancy measures the fractional difference between the entropy of an ensemble , and its maximum possible value \log(, \mathcal_X, ). Informally, it is the amount of wasted "space" used to transmit certain data. Data comp ...

''. Another symmetrical measure is the ''symmetric uncertainty'' , given by :

U(X, Y) = 2R = 2\frac

which represents the

harmonic mean In mathematics, the harmonic mean is one of several kinds of average, and in particular, one of the Pythagorean means. It is sometimes appropriate for situations when the average rate is desired. The harmonic mean can be expressed as the recipro ...

of the two uncertainty coefficients

C_, C_

. If we consider mutual information as a special case of the total correlation or

, the normalized version are respectively, :

\frac

and

\frac\; .

This normalized version also known as Information Quality Ratio (IQR) which quantifies the amount of information of a variable based on another variable against total uncertainty: :

= \frac = \frac - 1

There's a normalization which derives from first thinking of mutual information as an analogue to

covariance In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the ...

(thus

Shannon entropy Shannon may refer to: People * Shannon (given name) * Shannon (surname) * Shannon (American singer), stage name of singer Shannon Brenda Greene (born 1958) * Shannon (South Korean singer), British-South Korean singer and actress Shannon Arrum W ...

is analogous to

variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...

). Then the normalized mutual information is calculated akin to the

Pearson correlation coefficient In statistics, the Pearson correlation coefficient (PCC, pronounced ) ― also known as Pearson's ''r'', the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation, or colloquially simply as the correlation coefficient ...

, :

\frac\; .

Weighted variants

In the traditional formulation of the mutual information, :

\operatorname(X;Y) 
  = \sum_ \sum_ p(x, y) \log \frac,

each ''event'' or ''object'' specified by

(x, y)

is weighted by the corresponding probability

p(x, y)

. This assumes that all objects or events are equivalent ''apart from'' their probability of occurrence. However, in some applications it may be the case that certain objects or events are more ''significant'' than others, or that certain patterns of association are more semantically important than others. For example, the deterministic mapping

\

may be viewed as stronger than the deterministic mapping

\

, although these relationships would yield the same mutual information. This is because the mutual information is not sensitive at all to any inherent ordering in the variable values (, , ), and is therefore not sensitive at all to the form of the relational mapping between the associated variables. If it is desired that the former relation—showing agreement on all variable values—be judged stronger than the later relation, then it is possible to use the following ''weighted mutual information'' . :

\operatorname(X;Y) 
  = \sum_ \sum_ w(x,y) p(x,y) \log \frac,

which places a weight

w(x,y)

on the probability of each variable value co-occurrence,

p(x,y)

. This allows that certain probabilities may carry more or less significance than others, thereby allowing the quantification of relevant ''holistic'' or '' Prägnanz'' factors. In the above example, using larger relative weights for

w(1,1)

w(2,2)

, and

w(3,3)

would have the effect of assessing greater ''informativeness'' for the relation

\

than for the relation

\

, which may be desirable in some cases of pattern recognition, and the like. This weighted mutual information is a form of weighted KL-Divergence, which is known to take negative values for some inputs, and there are examples where the weighted mutual information also takes negative values.

Adjusted mutual information

A probability distribution can be viewed as a

partition of a set In mathematics, a partition of a set is a grouping of its elements into non-empty subsets, in such a way that every element is included in exactly one subset. Every equivalence relation on a set defines a partition of this set, and every part ...

. One may then ask: if a set were partitioned randomly, what would the distribution of probabilities be? What would the expectation value of the mutual information be? The adjusted mutual information or AMI subtracts the expectation value of the MI, so that the AMI is zero when two different distributions are random, and one when two distributions are identical. The AMI is defined in analogy to the

adjusted Rand index Adjustment may refer to: *Adjustment (law), with several meanings *Adjustment (psychology), the process of balancing conflicting needs *Adjustment of observations, in mathematics, a method of solving an overdetermined system of equations *Calibra ...

of two different partitions of a set.

Absolute mutual information

Using the ideas of

Kolmogorov complexity In algorithmic information theory (a subfield of computer science and mathematics), the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program (in a predetermined programming language) that produ ...

, one can consider the mutual information of two sequences independent of any probability distribution: :

\operatorname_K(X;Y) = K(X) - K(X\mid Y).

To establish that this quantity is symmetric up to a logarithmic factor (

\operatorname_K(X;Y) \approx \operatorname_K(Y;X)

) one requires the chain rule for Kolmogorov complexity . Approximations of this quantity via

compression Compression may refer to: Physical science *Compression (physics), size reduction due to forces *Compression member, a structural element such as a column *Compressibility, susceptibility to compression * Gas compression *Compression ratio, of a ...

can be used to define a

distance measure Distance measures are used in physical cosmology to give a natural notion of the distance between two objects or events in the universe. They are often used to tie some ''observable'' quantity (such as the luminosity of a distant quasar, the red ...

to perform a hierarchical clustering of sequences without having any

domain knowledge Domain knowledge is knowledge of a specific, specialized discipline or field, in contrast to general (or domain-independent) knowledge. The term is often used in reference to a more general discipline—for example, in describing a software engin ...

of the sequences .

Linear correlation

Unlike correlation coefficients, such as the

product moment correlation coefficient In statistics, the Pearson correlation coefficient (PCC, pronounced ) ― also known as Pearson's ''r'', the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation, or colloquially simply as the correlation coefficien ...

, mutual information contains information about all dependence—linear and nonlinear—and not just linear dependence as the correlation coefficient measures. However, in the narrow case that the joint distribution for

X

and

Y

is a

bivariate normal distribution In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One ...

(implying in particular that both marginal distributions are normally distributed), there is an exact relationship between

\operatorname

and the correlation coefficient

\rho

. :

\operatorname = -\frac \log\left(1 - \rho^2\right)

The equation above can be derived as follows for a bivariate Gaussian: :

= 1 + \log(2\pi) + \log\left(\sigma_1 \sigma_2\right) + \frac\log\left(1 - \rho^2\right) \\ \end

Therefore, :

\operatorname\left(X_1; X_2\right) 
  = \Eta\left(X_1\right) + \Eta\left(X_2\right) - \Eta\left(X_1, X_2\right) 
  = -\frac\log\left(1 - \rho^2\right)

For discrete data

When

X

and

Y

are limited to be in a discrete number of states, observation data is summarized in a contingency table, with row variable

X

(or

i

) and column variable

Y

(or

j

). Mutual information is one of the measures of

association Association may refer to: *Club (organization), an association of two or more people united by a common interest or goal *Trade association, an organization founded and funded by businesses that operate in a specific industry *Voluntary associatio ...

correlation In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics ...

between the row and column variables. Other measures of association include Pearson's chi-squared test statistics,

G-test In statistics, ''G''-tests are likelihood ratio test, likelihood-ratio or maximum likelihood statistical significance tests that are increasingly being used in situations where chi-squared tests were previously recommended. The general formula fo ...

statistics, etc. In fact, with the same log base, mutual information will be equal to the

log-likelihood statistic divided by

2N

, where

N

is the sample size.

Applications

In many applications, one wants to maximize mutual information (thus increasing dependencies), which is often equivalent to minimizing

conditional entropy In information theory, the conditional entropy quantifies the amount of information needed to describe the outcome of a random variable Y given that the value of another random variable X is known. Here, information is measured in shannons, na ...

. Examples include: * In

search engine technology A search engine is an information retrieval software program that discovers, crawls, transforms and stores information for retrieval and presentation in response to user queries. A search engine normally consists of four components, that are sear ...

, mutual information between phrases and contexts is used as a feature for k-means clustering to discover semantic clusters (concepts).Parsing a Natural Language Using Mutual Information Statistics
by David M. Magerman and Mitchell P. Marcus For example, the mutual information of a bigram might be calculated as: : where

f_

is the number of times the bigram xy appears in the corpus,

f_

is the number of times the unigram x appears in the corpus, B is the total number of bigrams, and U is the total number of unigrams. * In

telecommunications Telecommunication is the transmission of information by various types of technologies over wire, radio, optical, or other electromagnetic systems. It has its origin in the desire of humans for communication over a distance greater than that fe ...

, the channel capacity is equal to the mutual information, maximized over all input distributions. * Discriminative training procedures for

hidden Markov model A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...

s have been proposed based on the maximum mutual information (MMI) criterion. * RNA secondary structure prediction from a

multiple sequence alignment Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutio ...

. *

Phylogenetic profiling Phylogenetic profiling is a bioinformatics technique in which the joint presence or joint absence of two traits across large numbers of species is used to infer a meaningful biological connection, such as involvement of two different proteins in the ...

prediction from pairwise present and disappearance of functionally link

gene In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a ba ...

s. * Mutual information has been used as a criterion for

and feature transformations in

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

. It can be used to characterize both the relevance and redundancy of variables, such as the

minimum redundancy feature selection Minimum redundancy feature selection is an algorithm frequently used in a method to accurately identify characteristics of genes and phenotypes and narrow down their relevance and is usually described in its pairing with relevant feature selection a ...

. * Mutual information is used in determining the similarity of two different clusterings of a dataset. As such, it provides some advantages over the traditional

Rand index The RAND Corporation (from the phrase "research and development") is an American nonprofit global policy think tank created in 1948 by Douglas Aircraft Company to offer research and analysis to the United States Armed Forces. It is financed ...

. * Mutual information of words is often used as a significance function for the computation of

collocation In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words th ...

s in

corpus linguistics Corpus linguistics is the study of language, study of a language as that language is expressed in its text corpus (plural ''corpora''), its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feas ...

. This has the added complexity that no word-instance is an instance to two different words; rather, one counts instances where 2 words occur adjacent or in close proximity; this slightly complicates the calculation, since the expected probability of one word occurring within

N

words of another, goes up with

N

* Mutual information is used in

medical imaging Medical imaging is the technique and process of imaging the interior of a body for clinical analysis and medical intervention, as well as visual representation of the function of some organs or tissues (physiology). Medical imaging seeks to rev ...

for

image registration Image registration is the process of transforming different sets of data into one coordinate system. Data may be multiple photographs, data from different sensors, times, depths, or viewpoints. It is used in computer vision, medical imaging, milit ...

. Given a reference image (for example, a brain scan), and a second image which needs to be put into the same

coordinate system In geometry, a coordinate system is a system that uses one or more numbers, or coordinates, to uniquely determine the position of the points or other geometric elements on a manifold such as Euclidean space. The order of the coordinates is sig ...

as the reference image, this image is deformed until the mutual information between it and the reference image is maximized. * Detection of

phase synchronization {{no footnotes, date=June 2017 Phase synchronization is the process by which two or more cyclic signals tend to oscillate with a repeating sequence of relative phase angles. Phase synchronisation is usually applied to two waveforms of the same fr ...

time series In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Exa ...

analysis. * In the

infomax Infomax is an optimization principle for artificial neural networks and other information processing systems. It prescribes that a function that maps a set of input values ''I'' to a set of output values ''O'' should be chosen or learned so as to ...

method for neural-net and other machine learning, including the infomax-based

Independent component analysis In signal processing, independent component analysis (ICA) is a computational method for separating a multivariate signal into additive subcomponents. This is done by assuming that at most one subcomponent is Gaussian and that the subcomponents ar ...

algorithm * Average mutual information in

delay embedding theorem In the study of dynamical systems, a delay embedding theorem gives the conditions under which a chaotic dynamical system can be reconstructed from a sequence of observations of the state of a dynamical system. The reconstruction preserves the prop ...

is used for determining the ''embedding delay'' parameter. * Mutual information between

genes In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a ba ...

in expression microarray data is used by the ARACNE algorithm for reconstruction of gene networks. * In

statistical mechanics In physics, statistical mechanics is a mathematical framework that applies statistical methods and probability theory to large assemblies of microscopic entities. It does not assume or postulate any natural laws, but explains the macroscopic be ...

Loschmidt's paradox Loschmidt's paradox, also known as the reversibility paradox, irreversibility paradox or ', is the objection that it should not be possible to deduce an irreversible process from time-symmetric dynamics. This puts the time reversal symmetry of (al ...

may be expressed in terms of mutual information.

Hugh Everett Hugh Everett III (; November 11, 1930 – July 19, 1982) was an American physicist who first proposed the many-worlds interpretation (MWI) of quantum physics, which he termed his "relative state" formulation. In contrast to the then-dominant Cope ...

br>Theory of the Universal Wavefunction
Thesis, Princeton University, (1956, 1973), pp 1–140 (page 30) Loschmidt noted that it must be impossible to determine a physical law which lacks

time reversal symmetry T-symmetry or time reversal symmetry is the theoretical symmetry of physical laws under the transformation of time reversal, : T: t \mapsto -t. Since the second law of thermodynamics states that entropy increases as time flows toward the future ...

(e.g. the

second law of thermodynamics The second law of thermodynamics is a physical law based on universal experience concerning heat and Energy transformation, energy interconversions. One simple statement of the law is that heat always moves from hotter objects to colder objects ( ...

) only from physical laws which have this symmetry. He pointed out that the

H-theorem In classical statistical mechanics, the ''H''-theorem, introduced by Ludwig Boltzmann in 1872, describes the tendency to decrease in the quantity ''H'' (defined below) in a nearly-ideal gas of molecules. L. Boltzmann,Weitere Studien über das Wä ...

Boltzmann Ludwig Eduard Boltzmann (; 20 February 1844 – 5 September 1906) was an Austrian physicist and philosopher. His greatest achievements were the development of statistical mechanics, and the statistical explanation of the second law of thermodyn ...

made the assumption that the velocities of particles in a gas were permanently uncorrelated, which removed the time symmetry inherent in the H-theorem. It can be shown that if a system is described by a probability density in

phase space In dynamical system theory, a phase space is a space in which all possible states of a system are represented, with each possible state corresponding to one unique point in the phase space. For mechanical systems, the phase space usually ...

, then Liouville's theorem implies that the joint information (negative of the joint entropy) of the distribution remains constant in time. The joint information is equal to the mutual information plus the sum of all the marginal information (negative of the marginal entropies) for each particle coordinate. Boltzmann's assumption amounts to ignoring the mutual information in the calculation of entropy, which yields the thermodynamic entropy (divided by the Boltzmann constant). * The mutual information is used to learn the structure of

Bayesian network A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bay ...

dynamic Bayesian network A Dynamic Bayesian Network (DBN) is a Bayesian network (BN) which relates variables to each other over adjacent time steps. This is often called a ''Two-Timeslice'' BN (2TBN) because it says that at any point in time T, the value of a variable c ...

s, which is thought to explain the causal relationship between random variables, as exemplified by the GlobalMIT toolkit: learning the globally optimal dynamic Bayesian network with the Mutual Information Test criterion. * The mutual information is used to quantify information transmitted during the updating procedure in the

Gibbs sampling In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is dif ...

algorithm. * Popular cost function in

decision tree learning Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of ob ...

. * The mutual information is used in

cosmology Cosmology () is a branch of physics and metaphysics dealing with the nature of the universe. The term ''cosmology'' was first used in English in 1656 in Thomas Blount (lexicographer), Thomas Blount's ''Glossographia'', and in 1731 taken up in ...

to test the influence of large-scale environments on galaxy properties in the

Galaxy Zoo Galaxy Zoo is a crowdsourced astronomy project which invites people to assist in the morphological classification of large numbers of galaxies. It is an example of citizen science as it enlists the help of members of the public to help in scien ...

. * The mutual information was used in

Solar Physics Solar physics is the branch of astrophysics that specializes in the study of the Sun. It deals with detailed measurements that are possible only for our closest star. It intersects with many disciplines of pure physics, astrophysics, and compu ...

to derive the solar

differential rotation Differential rotation is seen when different parts of a rotating object move with different angular velocities (rates of rotation) at different latitudes and/or depths of the body and/or in time. This indicates that the object is not solid. In fl ...

profile, a travel-time deviation map for sunspots, and a time–distance diagram from quiet-Sun measurements * Used in Invariant Information Clustering to automatically train neural network classifiers and image segmenters given no labelled data.Invariant Information Clustering for Unsupervised Image Classification and Segmentation
by Xu Ji, Joao Henriques and Andrea Vedaldi

Notes

References

* * * * * * English translation of original in ''Uspekhi Matematicheskikh Nauk'' 12 (1): 3-52. * * * * David J. C. MacKay.
Information Theory, Inference, and Learning Algorithms
' Cambridge: Cambridge University Press, 2003. (available free online) * * Athanasios Papoulis. ''Probability, Random Variables, and Stochastic Processes'', second edition. New York: McGraw-Hill, 1984. ''(See Chapter 15.)'' * * * * * {{Authority control Information theory Entropy and information