In
probability theory and
information theory, the mutual information (MI) of two
random variables is a measure of the mutual
dependence between the two variables. More specifically, it quantifies the "
amount of information" (in
units
Unit may refer to:
Arts and entertainment
* UNIT, a fictional military organization in the science fiction television series ''Doctor Who''
* Unit of action, a discrete piece of action (or beat) in a theatrical presentation
Music
* Unit (album), ...
such as
shannons (
bit
The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represente ...
s),
nats or
hartleys) obtained about one random variable by observing the other random variable. The concept of mutual information is intimately linked to that of
entropy of a random variable, a fundamental notion in information theory that quantifies the expected "amount of information" held in a random variable.
Not limited to real-valued random variables and linear dependence like the
correlation coefficient
A correlation coefficient is a numerical measure of some type of correlation, meaning a statistical relationship between two variables. The variables may be two columns of a given data set of observations, often called a sample, or two components ...
, MI is more general and determines how different the
joint distribution
Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered ...
of the pair
is from the product of the marginal distributions of
and
. MI is the
expected value of the
pointwise mutual information (PMI).
The quantity was defined and analyzed by
Claude Shannon in his landmark paper "
A Mathematical Theory of Communication
"A Mathematical Theory of Communication" is an article by mathematician Claude E. Shannon published in ''Bell System Technical Journal'' in 1948. It was renamed ''The Mathematical Theory of Communication'' in the 1949 book of the same name, a sma ...
", although he did not call it "mutual information". This term was coined later by
Robert Fano
Roberto Mario "Robert" Fano (11 November 1917 – 13 July 2016) was an Italian-American computer scientist and professor of electrical engineering and computer science at the Massachusetts Institute of Technology. He became a student and working ...
. Mutual Information is also known as
information gain
Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random, ...
.
Definition
Let
be a pair of random variables with values over the space
. If their joint distribution is
and the marginal distributions are
and
, the mutual information is defined as
where
is the
Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fro ...
.
Notice, as per property of the
Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fro ...
, that
is equal to zero precisely when the joint distribution coincides with the product of the marginals, i.e. when
and
are independent (and hence observing
tells you nothing about
).
is non-negative, it is a measure of the price for encoding
as a pair of independent random variables, when in reality they are not.
If the
natural logarithm
The natural logarithm of a number is its logarithm to the base of the mathematical constant , which is an irrational and transcendental number approximately equal to . The natural logarithm of is generally written as , , or sometimes, if ...
is used, the unit of mutual information is the
nat
Nat or NAT may refer to:
Computing
* Network address translation (NAT), in computer networking
Organizations
* National Actors Theatre, New York City, U.S.
* National AIDS trust, a British charity
* National Archives of Thailand
* National As ...
. If the
log base 2 is used, the unit of mutual information is the
shannon, also known as the bit. If the
log base 10 is used, the unit of mutual information is the
hartley, also known as the ban or the dit.
In terms of PMFs for discrete distributions
The mutual information of two jointly discrete random variables
and
is calculated as a double sum:
where
is the
joint probability ''mass'' function of
and
, and
and
are the
marginal probability
In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the variables ...
mass functions of
and
respectively.
In terms of PDFs for continuous distributions
In the case of jointly continuous random variables, the double sum is replaced by a
double integral
In mathematics (specifically multivariable calculus), a multiple integral is a definite integral of a function of several real variables, for instance, or . Integrals of a function of two variables over a region in \mathbb^2 (the real-number ...
:
where
is now the joint probability ''density'' function of
and
, and
and
are the marginal probability density functions of
and
respectively.
Motivation
Intuitively, mutual information measures the information that
and
share: It measures how much knowing one of these variables reduces uncertainty about the other. For example, if
and
are independent, then knowing
does not give any information about
and vice versa, so their mutual information is zero. At the other extreme, if
is a deterministic function of
and
is a deterministic function of
then all information conveyed by
is shared with
: knowing
determines the value of
and vice versa. As a result, in this case the mutual information is the same as the uncertainty contained in
(or
) alone, namely the
entropy
Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...
of
(or
). Moreover, this mutual information is the same as the entropy of
and as the entropy of
. (A very special case of this is when
and
are the same random variable.)
Mutual information is a measure of the inherent dependence expressed in the
joint distribution
Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered ...
of
and
relative to the marginal distribution of
and
under the assumption of independence. Mutual information therefore measures dependence in the following sense:
if and only if
In logic and related fields such as mathematics and philosophy, "if and only if" (shortened as "iff") is a biconditional logical connective between statements, where either both statements are true or both are false.
The connective is bicondi ...
and
are independent random variables. This is easy to see in one direction: if
and
are independent, then
, and therefore:
:
Moreover, mutual information is nonnegative (i.e.
see below) and
symmetric
Symmetry (from grc, συμμετρία "agreement in dimensions, due proportion, arrangement") in everyday language refers to a sense of harmonious and beautiful proportion and balance. In mathematics, "symmetry" has a more precise definiti ...
(i.e.
see below).
Properties
Nonnegativity
Using
Jensen's inequality
In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier pr ...
on the definition of mutual information we can show that
is non-negative, i.e.
:
Symmetry
:
The proof is given considering the relationship with entropy, as shown below.
Relation to conditional and joint entropy
Mutual information can be equivalently expressed as:
:
where
and
are the marginal
entropies,
and
are the
conditional entropies, and
is the
joint entropy
In information theory, joint entropy is a measure of the uncertainty associated with a set of variables.
Definition
The joint Shannon entropy (in bits) of two discrete random variables X and Y with images \mathcal X and \mathcal Y is defined ...
of
and
.
Notice the analogy to the union, difference, and intersection of two sets: in this respect, all the formulas given above are apparent from the Venn diagram reported at the beginning of the article.
In terms of a communication channel in which the output
is a noisy version of the input
, these relations are summarised in the figure:
Because
is non-negative, consequently,
. Here we give the detailed deduction of
for the case of jointly discrete random variables:
:
The proofs of the other identities above are similar. The proof of the general case (not just discrete) is similar, with integrals replacing sums.
Intuitively, if entropy
is regarded as a measure of uncertainty about a random variable, then
is a measure of what
does ''not'' say about
. This is "the amount of uncertainty remaining about
after
is known", and thus the right side of the second of these equalities can be read as "the amount of uncertainty in
, minus the amount of uncertainty in
which remains after
is known", which is equivalent to "the amount of uncertainty in
which is removed by knowing
". This corroborates the intuitive meaning of mutual information as the amount of information (that is, reduction in uncertainty) that knowing either variable provides about the other.
Note that in the discrete case
and therefore
. Thus
, and one can formulate the basic principle that a variable contains at least as much information about itself as any other variable can provide.
Relation to Kullback–Leibler divergence
For jointly discrete or jointly continuous pairs
, mutual information is the
Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fro ...
from the product of the
marginal distribution
In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the varia ...
s,
, of the
joint distribution
Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered ...
, that is,
Furthermore, let
be the conditional mass or density function. Then, we have the identity
The proof for jointly discrete random variables is as follows:
:
Similarly this identity can be established for jointly continuous random variables.
Note that here the Kullback–Leibler divergence involves integration over the values of the random variable
only, and the expression
still denotes a random variable because
is random. Thus mutual information can also be understood as the
expectation of the Kullback–Leibler divergence of the
univariate distribution In statistics, a univariate distribution is a probability distribution of only one random variable. This is in contrast to a multivariate distribution, the probability distribution of a random vector (consisting of multiple random variables).
Examp ...
of
from the
conditional distribution
In probability theory and statistics, given two jointly distributed random variables X and Y, the conditional probability distribution of Y given X is the probability distribution of Y when X is known to be a particular value; in some cases the ...
of
given
: the more different the distributions
and
are on average, the greater the
information gain
Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random, ...
.
Bayesian estimation of mutual information
If samples from a joint distribution are available, a Bayesian approach can be used to estimate the mutual information of that distribution. The first work to do this, which also showed how to do Bayesian estimation of many other information-theoretic properties besides mutual information, was. Subsequent researchers have rederived and extended
this analysis. See for a recent paper based on a prior specifically tailored to estimation of mutual
information per se. Besides, recently an estimation method accounting for continuous and multivariate outputs,
, was proposed in
.
Independence assumptions
The Kullback-Leibler divergence formulation of the mutual information is predicated on that one is interested in comparing
to the fully factorized
outer product
In linear algebra, the outer product of two coordinate vector
In linear algebra, a coordinate vector is a representation of a vector as an ordered list of numbers (a tuple) that describes the vector in terms of a particular ordered basis. An ea ...
. In many problems, such as
non-negative matrix factorization
Non-negative matrix factorization (NMF or NNMF), also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix is factorized into (usually) two matrices and , with the property that ...
, one is interested in less extreme factorizations; specifically, one wishes to compare
to a low-rank matrix approximation in some unknown variable
; that is, to what degree one might have
:
Alternately, one might be interested in knowing how much more information
carries over its factorization. In such a case, the excess information that the full distribution
carries over the matrix factorization is given by the Kullback-Leibler divergence
:
The conventional definition of the mutual information is recovered in the extreme case that the process
has only one value for
.
Variations
Several variations on mutual information have been proposed to suit various needs. Among these are normalized variants and generalizations to more than two variables.
Metric
Many applications require a
metric
Metric or metrical may refer to:
* Metric system, an internationally adopted decimal system of measurement
* An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement
Mathematics
In mathem ...
, that is, a distance measure between pairs of points. The quantity
:
satisfies the properties of a metric (
triangle inequality
In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side.
This statement permits the inclusion of degenerate triangles, but ...
,
non-negativity,
indiscernability and symmetry). This distance metric is also known as the
variation of information.
If
are discrete random variables then all the entropy terms are non-negative, so
and one can define a normalized distance
:
The metric
is a universal metric, in that if any other distance measure places
and
close by, then the
will also judge them close.
Plugging in the definitions shows that
:
This is known as the Rajski Distance. In a set-theoretic interpretation of information (see the figure for
Conditional entropy
In information theory, the conditional entropy quantifies the amount of information needed to describe the outcome of a random variable Y given that the value of another random variable X is known. Here, information is measured in shannons, na ...
), this is effectively the
Jaccard distance between
and
.
Finally,
:
is also a metric.
Conditional mutual information
Sometimes it is useful to express the mutual information of two random variables conditioned on a third.
For jointly
discrete random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
s this takes the form
:
which can be simplified as
:
For jointly
continuous random variable
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon ...
s this takes the form
:
which can be simplified as
:
Conditioning on a third random variable may either increase or decrease the mutual information, but it is always true that
:
for discrete, jointly distributed random variables
. This result has been used as a basic building block for proving other
inequalities in information theory
Inequalities are very important in the study of information theory. There are a number of different contexts in which these inequalities appear.
Entropic inequalities
Consider a tuple X_1,X_2,\dots,X_n of n finitely (or at most countably) supp ...
.
Interaction information
Several generalizations of mutual information to more than two random variables have been proposed, such as
total correlation (or multi-information) and
dual total correlation
In information theory, dual total correlation (Han 1978), information rate (Dubnov 2006), excess entropy (Olbrich 2008), or binding information (Abdallah and Plumbley 2010) is one of several known non-negative generalizations of mutual information ...
. The expression and study of multivariate higher-degree mutual information was achieved in two seemingly independent works: McGill (1954) who called these functions "interaction information", and Hu Kuo Ting (1962).
Interaction information is defined for one variable as follows:
:
and for
:
Some authors reverse the order of the terms on the right-hand side of the preceding equation, which changes the sign when the number of random variables is odd. (And in this case, the single-variable expression becomes the negative of the entropy.) Note that
:
Multivariate statistical independence
The multivariate mutual information functions generalize the pairwise independence case that states that
if and only if
, to arbitrary numerous variable. n variables are mutually independent if and only if the
mutual information functions vanish
with
(theorem 2
). In this sense, the
can be used as a refined statistical independence criterion.
Applications
For 3 variables, Brenner et al. applied multivariate mutual information to neural coding and called its negativity "synergy" and Watkinson et al. applied it to genetic expression. For arbitrary k variables, Tapia et al. applied multivariate mutual information to gene expression.
[ It can be zero, positive, or negative.] The positivity corresponds to relations generalizing the pairwise correlations, nullity corresponds to a refined notion of independence, and negativity detects high dimensional "emergent" relations and clusterized datapoints [).
One high-dimensional generalization scheme which maximizes the mutual information between the joint distribution and other target variables is found to be useful in ]feature selection
In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construc ...
.
Mutual information is also used in the area of signal processing as a measure of similarity
In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such meas ...
between two signals. For example, FMI metric is an image fusion performance measure that makes use of mutual information in order to measure the amount of information that the fused image contains about the source images. The Matlab
MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementation ...
code for this metric can be found at. A python package for computing all multivariate mutual informations, conditional mutual information, joint entropies, total correlations, information distance in a dataset of n variables is available.
Directed information
Directed information
Directed information, I(X^n\to Y^n) , is an information theory measure that quantifies the information flow from the random process X^n = \ to the random process Y^n = \. The term ''directed information'' was coined by James Massey and is defined ...
, , measures the amount of information that flows from the process to , where denotes the vector and denotes . The term ''directed information'' was coined by James Massey
James Lee Massey (February 11, 1934 – June 16, 2013) was an American information theorist and
cryptographer, Professor Emeritus of Digital Technology at ETH Zurich. His notable work includes the application of the Berlekamp–Massey algorithm to ...
and is defined as
:.
Note that if , the directed information becomes the mutual information. Directed information has many applications in problems where causality
Causality (also referred to as causation, or cause and effect) is influence by which one event, process, state, or object (''a'' ''cause'') contributes to the production of another event, process, state, or object (an ''effect'') where the cau ...
plays an important role, such as capacity of channel with feedback.
Normalized variants
Normalized variants of the mutual information are provided by the ''coefficients of constraint'', uncertainty coefficient
In statistics, the uncertainty coefficient, also called proficiency, entropy coefficient or Theil's U, is a measure of nominal association. It was first introduced by Henri Theil and is based on the concept of information entropy.
Definition
Sup ...
or proficiency:
:
The two coefficients have a value ranging in , 1
The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline (t ...
but are not necessarily equal. In some cases a symmetric measure may be desired, such as the following '' redundancy'' measure:
:
which attains a minimum of zero when the variables are independent and a maximum value of
:
when one variable becomes completely redundant with the knowledge of the other. See also ''Redundancy (information theory)
In information theory, redundancy measures the fractional difference between the entropy of an ensemble , and its maximum possible value \log(, \mathcal_X, ). Informally, it is the amount of wasted "space" used to transmit certain data. Data comp ...
''.
Another symmetrical measure is the ''symmetric uncertainty'' , given by
:
which represents the harmonic mean
In mathematics, the harmonic mean is one of several kinds of average, and in particular, one of the Pythagorean means. It is sometimes appropriate for situations when the average rate is desired.
The harmonic mean can be expressed as the recipro ...
of the two uncertainty coefficients .
If we consider mutual information as a special case of the total correlation or dual total correlation
In information theory, dual total correlation (Han 1978), information rate (Dubnov 2006), excess entropy (Olbrich 2008), or binding information (Abdallah and Plumbley 2010) is one of several known non-negative generalizations of mutual information ...
, the normalized version are respectively,
: and
This normalized version also known as Information Quality Ratio (IQR) which quantifies the amount of information of a variable based on another variable against total uncertainty:
:
There's a normalization which derives from first thinking of mutual information as an analogue to covariance
In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the ...
(thus Shannon entropy
Shannon may refer to:
People
* Shannon (given name)
* Shannon (surname)
* Shannon (American singer), stage name of singer Shannon Brenda Greene (born 1958)
* Shannon (South Korean singer), British-South Korean singer and actress Shannon Arrum W ...
is analogous to variance
In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
). Then the normalized mutual information is calculated akin to the Pearson correlation coefficient
In statistics, the Pearson correlation coefficient (PCC, pronounced ) ― also known as Pearson's ''r'', the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation, or colloquially simply as the correlation coefficient ...
,
:
Weighted variants
In the traditional formulation of the mutual information,
:
each ''event'' or ''object'' specified by is weighted by the corresponding probability . This assumes that all objects or events are equivalent ''apart from'' their probability of occurrence. However, in some applications it may be the case that certain objects or events are more ''significant'' than others, or that certain patterns of association are more semantically important than others.
For example, the deterministic mapping may be viewed as stronger than the deterministic mapping , although these relationships would yield the same mutual information. This is because the mutual information is not sensitive at all to any inherent ordering in the variable values (, , ), and is therefore not sensitive at all to the form of the relational mapping between the associated variables. If it is desired that the former relation—showing agreement on all variable values—be judged stronger than the later relation, then it is possible to use the following ''weighted mutual information'' .
:
which places a weight on the probability of each variable value co-occurrence, . This allows that certain probabilities may carry more or less significance than others, thereby allowing the quantification of relevant ''holistic'' or '' Prägnanz'' factors. In the above example, using larger relative weights for , , and would have the effect of assessing greater ''informativeness'' for the relation than for the relation , which may be desirable in some cases of pattern recognition, and the like. This weighted mutual information is a form of weighted KL-Divergence, which is known to take negative values for some inputs, and there are examples where the weighted mutual information also takes negative values.
Adjusted mutual information
A probability distribution can be viewed as a partition of a set
In mathematics, a partition of a set is a grouping of its elements into non-empty subsets, in such a way that every element is included in exactly one subset.
Every equivalence relation on a set defines a partition of this set, and every part ...
. One may then ask: if a set were partitioned randomly, what would the distribution of probabilities be? What would the expectation value of the mutual information be? The adjusted mutual information or AMI subtracts the expectation value of the MI, so that the AMI is zero when two different distributions are random, and one when two distributions are identical. The AMI is defined in analogy to the adjusted Rand index
Adjustment may refer to:
*Adjustment (law), with several meanings
*Adjustment (psychology), the process of balancing conflicting needs
*Adjustment of observations, in mathematics, a method of solving an overdetermined system of equations
*Calibra ...
of two different partitions of a set.
Absolute mutual information
Using the ideas of Kolmogorov complexity
In algorithmic information theory (a subfield of computer science and mathematics), the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program (in a predetermined programming language) that produ ...
, one can consider the mutual information of two sequences independent of any probability distribution:
:
To establish that this quantity is symmetric up to a logarithmic factor () one requires the chain rule for Kolmogorov complexity . Approximations of this quantity via compression
Compression may refer to:
Physical science
*Compression (physics), size reduction due to forces
*Compression member, a structural element such as a column
*Compressibility, susceptibility to compression
* Gas compression
*Compression ratio, of a ...
can be used to define a distance measure
Distance measures are used in physical cosmology to give a natural notion of the distance between two objects or events in the universe. They are often used to tie some ''observable'' quantity (such as the luminosity of a distant quasar, the red ...
to perform a hierarchical clustering of sequences without having any domain knowledge
Domain knowledge is knowledge of a specific, specialized discipline or field, in contrast to general (or domain-independent) knowledge. The term is often used in reference to a more general discipline—for example, in describing a software engin ...
of the sequences .
Linear correlation
Unlike correlation coefficients, such as the product moment correlation coefficient
In statistics, the Pearson correlation coefficient (PCC, pronounced ) ― also known as Pearson's ''r'', the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation, or colloquially simply as the correlation coefficien ...
, mutual information contains information about all dependence—linear and nonlinear—and not just linear dependence as the correlation coefficient measures. However, in the narrow case that the joint distribution for and is a bivariate normal distribution
In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One ...
(implying in particular that both marginal distributions are normally distributed), there is an exact relationship between and the correlation coefficient .
:
The equation above can be derived as follows for a bivariate Gaussian:
:
Therefore,
:
For discrete data
When and are limited to be in a discrete number of states, observation data is summarized in a contingency table, with row variable (or ) and column variable (or ). Mutual information is one of the measures of association
Association may refer to:
*Club (organization), an association of two or more people united by a common interest or goal
*Trade association, an organization founded and funded by businesses that operate in a specific industry
*Voluntary associatio ...
or correlation
In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics ...
between the row and column variables.
Other measures of association include Pearson's chi-squared test statistics, G-test
In statistics, ''G''-tests are likelihood ratio test, likelihood-ratio or maximum likelihood statistical significance tests that are increasingly being used in situations where chi-squared tests were previously recommended.
The general formula fo ...
statistics, etc. In fact, with the same log base, mutual information will be equal to the G-test
In statistics, ''G''-tests are likelihood ratio test, likelihood-ratio or maximum likelihood statistical significance tests that are increasingly being used in situations where chi-squared tests were previously recommended.
The general formula fo ...
log-likelihood statistic divided by , where is the sample size.
Applications
In many applications, one wants to maximize mutual information (thus increasing dependencies), which is often equivalent to minimizing conditional entropy
In information theory, the conditional entropy quantifies the amount of information needed to describe the outcome of a random variable Y given that the value of another random variable X is known. Here, information is measured in shannons, na ...
. Examples include:
* In search engine technology
A search engine is an information retrieval software program that discovers, crawls, transforms and stores information for retrieval and presentation in response to user queries.
A search engine normally consists of four components, that are sear ...
, mutual information between phrases and contexts is used as a feature for k-means clustering to discover semantic clusters (concepts).[Parsing a Natural Language Using Mutual Information Statistics]
by David M. Magerman and Mitchell P. Marcus For example, the mutual information of a bigram might be calculated as:
: where is the number of times the bigram xy appears in the corpus, is the number of times the unigram x appears in the corpus, B is the total number of bigrams, and U is the total number of unigrams.[
* In ]telecommunications
Telecommunication is the transmission of information by various types of technologies over wire, radio, optical, or other electromagnetic systems. It has its origin in the desire of humans for communication over a distance greater than that fe ...
, the channel capacity is equal to the mutual information, maximized over all input distributions.
* Discriminative training procedures for hidden Markov model
A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...
s have been proposed based on the maximum mutual information (MMI) criterion.
* RNA secondary structure prediction from a multiple sequence alignment
Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutio ...
.
* Phylogenetic profiling Phylogenetic profiling is a bioinformatics technique in which the joint presence or joint absence of two traits across large numbers of species is used to infer a meaningful biological connection, such as involvement of two different proteins in the ...
prediction from pairwise present and disappearance of functionally link gene
In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a ba ...
s.
* Mutual information has been used as a criterion for feature selection
In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construc ...
and feature transformations in machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
. It can be used to characterize both the relevance and redundancy of variables, such as the minimum redundancy feature selection Minimum redundancy feature selection is an algorithm frequently used in a method to accurately identify characteristics of genes and phenotypes and narrow down their relevance and is usually described in its pairing with relevant feature selection a ...
.
* Mutual information is used in determining the similarity of two different clusterings of a dataset. As such, it provides some advantages over the traditional Rand index
The RAND Corporation (from the phrase "research and development") is an American nonprofit global policy think tank created in 1948 by Douglas Aircraft Company to offer research and analysis to the United States Armed Forces. It is financed ...
.
* Mutual information of words is often used as a significance function for the computation of collocation
In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words th ...
s in corpus linguistics
Corpus linguistics is the study of language, study of a language as that language is expressed in its text corpus (plural ''corpora''), its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feas ...
. This has the added complexity that no word-instance is an instance to two different words; rather, one counts instances where 2 words occur adjacent or in close proximity; this slightly complicates the calculation, since the expected probability of one word occurring within words of another, goes up with
* Mutual information is used in medical imaging
Medical imaging is the technique and process of imaging the interior of a body for clinical analysis and medical intervention, as well as visual representation of the function of some organs or tissues (physiology). Medical imaging seeks to rev ...
for image registration
Image registration is the process of transforming different sets of data into one coordinate system. Data may be multiple photographs, data from different sensors, times, depths, or viewpoints. It is used in computer vision, medical imaging, milit ...
. Given a reference image (for example, a brain scan), and a second image which needs to be put into the same coordinate system
In geometry, a coordinate system is a system that uses one or more numbers, or coordinates, to uniquely determine the position of the points or other geometric elements on a manifold such as Euclidean space. The order of the coordinates is sig ...
as the reference image, this image is deformed until the mutual information between it and the reference image is maximized.
* Detection of phase synchronization
{{no footnotes, date=June 2017
Phase synchronization is the process by which two or more cyclic signals tend to oscillate with a repeating sequence of relative phase angles.
Phase synchronisation is usually applied to two waveforms of the same fr ...
in time series
In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Exa ...
analysis.
* In the infomax Infomax is an optimization principle for artificial neural networks and other information processing systems. It prescribes that a function that maps a set of input values ''I'' to a set of output values ''O'' should be chosen or learned so as to ...
method for neural-net and other machine learning, including the infomax-based Independent component analysis
In signal processing, independent component analysis (ICA) is a computational method for separating a multivariate signal into additive subcomponents. This is done by assuming that at most one subcomponent is Gaussian and that the subcomponents ar ...
algorithm
* Average mutual information in delay embedding theorem
In the study of dynamical systems, a delay embedding theorem gives the conditions under which a chaotic dynamical system can be reconstructed from a sequence of observations of the state of a dynamical system. The reconstruction preserves the prop ...
is used for determining the ''embedding delay'' parameter.
* Mutual information between genes
In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a ba ...
in expression microarray data is used by the ARACNE algorithm for reconstruction of gene networks.
* In statistical mechanics
In physics, statistical mechanics is a mathematical framework that applies statistical methods and probability theory to large assemblies of microscopic entities. It does not assume or postulate any natural laws, but explains the macroscopic be ...
, Loschmidt's paradox
Loschmidt's paradox, also known as the reversibility paradox, irreversibility paradox or ', is the objection that it should not be possible to deduce an irreversible process from time-symmetric dynamics. This puts the time reversal symmetry of (al ...
may be expressed in terms of mutual information.Hugh Everett
Hugh Everett III (; November 11, 1930 – July 19, 1982) was an American physicist who first proposed the many-worlds interpretation (MWI) of quantum physics, which he termed his "relative state" formulation. In contrast to the then-dominant Cope ...
br>Theory of the Universal Wavefunction
Thesis, Princeton University, (1956, 1973), pp 1–140 (page 30) Loschmidt noted that it must be impossible to determine a physical law which lacks time reversal symmetry
T-symmetry or time reversal symmetry is the theoretical symmetry of physical laws under the transformation of time reversal,
: T: t \mapsto -t.
Since the second law of thermodynamics states that entropy increases as time flows toward the future ...
(e.g. the second law of thermodynamics
The second law of thermodynamics is a physical law based on universal experience concerning heat and Energy transformation, energy interconversions. One simple statement of the law is that heat always moves from hotter objects to colder objects ( ...
) only from physical laws which have this symmetry. He pointed out that the H-theorem
In classical statistical mechanics, the ''H''-theorem, introduced by Ludwig Boltzmann in 1872, describes the tendency to decrease in the quantity ''H'' (defined below) in a nearly-ideal gas of molecules.
L. Boltzmann,Weitere Studien über das Wä ...
of Boltzmann
Ludwig Eduard Boltzmann (; 20 February 1844 – 5 September 1906) was an Austrian physicist and philosopher. His greatest achievements were the development of statistical mechanics, and the statistical explanation of the second law of thermodyn ...
made the assumption that the velocities of particles in a gas were permanently uncorrelated, which removed the time symmetry inherent in the H-theorem. It can be shown that if a system is described by a probability density in phase space
In dynamical system theory, a phase space is a space in which all possible states of a system are represented, with each possible state corresponding to one unique point in the phase space. For mechanical systems, the phase space usually ...
, then Liouville's theorem implies that the joint information (negative of the joint entropy) of the distribution remains constant in time. The joint information is equal to the mutual information plus the sum of all the marginal information (negative of the marginal entropies) for each particle coordinate. Boltzmann's assumption amounts to ignoring the mutual information in the calculation of entropy, which yields the thermodynamic entropy (divided by the Boltzmann constant).
* The mutual information is used to learn the structure of Bayesian network
A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bay ...
s/dynamic Bayesian network
A Dynamic Bayesian Network (DBN) is a Bayesian network (BN) which relates variables to each other over adjacent time steps. This is often called a ''Two-Timeslice'' BN (2TBN) because it says that at any point in time T, the value of a variable c ...
s, which is thought to explain the causal relationship between random variables, as exemplified by the GlobalMIT toolkit: learning the globally optimal dynamic Bayesian network with the Mutual Information Test criterion.
* The mutual information is used to quantify information transmitted during the updating procedure in the Gibbs sampling
In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is dif ...
algorithm.
* Popular cost function in decision tree learning
Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of ob ...
.
* The mutual information is used in cosmology
Cosmology () is a branch of physics and metaphysics dealing with the nature of the universe. The term ''cosmology'' was first used in English in 1656 in Thomas Blount (lexicographer), Thomas Blount's ''Glossographia'', and in 1731 taken up in ...
to test the influence of large-scale environments on galaxy properties in the Galaxy Zoo
Galaxy Zoo is a crowdsourced astronomy project which invites people to assist in the morphological classification of large numbers of galaxies. It is an example of citizen science as it enlists the help of members of the public to help in scien ...
.
* The mutual information was used in Solar Physics
Solar physics is the branch of astrophysics that specializes in the study of the Sun. It deals with detailed measurements that are possible only for our closest star. It intersects with many disciplines of pure physics, astrophysics, and compu ...
to derive the solar differential rotation
Differential rotation is seen when different parts of a rotating object move with different angular velocities (rates of rotation) at different latitudes and/or depths of the body and/or in time. This indicates that the object is not solid. In fl ...
profile, a travel-time deviation map for sunspots, and a time–distance diagram from quiet-Sun measurements
* Used in Invariant Information Clustering to automatically train neural network classifiers and image segmenters given no labelled data.[Invariant Information Clustering for Unsupervised Image Classification and Segmentation]
by Xu Ji, Joao Henriques and Andrea Vedaldi
See also
* Data differencing
In computer science and information theory, data differencing or differential compression is producing a technical description of the difference between two sets of data – a source and a target. Formally, a data differencing algorithm takes as in ...
* Pointwise mutual information
In statistics, probability theory and information theory, pointwise mutual information (PMI), or point mutual information, is a measure of association. It compares the probability of two events occurring together to what this probability would b ...
* Quantum mutual information
In quantum information theory, quantum mutual information, or von Neumann mutual information, after John von Neumann, is a measure of correlation between subsystems of quantum state. It is the quantum mechanical analog of Shannon mutual informati ...
* Specific-information
Notes
References
*
*
*
*
*
* English translation of original in ''Uspekhi Matematicheskikh Nauk'' 12 (1): 3-52.
*
*
*
* David J. C. MacKay.
Information Theory, Inference, and Learning Algorithms
' Cambridge: Cambridge University Press, 2003. (available free online)
*
* Athanasios Papoulis. ''Probability, Random Variables, and Stochastic Processes'', second edition. New York: McGraw-Hill, 1984. ''(See Chapter 15.)''
*
*
*
*
*
{{Authority control
Information theory
Entropy and information