In
mathematical statistics
Mathematical statistics is the application of probability theory, a branch of mathematics, to statistics, as opposed to techniques for collecting statistical data. Specific mathematical techniques which are used for this include mathematical an ...
, the Kullback–Leibler divergence (also called relative entropy and I-divergence
), denoted
, is a type of
statistical distance: a measure of how one
probability distribution
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...
''P'' is different from a second, reference probability distribution ''Q''.
A simple
interpretation of the KL divergence of ''P'' from ''Q'' is the
expected excess
surprise from using ''Q'' as a model when the actual distribution is ''P''. While it is a distance, it is not a
metric
Metric or metrical may refer to:
* Metric system, an internationally adopted decimal system of measurement
* An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement
Mathematics
In mathem ...
, the most familiar type of distance: it is not symmetric in the two distributions (in contrast to
variation of information), and does not satisfy the
triangle inequality
In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side.
This statement permits the inclusion of degenerate triangles, but ...
. Instead, in terms of
information geometry
Information geometry is an interdisciplinary field that applies the techniques of differential geometry to study probability theory and statistics. It studies statistical manifolds, which are Riemannian manifolds whose points correspond to pro ...
, it is a type of
divergence
In vector calculus, divergence is a vector operator that operates on a vector field, producing a scalar field giving the quantity of the vector field's source at each point. More technically, the divergence represents the volume density of the ...
, a generalization of
squared distance, and for certain classes of distributions (notably an
exponential family
In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate ...
), it satisfies a generalized
Pythagorean theorem
In mathematics, the Pythagorean theorem or Pythagoras' theorem is a fundamental relation in Euclidean geometry between the three sides of a right triangle. It states that the area of the square whose side is the hypotenuse (the side opposite t ...
(which applies to squared distances).
In the simple case, a relative entropy of 0 indicates that the two distributions in question have identical quantities of information. Relative entropy is a nonnegative function of two distributions or measures. It has diverse applications, both theoretical, such as characterizing the relative
(Shannon) entropy in information systems, randomness in continuous
time-series
In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Exa ...
, and information gain when comparing statistical models of
inference
Inferences are steps in reasoning, moving from premises to logical consequences; etymologically, the word '' infer'' means to "carry forward". Inference is theoretically traditionally divided into deduction and induction, a distinction that in ...
; and practical, such as applied statistics,
fluid mechanics
Fluid mechanics is the branch of physics concerned with the mechanics of fluids ( liquids, gases, and plasmas) and the forces on them.
It has applications in a wide range of disciplines, including mechanical, aerospace, civil, chemical and bio ...
,
neuroscience
Neuroscience is the scientific study of the nervous system (the brain, spinal cord, and peripheral nervous system), its functions and disorders. It is a multidisciplinary science that combines physiology, anatomy, molecular biology, development ...
and
bioinformatics
Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
.
Introduction and context
Consider two probability distributions
and
. Usually,
represents the data, the observations, or a measured probability distribution. Distribution
represents instead a theory, a model, a description or an approximation of
. The Kullback–Leibler divergence is then interpreted as the average difference of the number of bits required for encoding samples of
using a code optimized for
rather than one optimized for
. Note that the roles of
and
can be reversed in some situations where that is easier to compute, such as with the
Expectation–maximization (EM) algorithm and
Evidence lower bound (ELBO) computations.
Etymology
The relative entropy was introduced by
Solomon Kullback
Solomon Kullback (April 3, 1907August 5, 1994) was an American cryptanalyst and mathematician, who was one of the first three employees hired by William F. Friedman at the US Army's Signal Intelligence Service (SIS) in the 1930s, along with Frank ...
and
Richard Leibler
Richard A. Leibler (March 18, 1914, Chicago, Illinois – October 25, 2003, Reston, Virginia) was an American mathematician and cryptanalyst. Richard Leibler was born in March 1914. He received his A.M. in mathematics from Northwestern University ...
in as "the mean information for discrimination between
and
per observation from
", where one is comparing two probability measures
, and
are the hypotheses that one is selecting from measure
(respectively). They denoted this by
, and defined the "'divergence' between
and
" as the symmetrized quantity
, which had already been defined and used by
Harold Jeffreys
Sir Harold Jeffreys, FRS (22 April 1891 – 18 March 1989) was a British mathematician, statistician, geophysicist, and astronomer. His book, ''Theory of Probability'', which was first published in 1939, played an important role in the revival ...
in 1948. In , the symmetrized form is again referred to as the "divergence", and the relative entropies in each direction are referred to as a "directed divergences" between two distributions; Kullback preferred the term discrimination information.
The term "divergence" is in contrast to a distance (metric), since the symmetrized divergence does not satisfy the triangle inequality. Numerous references to earlier uses of the symmetrized divergence and to other
statistical distances are given in . The asymmetric "directed divergence" has come to be known as the Kullback–Leibler divergence, while the symmetrized "divergence" is now referred to as the Jeffreys divergence.
Definition
For
discrete probability distribution
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...
s
and
defined on the same
probability space
In probability theory, a probability space or a probability triple (\Omega, \mathcal, P) is a mathematical construct that provides a formal model of a random process or "experiment". For example, one can define a probability space which models t ...
,
, the relative entropy from
to
is defined
to be
:
which is equivalent to
:
In other words, it is the
expectation of the logarithmic difference between the probabilities
and
, where the expectation is taken using the probabilities
.
Relative entropy is defined so only if for all
,
implies
(
absolute continuity
In calculus, absolute continuity is a smoothness property of functions that is stronger than continuity and uniform continuity. The notion of absolute continuity allows one to obtain generalizations of the relationship between the two central ope ...
). Else it is often defined as
,
[ but the value is possible even if everywhere, provided that is infinite. Analogous comments apply to the continuous and general measure cases defined below.
Whenever is zero the contribution of the corresponding term is interpreted as zero because
:
For distributions and of a ]continuous random variable
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon ...
, relative entropy is defined to be the integral:
:
where and denote the probability densities of and .
More generally, if and are probability measure
Measure may refer to:
* Measurement, the assignment of a number to a characteristic of an object or event
Law
* Ballot measure, proposed legislation in the United States
* Church of England Measure, legislation of the Church of England
* Mea ...
s on a measurable space
In mathematics, a measurable space or Borel space is a basic object in measure theory. It consists of a set and a σ-algebra, which defines the subsets that will be measured.
Definition
Consider a set X and a σ-algebra \mathcal A on X. Then the ...
, and is absolutely continuous
In calculus, absolute continuity is a smoothness property of functions that is stronger than continuity and uniform continuity. The notion of absolute continuity allows one to obtain generalizations of the relationship between the two central ope ...
with respect to , then the relative entropy from to is defined as
:
where is the Radon–Nikodym derivative of with respect to ,ie. the unique -almost everywhere defined function on such that which exists because is absolutely continuous with respect to . Also we assume the expression on the right-hand side exists. Equivalently (by the chain rule
In calculus, the chain rule is a formula that expresses the derivative of the composition of two differentiable functions and in terms of the derivatives of and . More precisely, if h=f\circ g is the function such that h(x)=f(g(x)) for every , ...
), this can be written as
:
which is the entropy
Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...
of relative to . Continuing in this case, if is any measure on for which densities and with and exist (meaning that and are both absolutely continuous with respect to ), then the relative entropy from to is given as
:
Note that such a measure for which densities can be defined always exists, since one can take although in practice it will usually be one that in the context like counting measure In mathematics, specifically measure theory, the counting measure is an intuitive way to put a measure on any set – the "size" of a subset is taken to be the number of elements in the subset if the subset has finitely many elements, and infinity ...
for discrete distributions, or Lebesgue measure
In measure theory, a branch of mathematics, the Lebesgue measure, named after French mathematician Henri Lebesgue, is the standard way of assigning a measure to subsets of ''n''-dimensional Euclidean space. For ''n'' = 1, 2, or 3, it coincides wit ...
or a convenient variant thereof like Gaussian measure
In mathematics, Gaussian measure is a Borel measure on finite-dimensional Euclidean space R''n'', closely related to the normal distribution in statistics. There is also a generalization to infinite-dimensional spaces. Gaussian measures are nam ...
or the uniform measure on the sphere
A sphere () is a Geometry, geometrical object that is a solid geometry, three-dimensional analogue to a two-dimensional circle. A sphere is the Locus (mathematics), set of points that are all at the same distance from a given point in three ...
, Haar measure In mathematical analysis, the Haar measure assigns an "invariant volume" to subsets of locally compact topological groups, consequently defining an integral for functions on those groups.
This measure was introduced by Alfréd Haar in 1933, though ...
on a Lie group
In mathematics, a Lie group (pronounced ) is a group that is also a differentiable manifold. A manifold is a space that locally resembles Euclidean space, whereas groups define the abstract concept of a binary operation along with the additio ...
etc. for continuous distributions.
The logarithms in these formulae are usually taken to base 2 if information is measured in units of bit
The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represente ...
s, or to base if information is measured in nat
Nat or NAT may refer to:
Computing
* Network address translation (NAT), in computer networking
Organizations
* National Actors Theatre, New York City, U.S.
* National AIDS trust, a British charity
* National Archives of Thailand
* National As ...
s. Most formulas involving relative entropy hold regardless of the base of the logarithm.
Various conventions exist for referring to in words. Often it is referred to as the divergence ''between'' and , but this fails to convey the fundamental asymmetry in the relation. Sometimes, as in this article, it may be described as the divergence of ''from'' or as the divergence ''from'' ''to'' . This reflects the asymmetry
Asymmetry is the absence of, or a violation of, symmetry (the property of an object being invariant to a transformation, such as reflection). Symmetry is an important property of both physical and abstract systems and it may be displayed in pre ...
in Bayesian inference
Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, a ...
, which starts ''from'' a prior
Prior (or prioress) is an ecclesiastical title for a superior in some religious orders. The word is derived from the Latin for "earlier" or "first". Its earlier generic usage referred to any monastic superior. In abbeys, a prior would be l ...
and updates ''to'' the posterior . Another common way to refer to is as the relative entropy of ''with respect to'' or the information gain
Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random, ...
from over .
Basic example
Kullback gives the following example (Table 2.1, Example 2.1). Let and be the distributions shown in the table and figure. is the distribution on the left side of the figure, a binomial distribution
In probability theory and statistics, the binomial distribution with parameters ''n'' and ''p'' is the discrete probability distribution of the number of successes in a sequence of ''n'' independent experiments, each asking a yes–no quest ...
with and . is the distribution on the right side of the figure, a discrete uniform distribution with the three possible outcomes , , (i.e. ), each with probability .
Relative entropies and are calculated as follows. This example uses the natural log
The natural logarithm of a number is its logarithm to the base of the mathematical constant , which is an irrational and transcendental number approximately equal to . The natural logarithm of is generally written as , , or sometimes, if ...
with base '' e'', designated to get results in nats (see units of information
In computing and telecommunications, a unit of information is the capacity of some standard data storage system or communication channel, used to measure the capacities of other systems and channels. In information theory, units of information ar ...
).
:
:
Interpretations
Statistics
In the field of statistics the Neyman-Pearson lemma states that the most powerful way to distinguish between the two distributions and based on an observation (drawn from one of them) is through the log of the ratio of their likelihoods: . The KL divergence is the expected value of this statistic if is actually drawn from . Kullback motivated the statistic as an expected log likelihood ratio.
Coding
In the context of coding theory
Coding theory is the study of the properties of codes and their respective fitness for specific applications. Codes are used for data compression, cryptography, error detection and correction, data transmission and data storage. Codes are stud ...
, can be constructed by measuring the expected number of extra bit
The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represente ...
s required to code
In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communication ...
samples from using a code optimized for rather than the code optimized for .
Inference
In the context of machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
, is often called the information gain
Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random, ...
achieved if would be used instead of which is currently used. By analogy with information theory, it is called the ''relative entropy'' of with respect to .
Expressed in the language of Bayesian inference
Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, a ...
, is a measure of the information gained by revising one's beliefs from the prior probability distribution
In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken int ...
to the posterior probability distribution
The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior ...
. In other words, it is the amount of information lost when is used to approximate .
Information geometry
In applications, typically represents the "true" distribution of data, observations, or a precisely calculated theoretical distribution, while typically represents a theory, model, description, or approximation
An approximation is anything that is intentionally similar but not exactly equality (mathematics), equal to something else.
Etymology and usage
The word ''approximation'' is derived from Latin ''approximatus'', from ''proximus'' meaning ''very ...
of . In order to find a distribution that is closest to , we can minimize the KL divergence and compute an information projection In information theory, the information projection or I-projection of a probability distribution ''q'' onto a set of distributions ''P'' is
:p^* = \underset \operatorname_(p, , q).
where D_ is the Kullback–Leibler divergence from ''q'' to ''p'' ...
.
While it is a statistical distance, it is not a metric
Metric or metrical may refer to:
* Metric system, an internationally adopted decimal system of measurement
* An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement
Mathematics
In mathem ...
, the most familiar type of distance, but instead it is a divergence
In vector calculus, divergence is a vector operator that operates on a vector field, producing a scalar field giving the quantity of the vector field's source at each point. More technically, the divergence represents the volume density of the ...
. While metrics are symmetric and generalize ''linear'' distance, satisfying the triangle inequality
In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side.
This statement permits the inclusion of degenerate triangles, but ...
, divergences are asymmetric and generalize ''squared'' distance, in some cases satisfying a generalized Pythagorean theorem
In mathematics, the Pythagorean theorem or Pythagoras' theorem is a fundamental relation in Euclidean geometry between the three sides of a right triangle. It states that the area of the square whose side is the hypotenuse (the side opposite t ...
. In general does not equal , and the asymmetry is an important part of the geometry. The infinitesimal
In mathematics, an infinitesimal number is a quantity that is closer to zero than any standard real number, but that is not zero. The word ''infinitesimal'' comes from a 17th-century Modern Latin coinage ''infinitesimus'', which originally referr ...
form of relative entropy, specifically its Hessian
A Hessian is an inhabitant of the German state of Hesse.
Hessian may also refer to:
Named from the toponym
*Hessian (soldier), eighteenth-century German regiments in service with the British Empire
**Hessian (boot), a style of boot
**Hessian f ...
, gives a metric tensor
In the mathematical field of differential geometry, a metric tensor (or simply metric) is an additional structure on a manifold (such as a surface) that allows defining distances and angles, just as the inner product on a Euclidean space allows ...
that equals the Fisher information metric In information geometry, the Fisher information metric is a particular Riemannian metric which can be defined on a smooth statistical manifold, ''i.e.'', a smooth manifold whose points are probability measures defined on a common probability space ...
; see . Relative entropy satisfies a generalized Pythagorean theorem for exponential families
In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate ...
(geometrically interpreted as dually flat manifold Dually may refer to:
* Dualla, County Tipperary, a village in Ireland
*A pickup truck with dual wheels on the rear axle
* DUALLy, s platform for architectural languages interoperability
* Dual-processor
See also
* Dual (disambiguation)
Dual or ...
s), and this allows one to minimize relative entropy by geometric means, for example by information projection In information theory, the information projection or I-projection of a probability distribution ''q'' onto a set of distributions ''P'' is
:p^* = \underset \operatorname_(p, , q).
where D_ is the Kullback–Leibler divergence from ''q'' to ''p'' ...
and in maximum likelihood estimation
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statis ...
.
Relative entropy is a special case of a broader class of statistical divergences called ''f''-divergences as well as the class of Bregman divergence
In mathematics, specifically statistics and information geometry, a Bregman divergence or Bregman distance is a measure of difference between two points, defined in terms of a strictly convex function; they form an important class of divergences. W ...
s, and it is the only such divergence over probabilities that is a member of both classes.
Finance (game theory)
Consider a growth-optimizing investor in a fair game with mutually exclusive outcomes
(e.g. a “horse race” in which the official odds add up to one).
The rate of return expected by such an investor is equal to the relative entropy
between the investor’s believed probabilities and the official odds.
This is a special case of a much more general connection between financial returns and divergence measures.
Motivation
In information theory, the Kraft–McMillan theorem establishes that any directly decodable coding scheme for coding a message to identify one value out of a set of possibilities can be seen as representing an implicit probability distribution over , where is the length of the code for in bits. Therefore, relative entropy can be interpreted as the expected extra message-length per datum that must be communicated if a code that is optimal for a given (wrong) distribution is used, compared to using a code based on the true distribution : it is the ''excess'' entropy.
:
where is the cross entropy
In information theory, the cross-entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is ...
of and , and is the entropy
Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...
of (which is the same as the cross-entropy of P with itself).
The relative entropy can be thought of geometrically as a statistical distance, a measure of how far the distribution ''Q'' is from the distribution ''P''. Geometrically it is a divergence
In vector calculus, divergence is a vector operator that operates on a vector field, producing a scalar field giving the quantity of the vector field's source at each point. More technically, the divergence represents the volume density of the ...
: an asymmetric, generalized form of squared distance. The cross-entropy is itself such a measurement (formally a loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
), but it cannot be thought of as a distance, since isn't zero. This can be fixed by subtracting to make agree more closely with our notion of distance, as the ''excess'' loss. The resulting function is asymmetric, and while this can be symmetrized (see ), the asymmetric form is more useful. See for more on the geometric interpretation.
Relative entropy relates to "rate function
In mathematics — specifically, in large deviations theory — a rate function is a function used to quantify the probabilities of rare events. It is required to have several properties which assist in the formulation of the large deviati ...
" in the theory of large deviations
In probability theory, the theory of large deviations concerns the asymptotic behaviour of remote tails of sequences of probability distributions. While some basic ideas of the theory can be traced to Laplace, the formalization started with insura ...
.[Novak S.Y. (2011), ''Extreme Value Methods with Applications to Finance'' ch. 14.5 (]Chapman & Hall
Chapman & Hall is an imprint owned by CRC Press, originally founded as a British publishing house in London in the first half of the 19th century by Edward Chapman and William Hall. Chapman & Hall were publishers for Charles Dickens (from 1840 ...
). .
Arthur Hobson proved that relative entropy is the only measure of difference between probability distributions that satisfies some desired properties, which are the canonical extension to those appearing in a commonly used characterization of entropy. Consequently, mutual information
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
is the only measure of mutual dependence that obeys certain related conditions, since it can be defined in terms of Kullback–Leibler divergence.
Properties
* Relative entropy is always non-negative
In mathematics, the sign of a real number is its property of being either positive, negative, or zero. Depending on local conventions, zero may be considered as being neither positive nor negative (having no sign or a unique third sign), or it ...
, a result known as Gibbs' inequality 200px, Josiah Willard Gibbs
In information theory, Gibbs' inequality is a statement about the information entropy of a discrete probability distribution. Several other bounds on the entropy of probability distributions are derived from Gibbs' inequ ...
, with equals zero if and only if
In logic and related fields such as mathematics and philosophy, "if and only if" (shortened as "iff") is a biconditional logical connective between statements, where either both statements are true or both are false.
The connective is bicondi ...
as measures.
In particular, if and , then -almost everywhere
In measure theory (a branch of mathematical analysis), a property holds almost everywhere if, in a technical sense, the set for which the property holds takes up nearly all possibilities. The notion of "almost everywhere" is a companion notion to ...
. The entropy thus sets a minimum value for the cross-entropy , the expected number of bit
The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represente ...
s required when using a code based on rather than ; and the Kullback–Leibler divergence therefore represents the expected number of extra bits that must be transmitted to identify a value drawn from , if a code is used corresponding to the probability distribution , rather than the "true" distribution .
* No upper-bound exists for the general case. However, it is shown that if and are two discrete probability distributions built by distributing the same discrete quantity, then the maximum value of can be calculated.
* Relative entropy remains well-defined for continuous distributions, and furthermore is invariant under parameter transformations. For example, if a transformation is made from variable to variable , then, since and where is the absolute value of the derivative or more generally of the Jacobian, the relative entropy may be rewritten: where and . Although it was assumed that the transformation was continuous, this need not be the case. This also shows that the relative entropy produces a dimensionally consistent quantity, since if is a dimensioned variable, and are also dimensioned, since e.g. is dimensionless. The argument of the logarithmic term is and remains dimensionless, as it must. It can therefore be seen as in some ways a more fundamental quantity than some other properties in information theory[See the section "differential entropy – 4" i]
Relative Entropy
video lecture by Sergio Verdú
Sergio Verdú (born Barcelona, Spain, August 15, 1958) is a former professor of electrical engineering and specialist in information theory. Until September 22, 2018, he was the Eugene Higgins Professor of Electrical Engineering at Princeton Univ ...
NIPS 2009 (such as self-information
In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative ...
or Shannon entropy
Shannon may refer to:
People
* Shannon (given name)
* Shannon (surname)
* Shannon (American singer), stage name of singer Shannon Brenda Greene (born 1958)
* Shannon (South Korean singer), British-South Korean singer and actress Shannon Arrum W ...
), which can become undefined or negative for non-discrete probabilities.
* Relative entropy is additive
Additive may refer to:
Mathematics
* Additive function, a function in number theory
* Additive map, a function that preserves the addition operation
* Additive set-functionn see Sigma additivity
* Additive category, a preadditive category with f ...
for independent distributions in much the same way as Shannon entropy. If are independent distributions, and , and likewise for independent distributions then
* Relative entropy is convex
Convex or convexity may refer to:
Science and technology
* Convex lens, in optics
Mathematics
* Convex set, containing the whole line segment that joins points
** Convex polygon, a polygon which encloses a convex set of points
** Convex polytope ...
in the pair of probability measure
In mathematics, a probability measure is a real-valued function defined on a set of events in a probability space that satisfies measure properties such as ''countable additivity''. The difference between a probability measure and the more gener ...
s , i.e. if and are two pairs of probability measures then
Duality formula for variational inference
The following result, due to Donsker and Varadhan, is known as Donsker and Varadhan's variational formula.
For alternative proof using measure theory
In mathematics, the concept of a measure is a generalization and formalization of geometrical measures ( length, area, volume) and other common notions, such as mass and probability of events. These seemingly distinct concepts have many simil ...
, see.
Examples
Multivariate normal distributions
Suppose that we have two multivariate normal distribution
In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional ( univariate) normal distribution to higher dimensions. One ...
s, with means and with (non-singular) covariance matrices
In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
If the two distributions have the same dimension, , then the relative entropy between the distributions is as follows:
:
The logarithm
In mathematics, the logarithm is the inverse function to exponentiation. That means the logarithm of a number to the base is the exponent to which must be raised, to produce . For example, since , the ''logarithm base'' 10 o ...
in the last term must be taken to base '' e'' since all terms apart from the last are base-''e'' logarithms of expressions that are either factors of the density function or otherwise arise naturally. The equation therefore gives a result measured in nat
Nat or NAT may refer to:
Computing
* Network address translation (NAT), in computer networking
Organizations
* National Actors Theatre, New York City, U.S.
* National AIDS trust, a British charity
* National Archives of Thailand
* National As ...
s. Dividing the entire expression above by yields the divergence in bit
The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represente ...
s.
In a numerical implementation, it is helpful to express the result in terms of the Cholesky decompositions such that and . Then with and solutions to the triangular linear systems , and ,
:
A special case, and a common quantity in variational inference, is the relative entropy between a diagonal multivariate normal, and a standard normal distribution (with zero mean and unit variance):
:
For two univariate normal distributions ''p'' and ''q'' the above simplifies to
:
Relation to metrics
While relative entropy is a statistical distance, it is not a metric
Metric or metrical may refer to:
* Metric system, an internationally adopted decimal system of measurement
* An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement
Mathematics
In mathem ...
on the space of probability distributions, but instead it is a divergence
In vector calculus, divergence is a vector operator that operates on a vector field, producing a scalar field giving the quantity of the vector field's source at each point. More technically, the divergence represents the volume density of the ...
. While metrics are symmetric and generalize ''linear'' distance, satisfying the triangle inequality
In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side.
This statement permits the inclusion of degenerate triangles, but ...
, divergences are asymmetric in general and generalize ''squared'' distance, in some cases satisfying a generalized Pythagorean theorem
In mathematics, the Pythagorean theorem or Pythagoras' theorem is a fundamental relation in Euclidean geometry between the three sides of a right triangle. It states that the area of the square whose side is the hypotenuse (the side opposite t ...
. In general does not equal , and while this can be symmetrized (see ), the asymmetry is an important part of the geometry.
It generates a topology
In mathematics, topology (from the Greek language, Greek words , and ) is concerned with the properties of a mathematical object, geometric object that are preserved under Continuous function, continuous Deformation theory, deformations, such ...
on the space of probability distributions
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon ...
. More concretely, if is a sequence of distributions such that
:
then it is said that
:
Pinsker's inequality In information theory, Pinsker's inequality, named after its inventor Mark Semenovich Pinsker, is an inequality that bounds the total variation distance (or statistical distance) in terms of the Kullback–Leibler divergence.
The inequality is tig ...
entails that
:
where the latter stands for the usual convergence in total variation
In mathematics, the total variation identifies several slightly different concepts, related to the (local or global) structure of the codomain of a function or a measure. For a real-valued continuous function ''f'', defined on an interval 'a'' ...
.
Fisher information metric
Relative entropy is directly related to the Fisher information metric In information geometry, the Fisher information metric is a particular Riemannian metric which can be defined on a smooth statistical manifold, ''i.e.'', a smooth manifold whose points are probability measures defined on a common probability space ...
. This can be made explicit as follows. Assume that the probability distributions and are both parameterized by some (possibly multi-dimensional) parameter . Consider then two close by values of and so that the parameter differs by only a small amount from the parameter value . Specifically, up to first order one has (using the Einstein summation convention
In mathematics, especially the usage of linear algebra in Mathematical physics, Einstein notation (also known as the Einstein summation convention or Einstein summation notation) is a notational convention that implies summation over a set of i ...
)
:
with a small change of in the direction, and the corresponding rate of change in the probability distribution. Since relative entropy has an absolute minimum 0 for , i.e. , it changes only to ''second'' order in the small parameters . More formally, as for any minimum, the first derivatives of the divergence vanish
:
and by the Taylor expansion
In mathematics, the Taylor series or Taylor expansion of a function is an infinite sum of terms that are expressed in terms of the function's derivatives at a single point. For most common functions, the function and the sum of its Taylor seri ...
one has up to second order
:
where the Hessian matrix
In mathematics, the Hessian matrix or Hessian is a square matrix of second-order partial derivatives of a scalar-valued function, or scalar field. It describes the local curvature of a function of many variables. The Hessian matrix was developed ...
of the divergence
:
must be positive semidefinite. Letting vary (and dropping the subindex 0) the Hessian defines a (possibly degenerate) Riemannian metric
In differential geometry, a Riemannian manifold or Riemannian space , so called after the German mathematician Bernhard Riemann, is a real, smooth manifold ''M'' equipped with a positive-definite inner product ''g'p'' on the tangent space ''T ...
on the parameter space, called the Fisher information metric.
Fisher information metric theorem
When satisfies the following regularity conditions:
: exist,
:
where is independent of
:
then:
:
Variation of information
Another information-theoretic metric is variation of information, which is roughly a symmetrization of conditional entropy
In information theory, the conditional entropy quantifies the amount of information needed to describe the outcome of a random variable Y given that the value of another random variable X is known. Here, information is measured in shannons, na ...
. It is a metric on the set of partitions
Partition may refer to:
Computing Hardware
* Disk partitioning, the division of a hard disk drive
* Memory partition, a subdivision of a computer's memory, usually for use by a single job
Software
* Partition (database), the division of a ...
of a discrete probability space
In probability theory, a probability space or a probability triple (\Omega, \mathcal, P) is a mathematical construct that provides a formal model of a random process or "experiment". For example, one can define a probability space which models t ...
.
Relation to other quantities of information theory
Many of the other quantities of information theory can be interpreted as applications of relative entropy to specific cases.
Self-information
The self-information
In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative ...
, also known as the information content
In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative wa ...
of a signal, random variable, or event
Event may refer to:
Gatherings of people
* Ceremony, an event of ritual significance, performed on a special occasion
* Convention (meeting), a gathering of individuals engaged in some common interest
* Event management, the organization of e ...
is defined as the negative logarithm of the probability
Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...
of the given outcome occurring.
When applied to a discrete random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
, the self-information can be represented as
:
is the relative entropy of the probability distribution from a Kronecker delta
In mathematics, the Kronecker delta (named after Leopold Kronecker) is a function of two variables, usually just non-negative integers. The function is 1 if the variables are equal, and 0 otherwise:
\delta_ = \begin
0 &\text i \neq j, \\
1 &\ ...
representing certainty that — i.e. the number of extra bits that must be transmitted to identify if only the probability distribution is available to the receiver, not the fact that .
Mutual information
The mutual information
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
,
:
is the relative entropy of the product of the two marginal probability
In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the varia ...
distributions from the joint probability distribution — i.e. the expected number of extra bits that must be transmitted to identify and if they are coded using only their marginal distributions instead of the joint distribution. Equivalently, if the joint probability ''is'' known, it is the expected number of extra bits that must on average be sent to identify if the value of is not already known to the receiver.
Shannon entropy
The Shannon entropy
Shannon may refer to:
People
* Shannon (given name)
* Shannon (surname)
* Shannon (American singer), stage name of singer Shannon Brenda Greene (born 1958)
* Shannon (South Korean singer), British-South Korean singer and actress Shannon Arrum W ...
,
:
is the number of bits which would have to be transmitted to identify from equally likely possibilities, ''less'' the relative entropy of the uniform distribution on the random variate
In probability and statistics, a random variate or simply variate is a particular outcome of a ''random variable'': the random variates which are other outcomes of the same random variable might have different values ( random numbers).
A random ...
s of , , from the true distribution — i.e. ''less'' the expected number of bits saved, which would have had to be sent if the value of were coded according to the uniform distribution rather than the true distribution . This definition of Shannon entropy forms the basis of E.T. Jaynes
Edwin Thompson Jaynes (July 5, 1922 – April 30, 1998) was the Wayman Crow Distinguished Professor of Physics at Washington University in St. Louis, Missouri, St. Louis. He wrote extensively on statistical mechanics and on foundations of prob ...
's alternative generalization to continuous distributions, the limiting density of discrete points
In information theory, the limiting density of discrete points is an adjustment to the formula of Claude Shannon for differential entropy.
It was formulated by Edwin Thompson Jaynes to address defects in the initial definition of differential e ...
(as opposed to the usual differential entropy
Differential entropy (also referred to as continuous entropy) is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy, a measure of average surprisal of a random variable, to continu ...
), which defines the continuous entropy as
:
which is equivalent to:
:
Conditional entropy
The conditional entropy
In information theory, the conditional entropy quantifies the amount of information needed to describe the outcome of a random variable Y given that the value of another random variable X is known. Here, information is measured in shannons, na ...
,
:
is the number of bits which would have to be transmitted to identify from equally likely possibilities, ''less'' the relative entropy of the product distribution from the true joint distribution — i.e. ''less'' the expected number of bits saved which would have had to be sent if the value of were coded according to the uniform distribution rather than the conditional distribution of given .
Cross entropy
When we have a set of possible events, coming from the distribution , we can encode them (with a lossless data compression
Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistic ...
) using entropy encoding
In information theory, an entropy coding (or entropy encoding) is any lossless data compression method that attempts to approach the lower bound declared by Shannon's source coding theorem, which states that any lossless data compression method ...
. This compresses the data by replacing each fixed-length input symbol with a corresponding unique, variable-length, prefix-free code A prefix code is a type of code system distinguished by its possession of the "prefix property", which requires that there is no whole code word in the system that is a prefix (initial segment) of any other code word in the system. It is trivially ...
(e.g.: the events (A, B, C) with probabilities p = (1/2, 1/4, 1/4) can be encoded as the bits (0, 10, 11)). If we know the distribution in advance, we can devise an encoding that would be optimal (e.g.: using Huffman coding
In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code proceeds by means of Huffman coding, an algori ...
). Meaning the messages we encode will have the shortest length on average (assuming the encoded events are sampled from ), which will be equal to Shannon's Entropy of (denoted as ). However, if we use a different probability distribution () when creating the entropy encoding scheme, then a larger number of bit
The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represente ...
s will be used (on average) to identify an event from a set of possibilities. This new (larger) number is measured by the cross entropy
In information theory, the cross-entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is ...
between and .
The cross entropy
In information theory, the cross-entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is ...
between two probability distribution
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...
s ( and ) measures the average number of bit
The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represente ...
s needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distribution , rather than the "true" distribution . The cross entropy for two distributions and over the same probability space
In probability theory, a probability space or a probability triple (\Omega, \mathcal, P) is a mathematical construct that provides a formal model of a random process or "experiment". For example, one can define a probability space which models t ...
is thus defined as follows.
:
For explicit derivation of this, see the Motivation
Motivation is the reason for which humans and other animals initiate, continue, or terminate a behavior at a given time. Motivational states are commonly understood as forces acting within the agent that create a disposition to engage in goal-dire ...
section above.
Under this scenario, relative entropies (kl-divergence) can be interpreted as the extra number of bits, on average, that are needed (beyond ) for encoding the events because of using for constructing the encoding scheme instead of .
Bayesian updating
In Bayesian statistics
Bayesian statistics is a theory in the field of statistics based on the Bayesian interpretation of probability where probability expresses a ''degree of belief'' in an event. The degree of belief may be based on prior knowledge about the event, ...
, relative entropy can be used as a measure of the information gain in moving from a prior distribution
In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken int ...
to a posterior distribution
The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior p ...
: . If some new fact is discovered, it can be used to update the posterior distribution for from to a new posterior distribution using Bayes' theorem:
:
This distribution has a new entropy
Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...
:
:
which may be less than or greater than the original entropy . However, from the standpoint of the new probability distribution one can estimate that to have used the original code based on instead of a new code based on would have added an expected number of bits:
:
to the message length. This therefore represents the amount of useful information, or information gain, about , that has been learned by discovering .
If a further piece of data, , subsequently comes in, the probability distribution for can be updated further, to give a new best guess . If one reinvestigates the information gain for using rather than , it turns out that it may be either greater or less than previously estimated:
: may be ≤ or > than
and so the combined information gain does ''not'' obey the triangle inequality:
: may be <, = or > than
All one can say is that on ''average'', averaging using , the two sides will average out.
Bayesian experimental design
A common goal in Bayesian experimental design Bayesian experimental design provides a general probability-theoretical framework from which other theories on experimental design can be derived. It is based on Bayesian inference to interpret the observations/data acquired during the experiment. ...
is to maximise the expected relative entropy between the prior and the posterior. When posteriors are approximated to be Gaussian distributions, a design maximising the expected relative entropy is called Bayes d-optimal.
Discrimination information
Relative entropy can also be interpreted as the expected discrimination information for over : the mean information per sample for discriminating in favor of a hypothesis against a hypothesis , when hypothesis is true. Another name for this quantity, given to it by I. J. Good
Irving John Good (9 December 1916 – 5 April 2009)The Times of 16-apr-09, http://www.timesonline.co.uk/tol/comment/obituaries/article6100314.ece
was a British mathematician who worked as a cryptologist at Bletchley Park with Alan Turing. Afte ...
, is the expected weight of evidence for over to be expected from each sample.
The expected weight of evidence for over is not the same as the information gain expected per sample about the probability distribution of the hypotheses,
:
Either of the two quantities can be used as a utility function
As a topic of economics, utility is used to model worth or value. Its usage has evolved significantly over time. The term was introduced initially as a measure of pleasure or happiness as part of the theory of utilitarianism by moral philosoph ...
in Bayesian experimental design, to choose an optimal next question to investigate: but they will in general lead to rather different experimental strategies.
On the entropy scale of ''information gain'' there is very little difference between near certainty and absolute certainty—coding according to a near certainty requires hardly any more bits than coding according to an absolute certainty. On the other hand, on the logit
In statistics, the logit ( ) function is the quantile function associated with the standard logistic distribution. It has many uses in data analysis and machine learning, especially in data transformations.
Mathematically, the logit is the ...
scale implied by weight of evidence, the difference between the two is enormous – infinite perhaps; this might reflect the difference between being almost sure (on a probabilistic level) that, say, the Riemann hypothesis
In mathematics, the Riemann hypothesis is the conjecture that the Riemann zeta function has its zeros only at the negative even integers and complex numbers with real part . Many consider it to be the most important unsolved problem in ...
is correct, compared to being certain that it is correct because one has a mathematical proof. These two different scales of loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
for uncertainty are ''both'' useful, according to how well each reflects the particular circumstances of the problem in question.
Principle of minimum discrimination information
The idea of relative entropy as discrimination information led Kullback to propose the Principle of (MDI): given new facts, a new distribution should be chosen which is as hard to discriminate from the original distribution as possible; so that the new data produces as small an information gain as possible.
For example, if one had a prior distribution over and , and subsequently learnt the true distribution of was , then the relative entropy between the new joint distribution for and , , and the earlier prior distribution would be:
:
i.e. the sum of the relative entropy of the prior distribution for from the updated distribution , plus the expected value (using the probability distribution ) of the relative entropy of the prior conditional distribution from the new conditional distribution . (Note that often the later expected value is called the ''conditional relative entropy'' (or ''conditional Kullback-Leibler divergence'') and denoted by
) This is minimized if over the whole support of ; and we note that this result incorporates Bayes' theorem, if the new distribution is in fact a δ function representing certainty that has one particular value.
MDI can be seen as an extension of Laplace's Principle of Insufficient Reason
The principle of indifference (also called principle of insufficient reason) is a rule for assigning epistemic probabilities. The principle of indifference states that in the absence of any relevant evidence, agents should distribute their cre ...
, and the Principle of Maximum Entropy
The principle of maximum entropy states that the probability distribution which best represents the current state of knowledge about a system is the one with largest entropy, in the context of precisely stated prior data (such as a proposition ...
of E.T. Jaynes
Edwin Thompson Jaynes (July 5, 1922 – April 30, 1998) was the Wayman Crow Distinguished Professor of Physics at Washington University in St. Louis, Missouri, St. Louis. He wrote extensively on statistical mechanics and on foundations of prob ...
. In particular, it is the natural extension of the principle of maximum entropy from discrete to continuous distributions, for which Shannon entropy ceases to be so useful (see ''differential entropy
Differential entropy (also referred to as continuous entropy) is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy, a measure of average surprisal of a random variable, to continu ...
''), but the relative entropy continues to be just as relevant.
In the engineering literature, MDI is sometimes called the Principle of Minimum Cross-Entropy (MCE) or Minxent for short. Minimising relative entropy from to with respect to is equivalent to minimizing the cross-entropy of and , since
:
which is appropriate if one is trying to choose an adequate approximation to . However, this is just as often ''not'' the task one is trying to achieve. Instead, just as often it is that is some fixed prior reference measure, and that one is attempting to optimise by minimising subject to some constraint. This has led to some ambiguity in the literature, with some authors attempting to resolve the inconsistency by redefining cross-entropy to be , rather than .
Relationship to available work
Surprisal
In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular Event (probability theory), event occurring from a random variable. It can be tho ...
s add where probabilities multiply. The surprisal for an event of probability is defined as . If is then surprisal is in so that, for instance, there are bits of surprisal for landing all "heads" on a toss of coins.
Best-guess states (e.g. for atoms in a gas) are inferred by maximizing the ''average surprisal'' (entropy
Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...
) for a given set of control parameters (like pressure or volume ). This constrained entropy maximization
The principle of maximum entropy states that the probability distribution which best represents the current state of knowledge about a system is the one with largest entropy, in the context of precisely stated prior data (such as a proposition ...
, both classically and quantum mechanically, minimizes Gibbs availability in entropy units where is a constrained multiplicity or partition function.
When temperature is fixed, free energy () is also minimized. Thus if and number of molecules are constant, the Helmholtz free energy
In thermodynamics, the Helmholtz free energy (or Helmholtz energy) is a thermodynamic potential that measures the useful work obtainable from a closed thermodynamic system at a constant temperature (isothermal
In thermodynamics, an isotherma ...
(where is energy and is entropy) is minimized as a system "equilibrates." If and are held constant (say during processes in your body), the Gibbs free energy
In thermodynamics, the Gibbs free energy (or Gibbs energy; symbol G) is a thermodynamic potential that can be used to calculate the maximum amount of work that may be performed by a thermodynamically closed system at constant temperature and pr ...
is minimized instead. The change in free energy under these conditions is a measure of available work
Work may refer to:
* Work (human activity), intentional activity people perform to support themselves, others, or the community
** Manual labour, physical work done by humans
** House work, housework, or homemaking
** Working animal, an animal t ...
that might be done in the process. Thus available work for an ideal gas at constant temperature and pressure is where and (see also Gibbs inequality).
More generally the work available relative to some ambient is obtained by multiplying ambient temperature by relative entropy or ''net surprisal'' defined as the average value of where is the probability of a given state under ambient conditions. For instance, the work available in equilibrating a monatomic ideal gas to ambient values of and is thus , where relative entropy
:
The resulting contours of constant relative entropy, shown at right for a mole of Argon at standard temperature and pressure, for example put limits on the conversion of hot to cold as in flame-powered air-conditioning or in the unpowered device to convert boiling-water to ice-water discussed here. Thus relative entropy measures thermodynamic availability in bits.
Quantum information theory
For density matrices and on a Hilbert space
In mathematics, Hilbert spaces (named after David Hilbert) allow generalizing the methods of linear algebra and calculus from (finite-dimensional) Euclidean vector spaces to spaces that may be infinite-dimensional. Hilbert spaces arise natural ...
, the quantum relative entropy In quantum information theory, quantum relative entropy is a measure of distinguishability between two density matrix, quantum states. It is the quantum mechanical analog of relative entropy.
Motivation
For simplicity, it will be assumed that al ...
from to is defined to be
:
In quantum information science
Quantum information science is an interdisciplinary field that seeks to understand the analysis, processing, and transmission of information using quantum mechanics principles. It combines the study of Information science with quantum effects in p ...
the minimum of over all separable states can also be used as a measure of entanglement in the state .
Relationship between models and reality
Just as relative entropy of "actual from ambient" measures thermodynamic availability, relative entropy of "reality from a model" is also useful even if the only clues we have about reality are some experimental measurements. In the former case relative entropy describes ''distance to equilibrium'' or (when multiplied by ambient temperature) the amount of ''available work'', while in the latter case it tells you about surprises that reality has up its sleeve or, in other words, ''how much the model has yet to learn''.
Although this tool for evaluating models against systems that are accessible experimentally may be applied in any field, its application to selecting a statistical model
A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repres ...
via Akaike information criterion
The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to e ...
are particularly well described in papers and a book by Burnham and Anderson. In a nutshell the relative entropy of reality from a model may be estimated, to within a constant additive term, by a function of the deviations observed between data and the model's predictions (like the mean squared deviation
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
) . Estimates of such divergence for models that share the same additive term can in turn be used to select among models.
When trying to fit parametrized models to data there are various estimators which attempt to minimize relative entropy, such as maximum likelihood
In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
and maximum spacing estimators.
Symmetrised divergence
also considered the symmetrized function:
:
which they referred to as the "divergence", though today the "KL divergence" refers to the asymmetric function (see for the evolution of the term). This function is symmetric and nonnegative, and had already been defined and used by Harold Jeffreys
Sir Harold Jeffreys, FRS (22 April 1891 – 18 March 1989) was a British mathematician, statistician, geophysicist, and astronomer. His book, ''Theory of Probability'', which was first published in 1939, played an important role in the revival ...
in 1948; it is accordingly called the Jeffreys divergence.
This quantity has sometimes been used for feature selection
In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construc ...
in classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood.
Classification is the grouping of related facts into classes.
It may also refer to:
Business, organizat ...
problems, where and are the conditional pdfs of a feature under two different classes. In the Banking and Finance industries, this quantity is referred to as Population Stability Index (PSI), and is used to assess distributional shifts in model features through time.
An alternative is given via the divergence,
:
which can be interpreted as the expected information gain about from discovering which probability distribution is drawn from, or , if they currently have probabilities and respectively.
The value gives the Jensen–Shannon divergence
In probability theory and statistics, the Jensen– Shannon divergence is a method of measuring the similarity between two probability distributions. It is also known as information radius (IRad) or total divergence to the average. It is based o ...
, defined by
:
where is the average of the two distributions,
:
can also be interpreted as the capacity of a noisy information channel with two inputs giving the output distributions and . The Jensen–Shannon divergence, like all ''f''-divergences, is ''locally'' proportional to the Fisher information metric In information geometry, the Fisher information metric is a particular Riemannian metric which can be defined on a smooth statistical manifold, ''i.e.'', a smooth manifold whose points are probability measures defined on a common probability space ...
. It is similar to the Hellinger metric
In probability and statistics, the Hellinger distance (closely related to, although different from, the Bhattacharyya distance) is used to quantify the similarity between two probability distributions. It is a type of ''f''-divergence. The Helli ...
(in the sense that it induces the same affine connection on a statistical manifold
In mathematics, a statistical manifold is a Riemannian manifold, each of whose points is a probability distribution. Statistical manifolds provide a setting for the field of information geometry. The Fisher information metric provides a met ...
).
Furthermore, the Jensen-Shannon divergence can be generalized using abstract statistical M-mixtures relying on an abstract mean M.
Relationship to other probability-distance measures
There are many other important measures of probability distance. Some of these are particularly connected with relative entropy. For example:
* The total variation distance In probability theory, the total variation distance is a distance measure for probability distributions. It is an example of a statistical distance metric, and is sometimes called the statistical distance, statistical difference or variational dist ...
, . This is connected to the divergence through Pinsker's inequality In information theory, Pinsker's inequality, named after its inventor Mark Semenovich Pinsker, is an inequality that bounds the total variation distance (or statistical distance) in terms of the Kullback–Leibler divergence.
The inequality is tig ...
: Pinsker's inequality is vacuous for any distributions where , since the total variation distance is at most . For such distributions, an alternative bound can be used, due to Bretagnolle and Huber (see, also, Tsybakov[
Tsybakov, Alexandre B., ''Introduction to nonparametric estimation'', Revised and extended from the 2004 French original. Translated by Vladimir Zaiats. Springer Series in Statistics. Springer, New York, 2009. xii+214 pp. , Equation 2.25.]):
* The family of Rényi divergences generalize relative entropy. Depending on the value of a certain parameter, , various inequalities may be deduced.
Other notable measures of distance include the Hellinger distance
In probability and statistics, the Hellinger distance (closely related to, although different from, the Bhattacharyya distance) is used to quantify the similarity between two probability distributions. It is a type of ''f''-divergence. The Hellin ...
, ''histogram intersection'', ''Chi-squared statistic
A chi-squared test (also chi-square or test) is a statistical hypothesis test used in the analysis of contingency tables when the sample sizes are large. In simpler terms, this test is primarily used to examine whether two categorical variables ...
'', ''quadratic form distance'', '' match distance'', '' Kolmogorov–Smirnov distance'', and ''earth mover's distance
In statistics, the earth mover's distance (EMD) is a measure of the distance between two probability distributions over a region ''D''. In mathematics, this is known as the Wasserstein metric. Informally, if the distributions are interpreted ...
''.
Data differencing
Just as ''absolute'' entropy serves as theoretical background for data ''compression'', ''relative'' entropy serves as theoretical background for data ''differencing'' – the absolute entropy of a set of data in this sense being the data required to reconstruct it (minimum compressed size), while the relative entropy of a target set of data, given a source set of data, is the data required to reconstruct the target ''given'' the source (minimum size of a patch
Patch or Patches may refer to:
Arts, entertainment and media
* Patch Johnson, a fictional character from ''Days of Our Lives''
* Patch (''My Little Pony''), a toy
* "Patches" (Dickey Lee song), 1962
* "Patches" (Chairmen of the Board song) ...
).
See also
*Akaike information criterion
The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to e ...
* Bayesian information criterion
*Bregman divergence
In mathematics, specifically statistics and information geometry, a Bregman divergence or Bregman distance is a measure of difference between two points, defined in terms of a strictly convex function; they form an important class of divergences. W ...
*Cross-entropy
In information theory, the cross-entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set i ...
*Deviance information criterion The deviance information criterion (DIC) is a hierarchical modeling generalization of the Akaike information criterion (AIC). It is particularly useful in Bayesian model selection problems where the posterior distributions of the models have been o ...
*Entropic value at risk In financial mathematics and stochastic optimization, the concept of risk measure is used to quantify the risk involved in a random outcome or risk position. Many risk measures have hitherto been proposed, each having certain characteristics. The en ...
*Entropy power inequality In information theory, the entropy power inequality (EPI) is a result that relates to so-called "entropy power" of random variables. It shows that the entropy power of suitably well-behaved random variables is a superadditive function. The entrop ...
*Hellinger distance
In probability and statistics, the Hellinger distance (closely related to, although different from, the Bhattacharyya distance) is used to quantify the similarity between two probability distributions. It is a type of ''f''-divergence. The Hellin ...
*Information gain in decision trees
In information theory and machine learning, information gain is a synonym for ''Kullback–Leibler divergence''; the amount of information gained about a random variable or signal from observing another random variable. However, in the context of ...
* Information gain ratio
*Information theory and measure theory This article discusses how information theory (a branch of mathematics studying the transmission, processing and storage of information) is related to measure theory (a branch of mathematics related to integration and probability).
Measures in ...
*Jensen–Shannon divergence
In probability theory and statistics, the Jensen– Shannon divergence is a method of measuring the similarity between two probability distributions. It is also known as information radius (IRad) or total divergence to the average. It is based o ...
*Quantum relative entropy In quantum information theory, quantum relative entropy is a measure of distinguishability between two density matrix, quantum states. It is the quantum mechanical analog of relative entropy.
Motivation
For simplicity, it will be assumed that al ...
*Solomon Kullback
Solomon Kullback (April 3, 1907August 5, 1994) was an American cryptanalyst and mathematician, who was one of the first three employees hired by William F. Friedman at the US Army's Signal Intelligence Service (SIS) in the 1930s, along with Frank ...
and Richard Leibler
Richard A. Leibler (March 18, 1914, Chicago, Illinois – October 25, 2003, Reston, Virginia) was an American mathematician and cryptanalyst. Richard Leibler was born in March 1914. He received his A.M. in mathematics from Northwestern University ...
References
*
* . Republished by Dover Publications
Dover Publications, also known as Dover Books, is an American book publisher founded in 1941 by Hayward and Blanche Cirker. It primarily reissues books that are out of print from their original publishers. These are often, but not always, books ...
in 1968; reprinted in 1978: .
External links
Information Theoretical Estimators Toolbox
Ruby gem for calculating Kullback–Leibler divergence
Jon Shlens' tutorial on Kullback–Leibler divergence and likelihood theory
Matlab code for calculating Kullback–Leibler divergence for discrete distributions
* Sergio Verdú
Sergio Verdú (born Barcelona, Spain, August 15, 1958) is a former professor of electrical engineering and specialist in information theory. Until September 22, 2018, he was the Eugene Higgins Professor of Electrical Engineering at Princeton Univ ...
Relative Entropy
NIPS 2009. One-hour video lecture.
A modern summary of info-theoretic divergence measures
{{DEFAULTSORT:Kullback-Leibler Divergence
Entropy and information
F-divergences
Information geometry
Thermodynamics