200px, Josiah Willard Gibbs
In
information theory, Gibbs' inequality is a statement about the
information entropy
In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable X, which takes values in the alphabet \ ...
of a discrete
probability distribution. Several other bounds on the entropy of probability distributions are derived from Gibbs' inequality, including
Fano's inequality.
It was first presented by
J. Willard Gibbs
Josiah Willard Gibbs (; February 11, 1839 – April 28, 1903) was an American scientist who made significant theoretical contributions to physics, chemistry, and mathematics. His work on the applications of thermodynamics was instrumental in t ...
in the 19th century.
Gibbs' inequality
Suppose that
:
is a discrete
probability distribution. Then for any other probability distribution
:
the following inequality between positive quantities (since p
i and q
i are between zero and one) holds:
:
with equality if and only if
:
for all ''i''. Put in words, the
information entropy
In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable X, which takes values in the alphabet \ ...
of a distribution P is less than or equal to its
cross entropy
In information theory, the cross-entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is ...
with any other distribution Q.
The difference between the two quantities is the
Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fr ...
or relative entropy, so the inequality can also be written:
:
Note that the use of base-2
logarithm
In mathematics, the logarithm is the inverse function to exponentiation. That means the logarithm of a number to the base is the exponent to which must be raised, to produce . For example, since , the ''logarithm base'' 10 of ...
s is optional, and
allows one to refer to the quantity on each side of the inequality as an
"average
surprisal
In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular Event (probability theory), event occurring from a random variable. It can be tho ...
" measured in
bit
The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represente ...
s.
Proof
For simplicity, we prove the statement using the natural logarithm (ln), since
:
the particular logarithm we choose only scales the relationship.
Let
denote the set of all
for which ''p
i'' is non-zero. Then, since
for all ''x > 0'', with equality if and only if ''x=1'', we have:
:
The last inequality is a consequence of the ''p
i'' and ''q
i'' being part of a probability distribution. Specifically, the sum of all non-zero values is 1. Some non-zero ''q
i'', however, may have been excluded since the choice of indices is conditioned upon the ''p
i'' being non-zero. Therefore the sum of the ''q
i'' may be less than 1.
So far, over the index set
, we have:
:
,
or equivalently
:
.
Both sums can be extended to all
, i.e. including
, by recalling that the expression
tends to 0 as
tends to 0, and
tends to
as
tends to 0. We arrive at
:
For equality to hold, we require
#
for all
so that the equality
holds,
# and
which means
if
, that is,
if
.
This can happen if and only if
for
.
Alternative proofs
The result can alternatively be proved using
Jensen's inequality, the
log sum inequality The log sum inequality is used for proving theorems in information theory.
Statement
Let a_1,\ldots,a_n and b_1,\ldots,b_n be nonnegative numbers. Denote the sum of all a_is by a and the sum of all b_is by b. The log sum inequality states that
...
, or the fact that the Kullback-Leibler divergence is a form of
Bregman divergence
In mathematics, specifically statistics and information geometry, a Bregman divergence or Bregman distance is a measure of difference between two points, defined in terms of a strictly convex function; they form an important class of divergences. W ...
. Below we give a proof based on Jensen's inequality:
Because log is a concave function, we have that:
:
Where the first inequality is due to Jensen's inequality, and the last equality is due to the same reason given in the above proof.
Furthermore, since
is strictly concave, by the equality condition of Jensen's inequality we get equality when
:
and
:
Suppose that this ratio is
, then we have that
:
Where we use the fact that
are probability distributions. Therefore the equality happens when
.
Corollary
The
entropy
Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...
of
is bounded by:
:
The proof is trivial – simply set
for all ''i''.
See also
*
Information entropy
In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable X, which takes values in the alphabet \ ...
*
Bregman divergence
In mathematics, specifically statistics and information geometry, a Bregman divergence or Bregman distance is a measure of difference between two points, defined in terms of a strictly convex function; they form an important class of divergences. W ...
*
Log sum inequality The log sum inequality is used for proving theorems in information theory.
Statement
Let a_1,\ldots,a_n and b_1,\ldots,b_n be nonnegative numbers. Denote the sum of all a_is by a and the sum of all b_is by b. The log sum inequality states that
...
References
{{Reflist
Information theory
Coding theory
Inequalities
Articles containing proofs