200px, Josiah Willard Gibbs In

information theory Information theory is the mathematical study of the quantification (science), quantification, Data storage, storage, and telecommunications, communication of information. The field was established and formalized by Claude Shannon in the 1940s, ...

, Gibbs' inequality is a statement about the

information entropy In information theory, the entropy of a random variable quantifies the average level of uncertainty or information associated with the variable's potential states or possible outcomes. This measures the expected amount of information needed ...

of a discrete

probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...

. Several other bounds on the entropy of probability distributions are derived from Gibbs' inequality, including Fano's inequality. It was first presented by J. Willard Gibbs in the 19th century.

Gibbs' inequality

Suppose that

P = \

and

Q = \

are discrete

probability distributions In probability theory and statistics, a probability distribution is a function that gives the probabilities of occurrence of possible events for an experiment. It is a mathematical description of a random phenomenon in terms of its sample spac ...

. Then :

- \sum_^n p_i \log p_i \leq - \sum_^n p_i \log q_i

with equality if and only if

p_i = q_i

for

i = 1, \dots n

. Put in words, the

of a distribution

P

is less than or equal to its cross entropy with any other distribution

Q

. The difference between the two quantities is the

Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...

or relative entropy, so the inequality can also be written: :

D_(P\, Q) \equiv \sum_^n p_i \log \frac \geq 0.

Note that the use of base-2

logarithm In mathematics, the logarithm of a number is the exponent by which another fixed value, the base, must be raised to produce that number. For example, the logarithm of to base is , because is to the rd power: . More generally, if , the ...

s is optional, and allows one to refer to the quantity on each side of the inequality as an "average surprisal" measured in bits.

Proof

For simplicity, we prove the statement using the natural logarithm, denoted by , since :

\log_b a = \frac,

so the particular logarithm base that we choose only scales the relationship by the factor . Let

I

denote the set of all

i

for which ''p_i'' is non-zero. Then, since

\ln x \leq x-1

for all ''x > 0'', with equality if and only if ''x=1'', we have: :

- \sum_ p_i \ln \frac \geq  - \sum_ p_i \left( \frac - 1 \right)

= - \sum_ q_i + \sum_ p_i = - \sum_ q_i + 1 \geq 0

The last inequality is a consequence of the ''p_i'' and ''q_i'' being part of a probability distribution. Specifically, the sum of all non-zero values is 1. Some non-zero ''q_i'', however, may have been excluded since the choice of indices is conditioned upon the ''p_i'' being non-zero. Therefore, the sum of the ''q_i'' may be less than 1. So far, over the index set

I

, we have: :

- \sum_ p_i \ln \frac \geq 0

, or equivalently :

- \sum_ p_i \ln q_i \geq - \sum_ p_i \ln p_i

. Both sums can be extended to all

i=1, \ldots, n

, i.e. including

p_i=0

, by recalling that the expression

p \ln p

tends to 0 as

p

tends to 0, and

(-\ln q)

tends to

\infty

q

tends to 0. We arrive at :

- \sum_^n p_i \ln q_i \geq - \sum_^n p_i \ln p_i

For equality to hold, we require #

\frac = 1

for all

i \in I

so that the equality

\ln \frac = \frac -1

holds, # and

\sum_ q_i = 1

which means

q_i=0

i\notin I

, that is,

q_i=0

p_i=0

. This can happen if and only if

p_i = q_i

for

i = 1, \ldots, n

Alternative proofs

The result can alternatively be proved using Jensen's inequality, the log sum inequality, or the fact that the Kullback-Leibler divergence is a form of Bregman divergence.

Proof by Jensen's inequality

Because log is a concave function, we have that: :

\sum_i p_i \log\frac \le \log\sum_i p_i\frac = \log\sum_i q_i = 0

where the first inequality is due to Jensen's inequality, and

q

being a probability distribution implies the last equality. Furthermore, since

\log

is strictly concave, by the equality condition of Jensen's inequality we get equality when :

\frac = \frac = \cdots = \frac

and :

\sum_i q_i = 1

. Suppose that this ratio is

\sigma

, then we have that :

1 = \sum_i q_i = \sum_i \sigma p_i = \sigma

where we use the fact that

p, q

are probability distributions. Therefore, the equality happens when

p = q

Proof by Bregman divergence

Alternatively, it can be proved by noting that

q - p - p\ln\frac qp \geq 0

for all

p, q > 0

, with equality holding iff

p=q

. Then, sum over the states, we have

\sum_i q_i - p_i - p_i\ln\frac \geq 0

with equality holding iff

p = q

. This is because the KL divergence is the Bregman divergence generated by the function

t \mapsto \ln t

Corollary

The

entropy Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...

P

is bounded by: :

H(p_1, \ldots , p_n) \leq \log n.

The proof is trivial – simply set

q_i = 1/n

for all ''i''.

References

{{Reflist Information theory Coding theory Probabilistic inequalities Articles containing proofs