information theory Information theory is the scientific study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. ...

, Pinsker's inequality, named after its inventor Mark Semenovich Pinsker, is an inequality that bounds the

total variation distance In probability theory, the total variation distance is a distance measure for probability distributions. It is an example of a statistical distance metric, and is sometimes called the statistical distance, statistical difference or variational dist ...

(or statistical distance) in terms of the

Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fro ...

. The inequality is tight up to constant factors.

Formal statement

Pinsker's inequality states that, if

P

and

Q

are two

probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomeno ...

s on a

measurable space In mathematics, a measurable space or Borel space is a basic object in measure theory. It consists of a set and a σ-algebra, which defines the subsets that will be measured. Definition Consider a set X and a σ-algebra \mathcal A on X. Then ...

(X, \Sigma)

, then :

\delta(P,Q) \le \sqrt,

where :

\delta(P,Q)=\sup \bigl\

is the

(or statistical distance) between

P

and

Q

and :

D_(P\parallel Q) = \operatorname_P \left( \log \frac \right) = \int_X \left( \log \frac \right) \, \mathrm P

is the

in nats. When the sample space

X

is a finite set, the Kullback–Leibler divergence is given by :

D_(P\parallel Q) = \sum_ \left( \log \frac\right) P(i)\!

Note that in terms of the total variation norm

\,  P - Q \,

of the

signed measure In mathematics, signed measure is a generalization of the concept of (positive) measure by allowing the set function to take negative values. Definition There are two slightly different concepts of a signed measure, depending on whether or not ...

P - Q

, Pinsker's inequality differs from the one given above by a factor of two: :

\,  P - Q \,  \le \sqrt.

A proof of Pinsker's inequality uses the partition inequality for ''f''-divergences.

Alternative version

Note that the expression of Pinsker inequality depends on what basis of logarithm is used in the definition of KL-divergence.

D_

is defined using

\ln

(logarithm in base

e

), whereas

D

is typically defined with

\log_2

(logarithm in base 2). Then, :

D(P\parallel Q) =\frac.

Given the above comments, there is an alternative statement of Pinsker's inequality in some literature that relates information divergence to variation distance: :

D(P\parallel Q) = \frac 
  \ge
  \frac
  V^2(p, q),

i.e., :

\sqrt
  \ge
  \frac,

in which :

V(p, q) = 
  \sum_ 
  , p(x) - q(x) ,

is the (non-normalized) variation distance between two probability density functions

p

and

q

on the same alphabet

\mathcal

. This form of Pinsker's inequality shows that "convergence in divergence" is stronger notion than "convergence in variation distance". A simple proof by John Pollard is shown by letting

r(x)=P(x)/Q(x)-1 \ge -1

: :

1 \text \\&= \fracV(p, q)^2. \end

Here Titu's lemma is also known as

Sedrakyan's inequality The following inequality is known as Sedrakyan's inequality, Bergström's inequality, Engel's form or Titu's lemma, respectively, referring to the article ''About the applications of one useful inequality'' of Nairi Sedrakyan published in 1997, ...

. Note that the lower bound from Pinsker's inequality is vacuous for any distributions where

D_(P\parallel Q)>2

, since the total variation distance is at most

1

. For such distributions, an alternative bound can be used, due to Bretagnolle and Huber (see, also, Tsybakov): :

\delta(P,Q) \le \sqrt.

History

Pinsker first proved the inequality with a greater constant. The inequality in the above form was proved independently by Kullback, Csiszár, and Kemperman.

Inverse problem

A precise inverse of the inequality cannot hold: for every

\varepsilon > 0

, there are distributions

P_\varepsilon, Q

with

\delta(P_\varepsilon,Q)\le\varepsilon

but

D_(P_\varepsilon\parallel Q) = \infty

. An easy example is given by the two-point space

\

with

Q(0) = 0, Q(1) = 1

and

P_\varepsilon(0) = \varepsilon, P_\varepsilon(1) = 1-\varepsilon

. However, an inverse inequality holds on finite spaces

X

with a constant depending on

Q

.see Lemma 4.1 in More specifically, it can be shown that with the definition

\alpha_Q := \min_ Q(x)

we have for any measure

P

which is absolutely continuous to

Q

\frac D_(P\parallel Q) \le \frac \delta(P,Q)^2.

As a consequence, if

Q

has full support (i.e.

Q(x) > 0

for all

x \in X

), then :

\delta(P,Q)^2 \le \frac D(P\parallel Q) \le \frac \delta(P,Q)^2.

References

{{Reflist

History

Pinsker first proved the inequality with a greater constant. The inequality in the above form was proved independently by Kullback, Csiszár, and Kemperman.

Inverse problem

A precise inverse of the inequality cannot hold: for every

\varepsilon > 0

, there are distributions

P_\varepsilon, Q

with

\delta(P_\varepsilon,Q)\le\varepsilon

but

D_(P_\varepsilon\parallel Q) = \infty

. An easy example is given by the two-point space

\

with

Q(0) = 0, Q(1) = 1

and

P_\varepsilon(0) = \varepsilon, P_\varepsilon(1) = 1-\varepsilon

. However, an inverse inequality holds on finite spaces

X

with a constant depending on

Q

.see Lemma 4.1 in More specifically, it can be shown that with the definition

\alpha_Q := \min_ Q(x)

we have for any measure

P

which is absolutely continuous to

Q

\frac D_(P\parallel Q) \le \frac \delta(P,Q)^2.

As a consequence, if

Q

has full support (i.e.

Q(x) > 0

for all

x \in X

), then :

\delta(P,Q)^2 \le \frac D(P\parallel Q) \le \frac \delta(P,Q)^2.

References

{{Reflist

Formal statement

Alternative version

History

Inverse problem

References

Further reading

History

Inverse problem

References

Further reading