In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an

integral In mathematics, an integral assigns numbers to functions in a way that describes displacement, area, volume, and other concepts that arise by combining infinitesimal data. The process of finding integrals is called integration. Along with ...

to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier proof of the same inequality for doubly-differentiable functions by

Otto Hölder Ludwig Otto Hölder (December 22, 1859 – August 29, 1937) was a German mathematician born in Stuttgart. Early life and education Hölder was the youngest of three sons of professor Otto Hölder (1811–1890), and a grandson of professor Chris ...

in 1889. Given its generality, the inequality appears in many forms depending on the context, some of which are presented below. In its simplest form the inequality states that the convex transformation of a mean is less than or equal to the mean applied after convex transformation; it is a simple corollary that the opposite is true of concave transformations. Jensen's inequality generalizes the statement that the

secant line Secant is a term in mathematics derived from the Latin ''secare'' ("to cut"). It may refer to: * a secant line, in geometry * the secant variety, in algebraic geometry * secant (trigonometry) (Latin: secans), the multiplicative inverse (or recip ...

of a convex function lies ''above'' the

graph Graph may refer to: Mathematics *Graph (discrete mathematics), a structure made of vertices and edges **Graph theory, the study of such graphs and their properties *Graph (topology), a topological space resembling a graph in the sense of discre ...

of the function, which is Jensen's inequality for two points: the secant line consists of weighted means of the convex function (for ''t'' ∈ ,1, :

t f(x_1) + (1-t) f(x_2),

while the graph of the function is the convex function of the weighted means, :

f(t x_1 + (1-t) x_2).

Thus, Jensen's inequality is :

f(t x_1 + (1-t) x_2) \leq t f(x_1) + (1-t) f(x_2).

In the context of

probability theory Probability theory is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set o ...

, it is generally stated in the following form: if ''X'' is a

random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the p ...

and is a convex function, then :

\varphi(\operatorname \leq \operatorname \left varphi(X)\right

The difference between the two sides of the inequality,

\operatorname \left varphi(X)\right - \varphi\left(\operatorname right)

, is called the Jensen gap.

Statements

The classical form of Jensen's inequality involves several numbers and weights. The inequality can be stated quite generally using either the language of measure theory or (equivalently) probability. In the probabilistic setting, the inequality can be further generalized to its ''full strength''.

Finite form

For a real convex function

\varphi

, numbers

x_1, x_2, \ldots, x_n

in its domain, and positive weights

a_i

, Jensen's inequality can be stated as: and the inequality is reversed if

\varphi

is concave, which is Equality holds if and only if

x_1=x_2=\cdots =x_n

\varphi

is linear on a domain containing

x_1,x_2,\cdots ,x_n

. As a particular case, if the weights

a_i

are all equal, then () and () become For instance, the function is '' concave'', so substituting

\varphi(x) = \log(x)

in the previous formula () establishes the (logarithm of the) familiar arithmetic-mean/geometric-mean inequality: :

\log\!\left( \frac\right) \geq \frac \quad \text \quad
\frac \geq \sqrt /math>

A common application has  as a function of another variable (or set of variables) , that is, x_i = g(t_i) . All of this carries directly over to the general continuous case: the weights  are replaced by a non-negative integrable function , such as a probability distribution, and the summations are replaced by integrals.

Measure-theoretic and probabilistic form

Let

(\Omega, A, \mu)

be a

probability space In probability theory, a probability space or a probability triple (\Omega, \mathcal, P) is a mathematical construct that provides a formal model of a random process or "experiment". For example, one can define a probability space which models t ...

. Let

f : \Omega \to \mathbb

be a

\mu

-measurable function and

\varphi : \mathbb \to \mathbb

be convex. Then:

\varphi\left(\int_\Omega f \,\mathrm\mu\right) \leq \int_\Omega \varphi \circ f \,\mathrm\mu

In real analysis, we may require an estimate on :

\varphi\left(\int_a^b f(x)\, dx\right)

where

a, b \in \mathbb

, and

f\colon

, b The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...

\to \R is a non-negative Lebesgue- integrable function. In this case, the Lebesgue measure of

/math> need not be unity. However, by integration by substitution, the interval can be rescaled so that it has measure unity. Then Jensen's inequality can be applied to get :

\varphi\left(\frac\int_a^b  f(x)\, dx\right) \le \frac \int_a^b \varphi(f(x)) \,dx.

The same result can be equivalently stated in a

setting, by a simple change of notation. Let

(\Omega, \mathfrak,\operatorname)

be a

, ''X'' an integrable real-valued

and a convex function. Then: :

\varphi\left(\operatorname right) \leq \operatorname \left \varphi(X) \right

In this probability setting, the measure is intended as a probability

\operatorname

, the integral with respect to as an

expected value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a ...

\operatorname

, and the function

f

as a

''X''. Note that the equality holds if and only if is a linear function on some convex set

A

such that

\mathrm(X \in A) = 1

(which follows by inspecting the measure-theoretical proof below).

General inequality in a probabilistic setting

More generally, let ''T'' be a real topological vector space, and ''X'' a ''T''-valued integrable random variable. In this general setting, ''integrable'' means that there exists an element

\operatorname /math> in ''T'', such that for any element ''z'' in the

dual space In mathematics, any vector space ''V'' has a corresponding dual vector space (or just dual space for short) consisting of all linear forms on ''V'', together with the vector space structure of pointwise addition and scalar multiplication by con ...

of ''T'':

\operatorname, \langle z, X \rangle, <\infty

, and

\langle z, \operatorname rangle = \operatorname langle z, X \rangle /math>. Then, for any measurable convex function  and any sub- σ-algebra \mathfrak of \mathfrak :

: \varphi\left(\operatorname\left \mid\mathfrak\right right) \leq  \operatorname\left varphi(X)\mid\mathfrak\right Here \operatorname cdot\mid\mathfrak /math> stands for the expectation conditioned to the σ-algebra \mathfrak . This general statement reduces to the previous ones when the topological vector space  is the real axis, and \mathfrak is the trivial -algebra  (where  is the

empty set In mathematics, the empty set is the unique set having no elements; its size or cardinality (count of elements in a set) is zero. Some axiomatic set theories ensure that the empty set exists by including an axiom of empty set, while in oth ...

, and is the sample space).

A sharpened and generalized form

Let ''X'' be a one-dimensional random variable with mean

\mu

and variance

\sigma^2\ge 0

. Let

\varphi(x)

be a twice differentiable function, and define the function :

h(x)\triangleq\frac-\frac.

Then :

right)\le \sigma^2\sup h(x) \le \sigma^2\sup \frac.

In particular, when

\varphi(x)

is convex, then

\varphi''(x)\ge 0

, and the standard form of Jensen's inequality immediately follows for the case where

\varphi(x)

is additionally assumed to be twice differentiable.

Proofs

Jensen's inequality can be proved in several ways, and three different proofs corresponding to the different statements above will be offered. Before embarking on these mathematical derivations, however, it is worth analyzing an intuitive graphical argument based on the probabilistic case where is a real number (see figure). Assuming a hypothetical distribution of values, one can immediately identify the position of

\operatorname /math> and its image \varphi(\operatorname in the graph. Noticing that for convex mappings  the corresponding distribution of  values is increasingly "stretched out" for increasing values of , it is easy to see that the distribution of  is broader in the interval corresponding to  and narrower in  for any ; in particular, this is also true for X_0 = \operatorname /math>. Consequently, in this picture the expectation of  will always shift upwards with respect to the position of \varphi(\operatorname . A similar reasoning holds if the distribution of  covers a decreasing portion of the convex function, or both a decreasing and an increasing portion of it. This "proves" the inequality, i.e. 

: \varphi(\operatorname \leq  \operatorname

varphi(X) Phi (; uppercase Φ, lowercase φ or ϕ; grc, ϕεῖ ''pheî'' ; Modern Greek: ''fi'' ) is the 21st letter of the Greek alphabet. In Archaic and Classical Greek (c. 9th century BC to 4th century BC), it represented an aspirated voicele ...

= \operatorname with equality when is not strictly convex, e.g. when it is a straight line, or when follows a

degenerate distribution In mathematics, a degenerate distribution is, according to some, a probability distribution in a space with support only on a manifold of lower dimension, and according to others a distribution with support only at a single point. By the latter ...

(i.e. is a constant). The proofs below formalize this intuitive notion.

Proof 1 (finite form)

If and are two arbitrary nonnegative real numbers such that then convexity of implies :

\forall x_1, x_2: \qquad \varphi \left (\lambda_1 x_1+\lambda_2 x_2 \right )\leq \lambda_1\,\varphi(x_1)+\lambda_2\,\varphi(x_2).

This can be generalized: if are nonnegative real numbers such that , then :

\varphi(\lambda_1 x_1+\lambda_2 x_2+\cdots+\lambda_n x_n)\leq \lambda_1\,\varphi(x_1)+\lambda_2\,\varphi(x_2)+\cdots+\lambda_n\,\varphi(x_n),

for any . The ''finite form'' of the Jensen's inequality can be proved by induction: by convexity hypotheses, the statement is true for ''n'' = 2. Suppose the statement is true for some ''n'', so :

\varphi\left(\sum_^\lambda_i x_i\right) \leq \sum_^\lambda_i \varphi\left(x_i\right)

for any such that . One needs to prove it for . At least one of the is strictly smaller than

1

, say ; therefore by convexity inequality: :

\begin
\varphi\left(\sum_^\lambda_i x_i\right) &= \varphi\left((1-\lambda_)\sum_^ \frac x_i + \lambda_ x_ \right) \\
&\leq (1-\lambda_) \varphi\left(\sum_^ \frac x_i \right)+\lambda_\,\varphi(x_).
\end

Since , :

\sum_^ \frac = 1

, applying the induction hypothesis gives :

\varphi\left(\sum_^\frac x_i\right) \leq \sum_^\frac \varphi(x_i)

therefore :

\begin
\varphi\left(\sum_^\lambda_i x_i\right) 
&\leq (1-\lambda_) \sum_^\frac \varphi(x_i)+\lambda_\,\varphi(x_) 
=\sum_^\lambda_i \varphi(x_i)
\end

We deduce the equality is true for , by the principle of mathematical induction it follows that the result is also true for all integer greater than 2. In order to obtain the general inequality from this finite form, one needs to use a density argument. The finite form can be rewritten as: :

\varphi\left(\int x\,d\mu_n(x) \right)\leq \int \varphi(x)\,d\mu_n(x),

where ''μ''_''n'' is a measure given by an arbitrary

convex combination In convex geometry and vector algebra, a convex combination is a linear combination of points (which can be vectors, scalars, or more generally points in an affine space) where all coefficients are non-negative and sum to 1. In other ...

of Dirac deltas: :

\mu_n= \sum_^n \lambda_i \delta_.

Since convex functions are continuous, and since convex combinations of Dirac deltas are weakly dense in the set of probability measures (as could be easily verified), the general statement is obtained simply by a limiting procedure.

Proof 2 (measure-theoretic form)

Let

g

be a real-valued

\mu

-integrable function on a probability space

\Omega

, and let

\varphi

be a convex function on the real numbers. Since

\varphi

is convex, at each real number

x

we have a nonempty set of subderivatives, which may be thought of as lines touching the graph of

\varphi

x

, but which are at or below the graph of

\varphi

at all points (support lines of the graph). Now, if we define :

x_0:=\int_\Omega g\, d\mu,

because of the existence of subderivatives for convex functions, we may choose

a

and

b

such that :

ax + b \leq \varphi(x),

for all real

x

and :

ax_0+ b = \varphi(x_0).

But then we have that :

\varphi \circ g (\omega) \geq ag(\omega)+ b

for almost all

\omega \in \Omega

. Since we have a probability measure, the integral is monotone with

\mu(\Omega) = 1

so that :

\int_\Omega \varphi \circ g\, d\mu  \geq \int_\Omega (ag + b)\, d\mu  = a\int_\Omega g\, d\mu + b\int_\Omega d\mu = ax_0 + b = \varphi (x_0) = \varphi \left (\int_\Omega g\, d\mu \right ),

as desired.

Proof 3 (general inequality in a probabilistic setting)

Let ''X'' be an integrable random variable that takes values in a real topological vector space ''T''. Since

\varphi: T \to \R

is convex, for any

x,y \in T

, the quantity :

\frac,

is decreasing as approaches 0⁺. In particular, the ''subdifferential'' of

\varphi

evaluated at in the direction is well-defined by :

(D\varphi)(x)\cdot y:=\lim_ \frac=\inf_ \frac.

It is easily seen that the subdifferential is linear in (that is false and the assertion requires Hahn-Banach theorem to be proved) and, since the infimum taken in the right-hand side of the previous formula is smaller than the value of the same term for , one gets :

\varphi(x)\leq \varphi(x+y)-(D\varphi)(x)\cdot y.

In particular, for an arbitrary sub--algebra

\mathfrak

we can evaluate the last inequality when

x = \operatorname \mid\mathfrak \,y=X-\operatorname \mid\mathfrak /math> to obtain

: \varphi(\operatorname \mid\mathfrak \leq \varphi(X)-(D\varphi)(\operatorname \mid\mathfrak \cdot (X-\operatorname \mid\mathfrak . Now, if we take the expectation conditioned to \mathfrak on both sides of the previous expression, we get the result since:

: \operatorname \left [\left[(D\varphi)(\operatorname \mid\mathfrak \cdot (X-\operatorname \mid\mathfrak \right]\mid\mathfrak \right] = (D\varphi)(\operatorname \mid\mathfrak \cdot \operatorname[\left( X-\operatorname[X\mid\mathfrak] \right) \mid \mathfrak]=0, by the linearity of the subdifferential in the ''y'' variable, and the following well-known property of the conditional expectation :

: \operatorname \left \left(\operatorname[X\mid\mathfrak \right) \mid\mathfrak \right ">\mid\mathfrak.html" ;"title="\left(\operatorname[X\mid\mathfrak">\left(\operatorname[X\mid\mathfrak\right) \mid\mathfrak \right = \operatorname[ X \mid\mathfrak].

Applications and special cases

Form involving a probability density function

Suppose is a measurable subset of the real line and ''f''(''x'') is a non-negative function such that :

\int_^\infty f(x)\,dx = 1.

In probabilistic language, ''f'' is a

probability density function In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) c ...

. Then Jensen's inequality becomes the following statement about convex integrals: If ''g'' is any real-valued measurable function and

\varphi

is convex over the range of ''g'', then :

\varphi\left(\int_^\infty g(x)f(x)\, dx\right) \le \int_^\infty \varphi(g(x)) f(x)\, dx.

If ''g''(''x'') = ''x'', then this form of the inequality reduces to a commonly used special case: :

\varphi\left(\int_^\infty x\, f(x)\, dx\right) \le \int_^\infty \varphi(x)\,f(x)\, dx.

This is applied in Variational Bayesian methods.

Example: even
moment Moment or Moments may refer to: * Present time Music * The Moments, American R&B vocal group Albums * ''Moment'' (Dark Tranquillity album), 2020 * ''Moment'' (Speed album), 1998 * ''Moments'' (Darude album) * ''Moments'' (Christine Guldbrand ...
s of a random variable

If ''g''(''x'') = ''x²ⁿ'', and ''X'' is a random variable, then ''g'' is convex as :

\frac(x) = 2n(2n - 1)x^ \geq 0\quad \forall\ x \in \R

and so :

g(\operatorname = (\operatorname^ \leq\operatorname^

In particular, if some even moment ''2n'' of ''X'' is finite, ''X'' has a finite mean. An extension of this argument shows ''X'' has finite moments of every order

l\in\N

dividing ''n''.

Alternative finite form

Let and take to be the

counting measure In mathematics, specifically measure theory, the counting measure is an intuitive way to put a measure on any set – the "size" of a subset is taken to be the number of elements in the subset if the subset has finitely many elements, and infin ...

on , then the general form reduces to a statement about sums: :

\varphi\left(\sum_^ g(x_i)\lambda_i \right) \le \sum_^ \varphi(g(x_i)) \lambda_i,

provided that and :

\lambda_1 + \cdots + \lambda_n = 1.

There is also an infinite discrete form.

Statistical physics

Jensen's inequality is of particular importance in statistical physics when the convex function is an exponential, giving: :

e^ \leq \operatorname \left e^X \right

where the

s are with respect to some

probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomeno ...

in the

. Proof: Let

\varphi(x) = e^x

\varphi\left(\operatorname right) \leq \operatorname \left \varphi(X) \right

Information theory

If is the true probability density for , and is another density, then applying Jensen's inequality for the random variable and the convex function gives :

\operatorname varphi(Y) \ge \varphi(\operatorname

Therefore: :

-D(p(x)\, q(x))=\int p(x) \log \left (\frac \right ) \, dx \le \log \left ( \int p(x) \frac\,dx \right ) = \log \left (\int q(x)\,dx \right ) =0

a result called Gibbs' inequality. It shows that the average message length is minimised when codes are assigned on the basis of the true probabilities ''p'' rather than any other distribution ''q''. The quantity that is non-negative is called the

Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fro ...

of ''q'' from ''p''. Since is a strictly convex function for , it follows that equality holds when equals almost everywhere.

Rao–Blackwell theorem

If ''L'' is a convex function and

\mathfrak

a sub-sigma-algebra, then, from the conditional version of Jensen's inequality, we get :

L(\operatorname delta(X) \mid \mathfrak \le \operatorname (\delta(X)) \mid \mathfrak \quad \Longrightarrow \quad \operatorname delta(X) \mid \mathfrak ">(\operatorname delta(X) \mid \mathfrak \le \operatorname (\delta(X))

So if δ(''X'') is some

estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...

of an unobserved parameter θ given a vector of observables ''X''; and if ''T''(''X'') is a

sufficient statistic In statistics, a statistic is ''sufficient'' with respect to a statistical model and its associated unknown parameter if "no other statistic that can be calculated from the same sample provides any additional information as to the value of the pa ...

for θ; then an improved estimator, in the sense of having a smaller expected loss ''L'', can be obtained by calculating :

\delta_1 (X) = \operatorname_delta(X') \mid T(X')= T(X)

the expected value of δ with respect to θ, taken over all possible vectors of observations ''X'' compatible with the same value of ''T''(''X'') as that observed. Further, because T is a sufficient statistics,

\delta_1 (X)

does not depend on θ, hence, becomes a statistics. This result is known as the Rao–Blackwell theorem.

Financial Performance Simulation

A popular method of measuring the investment performance of an investment is the

Internal Rate of Return Internal rate of return (IRR) is a method of calculating an investment’s rate of return. The term ''internal'' refers to the fact that the calculation excludes external factors, such as the risk-free rate, inflation, the cost of capital, or ...

(IRR) which is the rate by which a series of uncertain future cash flows are discounted using Present Value Theory to cause the sum of the future cash flows to equal the initial investment. While it is tempting to perform Monte Carlo simulation of the IRR, Jensen's Inequality introduces a bias due to fact that the IRR function is a curved function and the expectation operator is a linear function.

Notes

References

* * Tristan Needham (1993) "A Visual Explanation of Jensen's Inequality",

American Mathematical Monthly ''The American Mathematical Monthly'' is a mathematical journal founded by Benjamin Finkel in 1894. It is published ten times each year by Taylor & Francis for the Mathematical Association of America. The ''American Mathematical Monthly'' is an ...

100(8):768–71. * * * *Sam Savage (2012
The Flaw of Averages: Why We Underestimate Risk in the Face of Uncertainty
(1st ed.) Wiley. ISBN 978-0471381976

External links

Jensen's Operator Inequality
of Hansen and Pedersen. * * * {{Convex analysis and variational analysis Convex analysis Inequalities Probabilistic inequalities Statistical inequalities Theorems in analysis Theorems involving convexity Articles containing proofs

Statements

Finite form

Measure-theoretic and probabilistic form

General inequality in a probabilistic setting

A sharpened and generalized form

Proofs

Proof 1 (finite form)

Proof 2 (measure-theoretic form)

Proof 3 (general inequality in a probabilistic setting)

Applications and special cases

Form involving a probability density function

Example: even moment Moment or Moments may refer to: * Present time Music * The Moments, American R&B vocal group Albums * ''Moment'' (Dark Tranquillity album), 2020 * ''Moment'' (Speed album), 1998 * ''Moments'' (Darude album) * ''Moments'' (Christine Guldbrand ...s of a random variable

Alternative finite form

Statistical physics

Information theory

Rao–Blackwell theorem

Financial Performance Simulation

See also

Notes

References

External links

Example: even
moment Moment or Moments may refer to: * Present time Music * The Moments, American R&B vocal group Albums * ''Moment'' (Dark Tranquillity album), 2020 * ''Moment'' (Speed album), 1998 * ''Moments'' (Darude album) * ''Moments'' (Christine Guldbrand ...
s of a random variable