Jensen inequality
   HOME

TheInfoList



OR:

In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an
integral In mathematics, an integral assigns numbers to functions in a way that describes displacement, area, volume, and other concepts that arise by combining infinitesimal data. The process of finding integrals is called integration. Along wit ...
to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier proof of the same inequality for doubly-differentiable functions by Otto Hölder in 1889. Given its generality, the
inequality Inequality may refer to: Economics * Attention inequality, unequal distribution of attention across users, groups of people, issues in etc. in attention economy * Economic inequality, difference in economic well-being between population groups * ...
appears in many forms depending on the context, some of which are presented below. In its simplest form the inequality states that the convex transformation of a mean is less than or equal to the mean applied after convex transformation; it is a simple corollary that the opposite is true of concave transformations. Jensen's inequality generalizes the statement that the
secant line Secant is a term in mathematics derived from the Latin ''secare'' ("to cut"). It may refer to: * a secant line, in geometry * the secant variety, in algebraic geometry * secant (trigonometry) (Latin: secans), the multiplicative inverse (or recipr ...
of a convex function lies ''above'' the
graph Graph may refer to: Mathematics *Graph (discrete mathematics), a structure made of vertices and edges **Graph theory, the study of such graphs and their properties *Graph (topology), a topological space resembling a graph in the sense of discre ...
of the
function Function or functionality may refer to: Computing * Function key, a type of key on computer keyboards * Function model, a structured representation of processes in a system * Function object or functor or functionoid, a concept of object-oriente ...
, which is Jensen's inequality for two points: the secant line consists of weighted means of the convex function (for ''t'' ∈  ,1, :t f(x_1) + (1-t) f(x_2), while the graph of the function is the convex function of the weighted means, :f(t x_1 + (1-t) x_2). Thus, Jensen's inequality is :f(t x_1 + (1-t) x_2) \leq t f(x_1) + (1-t) f(x_2). In the context of
probability theory Probability theory is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set ...
, it is generally stated in the following form: if ''X'' is a random variable and is a convex function, then :\varphi(\operatorname \leq \operatorname \left varphi(X)\right The difference between the two sides of the inequality, \operatorname \left varphi(X)\right- \varphi\left(\operatorname right), is called the
Jensen gap Jensen may refer to: People *Jensen (surname) *Jensen (given name) *Jensen (gamer), Danish professional ''League of Legends'' player Places Australia * Jensen Oval, Sydney, Australia, a soccer park * Jensen, Queensland, a suburb of Townsvill ...
.


Statements

The classical form of Jensen's inequality involves several numbers and weights. The inequality can be stated quite generally using either the language of measure theory or (equivalently) probability. In the probabilistic setting, the inequality can be further generalized to its ''full strength''.


Finite form

For a real convex function \varphi, numbers x_1, x_2, \ldots, x_n in its domain, and positive weights a_i, Jensen's inequality can be stated as: and the inequality is reversed if \varphi is
concave Concave or concavity may refer to: Science and technology * Concave lens * Concave mirror Mathematics * Concave function, the negative of a convex function * Concave polygon, a polygon which is not convex * Concave set * The concavity of a ...
, which is Equality holds if and only if x_1=x_2=\cdots =x_n or \varphi is linear on a domain containing x_1,x_2,\cdots ,x_n. As a particular case, if the weights a_i are all equal, then () and () become For instance, the function is ''
concave Concave or concavity may refer to: Science and technology * Concave lens * Concave mirror Mathematics * Concave function, the negative of a convex function * Concave polygon, a polygon which is not convex * Concave set * The concavity of a ...
'', so substituting \varphi(x) = \log(x) in the previous formula () establishes the (logarithm of the) familiar arithmetic-mean/geometric-mean inequality: :\log\!\left( \frac\right) \geq \frac \quad \text \quad \frac \geq \sqrt /math> A common application has as a function of another variable (or set of variables) , that is, x_i = g(t_i). All of this carries directly over to the general continuous case: the weights are replaced by a non-negative integrable function , such as a probability distribution, and the summations are replaced by integrals.


Measure-theoretic and probabilistic form

Let (\Omega, A, \mu) be a
probability space In probability theory, a probability space or a probability triple (\Omega, \mathcal, P) is a mathematical construct that provides a formal model of a random process or "experiment". For example, one can define a probability space which models t ...
. Let f : \Omega \to \mathbb be a \mu-measurable function and \varphi : \mathbb \to \mathbb be convex. Then: \varphi\left(\int_\Omega f \,\mathrm\mu\right) \leq \int_\Omega \varphi \circ f \,\mathrm\mu In real analysis, we may require an estimate on :\varphi\left(\int_a^b f(x)\, dx\right) where a, b \in \mathbb, and f\colon
, b The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...
\to \R is a non-negative Lebesgue-
integrable In mathematics, integrability is a property of certain dynamical systems. While there are several distinct formal definitions, informally speaking, an integrable system is a dynamical system with sufficiently many conserved quantities, or first ...
function. In this case, the Lebesgue measure of
, b The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...
/math> need not be unity. However, by integration by substitution, the interval can be rescaled so that it has measure unity. Then Jensen's inequality can be applied to get :\varphi\left(\frac\int_a^b f(x)\, dx\right) \le \frac \int_a^b \varphi(f(x)) \,dx. The same result can be equivalently stated in a
probability theory Probability theory is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set ...
setting, by a simple change of notation. Let (\Omega, \mathfrak,\operatorname) be a
probability space In probability theory, a probability space or a probability triple (\Omega, \mathcal, P) is a mathematical construct that provides a formal model of a random process or "experiment". For example, one can define a probability space which models t ...
, ''X'' an
integrable In mathematics, integrability is a property of certain dynamical systems. While there are several distinct formal definitions, informally speaking, an integrable system is a dynamical system with sufficiently many conserved quantities, or first ...
real-valued random variable and a convex function. Then: :\varphi\left(\operatorname right) \leq \operatorname \left \varphi(X) \right In this probability setting, the measure is intended as a probability \operatorname, the integral with respect to as an expected value \operatorname, and the function f as a random variable ''X''. Note that the equality holds if and only if is a linear function on some convex set A such that \mathrm(X \in A) = 1 (which follows by inspecting the measure-theoretical proof below).


General inequality in a probabilistic setting

More generally, let ''T'' be a real
topological vector space In mathematics, a topological vector space (also called a linear topological space and commonly abbreviated TVS or t.v.s.) is one of the basic structures investigated in functional analysis. A topological vector space is a vector space that is als ...
, and ''X'' a ''T''-valued
integrable In mathematics, integrability is a property of certain dynamical systems. While there are several distinct formal definitions, informally speaking, an integrable system is a dynamical system with sufficiently many conserved quantities, or first ...
random variable. In this general setting, ''integrable'' means that there exists an element \operatorname /math> in ''T'', such that for any element ''z'' in the dual space of ''T'': \operatorname, \langle z, X \rangle, <\infty , and \langle z, \operatorname rangle = \operatorname langle z, X \rangle/math>. Then, for any measurable convex function and any sub-
σ-algebra In mathematical analysis and in probability theory, a σ-algebra (also σ-field) on a set ''X'' is a collection Σ of subsets of ''X'' that includes the empty subset, is closed under complement, and is closed under countable unions and countabl ...
\mathfrak of \mathfrak: :\varphi\left(\operatorname\left \mid\mathfrak\rightright) \leq \operatorname\left varphi(X)\mid\mathfrak\right Here \operatorname cdot\mid\mathfrak/math> stands for the expectation conditioned to the σ-algebra \mathfrak. This general statement reduces to the previous ones when the topological vector space is the
real axis In elementary mathematics, a number line is a picture of a graduated straight line that serves as visual representation of the real numbers. Every point of a number line is assumed to correspond to a real number, and every real number to a poin ...
, and \mathfrak is the trivial -algebra (where is the empty set, and is the
sample space In probability theory, the sample space (also called sample description space, possibility space, or outcome space) of an experiment or random trial is the set of all possible outcomes or results of that experiment. A sample space is usually den ...
).


A sharpened and generalized form

Let ''X'' be a one-dimensional random variable with mean \mu and variance \sigma^2\ge 0. Let \varphi(x) be a twice differentiable function, and define the function : h(x)\triangleq\frac-\frac. Then : \sigma^2\inf \frac \le \sigma^2\inf h(x) \leq E\left varphi \left(X\right)\right\varphi\left(E right)\le \sigma^2\sup h(x) \le \sigma^2\sup \frac. In particular, when \varphi(x) is convex, then \varphi''(x)\ge 0, and the standard form of Jensen's inequality immediately follows for the case where \varphi(x) is additionally assumed to be twice differentiable.


Proofs

Jensen's inequality can be proved in several ways, and three different proofs corresponding to the different statements above will be offered. Before embarking on these mathematical derivations, however, it is worth analyzing an intuitive graphical argument based on the probabilistic case where is a real number (see figure). Assuming a hypothetical distribution of values, one can immediately identify the position of \operatorname /math> and its image \varphi(\operatorname in the graph. Noticing that for convex mappings the corresponding distribution of values is increasingly "stretched out" for increasing values of , it is easy to see that the distribution of is broader in the interval corresponding to and narrower in for any ; in particular, this is also true for X_0 = \operatorname /math>. Consequently, in this picture the expectation of will always shift upwards with respect to the position of \varphi(\operatorname . A similar reasoning holds if the distribution of covers a decreasing portion of the convex function, or both a decreasing and an increasing portion of it. This "proves" the inequality, i.e. :\varphi(\operatorname \leq \operatorname varphi(X)= \operatorname with equality when is not strictly convex, e.g. when it is a straight line, or when follows a
degenerate distribution In mathematics, a degenerate distribution is, according to some, a probability distribution in a space with support only on a manifold of lower dimension, and according to others a distribution with support only at a single point. By the latter d ...
(i.e. is a constant). The proofs below formalize this intuitive notion.


Proof 1 (finite form)

If and are two arbitrary nonnegative real numbers such that then convexity of implies :\forall x_1, x_2: \qquad \varphi \left (\lambda_1 x_1+\lambda_2 x_2 \right )\leq \lambda_1\,\varphi(x_1)+\lambda_2\,\varphi(x_2). This can be generalized: if are nonnegative real numbers such that , then :\varphi(\lambda_1 x_1+\lambda_2 x_2+\cdots+\lambda_n x_n)\leq \lambda_1\,\varphi(x_1)+\lambda_2\,\varphi(x_2)+\cdots+\lambda_n\,\varphi(x_n), for any . The ''finite form'' of the Jensen's inequality can be proved by induction: by convexity hypotheses, the statement is true for ''n'' = 2. Suppose the statement is true for some ''n'', so :\varphi\left(\sum_^\lambda_i x_i\right) \leq \sum_^\lambda_i \varphi\left(x_i\right) for any such that . One needs to prove it for . At least one of the is strictly smaller than 1, say ; therefore by convexity inequality: :\begin \varphi\left(\sum_^\lambda_i x_i\right) &= \varphi\left((1-\lambda_)\sum_^ \frac x_i + \lambda_ x_ \right) \\ &\leq (1-\lambda_) \varphi\left(\sum_^ \frac x_i \right)+\lambda_\,\varphi(x_). \end Since , : \sum_^ \frac = 1, applying the induction hypothesis gives : \varphi\left(\sum_^\frac x_i\right) \leq \sum_^\frac \varphi(x_i) therefore : \begin \varphi\left(\sum_^\lambda_i x_i\right) &\leq (1-\lambda_) \sum_^\frac \varphi(x_i)+\lambda_\,\varphi(x_) =\sum_^\lambda_i \varphi(x_i) \end We deduce the equality is true for , by the principle of mathematical induction it follows that the result is also true for all integer greater than 2. In order to obtain the general inequality from this finite form, one needs to use a density argument. The finite form can be rewritten as: :\varphi\left(\int x\,d\mu_n(x) \right)\leq \int \varphi(x)\,d\mu_n(x), where ''μ''''n'' is a measure given by an arbitrary
convex combination In convex geometry and vector algebra, a convex combination is a linear combination of points (which can be vectors, scalars, or more generally points in an affine space) where all coefficients are non-negative and sum to 1. In other w ...
of
Dirac delta In mathematics, the Dirac delta distribution ( distribution), also known as the unit impulse, is a generalized function or distribution (mathematics), distribution over the real numbers, whose value is zero everywhere except at zero, and who ...
s: :\mu_n= \sum_^n \lambda_i \delta_. Since convex functions are continuous, and since convex combinations of Dirac deltas are weakly
dense Density (volumetric mass density or specific mass) is the substance's mass per unit of volume. The symbol most often used for density is ''ρ'' (the lower case Greek letter rho), although the Latin letter ''D'' can also be used. Mathematically ...
in the set of probability measures (as could be easily verified), the general statement is obtained simply by a limiting procedure.


Proof 2 (measure-theoretic form)

Let g be a real-valued \mu-integrable function on a probability space \Omega, and let \varphi be a convex function on the real numbers. Since \varphi is convex, at each real number x we have a nonempty set of
subderivative In mathematics, the subderivative, subgradient, and subdifferential generalize the derivative to convex functions which are not necessarily differentiable. Subderivatives arise in convex analysis, the study of convex functions, often in connectio ...
s, which may be thought of as lines touching the graph of \varphi at x, but which are at or below the graph of \varphi at all points (support lines of the graph). Now, if we define :x_0:=\int_\Omega g\, d\mu, because of the existence of subderivatives for convex functions, we may choose a and b such that :ax + b \leq \varphi(x), for all real x and :ax_0+ b = \varphi(x_0). But then we have that :\varphi \circ g (\omega) \geq ag(\omega)+ b for almost all \omega \in \Omega. Since we have a probability measure, the integral is monotone with \mu(\Omega) = 1 so that :\int_\Omega \varphi \circ g\, d\mu \geq \int_\Omega (ag + b)\, d\mu = a\int_\Omega g\, d\mu + b\int_\Omega d\mu = ax_0 + b = \varphi (x_0) = \varphi \left (\int_\Omega g\, d\mu \right ), as desired.


Proof 3 (general inequality in a probabilistic setting)

Let ''X'' be an integrable random variable that takes values in a real topological vector space ''T''. Since \varphi: T \to \R is convex, for any x,y \in T, the quantity :\frac, is decreasing as approaches 0+. In particular, the ''subdifferential'' of \varphi evaluated at in the direction is well-defined by :(D\varphi)(x)\cdot y:=\lim_ \frac=\inf_ \frac. It is easily seen that the subdifferential is linear in (that is false and the assertion requires Hahn-Banach theorem to be proved) and, since the infimum taken in the right-hand side of the previous formula is smaller than the value of the same term for , one gets :\varphi(x)\leq \varphi(x+y)-(D\varphi)(x)\cdot y. In particular, for an arbitrary sub--algebra \mathfrak we can evaluate the last inequality when x = \operatorname \mid\mathfrak\,y=X-\operatorname \mid\mathfrak/math> to obtain :\varphi(\operatorname \mid\mathfrak \leq \varphi(X)-(D\varphi)(\operatorname \mid\mathfrak\cdot (X-\operatorname \mid\mathfrak. Now, if we take the expectation conditioned to \mathfrak on both sides of the previous expression, we get the result since: :\operatorname \left [\left[(D\varphi)(\operatorname \mid\mathfrak\cdot (X-\operatorname \mid\mathfrak\right]\mid\mathfrak \right] = (D\varphi)(\operatorname \mid\mathfrak\cdot \operatorname[\left( X-\operatorname[X\mid\mathfrak] \right) \mid \mathfrak]=0, by the linearity of the subdifferential in the ''y'' variable, and the following well-known property of the conditional expectation: :\operatorname \left \left(\operatorname[X\mid\mathfrak\right)_\mid\mathfrak_\right_.html" ;"title="\mid\mathfrak.html" ;"title="\left(\operatorname[X\mid\mathfrak">\left(\operatorname[X\mid\mathfrak\right) \mid\mathfrak \right ">\mid\mathfrak.html" ;"title="\left(\operatorname[X\mid\mathfrak">\left(\operatorname[X\mid\mathfrak\right) \mid\mathfrak \right = \operatorname[ X \mid\mathfrak].


Applications and special cases


Form involving a probability density function

Suppose is a measurable subset of the real line and ''f''(''x'') is a non-negative function such that :\int_^\infty f(x)\,dx = 1. In probabilistic language, ''f'' is a
probability density function In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) ca ...
. Then Jensen's inequality becomes the following statement about convex integrals: If ''g'' is any real-valued measurable function and \varphi is convex over the range of ''g'', then : \varphi\left(\int_^\infty g(x)f(x)\, dx\right) \le \int_^\infty \varphi(g(x)) f(x)\, dx. If ''g''(''x'') = ''x'', then this form of the inequality reduces to a commonly used special case: :\varphi\left(\int_^\infty x\, f(x)\, dx\right) \le \int_^\infty \varphi(x)\,f(x)\, dx. This is applied in
Variational Bayesian methods Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables (usually ...
.


Example: even moments of a random variable

If ''g''(''x'') = ''x2n'', and ''X'' is a random variable, then ''g'' is convex as : \frac(x) = 2n(2n - 1)x^ \geq 0\quad \forall\ x \in \R and so : g(\operatorname = (\operatorname ^ \leq\operatorname ^ In particular, if some even moment ''2n'' of ''X'' is finite, ''X'' has a finite mean. An extension of this argument shows ''X'' has finite moments of every order l\in\N dividing ''n''.


Alternative finite form

Let and take to be the
counting measure In mathematics, specifically measure theory, the counting measure is an intuitive way to put a measure on any set – the "size" of a subset is taken to be the number of elements in the subset if the subset has finitely many elements, and infinity ...
on , then the general form reduces to a statement about sums: : \varphi\left(\sum_^ g(x_i)\lambda_i \right) \le \sum_^ \varphi(g(x_i)) \lambda_i, provided that and :\lambda_1 + \cdots + \lambda_n = 1. There is also an infinite discrete form.


Statistical physics

Jensen's inequality is of particular importance in statistical physics when the convex function is an exponential, giving: : e^ \leq \operatorname \left e^X \right where the expected values are with respect to some probability distribution in the random variable . Proof: Let \varphi(x) = e^x in \varphi\left(\operatorname right) \leq \operatorname \left \varphi(X) \right


Information theory

If is the true probability density for , and is another density, then applying Jensen's inequality for the random variable and the convex function gives :\operatorname
varphi(Y) Phi (; uppercase Φ, lowercase φ or ϕ; grc, ϕεῖ ''pheî'' ; Modern Greek: ''fi'' ) is the 21st letter of the Greek alphabet. In Archaic and Classical Greek (c. 9th century BC to 4th century BC), it represented an aspirated voicele ...
\ge \varphi(\operatorname Therefore: :-D(p(x)\, q(x))=\int p(x) \log \left (\frac \right ) \, dx \le \log \left ( \int p(x) \frac\,dx \right ) = \log \left (\int q(x)\,dx \right ) =0 a result called
Gibbs' inequality 200px, Josiah Willard Gibbs In information theory, Gibbs' inequality is a statement about the information entropy of a discrete probability distribution. Several other bounds on the entropy of probability distributions are derived from Gibbs' inequ ...
. It shows that the average message length is minimised when codes are assigned on the basis of the true probabilities ''p'' rather than any other distribution ''q''. The quantity that is non-negative is called the
Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fr ...
of ''q'' from ''p''. Since is a strictly convex function for , it follows that equality holds when equals almost everywhere.


Rao–Blackwell theorem

If ''L'' is a convex function and \mathfrak a sub-sigma-algebra, then, from the conditional version of Jensen's inequality, we get :L(\operatorname delta(X) \mid \mathfrak \le \operatorname (\delta(X)) \mid \mathfrak\quad \Longrightarrow \quad \operatorname delta(X)_\mid_\mathfrak.html" ;"title="(\operatorname delta(X) \mid \mathfrak">(\operatorname delta(X) \mid \mathfrak\le \operatorname (\delta(X)) So if δ(''X'') is some
estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
of an unobserved parameter θ given a vector of observables ''X''; and if ''T''(''X'') is a sufficient statistic for θ; then an improved estimator, in the sense of having a smaller expected loss ''L'', can be obtained by calculating :\delta_1 (X) = \operatorname_ delta(X') \mid T(X')= T(X) the expected value of δ with respect to θ, taken over all possible vectors of observations ''X'' compatible with the same value of ''T''(''X'') as that observed. Further, because T is a sufficient statistics, \delta_1 (X) does not depend on θ, hence, becomes a statistics. This result is known as the
Rao–Blackwell theorem In statistics, the Rao–Blackwell theorem, sometimes referred to as the Rao–Blackwell–Kolmogorov theorem, is a result which characterizes the transformation of an arbitrarily crude estimator into an estimator that is optimal by the mean-squ ...
.


Financial Performance Simulation

A popular method of measuring the investment performance of an investment is the Internal Rate of Return (IRR) which is the rate by which a series of uncertain future cash flows are discounted using Present Value Theory to cause the sum of the future cash flows to equal the initial investment. While it is tempting to perform Monte Carlo simulation of the IRR, Jensen's Inequality introduces a bias due to fact that the IRR function is a curved function and the expectation operator is a linear function.


See also

* Karamata's inequality for a more general inequality *
Popoviciu's inequality In convex analysis, Popoviciu's inequality is an inequality about convex functions. It is similar to Jensen's inequality and was found in 1965 by Tiberiu Popoviciu, a Romanian mathematician. Formulation Let ''f'' be a function from an interva ...
*
Law of averages The law of averages is the commonly held belief that a particular outcome or event will, over certain periods of time, occur at a frequency that is similar to its probability. Depending on context or application it can be considered a valid common ...
* A proof without words of Jensen's inequality


Notes


References

* *
Tristan Needham Tristan Needham is a British mathematician and professor of mathematics at the University of San Francisco. Education, career and publications Tristan is the son of social anthropologist Rodney Needham of Oxford, England. He attended the Dragon ...
(1993) "A Visual Explanation of Jensen's Inequality", American Mathematical Monthly 100(8):768–71. * * * *Sam Savage (2012
The Flaw of Averages: Why We Underestimate Risk in the Face of Uncertainty
(1st ed.) Wiley. ISBN 978-0471381976


External links


Jensen's Operator Inequality
of Hansen and Pedersen. * * * {{Convex analysis and variational analysis Convex analysis Inequalities Probabilistic inequalities Statistical inequalities Theorems in analysis Theorems involving convexity Articles containing proofs