Stein's lemma, named in honor of Charles Stein, is a

theorem In mathematics and formal logic, a theorem is a statement (logic), statement that has been Mathematical proof, proven, or can be proven. The ''proof'' of a theorem is a logical argument that uses the inference rules of a deductive system to esta ...

probability theory Probability theory or probability calculus is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expre ...

that is of interest primarily because of its applications to

statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properties of ...

— in particular, to James–Stein estimation and

empirical Bayes method Empirical Bayes methods are procedures for statistical inference in which the prior probability distribution is estimated from the data. This approach stands in contrast to standard Bayesian methods, for which the prior distribution is fixed ...

s — and its applications to portfolio choice theory. The theorem gives a formula for the

covariance In probability theory and statistics, covariance is a measure of the joint variability of two random variables. The sign of the covariance, therefore, shows the tendency in the linear relationship between the variables. If greater values of one ...

of one

random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...

with the value of a function of another, when the two random variables are jointly normally distributed. Note that the name "Stein's lemma" is also commonly used to refer to a different result in the area of

statistical hypothesis testing A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. T ...

, which connects the

error exponents in hypothesis testing In statistical hypothesis testing, the error exponent of a hypothesis testing procedure is the rate at which the probabilities of Type I and Type II decay exponentially with the size of the sample used in the test. For example, if the probability of ...

with the

Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...

. This result is also known as the Chernoff–Stein lemma and is not related to the lemma discussed in this article.

Statement

Suppose ''X'' is a

normally distributed In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real number, real-valued random variable. The general form of its probability density function is f(x ...

with expectation μ and

variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...

σ². Further suppose ''g'' is a differentiable function for which the two expectations

\operatorname(g(X) (X - \mu))

and

\operatorname(g'(X))

both exist. (The existence of the expectation of any random variable is equivalent to the finiteness of the expectation of its

absolute value In mathematics, the absolute value or modulus of a real number x, is the non-negative value without regard to its sign. Namely, , x, =x if x is a positive number, and , x, =-x if x is negative (in which case negating x makes -x positive), ...

.) Then :

\operatorname\bigl(g(X)(X-\mu)\bigr)=\sigma^2 \operatorname\bigl(g'(X)\bigr).

Multidimensional

In general, suppose ''X'' and ''Y'' are jointly normally distributed. Then :

\operatorname(g(X),Y)= \operatorname(X,Y)\operatorname(g'(X)).

For a general multivariate Gaussian random vector

(X_1, ..., X_n) \sim \mathcal(\mu, \Sigma)

it follows that :

\operatorname\bigl(g(X)(X-\mu)\bigr)=\Sigma\cdot E\bigl(\nabla g(X)\bigr).

Similarly, when

\mu = 0

\operatorname partial_i g(X) = \operatorname (X) (\Sigma^X)_i \quad 
\operatorname partial_i\partial_j g(X) = \operatorname (X) ((\Sigma^X)_i(\Sigma^X)_j - \Sigma^_)

Gradient descent

Stein's lemma can be used to stochastically estimate gradient:

\nabla \operatorname_\bigl(g(x + \Sigma^\epsilon)\bigr) 
=
\Sigma^ \operatorname_\bigl(g(x + \Sigma^\epsilon)\epsilon\bigr)
\approx \Sigma^ \frac \sum_^N g(x + \Sigma^\epsilon_i )\epsilon_i

where

\epsilon_1, \dots, \epsilon_N

are IID samples from the standard normal distribution

\mathcal N(0, I)

. This form has applications in Stein variational

gradient descent Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function. The idea is to take repeated steps in the opposite direction of the gradi ...

and Stein variational policy gradient.

Proof

The univariate

probability density function In probability theory, a probability density function (PDF), density function, or density of an absolutely continuous random variable, is a Function (mathematics), function whose value at any given sample (or point) in the sample space (the s ...

for the univariate normal distribution with expectation 0 and variance 1 is :

\varphi(x)=e^

Since

\int x \exp(-x^2/2)\,dx = -\exp(-x^2/2)

we get from

integration by parts In calculus, and more generally in mathematical analysis, integration by parts or partial integration is a process that finds the integral of a product of functions in terms of the integral of the product of their derivative and antiderivati ...

: :

= \frac\int g(x) x \exp(-x^2/2)\,dx = \frac\int g'(x) \exp(-x^2/2)\,dx = \operatorname

'(X) The apostrophe (, ) is a punctuation mark, and sometimes a diacritical mark, in languages that use the Latin alphabet and some other alphabets. In English, the apostrophe is used for two basic purposes: * The marking of the omission of one o ...

/math>. The case of general variance

\sigma^2

follows by substitution.

Generalizations

Isserlis' theorem is equivalently stated as

\operatorname(X_1 f(X_1,\ldots,X_n))=\sum_^ \operatorname(X_1,X_i)\operatorname(\partial_f(X_1,\ldots,X_n)).

where

(X_1,\dots X_)

is a zero-mean multivariate normal random vector. Suppose ''X'' is in an

exponential family In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate ...

, that is, ''X'' has the density :

f_\eta(x)=\exp(\eta'T(x) - \Psi(\eta))h(x).

Suppose this density has support

(a,b)

where

a,b

could be

-\infty ,\infty

and as

x\rightarrow a\textb

\exp (\eta'T(x))h(x) g(x) \rightarrow 0

where

g

is any differentiable function such that

E, g'(X), <\infty

\exp (\eta'T(x))h(x) \rightarrow 0

a,b

finite. Then :

E\left left(\frac + \sum \eta_i T_i'(X)\right)\cdot g(X)\right = -E

The derivation is same as the special case, namely, integration by parts. If we only know

X

has support

\mathbb

, then it could be the case that

E, g(X),  <\infty \text E, g'(X),  <\infty

but

\lim_ f_\eta(x) g(x) \not= 0

. To see this, simply put

g(x)=1

and

f_\eta(x)

with infinitely spikes towards infinity but still integrable. One such example could be adapted from

f(x) = \begin 1 & x \in, n + 2^) \\ 0 & \text \end

so that

f

is smooth. Extensions to elliptically-contoured distributions also exist.

References

{{DEFAULTSORT:Stein's Lemma Theorems in statistics Theorems in probability theory

Statement

Multidimensional

Gradient descent

Proof

Generalizations

See also

References