Stein's Lemma
   HOME

TheInfoList



OR:

Stein's lemma, named in honor of Charles Stein, is a
theorem In mathematics and formal logic, a theorem is a statement (logic), statement that has been Mathematical proof, proven, or can be proven. The ''proof'' of a theorem is a logical argument that uses the inference rules of a deductive system to esta ...
of
probability theory Probability theory or probability calculus is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expre ...
that is of interest primarily because of its applications to
statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properties of ...
— in particular, to James–Stein estimation and
empirical Bayes method Empirical Bayes methods are procedures for statistical inference in which the prior probability distribution is estimated from the data. This approach stands in contrast to standard Bayesian methods, for which the prior distribution is fixed ...
s — and its applications to portfolio choice theory. The theorem gives a formula for the
covariance In probability theory and statistics, covariance is a measure of the joint variability of two random variables. The sign of the covariance, therefore, shows the tendency in the linear relationship between the variables. If greater values of one ...
of one
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...
with the value of a function of another, when the two random variables are jointly normally distributed. Note that the name "Stein's lemma" is also commonly used to refer to a different result in the area of
statistical hypothesis testing A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. T ...
, which connects the
error exponents in hypothesis testing In statistical hypothesis testing, the error exponent of a hypothesis testing procedure is the rate at which the probabilities of Type I and Type II decay exponentially with the size of the sample used in the test. For example, if the probability of ...
with the
Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...
. This result is also known as the Chernoff–Stein lemma and is not related to the lemma discussed in this article.


Statement

Suppose ''X'' is a
normally distributed In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real number, real-valued random variable. The general form of its probability density function is f(x ...
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...
with expectation μ and
variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...
σ2. Further suppose ''g'' is a differentiable function for which the two expectations \operatorname(g(X) (X - \mu)) and \operatorname(g'(X)) both exist. (The existence of the expectation of any random variable is equivalent to the finiteness of the expectation of its
absolute value In mathematics, the absolute value or modulus of a real number x, is the non-negative value without regard to its sign. Namely, , x, =x if x is a positive number, and , x, =-x if x is negative (in which case negating x makes -x positive), ...
.) Then :\operatorname\bigl(g(X)(X-\mu)\bigr)=\sigma^2 \operatorname\bigl(g'(X)\bigr).


Multidimensional

In general, suppose ''X'' and ''Y'' are jointly normally distributed. Then :\operatorname(g(X),Y)= \operatorname(X,Y)\operatorname(g'(X)). For a general multivariate Gaussian random vector (X_1, ..., X_n) \sim \mathcal(\mu, \Sigma) it follows that :\operatorname\bigl(g(X)(X-\mu)\bigr)=\Sigma\cdot E\bigl(\nabla g(X)\bigr). Similarly, when \mu = 0, \operatorname partial_i g(X) = \operatorname (X) (\Sigma^X)_i \quad \operatorname partial_i\partial_j g(X) = \operatorname (X) ((\Sigma^X)_i(\Sigma^X)_j - \Sigma^_)


Gradient descent

Stein's lemma can be used to stochastically estimate gradient:\nabla \operatorname_\bigl(g(x + \Sigma^\epsilon)\bigr) = \Sigma^ \operatorname_\bigl(g(x + \Sigma^\epsilon)\epsilon\bigr) \approx \Sigma^ \frac \sum_^N g(x + \Sigma^\epsilon_i )\epsilon_iwhere \epsilon_1, \dots, \epsilon_N are IID samples from the standard normal distribution \mathcal N(0, I). This form has applications in Stein variational
gradient descent Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function. The idea is to take repeated steps in the opposite direction of the gradi ...
and Stein variational policy gradient.


Proof

The univariate
probability density function In probability theory, a probability density function (PDF), density function, or density of an absolutely continuous random variable, is a Function (mathematics), function whose value at any given sample (or point) in the sample space (the s ...
for the univariate normal distribution with expectation 0 and variance 1 is :\varphi(x)=e^ Since \int x \exp(-x^2/2)\,dx = -\exp(-x^2/2) we get from
integration by parts In calculus, and more generally in mathematical analysis, integration by parts or partial integration is a process that finds the integral of a product of functions in terms of the integral of the product of their derivative and antiderivati ...
: :\operatorname (X)X= \frac\int g(x) x \exp(-x^2/2)\,dx = \frac\int g'(x) \exp(-x^2/2)\,dx = \operatorname
'(X) The apostrophe (, ) is a punctuation mark, and sometimes a diacritical mark, in languages that use the Latin alphabet and some other alphabets. In English, the apostrophe is used for two basic purposes: * The marking of the omission of one o ...
/math>. The case of general variance \sigma^2 follows by substitution.


Generalizations

Isserlis' theorem is equivalently stated as\operatorname(X_1 f(X_1,\ldots,X_n))=\sum_^ \operatorname(X_1,X_i)\operatorname(\partial_f(X_1,\ldots,X_n)).where (X_1,\dots X_) is a zero-mean multivariate normal random vector. Suppose ''X'' is in an
exponential family In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate ...
, that is, ''X'' has the density :f_\eta(x)=\exp(\eta'T(x) - \Psi(\eta))h(x). Suppose this density has support (a,b) where a,b could be -\infty ,\infty and as x\rightarrow a\textb, \exp (\eta'T(x))h(x) g(x) \rightarrow 0 where g is any differentiable function such that E, g'(X), <\infty or \exp (\eta'T(x))h(x) \rightarrow 0 if a,b finite. Then :E\left left(\frac + \sum \eta_i T_i'(X)\right)\cdot g(X)\right= -E
'(X) The apostrophe (, ) is a punctuation mark, and sometimes a diacritical mark, in languages that use the Latin alphabet and some other alphabets. In English, the apostrophe is used for two basic purposes: * The marking of the omission of one o ...
The derivation is same as the special case, namely, integration by parts. If we only know X has support \mathbb , then it could be the case that E, g(X), <\infty \text E, g'(X), <\infty but \lim_ f_\eta(x) g(x) \not= 0. To see this, simply put g(x)=1 and f_\eta(x) with infinitely spikes towards infinity but still integrable. One such example could be adapted from f(x) = \begin 1 & x \in , n + 2^) \\ 0 & \text \end so that f is smooth. Extensions to elliptically-contoured distributions also exist.


See also

*Stein's method *Taylor expansions for the moments of functions of random variables *Stein discrepancy


References

{{DEFAULTSORT:Stein's Lemma Theorems in statistics Theorems in probability theory