Stein's method is a general method in

probability theory Probability theory is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set o ...

to obtain bounds on the distance between two

probability distributions In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...

with respect to a probability metric. It was introduced by Charles Stein, who first published it in 1972, to obtain a bound between the distribution of a sum of

m

-dependent sequence of

random variables A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...

and a standard normal distribution in the Kolmogorov (uniform) metric and hence to prove not only a

central limit theorem In probability theory, the central limit theorem (CLT) establishes that, in many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution even if the original variables themselv ...

, but also bounds on the rates of

convergence Convergence may refer to: Arts and media Literature *''Convergence'' (book series), edited by Ruth Nanda Anshen * "Convergence" (comics), two separate story lines published by DC Comics: **A four-part crossover storyline that united the four Wei ...

for the given metric.

History

At the end of the 1960s, unsatisfied with the by-then known proofs of a specific

, Charles Stein developed a new way of proving the theorem for his

statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...

lecture.Charles Stein: The Invariant, the Direct and the "Pretentious"
. Interview given in 2003 in Singapore His seminal paper was presented in 1970 at the sixth Berkeley Symposium and published in the corresponding proceedings. Later, his

Ph.D. A Doctor of Philosophy (PhD, Ph.D., or DPhil; Latin: or ') is the most common degree at the highest academic level awarded following a course of study. PhDs are awarded for programs across the whole breadth of academic fields. Because it is a ...

student Louis Chen Hsiao Yun modified the method so as to obtain approximation results for the

Poisson distribution In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known co ...

; therefore the Stein method applied to the problem of Poisson approximation is often referred to as the Stein–Chen method. Probably the most important contributions are the monograph by Stein (1986), where he presents his view of the method and the concept of ''auxiliary randomisation'', in particular using ''exchangeable pairs'', and the articles by Barbour (1988) and Götze (1991), who introduced the so-called ''generator interpretation'', which made it possible to easily adapt the method to many other probability distributions. An important contribution was also an article by Bolthausen (1984) on the so-called ''combinatorial central limit theorem''. In the 1990s the method was adapted to a variety of distributions, such as

Gaussian process In probability theory and statistics, a Gaussian process is a stochastic process (a collection of random variables indexed by time or space), such that every finite collection of those random variables has a multivariate normal distribution, i.e. e ...

es by Barbour (1990), the

binomial distribution In probability theory and statistics, the binomial distribution with parameters ''n'' and ''p'' is the discrete probability distribution of the number of successes in a sequence of ''n'' independent experiments, each asking a yes–no quest ...

by Ehm (1991),

Poisson process In probability, statistics and related fields, a Poisson point process is a type of random mathematical object that consists of points randomly located on a mathematical space with the essential feature that the points occur independently of one ...

es by Barbour and Brown (1992), the

Gamma distribution In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-square distribution are special cases of the gamma distri ...

by Luk (1994), and many others. The method gained further popularity in the machine learning community in the mid 2010s, following the development of computable Stein discrepancies and the diverse applications and algorithms based on them.

The basic approach

Probability metrics

Stein's method is a way to bound the distance between two probability distributions using a specific probability metric. Let the metric be given in the form :

(1.1)\quad 
   d(P,Q) 
   = \sup_\left, \int h \, dP - \int h \, dQ \
   = \sup_\left, E h(W) - E h(Y) \

Here,

P

and

Q

are probability measures on a

measurable space In mathematics, a measurable space or Borel space is a basic object in measure theory. It consists of a set and a σ-algebra, which defines the subsets that will be measured. Definition Consider a set X and a σ-algebra \mathcal A on X. Then the ...

\mathcal

W

and

Y

are random variables with distribution

P

and

Q

respectively,

E

is the usual expectation operator and

\mathcal

is a set of functions from

\mathcal

to the set of real numbers. Set

\mathcal

has to be large enough, so that the above definition indeed yields a

metric Metric or metrical may refer to: * Metric system, an internationally adopted decimal system of measurement * An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement Mathematics In mathema ...

. Important examples are the total variation metric, where we let

\mathcal

consist of all the

indicator function In mathematics, an indicator function or a characteristic function of a subset of a set is a function that maps elements of the subset to one, and all other elements to zero. That is, if is a subset of some set , one has \mathbf_(x)=1 if x\i ...

s of measurable sets, the Kolmogorov (uniform) metric for probability measures on the real numbers, where we consider all the half-line indicator functions, and the Lipschitz (first order Wasserstein; Kantorovich) metric, where the underlying space is itself a metric space and we take the set

\mathcal

to be all

Lipschitz-continuous In mathematical analysis, Lipschitz continuity, named after German mathematician Rudolf Lipschitz, is a strong form of uniform continuity for functions. Intuitively, a Lipschitz continuous function is limited in how fast it can change: there exis ...

functions with Lipschitz-constant 1. However, note that not every metric can be represented in the form (1.1). In what follows

P

is a complicated distribution (e.g., the distribution of a sum of dependent random variables), which we want to approximate by a much simpler and tractable distribution

Q

(e.g., the standard normal distribution).

The Stein operator

We assume now that the distribution

Q

is a fixed distribution; in what follows we shall in particular consider the case where

Q

is the standard normal distribution, which serves as a classical example. First of all, we need an operator

\mathcal

, which acts on functions

f

from

\mathcal

to the set of real numbers and 'characterizes' distribution

Q

in the sense that the following equivalence holds: :

(2.1)\quad
 E ((\mathcalf)(Y)) = 0\text f  \quad \iff \quad Y \text Q.

We call such an operator the ''Stein operator''. For the standard normal distribution, Stein's lemma yields such an operator: :

(2.2)\quad
 E\left(f'(Y)-Yf(Y)\right) = 0\text f\in C_b^1  \quad \iff \quad Y \text

Thus, we can take :

(2.3)\quad
(\mathcalf)(x) = f'(x) - x f(x).

There are in general infinitely many such operators and it still remains an open question, which one to choose. However, it seems that for many distributions there is a particular ''good'' one, like (2.3) for the normal distribution. There are different ways to find Stein operators.

The Stein equation

P

is close to

Q

with respect to

d

if the difference of expectations in (1.1) is close to 0. We hope now that the operator

\mathcal

exhibits the same behavior: if

P=Q

then

E (\mathcalf)(W)=0

, and hopefully if

P\approx Q

we have

E (\mathcalf)(W) \approx 0

. It is usually possible to define a function

f = f_h

such that :

(3.1)\quad 
(\mathcalf)(x)= h(x) - E

(Y) A thumb signal, usually described as a thumbs-up or thumbs-down, is a common hand gesture achieved by a closed fist held with the thumb extended upward or downward in approval or disapproval, respectively. These gestures have become metaphors i ...

\qquad\text x . We call (3.1) the ''Stein equation''. Replacing

x

W

and taking expectation with respect to

W

, we get :

(3.2)\quad
E(\mathcalf)(W)=E (W) - E

Now all the effort is worthwhile only if the left-hand side of (3.2) is easier to bound than the right hand side. This is, surprisingly, often the case. If

Q

is the standard normal distribution and we use (2.3), then the corresponding Stein equation is :

(3.3)\quad
f'(x) - x f(x) = h(x) - E

\qquad\textx . If probability distribution Q has an absolutely continuous (with respect to the Lebesgue measure) density q, then :

(3.4)\quad 
  (\mathcalf)(x) = f'(x)+f(x)q'(x)/q(x).

Solving the Stein equation

''Analytic methods''. Equation (3.3) can be easily solved explicitly: :

(4.1)\quad
f(x) = e^\int_^x (s)-E h(Y)^ \, ds.

''Generator method''. If

\mathcal

is the generator of a Markov process

(Z_t)_

(see Barbour (1988), Götze (1991)), then the solution to (3.2) is :

(4.2)\quad
f(x) = -\int_0^\infty^x h(Z_t)-E h(Y) \, dt,

where

E^x

denotes expectation with respect to the process

Z

being started in

x

. However, one still has to prove that the solution (4.2) exists for all desired functions

h\in\mathcal

Properties of the solution to the Stein equation

Usually, one tries to give bounds on

f

and its derivatives (or differences) in terms of

h

and its derivatives (or differences), that is, inequalities of the form :

(5.1)\quad
\, D^k f\,  \leq C_ \, D^l h\, ,

for some specific

k,l=0,1,2,\dots

(typically

k\geq l

k\geq l-1

, respectively, depending on the form of the Stein operator), where often

\, \cdot\,

is the supremum norm. Here,

D^k

denotes the

differential operator In mathematics, a differential operator is an operator defined as a function of the differentiation operator. It is helpful, as a matter of notation first, to consider differentiation as an abstract operation that accepts a function and return ...

, but in discrete settings it usually refers to a

difference operator In mathematics, a recurrence relation is an equation according to which the nth term of a sequence of numbers is equal to some combination of the previous terms. Often, only k previous terms of the sequence appear in the equation, for a parameter ...

. The constants

C_

may contain the parameters of the distribution

Q

. If there are any, they are often referred to as ''Stein factors''. In the case of (4.1) one can prove for the

supremum norm In mathematical analysis, the uniform norm (or ) assigns to real- or complex-valued bounded functions defined on a set the non-negative number :\, f\, _\infty = \, f\, _ = \sup\left\. This norm is also called the , the , the , or, when the ...

that :

(5.2)\quad
\, f\, _\infty\leq \min\left\,\quad
\, f'\, _\infty\leq \min\,\quad
\, f''\, _\infty\leq 2 \, h'\, _\infty,

where the last bound is of course only applicable if

h

is differentiable (or at least Lipschitz-continuous, which, for example, is not the case if we regard the total variation metric or the Kolmogorov metric!). As the standard normal distribution has no extra parameters, in this specific case the constants are free of additional parameters. If we have bounds in the general form (5.1), we usually are able to treat many probability metrics together. One can often start with the next step below, if bounds of the form (5.1) are already available (which is the case for many distributions).

An abstract approximation theorem

We are now in a position to bound the left hand side of (3.1). As this step heavily depends on the form of the Stein operator, we directly regard the case of the standard normal distribution. At this point we could directly plug in random variable

W

, which we want to approximate, and try to find upper bounds. However, it is often fruitful to formulate a more general theorem. Consider here the case of local dependence. Assume that

W=\sum_^n X_i

is a sum of random variables such that the

E = 0

and

variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...

\operatorname = 1

. Assume that, for every

i=1,\dots,n

, there is a set

A_i\subset\

, such that

X_i

is independent of all the random variables

X_j

with

j\not\in A_i

. We call this set the 'neighborhood' of

X_i

. Likewise let

B_i\subset\

be a set such that all

X_j

with

j\in A_i

are independent of all

X_k

k\not\in B_i

. We can think of

B_i

as the neighbors in the neighborhood of

X_i

, a second-order neighborhood, so to speak. For a set

A\subset\

define now the sum

X_A := \sum_ X_j

. Using Taylor expansion, it is possible to prove that :

(6.1)\quad
\left, E(f'(W)-Wf(W))\ 
\leq \, f''\, _\infty\sum_^n \left(
 \fracE, X_i X_^2, 
+ E, X_i X_X_, 
+ E, X_i X_,  E, X_, 
\right)

Note that, if we follow this line of argument, we can bound (1.1) only for functions where

\, h'\, _

is bounded because of the third inequality of (5.2) (and in fact, if

h

has discontinuities, so will

f''

). To obtain a bound similar to (6.1) which contains only the expressions

\, f\, _\infty

and

\, f'\, _\infty

, the argument is much more involved and the result is not as simple as (6.1); however, it can be done. Theorem A. If

W

is as described above, we have for the Lipschitz metric

d_W

that :

(6.2)\quad
d_W(\mathcal(W),N(0,1)) \leq 2\sum_^n \left(
 \fracE, X_i X_^2, 
+ E, X_i X_X_, 
+ E, X_i X_,  E, X_, 
\right).

Proof. Recall that the Lipschitz metric is of the form (1.1) where the functions

h

are Lipschitz-continuous with Lipschitz-constant 1, thus

\, h'\, \leq 1

. Combining this with (6.1) and the last bound in (5.2) proves the theorem. Thus, roughly speaking, we have proved that, to calculate the Lipschitz-distance between a

W

with local dependence structure and a standard normal distribution, we only need to know the third moments of

X_i

and the size of the neighborhoods

A_i

and

B_i

Application of the theorem

We can treat the case of sums of

independent and identically distributed random variables In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is us ...

with Theorem A. Assume that

E X_i = 0

\operatorname X_i = 1

and

W=n^\sum X_i

. We can take

A_i=B_i=\

. From Theorem A we obtain that :

(7.1)\quad
d_W(\mathcal(W),N(0,1)) \leq \frac.

For sums of random variables another approach related to Steins Method is known as the zero bias transform.

Connections to other methods

*Lindeberg's device. Lindeberg (1922) introduced a device, where the difference

E h(X_1+\cdots+X_n)-E h(Y_1+\cdots+Y_n)

is represented as a sum of step-by-step differences. *Tikhomirov's method. Clearly the approach via (1.1) and (3.1) does not involve characteristic functions. However, Tikhomirov (1980) presented a proof of a central limit theorem based on characteristic functions and a differential operator similar to (2.3). The basic observation is that the characteristic function

\psi(t)

of the standard normal distribution satisfies the differential equation

\psi'(t)+t\psi(t) = 0

for all

t

. Thus, if the characteristic function

\psi_W(t)

W

is such that

\psi'_W(t)+t\psi_W(t)\approx 0

we expect that

\psi_W(t)\approx \psi(t)

and hence that

W

is close to the normal distribution. Tikhomirov states in his paper that he was inspired by Stein's seminal paper.

Notes

References

* * * * * * * * * * * English translation in

Literature

The following text is advanced, and gives a comprehensive overview of the normal case * Another advanced book, but having some introductory character, is * A standard reference is the book by Stein, * which contains a lot of interesting material, but may be a little hard to understand at first reading. Despite its age, there are few standard introductory books about Stein's method available. The following recent textbook has a chapter (Chapter 2) devoted to introducing Stein's method: * Although the book * is by large parts about Poisson approximation, it contains nevertheless a lot of information about the generator approach, in particular in the context of Poisson process approximation. The following textbook has a chapter (Chapter 10) devoted to introducing Stein's method of Poisson approximation: * {{DEFAULTSORT:Stein's Method Statistical distance Theory of probability distributions