Bayesian inference Bayesian inference ( or ) is a method of statistical inference in which Bayes' theorem is used to calculate a probability of a hypothesis, given prior evidence, and update it as more information becomes available. Fundamentally, Bayesian infer ...

, the Bernstein–von Mises theorem provides the basis for using Bayesian credible sets for confidence statements in parametric models. It states that under some conditions, a posterior distribution converges in

total variation distance In probability theory, the total variation distance is a statistical distance between probability distributions, and is sometimes called the statistical distance, statistical difference or variational distance. Definition Consider a measurable ...

to a multivariate normal distribution centered at the maximum likelihood estimator

\widehat_n

with covariance matrix given by

n^\mathcal(\theta_0)^

, where

\theta_0

is the true population parameter and

\mathcal(\theta_0)

is the Fisher information matrix at the true population parameter value: :

, , P(\theta, x_1,\dots x_n) - \mathcal(_n, n^\mathcal(\theta_0)^), , _ \xrightarrow = 0

The Bernstein–von Mises theorem links

with

frequentist inference Frequentist inference is a type of statistical inference based in frequentist probability, which treats “probability” in equivalent terms to “frequency” and draws conclusions from sample-data by means of emphasizing the frequency or pr ...

. It assumes there is some true probabilistic process that generates the observations, as in frequentism, and then studies the quality of Bayesian methods of recovering that process, and making uncertainty statements about that process. In particular, it states that asymptotically, many Bayesian credible sets of a certain credibility level

\alpha

will act as confidence sets of confidence level

\alpha

, which allows for the interpretation of Bayesian credible sets.

Statement

Let

(P_\theta\,:\,\theta\in\Theta)

be a well-specified statistical model, where the parameter space

\Theta

is a subset of

\mathbb^k

. Further, let data

X_1, \ldots, X_n \in \mathcal

be independently and identically distributed from

P_

. Suppose that all of the following conditions hold: # The model admits densities

(p_\theta\,:\,\theta\in\Theta)

with respect to some measure

\mu

. # The Fisher information matrix

\mathcal(\theta_0)

is nonsingular. # The model is differentiable in quadratic mean. That is, there exists a measurable function

f:\mathcal\rightarrow\mathbb^k

such that

2 \mathrm\mu(x) = o(, , \theta - \theta_0, , ^2)

\theta \rightarrow \theta_0

. # For every

\varepsilon > 0

, there exists a sequence of test functions

\phi_n:\mathcal^n \rightarrow

, 1 The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...

/math> such that

\rightarrow 0

and

\rightarrow 0

n \rightarrow \infty

. # The prior measure is absolutely continuous with respect to the Lebesgue measure in a neighborhood of

\theta_0

, with a continuous positive density at

\theta_0

. Then for any estimator

\widehat_n

satisfying

\sqrt( _n - \theta_0) \xrightarrow \mathcal(0, ^(\theta_0))

, the posterior distribution

\Pi_n

\theta\mid X_1, \ldots, X_n

satisfies

$_ \xrightarrow 0.$

n\rightarrow \infty

Relationship to maximum likelihood estimation

Under certain regularity conditions, the maximum likelihood estimator is an asymptotically efficient estimator and can thus be used as

\widehat_n

in the theorem statement. This then yields that the posterior distribution converges in total variation distance to the asymptotic distribution of the maximum likelihood estimator, which is commonly used to construct frequentist confidence sets.

Implications

The most important implication of the Bernstein–von Mises theorem is that the Bayesian inference is asymptotically correct from a frequentist point of view. This means that for large amounts of data, one can use the posterior distribution to make, from a frequentist point of view, valid statements about estimation and uncertainty.

History

The theorem is named after Richard von Mises and S. N. Bernstein, although the first proper proof was given by Joseph L. Doob in 1949 for random variables with finite

probability space In probability theory, a probability space or a probability triple (\Omega, \mathcal, P) is a mathematical construct that provides a formal model of a random process or "experiment". For example, one can define a probability space which models ...

. Later Lucien Le Cam, his PhD student Lorraine Schwartz, David A. Freedman and Persi Diaconis extended the proof under more general assumptions.

Limitations

In case of a misspecified model, the posterior distribution will also become asymptotically Gaussian with a correct mean, but not necessarily with the Fisher information as the variance. This implies that Bayesian credible sets of level

\alpha

cannot be interpreted as confidence sets of level

\alpha

. In the case of nonparametric statistics, the Bernstein–von Mises theorem usually fails to hold with a notable exception of the Dirichlet process. A remarkable result was found by Freedman in 1965: the Bernstein–von Mises theorem does not hold

almost surely In probability theory, an event is said to happen almost surely (sometimes abbreviated as a.s.) if it happens with probability 1 (with respect to the probability measure). In other words, the set of outcomes on which the event does not occur ha ...

if the random variable has an infinite countable

; however, this depends on allowing a very broad range of possible priors. In practice, the priors used typically in research do have the desirable property even with an infinite countable

. Different summary statistics such as the mode and mean may behave differently in the posterior distribution. In Freedman's examples, the posterior density and its mean can converge on the wrong result, but the posterior mode is consistent and will converge on the correct result.

Statement

Relationship to maximum likelihood estimation

Implications

History

Limitations

References

Further reading