In mathematics, Laplace's approximation fits an un-normalised

Gaussian Carl Friedrich Gauss (1777–1855) is the eponym of all of the topics listed below. There are over 100 topics all named after this German mathematician and scientist, all in the fields of mathematics, physics, and astronomy. The English eponymo ...

approximation to a (twice differentiable) un-normalised target density. In Bayesian statistical inference this is useful to simultaneously approximate the posterior and the marginal likelihood, see also

Approximate inference Approximate inference methods make it possible to learn realistic models from big data Though used sometimes loosely partly because of a lack of formal definition, the interpretation that seems to best describe Big data is the one associated w ...

. The method works by matching the log density and curvature at a mode of the target density. For example, a (possibly non-linear) regression or classification model with data set

\_

comprising inputs

x

and outputs

y

has (unknown) parameter vector

\theta

of length

D

. The

likelihood The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...

is denoted

p(, ,\theta)

and the parameter

prior Prior (or prioress) is an ecclesiastical title for a superior in some religious orders. The word is derived from the Latin for "earlier" or "first". Its earlier generic usage referred to any monastic superior. In abbeys, a prior would be l ...

p(\theta)

. The joint density of outputs and parameters

p(,\theta, )

is the object of inferential desire :

p(,\theta, )\;=\;p(, ,\theta)p(\theta)\;=\;p(, )p(\theta, ,)\;\simeq\;\tilde q(\theta)\;=\;Zq(\theta).

The joint is equal to the product of the likelihood and the prior and by

Bayes' rule In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For examp ...

, equal to the product of the

marginal likelihood A marginal likelihood is a likelihood function that has been integrated over the parameter space. In Bayesian statistics, it represents the probability of generating the observed sample from a prior and is therefore often referred to as model evi ...

p(, )

and posterior

p(\theta, ,)

. Seen as a function of

\theta

the joint is an un-normalised density. In Laplace's approximation we approximate the joint by an un-normalised Gaussian

\tilde q(\theta)=Zq(\theta)

, where we use

q

to denote approximate density,

\tilde q

for un-normalised density and

Z

is a constant (independent of

\theta

). Since the marginal likelihood

p(, )

doesn't depend on the parameter

\theta

and the posterior

p(\theta, ,)

normalises over

\theta

we can immediately identify them with

Z

and

q(\theta)

of our approximation, respectively. Laplace's approximation is :

p(,\theta, )\;\simeq\;p(,\hat\theta, )\exp\big(-\tfrac(\theta-\hat\theta)S^(\theta-\hat\theta)\big)\;=\;\tilde q(\theta),

where we have defined :

\begin
\hat\theta &\;=\; \operatorname_\theta \log p(,\theta, ),\\
S^ &\;=\; -\left.\nabla_\theta\nabla_\theta\log p(,\theta, )\_, 
\end

where

\hat\theta

is the location of a mode of the joint target density, also known as the

maximum a posteriori In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the b ...

or MAP point and

S^

is the

D\times D

positive definite In mathematics, positive definiteness is a property of any object to which a bilinear form or a sesquilinear form may be naturally associated, which is positive-definite. See, in particular: * Positive-definite bilinear form * Positive-definite f ...

matrix of second derivatives of the negative log joint target density at the mode

\theta=\hat\theta

. Thus, the Gaussian approximation matches the value and the curvature of the un-normalised target density at the mode. The value of

\hat\theta

is usually found using a gradient based method, e.g.

Newton's method In numerical analysis, Newton's method, also known as the Newton–Raphson method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real-valu ...

. In summary, we have :

\begin
q(\theta) &\;=\; (\theta, \mu=\hat\theta,\Sigma=S),\\
\log Z &\;=\; \log p(,\hat\theta, ) + \tfrac\log, S,  + \tfrac\log(2\pi),
\end

for the approximate posterior over

\theta

and the approximate log marginal likelihood respectively. In the special case of

Bayesian linear regression Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients (as well ...

with a Gaussian prior, the approximation is exact. The main weaknesses of Laplace's approximation are that it is symmetric around the mode and that it is very local: the entire approximation is derived from properties at a single point of the target density. Laplace's method is widely used and was pioneered in the context of neural networks by David MacKay and for

Gaussian process In probability theory and statistics, a Gaussian process is a stochastic process (a collection of random variables indexed by time or space), such that every finite collection of those random variables has a multivariate normal distribution, i.e. e ...

es by Williams and Barber, see references.

References

Sources

* * {{improve categories, date=October 2022 Approximations Statistical algorithms