HOME

TheInfoList



OR:

In
Bayesian probability Bayesian probability is an interpretation of the concept of probability, in which, instead of frequency or propensity of some phenomenon, probability is interpreted as reasonable expectation representing a state of knowledge or as quantification ...
, the Jeffreys prior, named after Sir
Harold Jeffreys Sir Harold Jeffreys, FRS (22 April 1891 – 18 March 1989) was a British mathematician, statistician, geophysicist, and astronomer. His book, ''Theory of Probability'', which was first published in 1939, played an important role in the revival ...
, is a non-informative (objective)
prior distribution In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into ...
for a parameter space; its density function is proportional to the
square root In mathematics, a square root of a number is a number such that ; in other words, a number whose '' square'' (the result of multiplying the number by itself, or  ⋅ ) is . For example, 4 and −4 are square roots of 16, because . ...
of the
determinant In mathematics, the determinant is a scalar value that is a function of the entries of a square matrix. It characterizes some properties of the matrix and the linear map represented by the matrix. In particular, the determinant is nonzero if a ...
of the
Fisher information In mathematical statistics, the Fisher information (sometimes simply called information) is a way of measuring the amount of information that an observable random variable ''X'' carries about an unknown parameter ''θ'' of a distribution that model ...
matrix: : p\left(\vec\theta\right) \propto \sqrt.\, It has the key feature that it is invariant under a
change of coordinates In mathematics, an ordered basis of a vector space of finite dimension allows representing uniquely any element of the vector space by a coordinate vector, which is a sequence of scalars called coordinates. If two different bases are consi ...
for the parameter vector \vec\theta. That is, the relative probability assigned to a volume of a probability space using a Jeffreys prior will be the same regardless of the parameterization used to define the Jeffreys prior. This makes it of special interest for use with ''scale parameters''.


Reparameterization


One-parameter case

If \theta and \varphi are two possible parametrizations of a statistical model, and \theta is a
continuously differentiable In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non- vertical tangent line at each interior point in ...
function of \varphi, we say that the prior p_\theta(\theta) is "invariant" under a reparametrization if :p_\varphi(\varphi) = p_\theta(\theta) \left, \frac\, that is, if the priors p_\theta(\theta) and p_\varphi(\varphi) are related by the usual change of variables theorem. Since the Fisher information transforms under reparametrization as :I_\varphi(\varphi) = I_\theta(\theta) \left( \frac \right)^2, defining the priors as p_\varphi(\varphi) \propto \sqrt and p_\theta(\theta) \propto \sqrt gives us the desired "invariance".


Multiple-parameter case

Analogous to the one-parameter case, let \vec\theta and \vec\varphi be two possible parametrizations of a statistical model, with \vec\theta a continuously differentiable function of \vec\varphi. We call the prior p_\theta(\vec\theta) "invariant" under reparametrization if :p_\varphi(\vec\varphi) = p_\theta(\vec\theta) \det J, where J is the
Jacobian matrix In vector calculus, the Jacobian matrix (, ) of a vector-valued function of several variables is the matrix of all its first-order partial derivatives. When this matrix is square, that is, when the function takes the same number of variable ...
with entries :J_ = \frac . Since the Fisher information matrix transforms under reparametrization as :I_\varphi(\vec\varphi) = J^T I_\theta(\vec\theta) J, we have that :\det I_\varphi(\varphi) = \det I_\theta(\theta) (\det J)^2 and thus defining the priors as p_\varphi(\vec\varphi) \propto \sqrt and p_\theta(\vec\theta) \propto \sqrt gives us the desired "invariance".


Attributes

From a practical and mathematical standpoint, a valid reason to use this non-informative prior instead of others, like the ones obtained through a limit in conjugate families of distributions, is that the relative probability of a volume of the probability space is not dependent upon the set of parameter variables that is chosen to describe parameter space. Sometimes the Jeffreys prior cannot be normalized, and is thus an
improper prior In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into ...
. For example, the Jeffreys prior for the distribution mean is uniform over the entire real line in the case of a
Gaussian distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...
of known variance. Use of the Jeffreys prior violates the strong version of the
likelihood principle In statistics, the likelihood principle is the proposition that, given a statistical model, all the evidence in a sample relevant to model parameters is contained in the likelihood function. A likelihood function arises from a probability density f ...
, which is accepted by many, but by no means all, statisticians. When using the Jeffreys prior, inferences about \vec\theta depend not just on the probability of the observed data as a function of \vec\theta, but also on the universe of all possible experimental outcomes, as determined by the experimental design, because the Fisher information is computed from an expectation over the chosen universe. Accordingly, the Jeffreys prior, and hence the inferences made using it, may be different for two experiments involving the same \vec\theta parameter even when the likelihood functions for the two experiments are the same—a violation of the strong likelihood principle.


Minimum description length

In the
minimum description length Minimum Description Length (MDL) is a model selection principle where the shortest description of the data is the best model. MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occam ...
approach to statistics the goal is to describe data as compactly as possible where the length of a description is measured in bits of the code used. For a parametric family of distributions one compares a code with the best code based on one of the distributions in the parameterized family. The main result is that in
exponential families In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate ...
, asymptotically for large sample size, the code based on the distribution that is a mixture of the elements in the exponential family with the Jeffreys prior is optimal. This result holds if one restricts the parameter set to a compact subset in the interior of the full parameter space. If the full parameter is used a modified version of the result should be used.


Examples

The Jeffreys prior for a parameter (or a set of parameters) depends upon the statistical model.


Gaussian distribution with mean parameter

For the
Gaussian distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...
of the real value x : f(x\mid\mu) = \frac with \sigma fixed, the Jeffreys prior for the mean \mu is : \begin p(\mu) & \propto \sqrt = \sqrt = \sqrt \\ & = \sqrt = \sqrt \propto 1.\end That is, the Jeffreys prior for \mu does not depend upon \mu; it is the unnormalized uniform distribution on the real line — the distribution that is 1 (or some other fixed constant) for all points. This is an
improper prior In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into ...
, and is, up to the choice of constant, the unique ''translation''-invariant distribution on the reals (the
Haar measure In mathematical analysis, the Haar measure assigns an "invariant volume" to subsets of locally compact topological groups, consequently defining an integral for functions on those groups. This measure was introduced by Alfréd Haar in 1933, though ...
with respect to addition of reals), corresponding to the mean being a measure of ''location'' and translation-invariance corresponding to no information about location.


Gaussian distribution with standard deviation parameter

For the
Gaussian distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...
of the real value x : f(x\mid\sigma) = \frac, with \mu fixed, the Jeffreys prior for the standard deviation \sigma > 0 is : \beginp(\sigma) & \propto \sqrt = \sqrt = \sqrt \\ & = \sqrt = \sqrt \propto \frac. \end Equivalently, the Jeffreys prior for \log \sigma = \int d\sigma/\sigma is the unnormalized uniform distribution on the real line, and thus this distribution is also known as the . Similarly, the Jeffreys prior for \log \sigma^2 = 2 \log \sigma is also uniform. It is the unique (up to a multiple) prior (on the positive reals) that is ''scale''-invariant (the
Haar measure In mathematical analysis, the Haar measure assigns an "invariant volume" to subsets of locally compact topological groups, consequently defining an integral for functions on those groups. This measure was introduced by Alfréd Haar in 1933, though ...
with respect to multiplication of positive reals), corresponding to the standard deviation being a measure of ''scale'' and scale-invariance corresponding to no information about scale. As with the uniform distribution on the reals, it is an
improper prior In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into ...
.


Poisson distribution with rate parameter

For the
Poisson distribution In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known co ...
of the non-negative integer n, : f(n \mid \lambda) = e^\frac, the Jeffreys prior for the rate parameter \lambda \ge 0 is : \beginp(\lambda) &\propto \sqrt = \sqrt = \sqrt \\ & = \sqrt = \sqrt.\end Equivalently, the Jeffreys prior for \sqrt\lambda = \int d\lambda/\sqrt\lambda is the unnormalized uniform distribution on the non-negative real line.


Bernoulli trial

For a coin that is "heads" with probability \gamma \in ,1/math> and is "tails" with probability 1 - \gamma, for a given (H,T) \in \ the probability is \gamma^H (1-\gamma)^T. The Jeffreys prior for the parameter \gamma is : \beginp(\gamma) & \propto \sqrt = \sqrt = \sqrt \\ & = \sqrt = \frac\,.\end This is the
arcsine distribution In probability theory, the arcsine distribution is the probability distribution whose cumulative distribution function involves the arcsine and the square root: :F(x) = \frac\arcsin\left(\sqrt x\right)=\frac+\frac for 0 ≤ ''x''  ...
and is a
beta distribution In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval , 1in terms of two positive parameters, denoted by ''alpha'' (''α'') and ''beta'' (''β''), that appear as ...
with \alpha = \beta = 1/2. Furthermore, if \gamma = \sin^2(\theta) then :\Pr
theta Theta (, ; uppercase: Θ or ; lowercase: θ or ; grc, ''thē̂ta'' ; Modern: ''thī́ta'' ) is the eighth letter of the Greek alphabet, derived from the Phoenician letter Teth . In the system of Greek numerals, it has a value of 9. Gr ...
= \Pr
gamma Gamma (uppercase , lowercase ; ''gámma'') is the third letter of the Greek alphabet. In the system of Greek numerals it has a value of 3. In Ancient Greek, the letter gamma represented a voiced velar stop . In Modern Greek, this letter re ...
\frac \propto \frac ~2 \sin \theta \cos \theta =2\,. That is, the Jeffreys prior for \theta is uniform in the interval , \pi / 2/math>. Equivalently, \theta is uniform on the whole circle , 2 \pi/math>.


''N''-sided die with biased probabilities

Similarly, for a throw of an N-sided die with outcome probabilities \vec = (\gamma_1, \ldots, \gamma_N), each non-negative and satisfying \sum_^N \gamma_i = 1, the Jeffreys prior for \vec is the
Dirichlet distribution In probability and statistics, the Dirichlet distribution (after Peter Gustav Lejeune Dirichlet), often denoted \operatorname(\boldsymbol\alpha), is a family of continuous multivariate probability distributions parameterized by a vector \boldsymb ...
with all (alpha) parameters set to one half. This amounts to using a
pseudocount In statistics, additive smoothing, also called Laplace smoothing or Lidstone smoothing, is a technique used to smooth categorical data. Given a set of observation counts \textstyle from a \textstyle -dimensional multinomial distribution wit ...
of one half for each possible outcome. Equivalently, if we write \gamma_i = \varphi_i^2 for each i, then the Jeffreys prior for \vec is uniform on the (''N'' − 1)-dimensional
unit sphere In mathematics, a unit sphere is simply a sphere of radius one around a given center. More generally, it is the set of points of distance 1 from a fixed central point, where different norms can be used as general notions of "distance". A unit b ...
(''i.e.'', it is uniform on the surface of an ''N''-dimensional
unit ball Unit may refer to: Arts and entertainment * UNIT, a fictional military organization in the science fiction television series ''Doctor Who'' * Unit of action, a discrete piece of action (or beat) in a theatrical presentation Music * ''Unit'' (alb ...
).


References


Further reading

* * {{cite book , last=Lee , first=Peter M. , title=Bayesian Statistics: An Introduction , location= , publisher=Wiley , edition=4th , year=2012 , isbn=978-1-118-33257-3 , chapter=Jeffreys’ rule , pages=96–102 Bayesian statistics