HOME

TheInfoList



OR:

In
mathematical statistics Mathematical statistics is the application of probability theory and other mathematical concepts to statistics, as opposed to techniques for collecting statistical data. Specific mathematical techniques that are commonly used in statistics inc ...
, the Fisher information is a way of measuring the amount of
information Information is an Abstraction, abstract concept that refers to something which has the power Communication, to inform. At the most fundamental level, it pertains to the Interpretation (philosophy), interpretation (perhaps Interpretation (log ...
that an observable
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...
''X'' carries about an unknown parameter ''θ'' of a distribution that models ''X''. Formally, it is the
variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...
of the score, or the
expected value In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first Moment (mathematics), moment) is a generalization of the weighted average. Informa ...
of the observed information. The role of the Fisher information in the asymptotic theory of
maximum-likelihood estimation In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...
was emphasized and explored by the statistician Sir Ronald Fisher (following some initial results by
Francis Ysidro Edgeworth Francis Ysidro Edgeworth (8 February 1845 – 13 February 1926) was an Anglo-Irish philosopher and political economist who made significant contributions to the methods of statistics during the 1880s. From 1891 onward, he was appointed th ...
). The Fisher information matrix is used to calculate the covariance matrices associated with maximum-likelihood
estimates In the Westminster system of government, the ''Estimates'' are an outline of government spending for the following fiscal year presented by the Cabinet (government), cabinet to parliament. The Estimates are drawn up by bureaucrats in the finance ...
. It can also be used in the formulation of test statistics, such as the Wald test. In
Bayesian statistics Bayesian statistics ( or ) is a theory in the field of statistics based on the Bayesian interpretation of probability, where probability expresses a ''degree of belief'' in an event. The degree of belief may be based on prior knowledge about ...
, the Fisher information plays a role in the derivation of non-informative
prior distribution A prior probability distribution of an uncertain quantity, simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the ...
s according to Jeffreys' rule. It also appears as the large-sample covariance of the
posterior distribution The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior ...
, provided that the prior is sufficiently smooth (a result known as
Bernstein–von Mises theorem In Bayesian inference, the Bernstein–von Mises theorem provides the basis for using Bayesian credible sets for confidence statements in parametric models. It states that under some conditions, a posterior distribution converges in total variat ...
, which was anticipated by
Laplace Pierre-Simon, Marquis de Laplace (; ; 23 March 1749 – 5 March 1827) was a French polymath, a scholar whose work has been instrumental in the fields of physics, astronomy, mathematics, engineering, statistics, and philosophy. He summariz ...
for exponential families). The same result is used when approximating the posterior with
Laplace's approximation Laplace's approximation provides an analytical expression for a posterior probability distribution by fitting a Gaussian distribution with a mean equal to the MAP solution and precision equal to the observed Fisher information. The approximat ...
, where the Fisher information appears as the covariance of the fitted Gaussian. Statistical systems of a scientific nature (physical, biological, etc.) whose likelihood functions obey shift invariance have been shown to obey maximum Fisher information. The level of the maximum depends upon the nature of the system constraints.


Definition

The Fisher information is a way of measuring the amount of information that an observable
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...
X carries about an unknown
parameter A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
\theta upon which the probability of X depends. Let f(X;\theta) be the
probability density function In probability theory, a probability density function (PDF), density function, or density of an absolutely continuous random variable, is a Function (mathematics), function whose value at any given sample (or point) in the sample space (the s ...
(or
probability mass function In probability and statistics, a probability mass function (sometimes called ''probability function'' or ''frequency function'') is a function that gives the probability that a discrete random variable is exactly equal to some value. Sometimes i ...
) for X conditioned on the value of \theta. It describes the probability that we observe a given outcome of X, ''given'' a known value of \theta. If f is sharply peaked with respect to changes in \theta, it is easy to indicate the "correct" value of \theta from the data, or equivalently, that the data X provides a lot of information about the parameter \theta. If f is flat and spread-out, then it would take many samples of X to estimate the actual "true" value of \theta that ''would'' be obtained using the entire population being sampled. This suggests studying some kind of variance with respect to \theta. Formally, the
partial derivative In mathematics, a partial derivative of a function of several variables is its derivative with respect to one of those variables, with the others held constant (as opposed to the total derivative, in which all variables are allowed to vary). P ...
with respect to \theta of the
natural logarithm The natural logarithm of a number is its logarithm to the base of a logarithm, base of the e (mathematical constant), mathematical constant , which is an Irrational number, irrational and Transcendental number, transcendental number approxima ...
of the
likelihood function A likelihood function (often simply called the likelihood) measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the ...
is called the '' score''. Under certain regularity conditions, if \theta is the true parameter (i.e. X is actually distributed as f(X;\theta)), it can be shown that the
expected value In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first Moment (mathematics), moment) is a generalization of the weighted average. Informa ...
(the first moment) of the score, evaluated at the true parameter value \theta, is 0: :\begin \operatorname \left \,\,\theta \right = &\int_ \frac f(x;\theta)\,dx \\ pt = &\frac \int_ f(x; \theta)\,dx \\ pt = &\frac 1 \\ pt = & 0. \end The Fisher information is defined to be the
variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...
of the score: : \mathcal(\theta) = \operatorname \left \,\, \theta \right= \int_ \left(\frac \log f(x;\theta)\right)^2 f(x; \theta)\,dx, Note that \mathcal(\theta) \geq 0. A random variable carrying high Fisher information implies that the absolute value of the score is often high. The Fisher information is not a function of a particular observation, as the random variable ''X'' has been averaged out. If is twice differentiable with respect to ''θ'', and under certain additional regularity conditions, then the Fisher information may also be written as : \mathcal(\theta) = - \operatorname \left \,\, \theta \right Begin by taking the second derivative of \log f(X;\theta): :\frac \log f(X;\theta) = \frac - \left( \frac \right)^2 = \frac - \left( \frac \log f(X;\theta)\right)^2 Now take the expectation value :\begin\operatorname \left \,\, \theta \right&=\operatorname\left \,\, \theta \right - \operatorname\left \,\, \theta \right\\ &=\operatorname\left \,\, \theta \right - \mathcal(\theta) \\ \mathcal(\theta)&=-\operatorname\left \,\, \theta \right\operatorname\left \,\, \theta \right \end Next, we show that the last term is equal to 0. :\operatorname \left \,\, \theta \right \int_ f(x;\theta) \frac\,dx = \frac \int_ f(x;\theta)\,dx = \frac (1) = 0 Therefore, :\mathcal(\theta)=-\operatorname\left \,\, \theta \right/math> Thus, the Fisher information may be seen as the curvature of the support curve (the graph of the log-likelihood). Near the
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...
estimate, low Fisher information indicates that the maximum appears to be "blunt", that is, there are many points in the neighborhood that provide a similar log-likelihood. Conversely, a high Fisher information indicates that the maximum is "sharp".


Regularity conditions

The regularity conditions are as follows: # The partial derivative of ''f''(''X''; ''θ'') with respect to ''θ'' exists
almost everywhere In measure theory (a branch of mathematical analysis), a property holds almost everywhere if, in a technical sense, the set for which the property holds takes up nearly all possibilities. The notion of "almost everywhere" is a companion notion to ...
. (It can fail to exist on a null set, as long as this set does not depend on ''θ''.) # The integral of ''f''(''X''; ''θ'') can be differentiated under the integral sign with respect to ''θ''. # The support of ''f''(''X''; ''θ'') does not depend on ''θ''. If ''θ'' is a vector then the regularity conditions must hold for every component of ''θ''. It is easy to find an example of a density that does not satisfy the regularity conditions: The density of a Uniform(0, ''θ'') variable fails to satisfy conditions 1 and 3. In this case, even though the Fisher information can be computed from the definition, it will not have the properties it is typically assumed to have.


In terms of likelihood

Because the
likelihood A likelihood function (often simply called the likelihood) measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the j ...
of ''θ'' given ''X'' is always proportional to the probability ''f''(''X''; ''θ''), their logarithms necessarily differ by a constant that is independent of ''θ'', and the derivatives of these logarithms with respect to ''θ'' are necessarily equal. Thus one can substitute in a log-likelihood ''l''(''θ''; ''X'') instead of in the definitions of Fisher Information.


Samples of any size

The value ''X'' can represent a single sample drawn from a single distribution or can represent a collection of samples drawn from a collection of distributions. If there are ''n'' samples and the corresponding ''n'' distributions are
statistically independent Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two event (probability theory), events are independent, statistically independent, or stochastically independent if, informally s ...
then the Fisher information will necessarily be the sum of the single-sample Fisher information values, one for each single sample from its distribution. In particular, if the ''n'' distributions are
independent and identically distributed Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in Pennsylvania, United States * Independentes (English: Independents), a Portuguese artist ...
then the Fisher information will necessarily be ''n'' times the Fisher information of a single sample from the common distribution. Stated in other words, the Fisher Information of i.i.d. observations of a sample of size ''n'' from a population is equal to the product of ''n'' and the Fisher Information of a single observation from the same population.


Informal derivation of the Cramér–Rao bound

The
Cramér–Rao bound In estimation theory and statistics, the Cramér–Rao bound (CRB) relates to estimation of a deterministic (fixed, though unknown) parameter. The result is named in honor of Harald Cramér and Calyampudi Radhakrishna Rao, but has also been d ...
states that the inverse of the Fisher information is a lower bound on the variance of any
unbiased estimator In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In stat ...
of ''θ''. and provide the following method of deriving the
Cramér–Rao bound In estimation theory and statistics, the Cramér–Rao bound (CRB) relates to estimation of a deterministic (fixed, though unknown) parameter. The result is named in honor of Harald Cramér and Calyampudi Radhakrishna Rao, but has also been d ...
, a result which describes use of the Fisher information. Informally, we begin by considering an
unbiased estimator In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In stat ...
\hat\theta(X). Mathematically, "unbiased" means that : \operatorname\left \,\, \theta \right= \int \left(\hat\theta(x) - \theta\right) \, f(x ;\theta) \, dx = 0 \text \theta. This expression is zero independent of ''θ'', so its partial derivative with respect to ''θ'' must also be zero. By the
product rule In calculus, the product rule (or Leibniz rule or Leibniz product rule) is a formula used to find the derivatives of products of two or more functions. For two functions, it may be stated in Lagrange's notation as (u \cdot v)' = u ' \cdot v ...
, this partial derivative is also equal to : 0 = \frac \int \left(\hat\theta(x) - \theta \right) \, f(x ;\theta) \,dx = \int \left(\hat\theta(x)-\theta\right) \frac \, dx - \int f \,dx. For each ''θ'', the likelihood function is a probability density function, and therefore \int f\,dx = 1. By using the
chain rule In calculus, the chain rule is a formula that expresses the derivative of the Function composition, composition of two differentiable functions and in terms of the derivatives of and . More precisely, if h=f\circ g is the function such that h ...
on the partial derivative of \log f and then dividing and multiplying by f(x;\theta), one can verify that :\frac = f \, \frac. Using these two facts in the above, we get : \int \left(\hat\theta-\theta\right) f \, \frac \, dx = 1. Factoring the integrand gives : \int \left(\left(\hat\theta-\theta\right) \sqrt \right) \left( \sqrt \, \frac \right) \, dx = 1. Squaring the expression in the integral, the
Cauchy–Schwarz inequality The Cauchy–Schwarz inequality (also called Cauchy–Bunyakovsky–Schwarz inequality) is an upper bound on the absolute value of the inner product between two vectors in an inner product space in terms of the product of the vector norms. It is ...
yields : 1 = \biggl( \int \left left(\hat\theta-\theta\right) \sqrt \right\cdot \left \sqrt \, \frac \right\, dx \biggr)^2 \le \left \int \left(\hat\theta - \theta\right)^2 f \, dx \right\cdot \left \int \left( \frac \right)^2 f \, dx \right The second bracketed factor is defined to be the Fisher Information, while the first bracketed factor is the mean-squared error (MSE) of the estimator \hat\theta. Since the estimator is unbiased, its MSE equals its variance. By rearranging, the inequality tells us that : \operatorname(\hat\theta) \geq \frac. In other words, the precision to which we can estimate ''θ'' is fundamentally limited by the Fisher information of the likelihood function. Alternatively, the same conclusion can be obtained directly from the Cauchy–Schwarz inequality for random variables, , \operatorname(A,B), ^2 \le \operatorname(A)\operatorname(B), applied to the random variables \hat\theta(X) and \partial_\theta\log f(X;\theta), and observing that for unbiased estimators we have\operatorname hat\theta(X),\partial_\theta \log f(X;\theta)= \int \hat\theta(x)\, \partial_\theta f(x;\theta)\, dx = \partial_\theta \operatorname E hat\theta= 1.


Examples


Single-parameter Bernoulli experiment

A
Bernoulli trial In the theory of probability and statistics, a Bernoulli trial (or binomial trial) is a random experiment with exactly two possible outcomes, "success" and "failure", in which the probability of success is the same every time the experiment is ...
is a random variable with two possible outcomes, 0 and 1, with 1 having a probability of ''θ''. The outcome can be thought of as determined by the toss of a biased coin, with the probability of heads (1) being ''θ'' and the probability of tails (0) being . Let ''X'' be a Bernoulli trial of one sample from the distribution. The Fisher information contained in ''X'' may be calculated to be: :\begin \mathcal(\theta) &= -\operatorname\left \theta\right\\ pt &= -\operatorname\left \,\, \theta \right\\ pt &= \operatorname\left \,\, \theta\right\\ pt &= \frac + \frac \\ pt &= \frac. \end Because Fisher information is additive, the Fisher information contained in ''n'' independent
Bernoulli trial In the theory of probability and statistics, a Bernoulli trial (or binomial trial) is a random experiment with exactly two possible outcomes, "success" and "failure", in which the probability of success is the same every time the experiment is ...
s is therefore :\mathcal(\theta) = \frac. If x_i is one of the 2^n possible outcomes of ''n'' independent Bernoulli trials and x_ is the ''j'' th outcome of the ''i'' th trial, then the probability of x_i is given by :p(x_i,\theta)=\prod_^n \theta^(1-\theta)^ The sample mean of the ''i'' th trial is \mu_i = (1/n)\sum_^n x_. The expected value of the sample mean (over the sampling distribution) is :E(\mu)=\sum_ \mu_i \, p(x_i,\theta) = \theta, where the sum is over all 2^n possible trial outcomes. The expected value of the square of the sample mean is :E(\mu^2)=\sum_ \mu_i^2 \, p(x_i,\theta) = \frac so the variance in the value of the mean is :E(\mu^2)-E(\mu)^2 = \frac It is seen that the Fisher information is the reciprocal of the
variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...
of the mean number of successes in ''n''
Bernoulli trial In the theory of probability and statistics, a Bernoulli trial (or binomial trial) is a random experiment with exactly two possible outcomes, "success" and "failure", in which the probability of success is the same every time the experiment is ...
s. This is generally true. In this case, the Cramér–Rao bound is an equality.


Estimate ''θ'' from ''X'' ~ Bern (√''θ'')

As another toy example consider a random variable X with possible outcomes 0 and 1, with probabilities p_0=1-\sqrt\theta and p_1=\sqrt\theta, respectively, for some \theta\in ,1/math>. Our goal is estimating \theta from observations of X. The Fisher information reads in this case\begin \mathcal I(\theta) &= \mathrm E\left \,\theta \right\\&= (1-\sqrt\theta)\left(\frac\right)^2 + \sqrt\theta\left(\frac\right)^2 \\ &= \frac\left(\frac + \frac\right) \end.This expression can also be derived directly from the change of reparametrization formula given below. More generally, for any sufficiently regular function f such that f(\theta)\in ,1/math>, the Fisher information to retrieve \theta from X\sim\operatorname(f(\theta)) is similarly computed to be\mathcal I(\theta) = f'(\theta)^2 \left(\frac+\frac \right).


Matrix form

When there are ''N'' parameters, so that ''θ'' is an
vector Vector most often refers to: * Euclidean vector, a quantity with a magnitude and a direction * Disease vector, an agent that carries and transmits an infectious pathogen into another living organism Vector may also refer to: Mathematics a ...
\theta = \begin\theta_1 & \theta_2 & \dots & \theta_N\end^\textsf, the Fisher information takes the form of an
matrix Matrix (: matrices or matrixes) or MATRIX may refer to: Science and mathematics * Matrix (mathematics), a rectangular array of numbers, symbols or expressions * Matrix (logic), part of a formula in prenex normal form * Matrix (biology), the m ...
. This matrix is called the Fisher information matrix (FIM) and has typical element : \bigl mathcal(\theta)\bigr = \operatorname\left \,\,\theta\right The FIM is a
positive semidefinite matrix In mathematics, a symmetric matrix M with real entries is positive-definite if the real number \mathbf^\mathsf M \mathbf is positive for every nonzero real column vector \mathbf, where \mathbf^\mathsf is the row vector transpose of \mathbf. M ...
. If it is positive definite, then it defines a
Riemannian metric In differential geometry, a Riemannian manifold is a geometric space on which many geometric notions such as distance, angles, length, volume, and curvature are defined. Euclidean space, the N-sphere, n-sphere, hyperbolic space, and smooth surf ...
on the ''N''-
dimension In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it. Thus, a line has a dimension of one (1D) because only one coo ...
al
parameter space The parameter space is the space of all possible parameter values that define a particular mathematical model. It is also sometimes called weight space, and is often a subset of finite-dimensional Euclidean space. In statistics, parameter spaces a ...
. The topic
information geometry Information geometry is an interdisciplinary field that applies the techniques of differential geometry to study probability theory and statistics. It studies statistical manifolds, which are Riemannian manifolds whose points correspond to proba ...
uses this to connect Fisher information to
differential geometry Differential geometry is a Mathematics, mathematical discipline that studies the geometry of smooth shapes and smooth spaces, otherwise known as smooth manifolds. It uses the techniques of Calculus, single variable calculus, vector calculus, lin ...
, and in that context, this metric is known as the
Fisher information metric In information geometry, the Fisher information metric is a particular Riemannian metric which can be defined on a smooth statistical manifold, ''i.e.'', a smooth manifold whose points are probability distributions. It can be used to calculate the ...
. Under certain regularity conditions, the Fisher information matrix may also be written as : \bigl mathcal(\theta) \bigr = -\operatorname\left \,\, \theta\right,. The result is interesting in several ways: *It can be derived as the Hessian of the
relative entropy Relative may refer to: General use *Kinship and family, the principle binding the most basic social units of society. If two people are connected by circumstances of birth, they are said to be ''relatives''. Philosophy *Relativism, the concept t ...
. *It can be used as a Riemannian metric for defining Fisher-Rao geometry when it is positive-definite. *It can be understood as a metric induced from the
Euclidean metric In mathematics, the Euclidean distance between two points in Euclidean space is the length of the line segment between them. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, and therefore is oc ...
, after appropriate change of variable. *In its complex-valued form, it is the Fubini–Study metric. *It is the key part of the proof of Wilks' theorem, which allows confidence region estimates for
maximum likelihood estimation In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
(for those conditions for which it applies) without needing the Likelihood Principle. *In cases where the analytical calculations of the FIM above are difficult, it is possible to form an average of easy
Monte Carlo Monte Carlo ( ; ; or colloquially ; , ; ) is an official administrative area of Monaco, specifically the Ward (country subdivision), ward of Monte Carlo/Spélugues, where the Monte Carlo Casino is located. Informally, the name also refers to ...
estimates of the Hessian of the negative log-likelihood function as an estimate of the FIM. The estimates may be based on values of the negative log-likelihood function or the gradient of the negative log-likelihood function; no analytical calculation of the Hessian of the negative log-likelihood function is needed.


Information orthogonal parameters

We say that two parameter component vectors ''θ1'' and ''θ2'' are information orthogonal if the Fisher information matrix is block diagonal, with these components in separate blocks. Orthogonal parameters are easy to deal with in the sense that their maximum likelihood estimates are asymptotically uncorrelated. When considering how to analyse a statistical model, the modeller is advised to invest some time searching for an orthogonal parametrization of the model, in particular when the parameter of interest is one-dimensional, but the nuisance parameter can have any dimension.


Singular statistical model

If the Fisher information matrix is positive definite for all , then the corresponding
statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repre ...
is said to be ''regular''; otherwise, the statistical model is said to be ''singular''. Examples of singular statistical models include the following: normal mixtures, binomial mixtures, multinomial mixtures,
Bayesian network A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Whi ...
s,
neural networks A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either Cell (biology), biological cells or signal pathways. While individual neurons are simple, many of them together in a netwo ...
,
radial basis function In mathematics a radial basis function (RBF) is a real-valued function \varphi whose value depends only on the distance between the input and some fixed point, either the origin, so that \varphi(\mathbf) = \hat\varphi(\left\, \mathbf\right\, ), o ...
s,
hidden Markov model A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent (or ''hidden'') Markov process (referred to as X). An HMM requires that there be an observable process Y whose outcomes depend on the outcomes of X ...
s, stochastic context-free grammars, reduced rank regressions,
Boltzmann machine A Boltzmann machine (also called Sherrington–Kirkpatrick model with external field or stochastic Ising model), named after Ludwig Boltzmann, is a spin glass, spin-glass model with an external field, i.e., a Spin glass#Sherrington–Kirkpatrick m ...
s. In
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
, if a statistical model is devised so that it extracts hidden structure from a random phenomenon, then it naturally becomes singular.


Multivariate normal distribution

The FIM for a ''N''-variate
multivariate normal distribution In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional ( univariate) normal distribution to higher dimensions. One d ...
, \,X \sim N\left(\mu(\theta),\, \Sigma(\theta)\right) has a special form. Let the ''K''-dimensional vector of parameters be \theta = \begin \theta_1 & \dots & \theta_K \end^\textsf and the vector of normal random variables be X = \begin X_1 & \dots & X_N \end^\textsf. Assume that the mean values of these random variables are \,\mu(\theta) = \begin \mu_1(\theta) & \dots & \mu_N(\theta) \end^\textsf, and let \,\Sigma(\theta) be the
covariance matrix In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
. Then, for 1 \le m,\, n \le K, the (''m'', ''n'') entry of the FIM is: : \mathcal_ = \frac\Sigma^ \frac + \frac\operatorname\left( \Sigma^\frac \Sigma^\frac \right), where (\cdot)^\textsf denotes the
transpose In linear algebra, the transpose of a Matrix (mathematics), matrix is an operator which flips a matrix over its diagonal; that is, it switches the row and column indices of the matrix by producing another matrix, often denoted by (among other ...
of a vector, \operatorname(\cdot) denotes the trace of a
square matrix In mathematics, a square matrix is a Matrix (mathematics), matrix with the same number of rows and columns. An ''n''-by-''n'' matrix is known as a square matrix of order Any two square matrices of the same order can be added and multiplied. Squ ...
, and: : \begin \frac &= \begin \dfrac & \dfrac & \cdots & \dfrac \end^\textsf; \\ pt \dfrac &= \begin \dfrac & \dfrac & \cdots & \dfrac \\ pt \dfrac & \dfrac & \cdots & \dfrac \\ \vdots & \vdots & \ddots & \vdots \\ \dfrac & \dfrac & \cdots & \dfrac \end. \end Note that a special, but very common, case is the one where \Sigma(\theta) = \Sigma, a constant. Then : \mathcal_ = \frac\Sigma^ \frac.\ In this case the Fisher information matrix may be identified with the coefficient matrix of the normal equations of
least squares The method of least squares is a mathematical optimization technique that aims to determine the best fit function by minimizing the sum of the squares of the differences between the observed values and the predicted values of the model. The me ...
estimation theory. Another special case occurs when the mean and covariance depend on two different vector parameters, say, ''β'' and ''θ''. This is especially popular in the analysis of spatial data, which often uses a linear model with correlated residuals. In this case, : \mathcal(\beta, \theta) = \operatorname\left(\mathcal(\beta), \mathcal(\theta)\right) where : \begin \mathcal &= \frac \Sigma^ \frac, \\ pt \mathcal &= \frac\operatorname\left(\Sigma^ \frac\frac\right) \end


Properties


Chain rule

Similar to the
entropy Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...
or
mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual Statistical dependence, dependence between the two variables. More specifically, it quantifies the "Information conten ...
, the Fisher information also possesses a chain rule decomposition. In particular, if ''X'' and ''Y'' are jointly distributed random variables, it follows that: :\mathcal_(\theta) = \mathcal_X(\theta) + \mathcal_(\theta), where \mathcal_(\theta) = \operatorname_ \left \mathcal_(\theta) \right and \mathcal_(\theta) is the Fisher information of ''Y'' relative to \theta calculated with respect to the conditional density of ''Y'' given a specific value ''X'' = ''x''. As a special case, if the two random variables are
independent Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in Pennsylvania, United States * Independentes (English: Independents), a Portuguese artist ...
, the information yielded by the two random variables is the sum of the information from each random variable separately: :\mathcal_(\theta) = \mathcal_X(\theta) + \mathcal_Y(\theta). Consequently, the information in a random sample of ''n''
independent and identically distributed Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in Pennsylvania, United States * Independentes (English: Independents), a Portuguese artist ...
observations is ''n'' times the information in a sample of size 1.


''f''-divergence

Given a convex function f: , \infty)\to(-\infty, \infty/math> that f(x) is finite for all x > 0, f(1)=0, and f(0)=\lim_ f(t) , (which could be infinite), it defines an ''f''-divergence D_f. Then if f is strictly convex at 1, then locally at \theta\in\Theta, the Fisher information matrix is a metric, in the sense that(\delta\theta)^T I(\theta) (\delta\theta) = \fracD_f(P_ \parallel P_\theta)where P_\theta is the distribution parametrized by \theta. That is, it's the distribution with pdf f(x; \theta). In this form, it is clear that the Fisher information matrix is a Riemannian metric, and varies correctly under a change of variables. (see section on
Reparameterization In mathematics, and more specifically in geometry, parametrization (or parameterization; also parameterisation, parametrisation) is the process of finding parametric equations of a curve, a surface, or, more generally, a manifold or a variety, ...
.)


Sufficient statistic

The information provided by a
sufficient statistic In statistics, sufficiency is a property of a statistic computed on a sample dataset in relation to a parametric model of the dataset. A sufficient statistic contains all of the information that the dataset provides about the model parameters. It ...
is the same as that of the sample ''X''. This may be seen by using Neyman's factorization criterion for a sufficient statistic. If ''T''(''X'') is sufficient for ''θ'', then :f(X; \theta) = g(T(X), \theta) h(X) for some functions ''g'' and ''h''. The independence of ''h''(''X'') from ''θ'' implies :\frac \log \left (X; \theta)\right= \frac \log\left (T(X);\theta)\right and the equality of information then follows from the definition of Fisher information. More generally, if is a
statistic A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose. Statistical purposes include estimating a population parameter, describing a sample, or evaluating a hypot ...
, then : \mathcal_T(\theta) \leq \mathcal_X(\theta) with equality
if and only if In logic and related fields such as mathematics and philosophy, "if and only if" (often shortened as "iff") is paraphrased by the biconditional, a logical connective between statements. The biconditional is true in two cases, where either bo ...
''T'' is a
sufficient statistic In statistics, sufficiency is a property of a statistic computed on a sample dataset in relation to a parametric model of the dataset. A sufficient statistic contains all of the information that the dataset provides about the model parameters. It ...
.


Reparameterization

The Fisher information depends on the parametrization of the problem. If ''θ'' and ''η'' are two scalar parametrizations of an estimation problem, and ''θ'' is a
continuously differentiable In mathematics, a differentiable function of one Real number, real variable is a Function (mathematics), function whose derivative exists at each point in its Domain of a function, domain. In other words, the Graph of a function, graph of a differ ...
function of ''η'', then :_\eta(\eta) = _\theta(\theta(\eta)) \left( \frac \right)^2 where _\eta and _\theta are the Fisher information measures of ''η'' and ''θ'', respectively. In the vector case, suppose and are ''k''-vectors which parametrize an estimation problem, and suppose that is a continuously differentiable function of , then, :_() = ^\textsf _ (()) where the (''i'', ''j'')th element of the ''k'' × ''k''
Jacobian matrix In vector calculus, the Jacobian matrix (, ) of a vector-valued function of several variables is the matrix of all its first-order partial derivatives. If this matrix is square, that is, if the number of variables equals the number of component ...
\boldsymbol J is defined by : J_ = \frac, and where ^\textsf is the matrix transpose of . In
information geometry Information geometry is an interdisciplinary field that applies the techniques of differential geometry to study probability theory and statistics. It studies statistical manifolds, which are Riemannian manifolds whose points correspond to proba ...
, this is seen as a change of coordinates on a
Riemannian manifold In differential geometry, a Riemannian manifold is a geometric space on which many geometric notions such as distance, angles, length, volume, and curvature are defined. Euclidean space, the N-sphere, n-sphere, hyperbolic space, and smooth surf ...
, and the intrinsic properties of curvature are unchanged under different parametrizations. In general, the Fisher information matrix provides a Riemannian metric (more precisely, the Fisher–Rao metric) for the manifold of thermodynamic states, and can be used as an information-geometric complexity measure for a classification of
phase transitions In physics, chemistry, and other related fields like biology, a phase transition (or phase change) is the physical process of transition between one state of a medium and another. Commonly the term is used to refer to changes among the basic Sta ...
, e.g., the scalar curvature of the thermodynamic metric tensor diverges at (and only at) a phase transition point. In the thermodynamic context, the Fisher information matrix is directly related to the rate of change in the corresponding order parameters. In particular, such relations identify second-order phase transitions via divergences of individual elements of the Fisher information matrix.


Isoperimetric inequality

The Fisher information matrix plays a role in an inequality like the
isoperimetric inequality In mathematics, the isoperimetric inequality is a geometric inequality involving the square of the circumference of a closed curve in the plane and the area of a plane region it encloses, as well as its various generalizations. '' Isoperimetric'' ...
. Of all probability distributions with a given entropy, the one whose Fisher information matrix has the smallest trace is the Gaussian distribution. This is like how, of all bounded sets with a given volume, the sphere has the smallest surface area. The proof involves taking a multivariate random variable X with density function f and adding a location parameter to form a family of densities \. Then, by analogy with the Minkowski–Steiner formula, the "surface area" of X is defined to be :S(X) = \lim_ \frac where Z_\varepsilon is a Gaussian variable with covariance matrix \varepsilon I. The name "surface area" is apt because the entropy power e^ is the volume of the "effective support set", so S(X) is the "derivative" of the volume of the effective support set, much like the Minkowski-Steiner formula. The remainder of the proof uses the entropy power inequality, which is like the Brunn–Minkowski inequality. The trace of the Fisher information matrix is found to be a factor of S(X).


Applications


Optimal design of experiments

Fisher information is widely used in optimal experimental design. Because of the reciprocity of estimator-variance and Fisher information, ''minimizing'' the ''variance'' corresponds to ''maximizing'' the ''information''. When the
linear In mathematics, the term ''linear'' is used in two distinct senses for two different properties: * linearity of a '' function'' (or '' mapping''); * linearity of a '' polynomial''. An example of a linear function is the function defined by f(x) ...
(or linearized)
statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repre ...
has several
parameter A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
s, the
mean A mean is a quantity representing the "center" of a collection of numbers and is intermediate to the extreme values of the set of numbers. There are several kinds of means (or "measures of central tendency") in mathematics, especially in statist ...
of the parameter estimator is a
vector Vector most often refers to: * Euclidean vector, a quantity with a magnitude and a direction * Disease vector, an agent that carries and transmits an infectious pathogen into another living organism Vector may also refer to: Mathematics a ...
and its
variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...
is a
matrix Matrix (: matrices or matrixes) or MATRIX may refer to: Science and mathematics * Matrix (mathematics), a rectangular array of numbers, symbols or expressions * Matrix (logic), part of a formula in prenex normal form * Matrix (biology), the m ...
. The inverse of the variance matrix is called the "information matrix". Because the variance of the estimator of a parameter vector is a matrix, the problem of "minimizing the variance" is complicated. Using
statistical theory The theory of statistics provides a basis for the whole range of techniques, in both study design and data analysis, that are used within applications of statistics. The theory covers approaches to statistical-decision problems and to statistica ...
, statisticians compress the information-matrix using real-valued
summary statistics In descriptive statistics, summary statistics are used to summarize a set of observations, in order to communicate the largest amount of information as simply as possible. Statisticians commonly try to describe the observations in * a measure of ...
; being real-valued functions, these "information criteria" can be maximized. Traditionally, statisticians have evaluated estimators and designs by considering some summary statistic of the covariance matrix (of an unbiased estimator), usually with positive real values (like the
determinant In mathematics, the determinant is a Scalar (mathematics), scalar-valued function (mathematics), function of the entries of a square matrix. The determinant of a matrix is commonly denoted , , or . Its value characterizes some properties of the ...
or
matrix trace Matrix (: matrices or matrixes) or MATRIX may refer to: Science and mathematics * Matrix (mathematics), a rectangular array of numbers, symbols or expressions * Matrix (logic), part of a formula in prenex normal form * Matrix (biology), the m ...
). Working with positive real numbers brings several advantages: If the estimator of a single parameter has a positive variance, then the variance and the Fisher information are both positive real numbers; hence they are members of the convex cone of nonnegative real numbers (whose nonzero members have reciprocals in this same cone). For several parameters, the covariance matrices and information matrices are elements of the convex cone of nonnegative-definite symmetric matrices in a partially ordered vector space, under the Loewner (Löwner) order. This cone is closed under matrix addition and inversion, as well as under the multiplication of positive real numbers and matrices. An exposition of matrix theory and Loewner order appears in Pukelsheim. The traditional optimality criteria are the information matrix's invariants, in the sense of
invariant theory Invariant theory is a branch of abstract algebra dealing with actions of groups on algebraic varieties, such as vector spaces, from the point of view of their effect on functions. Classically, the theory dealt with the question of explicit descr ...
; algebraically, the traditional optimality criteria are functionals of the
eigenvalue In linear algebra, an eigenvector ( ) or characteristic vector is a vector that has its direction unchanged (or reversed) by a given linear transformation. More precisely, an eigenvector \mathbf v of a linear transformation T is scaled by a ...
s of the (Fisher) information matrix (see
optimal design In the design of experiments, optimal experimental designs (or optimum designs) are a class of experimental designs that are optimal with respect to some statistical criterion. The creation of this field of statistics has been credited to D ...
).


Jeffreys prior in Bayesian statistics

In
Bayesian statistics Bayesian statistics ( or ) is a theory in the field of statistics based on the Bayesian interpretation of probability, where probability expresses a ''degree of belief'' in an event. The degree of belief may be based on prior knowledge about ...
, the Fisher information is used to calculate the
Jeffreys prior In Bayesian statistics, the Jeffreys prior is a non-informative prior distribution for a parameter space. Named after Sir Harold Jeffreys, its density function is proportional to the square root of the determinant of the Fisher information matri ...
, which is a standard, non-informative prior for continuous distribution parameters.


Computational neuroscience

The Fisher information has been used to find bounds on the accuracy of neural codes. In that case, ''X'' is typically the joint responses of many neurons representing a low dimensional variable ''θ'' (such as a stimulus parameter). In particular the role of correlations in the noise of the neural responses has been studied.


Epidemiology

Fisher information was used to study how informative different data sources are for estimation of the reproduction number of SARS-CoV-2.


Machine learning

The Fisher information is used in machine learning techniques such as elastic weight consolidation, which reduces catastrophic forgetting in
artificial neural networks In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks. A neural network consists of connected ...
. Fisher information can be used as an alternative to the Hessian of the loss function in second-order gradient descent network training.


Color discrimination

Using a
Fisher information metric In information geometry, the Fisher information metric is a particular Riemannian metric which can be defined on a smooth statistical manifold, ''i.e.'', a smooth manifold whose points are probability distributions. It can be used to calculate the ...
, da Fonseca et. al investigated the degree to which MacAdam ellipses (color discrimination ellipses) can be derived from the response functions of the retinal photoreceptors.


Relation to relative entropy

Fisher information is related to
relative entropy Relative may refer to: General use *Kinship and family, the principle binding the most basic social units of society. If two people are connected by circumstances of birth, they are said to be ''relatives''. Philosophy *Relativism, the concept t ...
.Gourieroux & Montfort (1995), page 87
/ref> The relative entropy, or
Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...
, between two distributions p and q can be written as :KL(p:q) = \int p(x)\log\frac \, dx. Now, consider a family of probability distributions f(x; \theta) parametrized by \theta \in \Theta. Then the
Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...
, between two distributions in the family can be written as :D(\theta,\theta') = KL(p(\cdot;\theta):p(\cdot;\theta'))= \int f(x; \theta)\log\frac \, dx. If \theta is fixed, then the relative entropy between two distributions of the same family is minimized at \theta'=\theta. For \theta' close to \theta, one may expand the previous expression in a series up to second order: :D(\theta,\theta') = \frac(\theta' - \theta)^\textsf \left(\frac D(\theta,\theta')\right)_(\theta' - \theta) + o\left( (\theta'-\theta)^2 \right) But the second order derivative can be written as : \left(\frac D(\theta,\theta')\right)_ = - \int f(x; \theta) \left( \frac \log(f(x; \theta'))\right)_ \, dx = mathcal(\theta). Thus the Fisher information represents the
curvature In mathematics, curvature is any of several strongly related concepts in geometry that intuitively measure the amount by which a curve deviates from being a straight line or by which a surface deviates from being a plane. If a curve or su ...
of the relative entropy of a conditional distribution with respect to its parameters.


History

The Fisher information was discussed by several early statisticians, notably F. Y. Edgeworth. For example, Savage says: "In it isher information he isherwas to some extent anticipated (Edgeworth 1908–9 esp. 502, 507–8, 662, 677–8, 82–5 and references he dgeworthcites including Pearson and Filon 1898 . .." There are a number of early historical sources and a number of reviews of this early work.


See also

*
Efficiency (statistics) In statistics, efficiency is a measure of quality of an estimator, of an experimental design, or of a hypothesis testing procedure. Essentially, a more efficient estimator needs fewer input data or observations than a less efficient one to achiev ...
* Observed information *
Fisher information metric In information geometry, the Fisher information metric is a particular Riemannian metric which can be defined on a smooth statistical manifold, ''i.e.'', a smooth manifold whose points are probability distributions. It can be used to calculate the ...
* Formation matrix *
Information geometry Information geometry is an interdisciplinary field that applies the techniques of differential geometry to study probability theory and statistics. It studies statistical manifolds, which are Riemannian manifolds whose points correspond to proba ...
*
Jeffreys prior In Bayesian statistics, the Jeffreys prior is a non-informative prior distribution for a parameter space. Named after Sir Harold Jeffreys, its density function is proportional to the square root of the determinant of the Fisher information matri ...
*
Cramér–Rao bound In estimation theory and statistics, the Cramér–Rao bound (CRB) relates to estimation of a deterministic (fixed, though unknown) parameter. The result is named in honor of Harald Cramér and Calyampudi Radhakrishna Rao, but has also been d ...
* Minimum Fisher information *
Quantum Fisher information The quantum Fisher information is a central quantity in quantum metrology and is the quantum analogue of the classical Fisher information. It is one of the central quantities used to qualify the utility of an input state, especially in Mach–Zehnd ...
Other measures employed in
information theory Information theory is the mathematical study of the quantification (science), quantification, Data storage, storage, and telecommunications, communication of information. The field was established and formalized by Claude Shannon in the 1940s, ...
: *
Entropy (information theory) In information theory, the entropy of a random variable quantifies the average level of uncertainty or information associated with the variable's potential states or possible outcomes. This measures the expected amount of information needed ...
*
Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...
*
Self-information In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative w ...


Notes


References

* * * * * * * * * * * * * * * * * * * {{DEFAULTSORT:Fisher Information Estimation theory Information theory Design of experiments Ronald Fisher