In information geometry, the Fisher information metric is a particular

Riemannian metric In differential geometry, a Riemannian manifold or Riemannian space , so called after the German mathematician Bernhard Riemann, is a real, smooth manifold ''M'' equipped with a positive-definite inner product ''g'p'' on the tangent space '' ...

which can be defined on a smooth

statistical manifold In mathematics, a statistical manifold is a Riemannian manifold, each of whose points is a probability distribution. Statistical manifolds provide a setting for the field of information geometry. The Fisher information metric provides a met ...

, ''i.e.'', a

smooth manifold In mathematics, a differentiable manifold (also differential manifold) is a type of manifold that is locally similar enough to a vector space to allow one to apply calculus. Any manifold can be described by a collection of charts (atlas). One ma ...

whose points are probability measures defined on a common

probability space In probability theory, a probability space or a probability triple (\Omega, \mathcal, P) is a mathematical construct that provides a formal model of a random process or "experiment". For example, one can define a probability space which models t ...

. It can be used to calculate the informational difference between measurements. The metric is interesting in several respects. By Chentsov’s theorem, the Fisher information metric on statistical models is the only Riemannian metric (up to rescaling) that is invariant under sufficient statistics. It can also be understood to be the infinitesimal form of the relative entropy (''i.e.'', the

Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fr ...

); specifically, it is the Hessian of the divergence. Alternately, it can be understood as the metric induced by the flat space

Euclidean metric In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, therefore occ ...

, after appropriate changes of variable. When extended to complex

projective Hilbert space In mathematics and the foundations of quantum mechanics, the projective Hilbert space P(H) of a complex Hilbert space H is the set of equivalence classes of non-zero vectors v in H, for the relation \sim on H given by :w \sim v if and only if v = \ ...

, it becomes the

Fubini–Study metric In mathematics, the Fubini–Study metric is a Kähler metric on projective Hilbert space, that is, on a complex projective space CP''n'' endowed with a Hermitian form. This metric was originally described in 1904 and 1905 by Guido Fubini and Edu ...

; when written in terms of mixed states, it is the quantum Bures metric. Considered purely as a matrix, it is known as the

Fisher information matrix In mathematical statistics, the Fisher information (sometimes simply called information) is a way of measuring the amount of information that an observable random variable ''X'' carries about an unknown parameter ''θ'' of a distribution that model ...

. Considered as a measurement technique, where it is used to estimate hidden parameters in terms of observed random variables, it is known as the

observed information In statistics, the observed information, or observed Fisher information, is the negative of the second derivative (the Hessian matrix) of the "log-likelihood" (the logarithm of the likelihood function). It is a sample-based version of the Fisher ...

Definition

Given a statistical manifold with coordinates

\theta=(\theta_1, \theta_2, \ldots, \theta_n)

, one writes

p(x,\theta)

for the probability distribution as a function of

\theta

. Here

x

is drawn from the value space ''R'' for a (discrete or continuous) random variable ''X''. The probability is normalized by

\int_X p(x,\theta) \,dx = 1

The Fisher information metric then takes the form: :

g_(\theta)
=
\int_X
 \frac
 \frac
 p(x,\theta) \, dx.

The integral is performed over all values ''x'' in ''X''. The variable

\theta

is now a coordinate on a Riemann manifold. The labels ''j'' and ''k'' index the local coordinate axes on the manifold. When the probability is derived from the

Gibbs measure In mathematics, the Gibbs measure, named after Josiah Willard Gibbs, is a probability measure frequently seen in many problems of probability theory and statistical mechanics. It is a generalization of the canonical ensemble to infinite systems. Th ...

, as it would be for any Markovian process, then

\theta

can also be understood to be a

Lagrange multiplier In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints (i.e., subject to the condition that one or more equations have to be satisfied e ...

; Lagrange multipliers are used to enforce constraints, such as holding the

expectation value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a l ...

of some quantity constant. If there are ''n'' constraints holding ''n'' different expectation values constant, then the dimension of the manifold is ''n'' dimensions smaller than the original space. In this case, the metric can be explicitly derived from the partition function; a derivation and discussion is presented there. Substituting

i(x,\theta) = -\logp(x,\theta)

from information theory, an equivalent form of the above definition is: :

g_(\theta)
=
\int_X
 \frac
 p(x,\theta) \, dx
=
\mathrm
\left \frac
\right

To show that the equivalent form equals the above definition note that :

\mathrm
\left \frac
\right 0

and apply

\frac

on both sides.

Relation to the Kullback–Leibler divergence

Alternatively, the metric can be obtained as the second derivative of the ''relative entropy'' or

. To obtain this, one considers two probability distributions

P(\theta)

and

P(\theta_0)

, which are infinitesimally close to one another, so that :

P(\theta) = P(\theta_0) + \sum_j \Delta\theta^j \left.\frac\_

with

\Delta\theta^j

an infinitesimally small change of

\theta

in the ''j'' direction. Then, since the Kullback–Leibler divergence

D_P(\theta) /math> has an absolute minimum of 0 when P(\theta) = P(\theta_0), one has an expansion up to second order in \theta = \theta_0 of the form
 
: f_(\theta) := D_P(\theta) = \frac \sum_\Delta\theta^j\Delta\theta^k g_(\theta_0) + \mathrm(\Delta\theta^3) .

The symmetric matrix g_is positive (semi) definite and is the

Hessian matrix In mathematics, the Hessian matrix or Hessian is a square matrix of second-order partial derivatives of a scalar-valued function, or scalar field. It describes the local curvature of a function of many variables. The Hessian matrix was developed ...

of the function

f_(\theta)

at the extremum point

\theta_0

. This can be thought of intuitively as: "The distance between two infinitesimally close points on a statistical differential manifold is the informational difference between them."

Relation to Ruppeiner geometry

The

Ruppeiner metric Ruppeiner geometry is thermodynamic geometry (a type of information geometry) using the language of Riemannian geometry to study thermodynamics. George Ruppeiner proposed it in 1979. He claimed that thermodynamic systems can be represented by Rieman ...

and

Weinhold metric Ruppeiner geometry is thermodynamic geometry (a type of information geometry) using the language of Riemannian geometry to study thermodynamics. George Ruppeiner proposed it in 1979. He claimed that thermodynamic systems can be represented by Rieman ...

are the Fisher information metric calculated for Gibbs distributions as the ones found in equilibrium statistical mechanics.

Change in free entropy

The

action Action may refer to: * Action (narrative), a literary mode * Action fiction, a type of genre fiction * Action game, a genre of video game Film * Action film, a genre of film * ''Action'' (1921 film), a film by John Ford * ''Action'' (1980 fil ...

of a curve on a Riemannian manifold is given by :

A=\frac\int_a^b 
\frac
g_(\theta)\frac dt

The path parameter here is time ''t''; this action can be understood to give the change in

free entropy A thermodynamic free entropy is an entropic thermodynamic potential analogous to the free energy. Also known as a Massieu, Planck, or Massieu–Planck potentials (or functions), or (rarely) free information. In statistical mechanics, free entropi ...

of a system as it is moved from time ''a'' to time ''b''. Specifically, one has :

\Delta S = (b-a) A \,

as the change in free entropy. This observation has resulted in practical applications in

chemical A chemical substance is a form of matter having constant chemical composition and characteristic properties. Some references add that chemical substance cannot be separated into its constituent elements by physical separation methods, i.e., w ...

and

processing industry Process manufacturing is a branch of manufacturing that is associated with formulas and manufacturing recipes,

: in order to minimize the change in free entropy of a system, one should follow the minimum geodesic path between the desired endpoints of the process. The geodesic minimizes the entropy, due to the Cauchy–Schwarz inequality, which states that the action is bounded below by the length of the curve, squared.

Relation to the Jensen–Shannon divergence

The Fisher metric also allows the action and the curve length to be related to the

Jensen–Shannon divergence In probability theory and statistics, the Jensen– Shannon divergence is a method of measuring the similarity between two probability distributions. It is also known as information radius (IRad) or total divergence to the average. It is based o ...

. Specifically, one has :

(b-a)\int_a^b \frac g_\frac \,dt =
8\int_a^b dJSD

where the integrand ''dJSD'' is understood to be the infinitesimal change in the Jensen–Shannon divergence along the path taken. Similarly, for the

curve length In mathematics, a curve (also called a curved line in older texts) is an object similar to a line, but that does not have to be straight. Intuitively, a curve may be thought of as the trace left by a moving point. This is the definition that a ...

, one has :

\int_a^b \sqrt \,dt = \sqrt\int_a^b \sqrt

That is, the square root of the Jensen–Shannon divergence is just the Fisher metric (divided by the square root of 8).

As Euclidean metric

For a discrete probability space, that is, a probability space on a finite set of objects, the Fisher metric can be understood to simply be the

restricted to a positive "quadrant" of a unit sphere, after appropriate changes of variable. Consider a flat, Euclidean space, of dimension , parametrized by points

y=(y_0,\cdots,y_n)

. The metric for Euclidean space is given by :

h=\sum_^N dy_i \; dy_i

where the

\textstyle dy_i

are 1-forms; they are the basis vectors for the

cotangent space In differential geometry, the cotangent space is a vector space associated with a point x on a smooth (or differentiable) manifold \mathcal M; one can define a cotangent space for every point on a smooth manifold. Typically, the cotangent space, T ...

. Writing

\textstyle \frac

as the basis vectors for the

tangent space In mathematics, the tangent space of a manifold generalizes to higher dimensions the notion of '' tangent planes'' to surfaces in three dimensions and ''tangent lines'' to curves in two dimensions. In the context of physics the tangent space to a ...

, so that :

dy_j\left(\frac\right) = \delta_

, the Euclidean metric may be written as :

h^\mathrm_ = h\left(\frac, \frac\right) = \delta_

The superscript 'flat' is there to remind that, when written in coordinate form, this metric is with respect to the flat-space coordinate

y

. An ''N''-dimensional unit sphere embedded in (''N'' + 1)-dimensional Euclidean space may be defined as :

\sum_^N y_i^2 = 1

This embedding induces a metric on the sphere, it is inherited directly from the Euclidean metric on the ambient space. It takes exactly the same form as the above, taking care to ensure that the coordinates are constrained to lie on the surface of the sphere. This can be done, e.g. with the technique of

s. Consider now the change of variable

p_i=y_i^2

. The sphere condition now becomes the probability normalization condition :

\sum_i p_i = 1

while the metric becomes :

\begin h &=\sum_i dy_i \; dy_i
= \sum_i d\sqrt \; d\sqrt \\
&= \frac\sum_i \frac 
= \frac\sum_i p_i\; d(\log p_i) \; d(\log p_i)
\end

The last can be recognized as one-fourth of the Fisher information metric. To complete the process, recall that the probabilities are parametric functions of the manifold variables

\theta

, that is, one has

p_i = p_i(\theta)

. Thus, the above induces a metric on the parameter manifold: :

\begin h
& = \frac\sum_i p_i(\theta) \; d(\log p_i(\theta))\; d(\log p_i(\theta)) \\
&= \frac\sum_ \sum_i p_i(\theta) \;
\frac 
\frac 
d\theta_j d\theta_k
\end

or, in coordinate form, the Fisher information metric is: :

\end

where, as before, :

d\theta_j\left(\frac\right) = \delta_.

The superscript 'fisher' is present to remind that this expression is applicable for the coordinates

\theta

; whereas the non-coordinate form is the same as the Euclidean (flat-space) metric. That is, the Fisher information metric on a statistical manifold is simply (four times) the Euclidean metric restricted to the positive quadrant of the sphere, after appropriate changes of variable. When the random variable

p

is not discrete, but continuous, the argument still holds. This can be seen in one of two different ways. One way is to carefully recast all of the above steps in an infinite-dimensional space, being careful to define limits appropriately, etc., in order to make sure that all manipulations are well-defined, convergent, etc. The other way, as noted by Gromov, is to use a category-theoretic approach; that is, to note that the above manipulations remain valid in the category of probabilities. Here, one should note that such a category would have the

Radon–Nikodym property In mathematics, the Bochner integral, named for Salomon Bochner, extends the definition of Lebesgue integral to functions that take values in a Banach space, as the limit of integrals of simple functions. Definition Let (X, \Sigma, \mu) be a me ...

, that is, the

Radon–Nikodym theorem In mathematics, the Radon–Nikodym theorem is a result in measure theory that expresses the relationship between two measures defined on the same measurable space. A ''measure'' is a set function that assigns a consistent magnitude to the measurab ...

holds in this category. This includes the Hilbert spaces; these are square-integrable, and in the manipulations above, this is sufficient to safely replace the sum over squares by an integral over squares.

As Fubini–Study metric

The above manipulations deriving the Fisher metric from the Euclidean metric can be extended to complex

s. In this case, one obtains the

. This should perhaps be no surprise, as the Fubini–Study metric provides the means of measuring information in quantum mechanics. The Bures metric, also known as the Helstrom metric, is identical to the Fubini–Study metric, although the latter is usually written in terms of

pure state In quantum physics, a quantum state is a mathematical entity that provides a probability distribution for the outcomes of each possible measurement on a system. Knowledge of the quantum state together with the rules for the system's evolution in t ...

s, as below, whereas the Bures metric is written for mixed states. By setting the phase of the complex coordinate to zero, one obtains exactly one-fourth of the Fisher information metric, exactly as above. One begins with the same trick, of constructing a

probability amplitude In quantum mechanics, a probability amplitude is a complex number used for describing the behaviour of systems. The modulus squared of this quantity represents a probability density. Probability amplitudes provide a relationship between the qu ...

, written in

polar coordinate In mathematics, the polar coordinate system is a two-dimensional coordinate system in which each point on a plane is determined by a distance from a reference point and an angle from a reference direction. The reference point (analogous to the ...

s, so: :

\psi(x;\theta) = \sqrt \; e^

Here,

\psi(x;\theta)

is a complex-valued

;

p(x; \theta)

and

\alpha(x;\theta)

are strictly real. The previous calculations are obtained by setting

\alpha(x;\theta)=0

. The usual condition that probabilities lie within a simplex, namely that :

\int_X p(x;\theta) \, dx =1

is equivalently expressed by the idea the square amplitude be normalized: :

\int_X \vert \psi(x;\theta)\vert^2 \, dx = 1

When

\psi(x;\theta)

is real, this is the surface of a sphere. The

, written in infinitesimal form, using quantum-mechanical bra–ket notation, is :

ds^2 = \frac
 - 
\frac 
.

In this notation, one has that

\langle x\mid \psi\rangle = \psi(x;\theta)

and integration over the entire measure space ''X'' is written as :

\langle \phi \mid \psi\rangle = \int_X \phi^*(x;\theta) \psi(x;\theta) \, dx.

The expression

\vert \delta \psi \rangle

can be understood to be an infinitesimal variation; equivalently, it can be understood to be a 1-form in the

. Using the infinitesimal notation, the polar form of the probability above is simply :

\delta\psi = \left(\frac + i \delta \alpha\right) \psi

Inserting the above into the Fubini–Study metric gives: :

& -\frac \int_X (\delta \log p \delta\alpha - \delta\alpha \delta \log p) \;p\,dx \end

Setting

\delta\alpha=0

in the above makes it clear that the first term is (one-fourth of) the Fisher information metric. The full form of the above can be made slightly clearer by changing notation to that of standard Riemannian geometry, so that the metric becomes a symmetric

2-form In mathematics, differential forms provide a unified approach to define integrands over curves, surfaces, solids, and higher-dimensional manifolds. The modern notion of differential forms was pioneered by Élie Cartan. It has many applications, ...

acting on the

. The change of notation is done simply replacing

\delta \to d

and

ds^2\to h

and noting that the integrals are just expectation values; so: :

\end

The imaginary term is a

symplectic form In mathematics, a symplectic vector space is a vector space ''V'' over a field ''F'' (for example the real numbers R) equipped with a symplectic bilinear form. A symplectic bilinear form is a mapping that is ; Bilinear: Linear in each argument ...

, it is the

Berry phase In classical and quantum mechanics, geometric phase is a phase difference acquired over the course of a cycle, when a system is subjected to cyclic adiabatic processes, which results from the geometrical properties of the parameter space of the ...

geometric phase In classical and quantum mechanics, geometric phase is a phase difference acquired over the course of a cycle, when a system is subjected to cyclic adiabatic processes, which results from the geometrical properties of the parameter space of the ...

. In index notation, the metric is: :

\end

Again, the first term can be clearly seen to be (one fourth of) the Fisher information metric, by setting

\alpha=0

. Equivalently, the Fubini–Study metric can be understood as the metric on complex projective Hilbert space that is induced by the complex extension of the flat Euclidean metric. The difference between this, and the Bures metric, is that the Bures metric is written in terms of mixed states.

Continuously-valued probabilities

A slightly more formal, abstract definition can be given, as follows. Let ''X'' be an

orientable manifold In mathematics, orientability is a property of some topological spaces such as real vector spaces, Euclidean spaces, surfaces, and more generally manifolds that allows a consistent definition of "clockwise" and "counterclockwise". A space is ...

, and let

(X,\Sigma,\mu)

be a measure on ''X''. Equivalently, let

(\Omega, \mathcal,P)

be a

\Omega=X

, with sigma algebra

\mathcal=\Sigma

and probability

P=\mu

. The

''S''(''X'') of ''X'' is defined as the space of all measures

\mu

on ''X'' (with the sigma-algebra

\Sigma

held fixed). Note that this space is infinite-dimensional, and is commonly taken to be a

Fréchet space In functional analysis and related areas of mathematics, Fréchet spaces, named after Maurice Fréchet, are special topological vector spaces. They are generalizations of Banach spaces ( normed vector spaces that are complete with respect to th ...

. The points of ''S''(''X'') are measures. Pick a point

\mu\in S(X)

and consider the

T_\mu S

. The Fisher information metric is then an

inner product In mathematics, an inner product space (or, rarely, a Hausdorff pre-Hilbert space) is a real vector space or a complex vector space with an operation called an inner product. The inner product of two vectors in the space is a scalar, often ...

on the tangent space. With some

abuse of notation In mathematics, abuse of notation occurs when an author uses a mathematical notation in a way that is not entirely formally correct, but which might help simplify the exposition or suggest the correct intuition (while possibly minimizing errors a ...

, one may write this as :

g(\sigma_1,\sigma_2)=\int_X \frac\fracd\mu

Here,

\sigma_1

and

\sigma_2

are vectors in the tangent space; that is,

\sigma_1,\sigma_2\in T_\mu S

. The abuse of notation is to write the tangent vectors as if they are derivatives, and to insert the extraneous ''d'' in writing the integral: the integration is meant to be carried out using the measure

\mu

over the whole space ''X''. This abuse of notation is, in fact, taken to be perfectly normal in measure theory; it is the standard notation for the Radon–Nikodym derivative. In order for the integral to be well-defined, the space ''S''(''X'') must have the

, and more specifically, the tangent space is restricted to those vectors that are

square-integrable In mathematics, a square-integrable function, also called a quadratically integrable function or L^2 function or square-summable function, is a real- or complex-valued measurable function for which the integral of the square of the absolute value ...

. Square integrability is equivalent to saying that a

Cauchy sequence In mathematics, a Cauchy sequence (; ), named after Augustin-Louis Cauchy, is a sequence whose elements become arbitrarily close to each other as the sequence progresses. More precisely, given any small positive distance, all but a finite numbe ...

converges to a finite value under the

weak topology In mathematics, weak topology is an alternative term for certain initial topologies, often on topological vector spaces or spaces of linear operators, for instance on a Hilbert space. The term is most commonly used for the initial topology of a ...

: the space contains its limit points. Note that Hilbert spaces possess this property. This definition of the metric can be seen to be equivalent to the previous, in several steps. First, one selects a

submanifold In mathematics, a submanifold of a manifold ''M'' is a subset ''S'' which itself has the structure of a manifold, and for which the inclusion map satisfies certain properties. There are different types of submanifolds depending on exactly which ...

of ''S''(''X'') by considering only those measures

\mu

that are parameterized by some smoothly varying parameter

\theta

. Then, if

\theta

is finite-dimensional, then so is the submanifold; likewise, the tangent space has the same dimension as

\theta

. With some additional abuse of language, one notes that the exponential map provides a map from vectors in a tangent space to points in an underlying manifold. Thus, if

\sigma\in T_\mu S

is a vector in the tangent space, then

p=\exp(\sigma)

is the corresponding probability associated with point

p\in S(X)

(after the

parallel transport In geometry, parallel transport (or parallel translation) is a way of transporting geometrical data along smooth curves in a manifold. If the manifold is equipped with an affine connection (a covariant derivative or connection on the tangent b ...

of the exponential map to

\mu

.) Conversely, given a point

p\in S(X)

, the logarithm gives a point in the tangent space (roughly speaking, as again, one must transport from the origin to point

\mu

; for details, refer to original sources). Thus, one has the appearance of logarithms in the simpler definition, previously given.

Notes

References

* Garvesh Raskutti Sayan Mukherjee, (2014). ''The information geometry of mirror descent'' https://arxiv.org/pdf/1310.7780.pdf * {{cite journal , doi=10.1103/PhysRevE.79.012104, title=Far-from-equilibrium measurements of thermodynamic length, year=2009, last1=Feng, first1=Edward H., last2=Crooks, first2=Gavin E., journal=Physical Review E, volume=79, issue=1 Pt 1, page=012104, pmid=19257090, arxiv=0807.0621, bibcode=2009PhRvE..79a2104F, s2cid=8210246 * Shun'ichi Amari (1985) ''Differential-geometrical methods in statistics'', Lecture Notes in Statistics, Springer-Verlag, Berlin. * Shun'ichi Amari, Hiroshi Nagaoka (2000) ''Methods of information geometry'', Translations of mathematical monographs; v. 191, American Mathematical Society. * Paolo Gibilisco, Eva Riccomagno, Maria Piera Rogantin and Henry P. Wynn, (2009) ''Algebraic and Geometric Methods in Statistics'', Cambridge U. Press, Cambridge. Differential geometry Information geometry Statistical distance