In
information geometry
Information geometry is an interdisciplinary field that applies the techniques of differential geometry to study probability theory and statistics. It studies statistical manifolds, which are Riemannian manifolds whose points correspond to p ...
, a divergence is a kind of
statistical distance
In statistics, probability theory, and information theory, a statistical distance quantifies the distance between two statistical objects, which can be two random variables, or two probability distributions or samples, or the distance can be b ...
: a
binary function
In mathematics, a binary function (also called bivariate function, or function of two variables) is a function that takes two inputs.
Precisely stated, a function f is binary if there exists sets X, Y, Z such that
:\,f \colon X \times Y \righ ...
which establishes the separation from one
probability distribution
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomeno ...
to another on a
statistical manifold
In mathematics, a statistical manifold is a Riemannian manifold, each of whose points is a probability distribution. Statistical manifolds provide a setting for the field of information geometry. The Fisher information metric provides a metric o ...
.
The simplest divergence is
squared Euclidean distance
In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points.
It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, therefore o ...
(SED), and divergences can be viewed as generalizations of SED. The other most important divergence is
relative entropy
Relative may refer to:
General use
*Kinship and family, the principle binding the most basic social units society. If two people are connected by circumstances of birth, they are said to be ''relatives''
Philosophy
*Relativism, the concept that ...
(
Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fro ...
, KL divergence), which is central to
information theory
Information theory is the scientific study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. ...
. There are numerous other specific divergences and classes of divergences, notably
''f''-divergences and
Bregman divergence
In mathematics, specifically statistics and information geometry, a Bregman divergence or Bregman distance is a measure of difference between two points, defined in terms of a strictly convex function; they form an important class of divergences. ...
s (see ).
Definition
Given a
differentiable manifold
In mathematics, a differentiable manifold (also differential manifold) is a type of manifold that is locally similar enough to a vector space to allow one to apply calculus. Any manifold can be described by a collection of charts (atlas). One ma ...
of dimension
, a divergence on
is a
-function
satisfying:
#
for all
(non-negativity),
#
if and only if
(positivity),
# At every point
,
is a positive-definite quadratic form for infinitesimal displacements
from
.
In applications to statistics, the manifold
is typically the space of parameters of a Parametric family, parametric family of probability distributions.
Condition 3 means that
defines an inner product on the tangent space
for every
. Since
is
on
, this defines a Riemannian metric
on
.
Locally at
, we may construct a local
coordinate chart In topology, a branch of mathematics, a topological manifold is a topological space that locally resembles real ''n''- dimensional Euclidean space. Topological manifolds are an important class of topological spaces, with applications throughout ma ...
with coordinates
, then the divergence is
where
is a matrix of size
. It is the Riemannian metric at point
expressed in coordinates
.
Dimensional analysis
In engineering and science, dimensional analysis is the analysis of the relationships between different physical quantities by identifying their base quantities (such as length, mass, time, and electric current) and units of measure (such as ...
of condition 3 shows that divergence has the dimension of squared distance.
The dual divergence
is defined as
:
When we wish to contrast
against
, we refer to
as primal divergence.
Given any divergence
, its symmetrized version is obtained by averaging it with its dual divergence:
:
Difference from other similar concepts
Unlike
metrics
Metric or metrical may refer to:
* Metric system, an internationally adopted decimal system of measurement
* An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement
Mathematics
In mathem ...
, divergences are not required to be symmetric, and the asymmetry is important in applications. Accordingly, one often refers asymmetrically to the divergence "of ''q'' from ''p''" or "from ''p'' to ''q''", rather than "between ''p'' and ''q''". Secondly, divergences generalize ''squared'' distance, not linear distance, and thus do not satisfy the
triangle inequality
In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side.
This statement permits the inclusion of degenerate triangles, bu ...
, but some divergences (such as the
Bregman divergence
In mathematics, specifically statistics and information geometry, a Bregman divergence or Bregman distance is a measure of difference between two points, defined in terms of a strictly convex function; they form an important class of divergences. ...
) do satisfy generalizations of the
Pythagorean theorem
In mathematics, the Pythagorean theorem or Pythagoras' theorem is a fundamental relation in Euclidean geometry between the three sides of a right triangle. It states that the area of the square whose side is the hypotenuse (the side opposit ...
.
In general statistics and probability, "divergence" generally refers to any kind of function
, where
are probability distributions or other objects under consideration, such that conditions 1, 2 are satisfied. Condition 3 is required for "divergence" as used in information geometry.
As an example, the
total variation distance In probability theory, the total variation distance is a distance measure for probability distributions. It is an example of a statistical distance metric, and is sometimes called the statistical distance, statistical difference or variational dist ...
, a commonly used statistical divergence, does not satisfy condition 3.
Notation
Notation for divergences varies significantly between fields, though there are some conventions.
Divergences are generally notated with an uppercase 'D', as in
, to distinguish them from metric distances, which are notated with a lowercase 'd'. When multiple divergences are in use, they are commonly distinguished with subscripts, as in
for
Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fro ...
(KL divergence).
Often a different separator between parameters is used, particularly to emphasize the asymmetry. In
information theory
Information theory is the scientific study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. ...
, a double bar is commonly used:
; this is similar to, but distinct from, the notation for
conditional probability
In probability theory, conditional probability is a measure of the probability of an event occurring, given that another event (by assumption, presumption, assertion or evidence) has already occurred. This particular method relies on event B occu ...
,
, and emphasizes interpreting the divergence as a relative measurement, as in
relative entropy
Relative may refer to:
General use
*Kinship and family, the principle binding the most basic social units society. If two people are connected by circumstances of birth, they are said to be ''relatives''
Philosophy
*Relativism, the concept that ...
; this notation is common for the KL divergence. A colon may be used instead, as
; this emphasizes the relative information supporting the two distributions.
The notation for parameters varies as well. Uppercase
interprets the parameters as probability distributions, while lowercase
or
interprets them geometrically as points in a space, and
or
interprets them as measures.
Geometrical properties
Many properties of divergences can be derived if we restrict ''S'' to be a statistical manifold, meaning that it can be parametrized with a finite-dimensional coordinate system ''θ'', so that for a distribution we can write .
For a pair of points with coordinates ''θ''
''p'' and ''θ''
''q'', denote the partial derivatives of ''D''(''p'', ''q'') as
:
Now we restrict these functions to a diagonal , and denote
:
By definition, the function ''D''(''p'', ''q'') is minimized at , and therefore
:
where matrix ''g''
(''D'') is
positive semi-definite and defines a unique
Riemannian metric
In differential geometry, a Riemannian manifold or Riemannian space , so called after the German mathematician Bernhard Riemann, is a real, smooth manifold ''M'' equipped with a positive-definite inner product ''g'p'' on the tangent spac ...
on the manifold ''S''.
Divergence ''D''(·, ·) also defines a unique
torsion
Torsion may refer to:
Science
* Torsion (mechanics), the twisting of an object due to an applied torque
* Torsion of spacetime, the field used in Einstein–Cartan theory and
** Alternatives to general relativity
* Torsion angle, in chemistry
Bio ...
-free
affine connection
In differential geometry, an affine connection is a geometric object on a smooth manifold which ''connects'' nearby tangent spaces, so it permits tangent vector fields to be differentiated as if they were functions on the manifold with values ...
∇
(''D'') with coefficients
:
and the
dual to this connection ∇* is generated by the dual divergence ''D''*.
Thus, a divergence ''D''(·, ·) generates on a statistical manifold a unique dualistic structure (''g''
(''D''), ∇
(''D''), ∇
(''D''*)). The converse is also true: every torsion-free dualistic structure on a statistical manifold is induced from some globally defined divergence function (which however need not be unique).
For example, when ''D'' is an
f-divergence
In probability theory, an f-divergence is a function D_f(P\, Q) that measures the difference between two probability distributions P and Q. Many common divergences, such as KL-divergence, Hellinger distance, and total variation distance, are s ...
[
] for some function ƒ(·), then it generates the
metric
Metric or metrical may refer to:
* Metric system, an internationally adopted decimal system of measurement
* An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement
Mathematics
In mathem ...
and the connection , where ''g'' is the canonical
Fisher information metric In information geometry, the Fisher information metric is a particular Riemannian metric which can be defined on a smooth statistical manifold, ''i.e.'', a smooth manifold whose points are probability measures defined on a common probability space. ...
, ∇
(''α'') is the
α-connection, , and .
Examples
The two most important divergences are the
relative entropy
Relative may refer to:
General use
*Kinship and family, the principle binding the most basic social units society. If two people are connected by circumstances of birth, they are said to be ''relatives''
Philosophy
*Relativism, the concept that ...
(
Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fro ...
, KL divergence), which is central to
information theory
Information theory is the scientific study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. ...
and statistics, and the
squared Euclidean distance
In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points.
It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, therefore o ...
(SED). Minimizing these two divergences is the main way that
linear inverse problem
An inverse problem in science is the process of calculating from a set of observations the causal factors that produced them: for example, calculating an image in X-ray computed tomography, source reconstruction in acoustics, or calculating the ...
are solved, via the
principle of maximum entropy
The principle of maximum entropy states that the probability distribution which best represents the current state of knowledge about a system is the one with largest entropy, in the context of precisely stated prior data (such as a proposition ...
and
least squares
The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the r ...
, notably in
logistic regression
In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...
and
linear regression
In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is ...
.
The two most important classes of divergences are the
''f''-divergences and
Bregman divergence
In mathematics, specifically statistics and information geometry, a Bregman divergence or Bregman distance is a measure of difference between two points, defined in terms of a strictly convex function; they form an important class of divergences. ...
s; however, other types of divergence functions are also encountered in the literature. The only divergence that is both an ''f''-divergence and a Bregman divergence is the Kullback–Leibler divergence;
the squared Euclidean divergence is a Bregman divergence (corresponding to the function ), but not an ''f''-divergence.
f-divergences
Given convex function