mathematics Mathematics is an area of knowledge that includes the topics of numbers, formulas and related structures, shapes and the spaces in which they are contained, and quantities and their changes. These topics are represented in modern mathematics ...

, specifically

statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...

and

information geometry Information geometry is an interdisciplinary field that applies the techniques of differential geometry to study probability theory and statistics. It studies statistical manifolds, which are Riemannian manifolds whose points correspond to prob ...

, a Bregman divergence or Bregman distance is a measure of difference between two points, defined in terms of a strictly

convex function In mathematics, a real-valued function is called convex if the line segment between any two points on the graph of a function, graph of the function lies above the graph between the two points. Equivalently, a function is convex if its epigra ...

; they form an important class of

divergence In vector calculus, divergence is a vector operator that operates on a vector field, producing a scalar field giving the quantity of the vector field's source at each point. More technically, the divergence represents the volume density of the ...

s. When the points are interpreted as

probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...

s – notably as either values of the parameter of a

parametric model In statistics, a parametric model or parametric family or finite-dimensional model is a particular class of statistical models. Specifically, a parametric model is a family of probability distributions that has a finite number of parameters. Def ...

or as a data set of observed values – the resulting distance is a

statistical distance In statistics, probability theory, and information theory, a statistical distance quantifies the distance between two statistical objects, which can be two random variables, or two probability distributions or samples, or the distance can be be ...

. The most basic Bregman divergence is the

squared Euclidean distance In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two Point (geometry), points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theo ...

. Bregman divergences are similar to

metric Metric or metrical may refer to: * Metric system, an internationally adopted decimal system of measurement * An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement Mathematics In mathem ...

s, but satisfy neither the

triangle inequality In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side. This statement permits the inclusion of degenerate triangles, but ...

(ever) nor symmetry (in general). However, they satisfy a generalization of the

Pythagorean theorem In mathematics, the Pythagorean theorem or Pythagoras' theorem is a fundamental relation in Euclidean geometry between the three sides of a right triangle. It states that the area of the square whose side is the hypotenuse (the side opposite t ...

, and in information geometry the corresponding

statistical manifold In mathematics, a statistical manifold is a Riemannian manifold, each of whose points is a probability distribution. Statistical manifolds provide a setting for the field of information geometry. The Fisher information metric provides a met ...

is interpreted as a (dually)

flat manifold In mathematics, a Riemannian manifold is said to be flat if its Riemann curvature tensor is everywhere zero. Intuitively, a flat manifold is one that "locally looks like" Euclidean space in terms of distances and angles, e.g. the interior angles o ...

. This allows many techniques of

optimization theory Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...

to be generalized to Bregman divergences, geometrically as generalizations of

least squares The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the res ...

. Bregman divergences are named after Russian mathematician

Lev M. Bregman Lev M. Bregman (1941 - 2023) is a Soviet and Israeli mathematician, most known for the Bregman divergence named after him. Bregman received his M. Sc. in mathematics in 1963 at Leningrad University and his Ph.D. in mathematics in 1966 at the same ...

, who introduced the concept in 1967.

Definition

Let

F: \Omega \to \mathbb

be a continuously-differentiable, strictly

defined on a

convex set In geometry, a subset of a Euclidean space, or more generally an affine space over the reals, is convex if, given any two points in the subset, the subset contains the whole line segment that joins them. Equivalently, a convex set or a convex r ...

\Omega

. The Bregman distance associated with ''F'' for points

p, q \in \Omega

is the difference between the value of ''F'' at point ''p'' and the value of the first-order

Taylor expansion In mathematics, the Taylor series or Taylor expansion of a function is an infinite sum of terms that are expressed in terms of the function's derivatives at a single point. For most common functions, the function and the sum of its Taylor serie ...

of ''F'' around point ''q'' evaluated at point ''p'': :

D_F(p, q) = F(p)-F(q)-\langle \nabla F(q), p-q\rangle.

Properties

* Non-negativity:

D_F(p, q) \ge 0

for all p, q. This is a consequence of the convexity of F. * Positivity: When F is strictly convex,

D_F(p, q) = 0

iff

p=q

. * Uniqueness up to affine difference:

D_F = D_G

iff

F-G

is an affine function. * Convexity:

D_F(p, q)

is convex in its first argument, but not necessarily in the second argument. If F is strictly convex, then

D_F(p, q)

is strictly convex in its first argument. ** For example, Take f(x) = , x, , smooth it at 0, then take

y = 1, x_1 = 0.1, x_2 = -0.9, x_3 = 0.9x_1 + 0.1x_2

, then

D_f(y, x_3) \approx 1 > 0.9 D_f(y, x_1) + 0.1 D_f(y, x_2) \approx 0.2

. * Linearity: If we think of the Bregman distance as an operator on the function ''F'', then it is linear with respect to non-negative coefficients. In other words, for

F_1, F_2

strictly convex and differentiable, and

\lambda \ge 0

, ::

D_(p, q) = D_(p, q) + \lambda D_(p, q)

* Duality: If F is strictly convex, then the function F has a

convex conjugate In mathematics and mathematical optimization, the convex conjugate of a function is a generalization of the Legendre transformation which applies to non-convex functions. It is also known as Legendre–Fenchel transformation, Fenchel transformation ...

F^*

which is also strictly convex and continuously differentiable on some convex set

\Omega^*

. The Bregman distance defined with respect to

F^*

is dual to

D_F(p, q)

as ::

D_(p^*, q^*) = D_F(q, p)

:Here,

p^* = \nabla F(p)

and

q^* = \nabla F(q)

are the dual points corresponding to p and q. * Mean as minimizer: A key result about Bregman divergences is that, given a random vector, the mean vector minimizes the expected Bregman divergence from the random vector. This result generalizes the textbook result that the mean of a set minimizes total squared error to elements in the set. This result was proved for the vector case by (Banerjee et al. 2005), and extended to the case of functions/distributions by (Frigyik et al. 2008). This result is important because it further justifies using a mean as a representative of a random set, particularly in Bayesian estimation. * Bregman balls are bounded, and compact if X is closed: Define Bregman ball centered at x with radius r by

B_f(x, r):= \left\

. When

X\subset \R^n

is finite dimensional,

\forall x\in X

, if

x

is in the relative interior of

X

, or if

X

is locally closed at

x

(that is, there exists a closed ball

B(x, r)

centered at

x

, such that

B(x,r) \cap X

is closed), then

B_f(x, r)

is bounded for all

r

. If

X

is closed, then

B_f(x, r)

is compact for all

r

. * Law of cosines:https://www.cs.utexas.edu/users/inderjit/Talks/bregtut.pdf For any

p,q,z

D_F(p, q) = D_F(p, z) + D_F(z, q) - (p - z)^T(\nabla F(q) - \nabla F(z))

Parallelogram law In mathematics, the simplest form of the parallelogram law (also called the parallelogram identity) belongs to elementary geometry. It states that the sum of the squares of the lengths of the four sides of a parallelogram equals the sum of the s ...

: for any

\theta, \theta_1, \theta_2

B_\left(\theta_: \theta\right)+B_\left(\theta_: \theta\right)=B_\left(\theta_: \frac\right)+B_\left(\theta_: \frac\right)+2 B_\left(\frac: \theta\right)

* Bregman projection: For any

W\subset \Omega

, define the "Bregman projection" of

q

onto

W

P_W(q) = \text_ D_F(\omega, q)

. Then ** if

W

is convex, then the projection is unique if it exists; ** if

W

is closed and convex, and

\Omega\subset \R^n

is finite-dimensional, then the projection exists and is unique. * Generalized Pythagorean Theorem: For any

v\in \Omega, a\in W

D_F(a, v) \ge D_F(a, P_W(v)) + D_F(P_W(v), v).

This is an equality if

P_W(v)

is in the

relative interior In mathematics, the relative interior of a set is a refinement of the concept of the interior, which is often more useful when dealing with low-dimensional sets placed in higher-dimensional spaces. Formally, the relative interior of a set S (de ...

W

. In particular, this always happens when

W

is an affine set. * ''Lack'' of triangle inequality: Since the Bregman divergence is essentially a generalization of squared Euclidean distance, there is no triangle inequality. Indeed,

D_F(z, x) - D_F(z, y) - D_F(y, x) = \langle\nabla f(y) - \nabla f(x), z-y\rangle

, which may be positive or negative.

Proofs

* Non-negativity and positivity: use

Jensen's inequality In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier pr ...

. * Uniqueness up to affine difference: Fix some

x\in \Omega

, then for any other

y\in \Omega

, we have by definition

F(y) - G(y) = F(x) - G(x) + \langle\nabla F(x) - \nabla G(x) , y-x \rangle

. * Convexity in the first argument: by definition, and use convexity of F. Same for strict convexity. * Linearity in F, law of cosines, parallelogram law: by definition. * Duality: See figure 1 of. * Bregman balls are bounded, and compact if X is closed: Fix

x\in X

. Take affine transform on

f

, so that

\nabla f(x) = 0

. Take some

\epsilon > 0

, such that

\partial B(x, \epsilon) \subset X

. Then consider the "radial-directional" derivative of

f

on the Euclidean sphere

\partial B(x, \epsilon)

\langle\nabla f(y), (y-x)\rangle

for all

y\in \partial B(x, \epsilon)

. Since

\partial B(x, \epsilon)\subset \R^n

is compact, it achieves minimal value

\delta

at some

y_0\in  \partial B(x, \epsilon)

. Since

f

is strictly convex,

\delta > 0

. Then

B_f(x, r)\subset B(x, r/\delta)\cap X

. Since

D_f(y, x)

C^1

y

D_f

is continuous in

y

, thus

B_f(x, r)

is closed if

X

is. * Projection

P_W

is well-defined when

W

is closed and convex. Fix

v\in X

. Take some

w\in W

, then let

r := D_f(w, v)

. Then draw the Bregman ball

B_f(v, r)\cap W

. It is closed and bounded, thus compact. Since

D_f(\cdot, v)

is continuous and strictly convex on it, and bounded below by

0

, it achieves a unique minimum on it. * Pythagorean inequality. By cosine law,

D_f(w, v) - D(w, P_W(v)) - D_f(P_W(v), v) = \langle \nabla_y D_f(y, v), _ , w - P_W(v)\rangle

, which must be

\geq 0

, since

P_W(v)

minimizes

D_f(\cdot, v)

X

, and

X

is convex. * Pythagorean equality when

P_W(v)

is in the relative interior of

X

. If

\langle \nabla_y D_f(y, v), _, w - P_W(v)\rangle > 0

, then since

w

is in the relative interior, we can move from

P_W(v)

in the direction opposite of

w

, to decrease

D_f(y, v)

, contradiction. Thus

\langle \nabla_y D_f(y, v), _, w - P_W(v)\rangle = 0

Classification theorems

* The only symmetric Bregman divergences on

X\subset \R^n

are squared generalized Euclidean distances ( Mahalanobis distance), that is,

D_f(y, x) = (y-x)^T A (y-x)

for some

positive definite In mathematics, positive definiteness is a property of any object to which a bilinear form or a sesquilinear form may be naturally associated, which is positive-definite. See, in particular: * Positive-definite bilinear form * Positive-definite f ...

A

. The following two characterizations are for divergences on

\Gamma_n

, the set of all probability measures on

\

, with

n \geq 2

. Define a divergence on

\Gamma_n

as any function of type

D: \Gamma_n \times \Gamma_n \to, \infty /math>, such that D(x, x) = 0 for all x\in\Gamma_n, then:
*The only divergence on \Gamma_n that is both a Bregman divergence and an

f-divergence In probability theory, an f-divergence is a function D_f(P\, Q) that measures the difference between two probability distributions P and Q. Many common divergences, such as KL-divergence, Hellinger distance, and total variation distance, are sp ...

is the

Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fro ...

. *If

n \geq 3

, then any Bregman divergence on

\Gamma_n

that satisfies the

data processing inequality The data processing inequality is an information theoretic concept which states that the information content of a signal cannot be increased via a local physical operation. This can be expressed concisely as 'post-processing cannot increase inform ...

must be the Kullback–Leibler divergence. (In fact, a weaker assumption of "sufficiency" is enough.) Counterexamples exist when

n = 2

. Given a Bregman divergence

D_F

, its "opposite", defined by

D_F^*(v, w) = D_F(w, v)

, is generally not a Bregman divergence. For example, the Kullback-Leiber divergence is both a Bregman divergence and an f-divergence. Its reverse is also an f-divergence, but by the above characterization, the reverse KL divergence cannot be a Bregman divergence.

Examples

* Squared Euclidean distance

D_F(x,y) = \, x - y\, ^2

is the canonical example of a Bregman distance, generated by the convex function

F(x) = \, x\, ^2

* The squared Mahalanobis distance,

D_F(x,y)=\tfrac(x-y)^T Q (x-y)

which is generated by the convex function

F(x) = \tfrac x^T Q x

. This can be thought of as a generalization of the above squared Euclidean distance. * The generalized

D_F(p, q) = \sum_i p(i) \log \frac - \sum p(i) + \sum q(i)

:is generated by the negative

entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...

function ::

F(p) = \sum_i p(i)\log p(i)

:When restricted to the simplex, this gives

D_F(p, q) = \sum_i p(i) \log \frac

, the usual

. * The

Itakura–Saito distance The Itakura–Saito distance (or Itakura–Saito divergence) is a measure of the difference between an original spectrum P(\omega) and an approximation \hat(\omega) of that spectrum. Although it is not a perceptual measure, it is intended to reflect ...

, ::

D_F(p, q) = \sum_i \left(\frac  - \log \frac - 1 \right)

:is generated by the convex function ::

F(p) = - \sum_i \log p(i)

Generalizing projective duality

A key tool in

computational geometry Computational geometry is a branch of computer science devoted to the study of algorithms which can be stated in terms of geometry. Some purely geometrical problems arise out of the study of computational geometric algorithms, and such problems ar ...

is the idea of

projective duality In geometry, a striking feature of projective planes is the symmetry of the roles played by points and lines in the definitions and theorems, and (plane) duality is the formalization of this concept. There are two approaches to the subject of du ...

, which maps points to hyperplanes and vice versa, while preserving incidence and above-below relationships. There are numerous analytical forms of the projective dual: one common form maps the point

p = (p_1, \ldots p_d)

to the hyperplane

x_ = \sum_1^d 2p_i x_i

. This mapping can be interpreted (identifying the hyperplane with its normal) as the convex conjugate mapping that takes the point p to its dual point

p^* = \nabla F(p)

, where ''F'' defines the ''d''-dimensional paraboloid

x_ = \sum x_i^2

. If we now replace the paraboloid by an arbitrary convex function, we obtain a different dual mapping that retains the incidence and above-below properties of the standard projective dual. This implies that natural dual concepts in computational geometry like

Voronoi diagram In mathematics, a Voronoi diagram is a partition of a plane into regions close to each of a given set of objects. In the simplest case, these objects are just finitely many points in the plane (called seeds, sites, or generators). For each seed ...

s and

Delaunay triangulation In mathematics and computational geometry, a Delaunay triangulation (also known as a Delone triangulation) for a given set P of discrete points in a general position is a triangulation DT(P) such that no point in P is inside the circumcircle o ...

s retain their meaning in distance spaces defined by an arbitrary Bregman divergence. Thus, algorithms from "normal" geometry extend directly to these spaces (Boissonnat, Nielsen and Nock, 2010)

Generalization of Bregman divergences

Bregman divergences can be interpreted as limit cases of skewed Jensen divergences (see Nielsen and Boltz, 2011). Jensen divergences can be generalized using comparative convexity, and limit cases of these skewed Jensen divergences generalizations yields generalized Bregman divergence (see Nielsen and Nock, 2017). The Bregman chord divergence is obtained by taking a chord instead of a tangent line.

Bregman divergence on other objects

Bregman divergences can also be defined between matrices, between functions, and between measures (distributions). Bregman divergences between matrices include the Stein's loss and

von Neumann entropy In physics, the von Neumann entropy, named after John von Neumann, is an extension of the concept of Gibbs entropy from classical statistical mechanics to quantum statistical mechanics. For a quantum-mechanical system described by a density matrix ...

. Bregman divergences between functions include total squared error, relative entropy, and squared bias; see the references by Frigyik et al. below for definitions and properties. Similarly Bregman divergences have also been defined over sets, through a

submodular set function In mathematics, a submodular set function (also known as a submodular function) is a set function whose value, informally, has the property that the difference in the incremental value of the function that a single element makes when added to an ...

which is known as the discrete analog of a

. The submodular Bregman divergences subsume a number of discrete distance measures, like the

Hamming distance In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of ''substitutions'' required to chan ...

, precision and recall,

mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...

and some other set based distance measures (see Iyer & Bilmes, 2012) for more details and properties of the submodular Bregman.) For a list of common matrix Bregman divergences, see Table 15.1 in.

Applications

In machine learning, Bregman divergences are used to calculate the bi-tempered logistic loss, performing better than the

softmax function The softmax function, also known as softargmax or normalized exponential function, converts a vector of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...

with noisy datasets.Ehsan Amid, Manfred K. Warmuth, Rohan Anil, Tomer Koren (2019). "Robust Bi-Tempered Logistic Loss Based on Bregman Divergences". Conference on Neural Information Processing Systems. pp. 14987-14996
pdf
/ref> Bregman divergence is used in the formulation of

mirror descent In mathematics, mirror descent is an iterative optimization algorithm for finding a local minimum of a differentiable function. It generalizes algorithms such as gradient descent and multiplicative weights. History Mirror descent was originall ...

, which includes optimization algorithms used in machine learning such as

gradient descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...

and the hedge algorithm.

References

* * * * * * * * * * * * * {{refend Geometric algorithms Statistical distance