A Stein discrepancy is a
statistical divergence between two
probability measures that is rooted in
Stein's method. It was first formulated as a tool to assess the quality of
Markov chain Monte Carlo
In statistics, Markov chain Monte Carlo (MCMC) methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain ...
samplers,
[J. Gorham and L. Mackey. Measuring Sample Quality with Stein's Method. Advances in Neural Information Processing Systems, 2015.] but has since been used in diverse settings in statistics, machine learning and computer science.
Definition
Let
be a
measurable space
In mathematics, a measurable space or Borel space is a basic object in measure theory. It consists of a set and a σ-algebra, which defines the subsets that will be measured.
Definition
Consider a set X and a σ-algebra \mathcal A on X. Then the ...
and let
be a set of
measurable function
In mathematics and in particular measure theory, a measurable function is a function between the underlying sets of two measurable spaces that preserves the structure of the spaces: the preimage of any measurable set is measurable. This is in di ...
s of the form
. A natural notion of distance between two probability distributions
,
, defined on
, is provided by an integral probability metric
:
where for the purposes of exposition we assume that the
expectations exist, and that the set
is sufficiently rich that (1.1) is indeed a
metric
Metric or metrical may refer to:
* Metric system, an internationally adopted decimal system of measurement
* An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement
Mathematics
In mathem ...
on the set of probability distributions on
, i.e.
if and only if
. The choice of the set
determines the
topological
In mathematics, topology (from the Greek words , and ) is concerned with the properties of a geometric object that are preserved under continuous deformations, such as stretching, twisting, crumpling, and bending; that is, without closing h ...
properties of (1.1). However, for practical purposes the evaluation of (1.1) requires access to both
and
, often rendering direct computation of (1.1) impractical.
Stein's method is a theoretical tool that can be used to bound (1.1). Specifically, we suppose that we can identify an
operator and a set
of real-valued functions in the domain of
, both of which may be
-dependent, such that for each
there exists a solution
to the ''Stein equation''
:
The operator
is termed a ''Stein operator'' and the set
is called a ''Stein set''. Substituting (1.2) into (1.1), we obtain an upper bound
:
.
This resulting bound
:
is called a ''Stein discrepancy''.
In contrast to the original integral probability metric
, it may be possible to analyse or compute
using expectations only with respect to the distribution
.
Examples
Several different Stein discrepancies have been studied, with some of the most widely used presented next.
Classical Stein discrepancy
For a probability distribution
with positive and differentiable density function
on a
convex set
In geometry, a subset of a Euclidean space, or more generally an affine space over the reals, is convex if, given any two points in the subset, the subset contains the whole line segment that joins them. Equivalently, a convex set or a convex r ...
, whose boundary is denoted
, the combination of the ''Langevin–Stein operator''
and the ''classical Stein set''
:
yields the ''classical Stein discrepancy''.
Here
denotes the
Euclidean norm
Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, that is, in Euclid's ''Elements'', it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are Euclidean s ...
and
the Euclidean inner product. Here
is the associated
operator norm
In mathematics, the operator norm measures the "size" of certain linear operators by assigning each a real number called its . Formally, it is a norm defined on the space of bounded linear operators between two given normed vector spaces.
Introdu ...
for matrices
, and
denotes the outward
unit normal to
at location
. If
then we interpret
.
In the univariate case
, the classical Stein discrepancy can be computed exactly by solving a
quadratically constrained quadratic program.
Graph Stein discrepancy
The first known computable Stein discrepancies were the graph Stein discrepancies (GSDs). Given a discrete distribution
, one can define the
graph
Graph may refer to:
Mathematics
*Graph (discrete mathematics), a structure made of vertices and edges
**Graph theory, the study of such graphs and their properties
*Graph (topology), a topological space resembling a graph in the sense of discre ...
with vertex set
and edge set
. From this graph, one can define the ''graph Stein set'' as
:
The combination of the Langevin–Stein operator and the graph Stein set is called the ''graph Stein discrepancy'' (GSD).
The GSD is actually the solution of a finite-dimensional
linear program
Linear programming (LP), also called linear optimization, is a method to achieve the best outcome (such as maximum profit or lowest cost) in a mathematical model whose requirements are represented by linear relationships. Linear programming is ...
, with the size of
as low as linear in
, meaning that the GSD can be efficiently computed.''
''
Kernel Stein discrepancy
The supremum arising in the definition of Stein discrepancy can be evaluated in closed form using a particular choice of Stein set. Indeed, let
be the unit ball in a (possibly vector-valued)
reproducing kernel Hilbert space
In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g in ...
with reproducing kernel
, whose elements are in the domain of the Stein operator
. Suppose that
*For each fixed
, the map
is a continuous linear functional on
.
*
.
where the Stein operator
acts on the first argument of
and
acts on the second argument. Then it can be shown
[Mastubara, T., Knoblauch, J., Briol, F-X., Oates, C. J. Robust Generalised Bayesian Inference for Intractable Likelihoods. arXiv:2104.07359.] that
:
,
where the random variables
and
in the expectation are independent. In particular, if
is a discrete distribution on
, then the Stein discrepancy takes the closed form
:
A Stein discrepancy constructed in this manner is called a ''kernel Stein discrepancy
[Oates, C. J., Girolami, M., & Chopin, N. (2017). Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society B: Statistical Methodology, 79(3), 695–718.][Liu, Q., Lee, J. D., & Jordan, M. I. (2016). A kernelized Stein discrepancy for goodness-of-fit tests and model evaluation. International Conference on Machine Learning, 276–284.][Chwialkowski, K., Strathmann, H., & Gretton, A. (2016). A kernel test of goodness of fit. International Conference on Machine Learning, 2606–2615.]''
[Gorham J, Mackey L. Measuring sample quality with kernels. InInternational Conference on Machine Learning 2017 Jul 17 (pp. 1292-1301). PMLR.] and the construction is closely connected to the theory of
kernel embedding of probability distributions.
Let
be a reproducing kernel. For a probability distribution
with positive and differentiable density function
on
, the combination of the Langevin--Stein operator
and the Stein set
:
associated to the matrix-valued reproducing kernel
, yields a kernel Stein discrepancy with
:
where
(resp.
) indicated the gradient with respect to the argument indexed by
(resp.
).
Concretely, if we take the ''inverse multi-quadric'' kernel
with parameters
and
a symmetric positive definite matrix, and if we denote
, then we have
.
Diffusion Stein discrepancy
''Diffusion Stein discrepancies''
[Gorham, J., Duncan, A. B., Vollmer, S. J., & Mackey, L. (2019). Measuring sample quality with diffusions. The Annals of Applied Probability, 29(5), 2884-2928.] generalize the Langevin Stein operator
to a class of ''diffusion Stein operators''
, each representing an
Itô diffusion
In mathematics – specifically, in stochastic analysis – an Itô diffusion is a solution to a specific type of stochastic differential equation. That equation is similar to the Langevin equation used in physics to describe the Brownian motion of ...
that has
as its stationary distribution.
Here,
is a matrix-valued function determined by the
infinitesimal generator of the diffusion.
Other Stein discrepancies
Additional Stein discrepancies have been developed for constrained domains,
[Shi, J., Liu, C., & Mackey, L. (2021). Sampling with Mirrored Stein Operators. arXiv preprint arXiv:2106.12506] non-Euclidean domains''
'', discrete domains,
[Shi J, Zhou Y, Hwang J, Titsias M, Mackey L. Gradient Estimation with Discrete Stein Operators. arXiv preprint arXiv:2202.09497. 2022.] improved scalability.
[Huggins JH, Mackey L. Random Feature Stein Discrepancies. In NeurIPS 2018.][Gorham J, Raj A, Mackey L. Stochastic Stein Discrepancies. In NeurIPS 2020.], and gradient-free Stein discrepancies where derivatives of the density
are circumvented.
Properties
The flexibility in the choice of Stein operator and Stein set in the construction of Stein discrepancy precludes general statements of a theoretical nature. However, much is known about the particular Stein discrepancies.
Computable without the normalisation constant
Stein discrepancy can sometimes be computed in challenging settings where the probability distribution
admits a probability density function
(with respect to an appropriate reference measure on
) of the form
, where
and its derivative can be numerically evaluated but whose normalisation constant
is not easily computed or approximated. Considering (2.1), we observe that the dependence of
on
occurs only through the term
:
which does not depend on the normalisation constant
.
Stein discrepancy as a statistical divergence
A basic requirement of Stein discrepancy is that it is a statistical divergence, meaning that
and
if and only if
. This property can be shown to hold for classical Stein discrepancy
and kernel Stein discrepancy''
''
a provided that appropriate regularity conditions hold.
Convergence control
A stronger property, compared to being a statistical divergence, is ''convergence control'', meaning that
implies
converges to
in a sense to be specified. For example, under appropriate regularity conditions, both the classical Stein discrepancy and graph Stein discrepancy enjoy ''Wasserstein convergence control'', meaning that
implies that the
Wasserstein metric
In mathematics, the Wasserstein distance or Kantorovich– Rubinstein metric is a distance function defined between probability distributions on a given metric space M. It is named after Leonid Vaseršteĭn.
Intuitively, if each distribution is ...
between
and
converges to zero.
[Mackey, L., & Gorham, J. (2016). Multivariate Stein factors for a class of strongly log-concave distributions. Electronic Communications in Probability, 21, 1-14.] For the kernel Stein discrepancy, ''weak convergence control'' has been established
[Chen WY, Mackey L, Gorham J, Briol FX, Oates CJ. Stein points. In International Conference on Machine Learning 2018 (pp. 844-853). PMLR.] under regularity conditions on the distribution
and the reproducing kernel
, which are applicable in particular to (2.1). Other well-known choices of
, such as based on the Gaussian kernel, provably do not enjoy weak convergence control.
Convergence detection
The converse property to convergence control is ''convergence detection'', meaning that
whenever
converges to
in a sense to be specified. For example, under appropriate regularity conditions, classical Stein discrepancy enjoys a particular form of ''mean square convergence detection
'', meaning that
whenever
converges in mean-square to
and
converges in mean-square to
. For kernel Stein discrepancy, W''asserstein convergence detection'' has been established,
under appropriate regularity conditions on the distribution
and the reproducing kernel
.
Applications of Stein discrepancy
Several applications of Stein discrepancy have been proposed, some of which are now described.
Optimal quantisation
Given a probability distribution
defined on a measurable space
, the ''
quantization'' task is to select a small number of states
such that the associated discrete distribution
is an accurate approximation of
in a sense to be specified.
''Stein points''
are the result of performing ''optimal'' quantisation via minimisation of Stein discrepancy:
Under appropriate regularity conditions, it can be shown
that
as
. Thus, if the Stein discrepancy enjoys convergence control, it follows that
converges to
. Extensions of this result, to allow for imperfect numerical optimisation, have also been derived.
[Chen WY, Barp A, Briol FX, Gorham J, Girolami M, Mackey L, Oates CJ. Stein Point Markov Chain Monte Carlo. International Conference on Machine Learning (ICML 2019). ][Riabiz M, Chen W, Cockayne J, Swietach P, Niederer SA, Mackey L, Oates CJ. Optimal thinning of MCMC output. Journal of the Royal Statistical Society B: Statistical Methodology, to appear. 2021. ]
Sophisticated optimisation algorithms have been designed to perform efficient quantisation based on Stein discrepancy, including gradient flow algorithms that aim to minimise kernel Stein discrepancy over an appropriate space of probability measures.
Optimal weighted approximation
If one is allowed to consider weighted combinations of point masses, then more accurate approximation is possible compared to (3.1). For simplicity of exposition, suppose we are given a set of states
. Then the optimal weighted combination of the point masses
, i.e.
:
which minimise Stein discrepancy can be obtained in closed form when a kernel Stein discrepancy is used.
Some authors consider imposing, in addition, a non-negativity constraint on the weights, i.e.
. However, in both cases the computation required to compute the optimal weights
can involve solving linear systems of equations that are numerically ill-conditioned. Interestingly, it has been shown
that greedy approximation of
using an un-weighted combination of
states can reduce this computational requirement. In particular, the greedy ''Stein thinning'' algorithm
:
has been shown to satisfy an error bound
:
Non-myopic and mini-batch generalisations of the greedy algorithm have been demonstrated to yield further improvement in approximation quality relative to computational cost.
Variational inference
Stein discrepancy has been exploited as a ''variational objective'' in
variational Bayesian methods
Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables (usually ...
.
[Fisher M, Nolan T, Graham M, Prangle D, Oates CJ. Measure transport with kernel Stein discrepancy. International Conference on Artificial Intelligence and Statistics 2021 (pp. 1054-1062). PMLR.] Given a collection
of probability distributions on
, parametrised by
, one can seek the distribution in this collection that best approximates a distribution
of interest:
:
A possible advantage of Stein discrepancy in this context,
compared to the traditional
Kullback–Leibler variational objective, is that
need not be absolutely continuous with respect to
in order for
to be well-defined. This property can be used to circumvent the use of
flow-based generative models, for example, which impose diffeomorphism constraints in order to enforce absolute continuity of
and
.
Statistical estimation
Stein discrepancy has been proposed as a tool to fit parametric statistical models to data. Given a dataset
, consider the associated discrete distribution
. For a given parametric collection
of probability distributions on
, one can estimate a value of the parameter
which is compatible with the dataset using a ''minimum Stein discrepancy estimator''
[Barp, A., Briol, F.-X., Duncan, A. B., Girolami, M., & Mackey, L. (2019). Minimum Stein discrepancy estimators. Neural Information Processing Systems, 12964–12976.]
:
The approach is closely related to the framework of
minimum distance estimation
Minimum-distance estimation (MDE) is a conceptual method for fitting a statistical model to data, usually the empirical distribution. Often-used estimators such as ordinary least squares can be thought of as special cases of minimum-distance esti ...
, with the role of the "distance" being played by the Stein discrepancy. Alternatively, a generalised Bayesian approach to estimation of the parameter
can be considered
where, given a prior probability distribution with density function
,
, (with respect to an appropriate reference measure on
), one constructs a ''generalised posterior'' with probability density function
:
for some
to be specified or determined.
Hypothesis testing
The Stein discrepancy has also been used as a test statistic for performing
goodness-of-fit testing and comparing latent variable models.
[Kanagawa, H., Jitkrittum, W., Mackey, L., Fukumizu, K., & Gretton, A. (2019). A kernel Stein test for comparing latent variable models. arXiv preprint arXiv:1907.00586.]
Since the aforementioned tests have a computational cost quadratic in the sample size, alternatives have been developed with (near-)linear runtimes.
[Jitkrittum W, Xu W, Szabó Z, Fukumizu K, Gretton A. A Linear-Time Kernel Goodness-of-Fit Test.]
See also
*
Stein's method
*
Divergence (statistics)
In information geometry, a divergence is a kind of statistical distance: a binary function which establishes the separation from one probability distribution to another on a statistical manifold.
The simplest divergence is squared Euclidean di ...
References
{{reflist
Statistical distance
Theory of probability distributions