In
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
, the kernel embedding of distributions (also called the kernel mean or mean map) comprises a class of
nonparametric
Nonparametric statistics is the branch of statistics that is not based solely on Statistical parameter, parametrized families of probability distributions (common examples of parameters are the mean and variance). Nonparametric statistics is based ...
methods in which a
probability distribution
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomeno ...
is represented as an element of a
reproducing kernel Hilbert space
In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g i ...
(RKHS).
[A. Smola, A. Gretton, L. Song, B. Schölkopf. (2007)]
A Hilbert Space Embedding for Distributions
. ''Algorithmic Learning Theory: 18th International Conference''. Springer: 13–31. A generalization of the individual data-point feature mapping done in classical
kernel methods
In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (for example c ...
, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as
inner product
In mathematics, an inner product space (or, rarely, a Hausdorff pre-Hilbert space) is a real vector space or a complex vector space with an operation called an inner product. The inner product of two vectors in the space is a scalar, often ...
s, distances,
projections,
linear transformation
In mathematics, and more specifically in linear algebra, a linear map (also called a linear mapping, linear transformation, vector space homomorphism, or in some contexts linear function) is a mapping V \to W between two vector spaces that pr ...
s, and
spectral analysis.
[L. Song, K. Fukumizu, F. Dinuzzo, A. Gretton (2013)]
Kernel Embeddings of Conditional Distributions: A unified kernel framework for nonparametric inference in graphical models
''IEEE Signal Processing Magazine'' 30: 98–111. This
learning framework is very general and can be applied to distributions over any space
on which a sensible
kernel function In operator theory, a branch of mathematics, a positive-definite kernel is a generalization of a positive-definite function or a positive-definite matrix. It was first introduced by James Mercer in the early 20th century, in the context of solving ...
(measuring similarity between elements of
) may be defined. For example, various kernels have been proposed for learning from data which are:
vectors in
, discrete classes/categories,
string
String or strings may refer to:
*String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects
Arts, entertainment, and media Films
* ''Strings'' (1991 film), a Canadian anim ...
s,
graph
Graph may refer to:
Mathematics
*Graph (discrete mathematics), a structure made of vertices and edges
**Graph theory, the study of such graphs and their properties
*Graph (topology), a topological space resembling a graph in the sense of discre ...
s/
networks
Network, networking and networked may refer to:
Science and technology
* Network theory, the study of graphs as a representation of relations between discrete objects
* Network science, an academic field that studies complex networks
Mathematics
...
, images,
time series
In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. E ...
,
manifold
In mathematics, a manifold is a topological space that locally resembles Euclidean space near each point. More precisely, an n-dimensional manifold, or ''n-manifold'' for short, is a topological space with the property that each point has a ...
s,
dynamical systems
In mathematics, a dynamical system is a system in which a function describes the time dependence of a point in an ambient space. Examples include the mathematical models that describe the swinging of a clock pendulum, the flow of water in ...
, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by
Alex SmolaLe Song Arthur Gretton and
Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.
The analysis of distributions is fundamental in
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
and
statistics, and many algorithms in these fields rely on information theoretic approaches such as
entropy
Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodyna ...
,
mutual information
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such as ...
, or
Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fro ...
. However, to estimate these quantities, one must first either perform density estimation, or employ sophisticated space-partitioning/bias-correction strategies which are typically infeasible for high-dimensional data.
[L. Song. (2008]
Learning via Hilbert Space Embedding of Distributions
PhD Thesis, University of Sydney. Commonly, methods for modeling complex distributions rely on parametric assumptions that may be unfounded or computationally challenging (e.g.
Gaussian mixture models), while nonparametric methods like
kernel density estimation
In statistics, kernel density estimation (KDE) is the application of kernel smoothing for probability density estimation, i.e., a non-parametric method to estimate the probability density function of a random variable based on '' kernels'' as ...
(Note: the smoothing kernels in this context have a different interpretation than the kernels discussed here) or
characteristic function In mathematics, the term "characteristic function" can refer to any of several distinct concepts:
* The indicator function of a subset, that is the function
::\mathbf_A\colon X \to \,
:which for a given subset ''A'' of ''X'', has value 1 at point ...
representation (via the
Fourier transform
A Fourier transform (FT) is a mathematical transform that decomposes functions into frequency components, which are represented by the output of the transform as a function of frequency. Most commonly functions of time or space are transformed, ...
of the distribution) break down in high-dimensional settings.
Methods based on the kernel embedding of distributions sidestep these problems and also possess the following advantages:
# Data may be modeled without restrictive assumptions about the form of the distributions and relationships between variables
# Intermediate density estimation is not needed
# Practitioners may specify the properties of a distribution most relevant for their problem (incorporating prior knowledge via choice of the kernel)
# If a ''characteristic'' kernel is used, then the embedding can uniquely preserve all information about a distribution, while thanks to the
kernel trick
In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (for example c ...
, computations on the potentially infinite-dimensional RKHS can be implemented in practice as simple
Gram
The gram (originally gramme; SI unit symbol g) is a unit of mass in the International System of Units (SI) equal to one one thousandth of a kilogram.
Originally defined as of 1795 as "the absolute weight of a volume of pure water equal to ...
matrix operations
# Dimensionality-independent rates of convergence for the empirical kernel mean (estimated using samples from the distribution) to the kernel embedding of the true underlying distribution can be proven.
# Learning algorithms based on this framework exhibit good generalization ability and finite sample convergence, while often being simpler and more effective than information theoretic methods
Thus, learning via the kernel embedding of distributions offers a principled drop-in replacement for information theoretic approaches and is a framework which not only subsumes many popular methods in machine learning and statistics as special cases, but also can lead to entirely new learning algorithms.
Definitions
Let
denote a random variable with domain
and distribution
Given a kernel
on
the
Moore–Aronszajn theorem asserts the existence of a RKHS
(a
Hilbert space
In mathematics, Hilbert spaces (named after David Hilbert) allow generalizing the methods of linear algebra and calculus from (finite-dimensional) Euclidean vector spaces to spaces that may be infinite-dimensional. Hilbert spaces arise natu ...
of functions
equipped with inner products
and norms
) in which the element
satisfies the reproducing property
:
One may alternatively consider
an implicit feature mapping
from
to
(which is therefore also called the feature space), so that
can be viewed as a measure of similarity between points
While the
similarity measure
In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such mea ...
is linear in the feature space, it may be highly nonlinear in the original space depending on the choice of kernel.
Kernel embedding
The kernel embedding of the distribution
in
(also called the kernel mean or mean map) is given by:
:
If
allows a square integrable density
, then
, where
is the
Hilbert–Schmidt integral operator In mathematics, a Hilbert–Schmidt integral operator is a type of integral transform. Specifically, given a domain (an open and connected set) Ω in ''n''- dimensional Euclidean space R''n'', a Hilbert–Schmidt kernel is a function ''k''& ...
. A kernel is ''characteristic'' if the mean embedding
is injective.
[K. Fukumizu, A. Gretton, X. Sun, and B. Schölkopf (2008)]
Kernel measures of conditional independence
''Advances in Neural Information Processing Systems'' 20, MIT Press, Cambridge, MA. Each distribution can thus be uniquely represented in the RKHS and all statistical features of distributions are preserved by the kernel embedding if a characteristic kernel is used.
Empirical kernel embedding
Given
training examples
drawn
independently and identically distributed
In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is usua ...
(i.i.d.) from
the kernel embedding of
can be empirically estimated as
:
Joint distribution embedding
If
denotes another random variable (for simplicity, assume the co-domain of
is also
with the same kernel
which satisfies
), then the
joint distribution
Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered ...
can be mapped into a
tensor product
In mathematics, the tensor product V \otimes W of two vector spaces and (over the same Field (mathematics), field) is a vector space to which is associated a bilinear map V\times W \to V\otimes W that maps a pair (v,w),\ v\in V, w\in W to an e ...
feature space
via
:
By the equivalence between a
tensor
In mathematics, a tensor is an algebraic object that describes a multilinear relationship between sets of algebraic objects related to a vector space. Tensors may map between different objects such as vectors, scalars, and even other tens ...
and a
linear map
In mathematics, and more specifically in linear algebra, a linear map (also called a linear mapping, linear transformation, vector space homomorphism, or in some contexts linear function) is a mapping V \to W between two vector spaces that pr ...
, this joint embedding may be interpreted as an uncentered
cross-covariance
In probability and statistics, given two stochastic processes \left\ and \left\, the cross-covariance is a function that gives the covariance of one process with the other at pairs of time points. With the usual notation \operatorname E for the ...
operator
from which the cross-covariance of functions
can be computed as
[L. Song, J. Huang, A. J. Smola, K. Fukumizu. (2009]
Hilbert space embeddings of conditional distributions
''Proc. Int. Conf. Machine Learning''. Montreal, Canada: 961–968.
:
Given
pairs of training examples
drawn i.i.d. from
, we can also empirically estimate the joint distribution kernel embedding via
:
Conditional distribution embedding
Given a
conditional distribution
In probability theory and statistics, given two jointly distributed random variables X and Y, the conditional probability distribution of Y given X is the probability distribution of Y when X is known to be a particular value; in some cases the c ...
one can define the corresponding RKHS embedding as
:
Note that the embedding of
thus defines a family of points in the RKHS indexed by the values
taken by conditioning variable
. By fixing
to a particular value, we obtain a single element in
, and thus it is natural to define the operator
:
which given the feature mapping of
outputs the conditional embedding of
given
Assuming that for all
it can be shown that
:
This assumption is always true for finite domains with characteristic kernels, but may not necessarily hold for continuous domains.
Nevertheless, even in cases where the assumption fails,
may still be used to approximate the conditional kernel embedding
and in practice, the inversion operator is replaced with a regularized version of itself
(where
denotes the
identity matrix
In linear algebra, the identity matrix of size n is the n\times n square matrix with ones on the main diagonal and zeros elsewhere.
Terminology and notation
The identity matrix is often denoted by I_n, or simply by I if the size is immaterial ...
).
Given training examples
the empirical kernel conditional embedding operator may be estimated as
:
where
are implicitly formed feature matrices,
is the Gram matrix for samples of
, and
is a
regularization parameter needed to avoid
overfitting
mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
.
Thus, the empirical estimate of the kernel conditional embedding is given by a weighted sum of samples of
in the feature space:
:
where
and
Properties
* The expectation of any function
in the RKHS can be computed as an inner product with the kernel embedding:
::
* In the presence of large sample sizes, manipulations of the
Gram matrix may be computationally demanding. Through use of a low-rank approximation of the Gram matrix (such as the
incomplete Cholesky factorization In numerical analysis, an incomplete Cholesky factorization of a symmetric positive definite matrix is a sparse approximation of the Cholesky factorization. An incomplete Cholesky factorization is often used as a preconditioner for algorithms like ...
), running time and memory requirements of kernel-embedding-based learning algorithms can be drastically reduced without suffering much loss in approximation accuracy.
Convergence of empirical kernel mean to the true distribution embedding
* If
is defined such that
takes values in