machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

, the kernel embedding of distributions (also called the kernel mean or mean map) comprises a class of

nonparametric Nonparametric statistics is the branch of statistics that is not based solely on Statistical parameter, parametrized families of probability distributions (common examples of parameters are the mean and variance). Nonparametric statistics is based ...

methods in which a

probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomeno ...

is represented as an element of a

reproducing kernel Hilbert space In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g i ...

(RKHS).A. Smola, A. Gretton, L. Song, B. Schölkopf. (2007)
A Hilbert Space Embedding for Distributions
. ''Algorithmic Learning Theory: 18th International Conference''. Springer: 13–31. A generalization of the individual data-point feature mapping done in classical

kernel methods In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (for example c ...

, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as

inner product In mathematics, an inner product space (or, rarely, a Hausdorff pre-Hilbert space) is a real vector space or a complex vector space with an operation called an inner product. The inner product of two vectors in the space is a scalar, often ...

s, distances, projections,

linear transformation In mathematics, and more specifically in linear algebra, a linear map (also called a linear mapping, linear transformation, vector space homomorphism, or in some contexts linear function) is a mapping V \to W between two vector spaces that pr ...

s, and spectral analysis.L. Song, K. Fukumizu, F. Dinuzzo, A. Gretton (2013)
Kernel Embeddings of Conditional Distributions: A unified kernel framework for nonparametric inference in graphical models
''IEEE Signal Processing Magazine'' 30: 98–111. This learning framework is very general and can be applied to distributions over any space

\Omega

on which a sensible

kernel function In operator theory, a branch of mathematics, a positive-definite kernel is a generalization of a positive-definite function or a positive-definite matrix. It was first introduced by James Mercer in the early 20th century, in the context of solving ...

(measuring similarity between elements of

\Omega

) may be defined. For example, various kernels have been proposed for learning from data which are: vectors in

\mathbb^d

, discrete classes/categories,

string String or strings may refer to: *String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects Arts, entertainment, and media Films * ''Strings'' (1991 film), a Canadian anim ...

graph Graph may refer to: Mathematics *Graph (discrete mathematics), a structure made of vertices and edges **Graph theory, the study of such graphs and their properties *Graph (topology), a topological space resembling a graph in the sense of discre ...

networks Network, networking and networked may refer to: Science and technology * Network theory, the study of graphs as a representation of relations between discrete objects * Network science, an academic field that studies complex networks Mathematics ...

, images,

time series In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. E ...

manifold In mathematics, a manifold is a topological space that locally resembles Euclidean space near each point. More precisely, an n-dimensional manifold, or ''n-manifold'' for short, is a topological space with the property that each point has a ...

dynamical systems In mathematics, a dynamical system is a system in which a function describes the time dependence of a point in an ambient space. Examples include the mathematical models that describe the swinging of a clock pendulum, the flow of water in ...

, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by
Alex SmolaLe Song Arthur Gretton
and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in. The analysis of distributions is fundamental in

and statistics, and many algorithms in these fields rely on information theoretic approaches such as

entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodyna ...

mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such as ...

, or

Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fro ...

. However, to estimate these quantities, one must first either perform density estimation, or employ sophisticated space-partitioning/bias-correction strategies which are typically infeasible for high-dimensional data.L. Song. (2008
Learning via Hilbert Space Embedding of Distributions
PhD Thesis, University of Sydney. Commonly, methods for modeling complex distributions rely on parametric assumptions that may be unfounded or computationally challenging (e.g. Gaussian mixture models), while nonparametric methods like

kernel density estimation In statistics, kernel density estimation (KDE) is the application of kernel smoothing for probability density estimation, i.e., a non-parametric method to estimate the probability density function of a random variable based on '' kernels'' as ...

(Note: the smoothing kernels in this context have a different interpretation than the kernels discussed here) or

characteristic function In mathematics, the term "characteristic function" can refer to any of several distinct concepts: * The indicator function of a subset, that is the function ::\mathbf_A\colon X \to \, :which for a given subset ''A'' of ''X'', has value 1 at point ...

representation (via the

Fourier transform A Fourier transform (FT) is a mathematical transform that decomposes functions into frequency components, which are represented by the output of the transform as a function of frequency. Most commonly functions of time or space are transformed, ...

of the distribution) break down in high-dimensional settings. Methods based on the kernel embedding of distributions sidestep these problems and also possess the following advantages: # Data may be modeled without restrictive assumptions about the form of the distributions and relationships between variables # Intermediate density estimation is not needed # Practitioners may specify the properties of a distribution most relevant for their problem (incorporating prior knowledge via choice of the kernel) # If a ''characteristic'' kernel is used, then the embedding can uniquely preserve all information about a distribution, while thanks to the

kernel trick In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (for example c ...

, computations on the potentially infinite-dimensional RKHS can be implemented in practice as simple

Gram The gram (originally gramme; SI unit symbol g) is a unit of mass in the International System of Units (SI) equal to one one thousandth of a kilogram. Originally defined as of 1795 as "the absolute weight of a volume of pure water equal to ...

matrix operations # Dimensionality-independent rates of convergence for the empirical kernel mean (estimated using samples from the distribution) to the kernel embedding of the true underlying distribution can be proven. # Learning algorithms based on this framework exhibit good generalization ability and finite sample convergence, while often being simpler and more effective than information theoretic methods Thus, learning via the kernel embedding of distributions offers a principled drop-in replacement for information theoretic approaches and is a framework which not only subsumes many popular methods in machine learning and statistics as special cases, but also can lead to entirely new learning algorithms.

Definitions

Let

X

denote a random variable with domain

\Omega

and distribution

P.

Given a kernel

k

\Omega \times \Omega,

the Moore–Aronszajn theorem asserts the existence of a RKHS

\mathcal

Hilbert space In mathematics, Hilbert spaces (named after David Hilbert) allow generalizing the methods of linear algebra and calculus from (finite-dimensional) Euclidean vector spaces to spaces that may be infinite-dimensional. Hilbert spaces arise natu ...

of functions

f: \Omega \to \R

equipped with inner products

\langle \cdot, \cdot \rangle_\mathcal

and norms

\,  \cdot \, _\mathcal

) in which the element

k(x,\cdot)

satisfies the reproducing property :

\forall f \in \mathcal, \forall x \in \Omega \qquad \langle f, k(x,\cdot) \rangle_\mathcal = f(x).

One may alternatively consider

k(x,\cdot)

an implicit feature mapping

\varphi(x)

from

\Omega

\mathcal

(which is therefore also called the feature space), so that

k(x, x') = \langle \varphi(x), \varphi(x')\rangle_\mathcal

can be viewed as a measure of similarity between points

x, x' \in \Omega.

While the

similarity measure In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such mea ...

is linear in the feature space, it may be highly nonlinear in the original space depending on the choice of kernel.

Kernel embedding

The kernel embedding of the distribution

P

\mathcal

(also called the kernel mean or mean map) is given by: :

= \int_\Omega \varphi(x) \ \mathrmP(x)

P

allows a square integrable density

p

, then

\mu_X = \mathcal_k p

, where

\mathcal_k

is the

Hilbert–Schmidt integral operator In mathematics, a Hilbert–Schmidt integral operator is a type of integral transform. Specifically, given a domain (an open and connected set) Ω in ''n''- dimensional Euclidean space R''n'', a Hilbert–Schmidt kernel is a function ''k''& ...

. A kernel is ''characteristic'' if the mean embedding

\mu: \ \to \mathcal

is injective.K. Fukumizu, A. Gretton, X. Sun, and B. Schölkopf (2008)
Kernel measures of conditional independence
''Advances in Neural Information Processing Systems'' 20, MIT Press, Cambridge, MA. Each distribution can thus be uniquely represented in the RKHS and all statistical features of distributions are preserved by the kernel embedding if a characteristic kernel is used.

Empirical kernel embedding

Given

n

training examples

\

drawn

independently and identically distributed In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is usua ...

(i.i.d.) from

P,

the kernel embedding of

P

can be empirically estimated as :

\widehat_X = \frac \sum_^n \varphi(x_i)

Joint distribution embedding

Y

denotes another random variable (for simplicity, assume the co-domain of

Y

is also

\Omega

with the same kernel

k

which satisfies

\langle \varphi(x) \otimes \varphi(y), \varphi(x') \otimes \varphi(y') \rangle = k(x,x') k(y,y')

), then the

joint distribution Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered ...

P(x,y))

can be mapped into a

tensor product In mathematics, the tensor product V \otimes W of two vector spaces and (over the same Field (mathematics), field) is a vector space to which is associated a bilinear map V\times W \to V\otimes W that maps a pair (v,w),\ v\in V, w\in W to an e ...

feature space

\mathcal \otimes \mathcal

via :

= \int_ \varphi(x) \otimes \varphi(y) \ \mathrm P(x,y)

By the equivalence between a

tensor In mathematics, a tensor is an algebraic object that describes a multilinear relationship between sets of algebraic objects related to a vector space. Tensors may map between different objects such as vectors, scalars, and even other tens ...

and a

linear map In mathematics, and more specifically in linear algebra, a linear map (also called a linear mapping, linear transformation, vector space homomorphism, or in some contexts linear function) is a mapping V \to W between two vector spaces that pr ...

, this joint embedding may be interpreted as an uncentered

cross-covariance In probability and statistics, given two stochastic processes \left\ and \left\, the cross-covariance is a function that gives the covariance of one process with the other at pairs of time points. With the usual notation \operatorname E for the ...

operator

\mathcal_: \mathcal \to \mathcal

from which the cross-covariance of functions

f,g \in \mathcal

can be computed as L. Song, J. Huang, A. J. Smola, K. Fukumizu. (2009
Hilbert space embeddings of conditional distributions
''Proc. Int. Conf. Machine Learning''. Montreal, Canada: 961–968. :

\operatorname (f(X), g(Y)) := \mathbb

(X) g(Y) An emoticon (, , rarely , ), short for "emotion icon", also known simply as an emote, is a pictorial representation of a facial expression using characters—usually punctuation marks, numbers, and letters—to express a person's feelings ...

- \mathbb

(X) An emoticon (, , rarely , ), short for "emotion icon", also known simply as an emote, is a pictorial representation of a facial expression using characters—usually punctuation marks, numbers, and letters—to express a person's feelings, ...

\mathbb (Y) \langle f , \mathcal_ g \rangle_ = \langle f \otimes g , \mathcal_ \rangle_ Given

n

pairs of training examples

\

drawn i.i.d. from

P

, we can also empirically estimate the joint distribution kernel embedding via :

\widehat_ = \frac \sum_^n \varphi(x_i) \otimes \varphi(y_i)

Conditional distribution embedding

Given a

conditional distribution In probability theory and statistics, given two jointly distributed random variables X and Y, the conditional probability distribution of Y given X is the probability distribution of Y when X is known to be a particular value; in some cases the c ...

P(y\mid x),

one can define the corresponding RKHS embedding as :

= \int_\Omega \varphi(y) \ \mathrmP(y \mid x)

Note that the embedding of

P(y\mid x)

thus defines a family of points in the RKHS indexed by the values

x

taken by conditioning variable

X

. By fixing

X

to a particular value, we obtain a single element in

\mathcal

, and thus it is natural to define the operator :

\begin \mathcal_: \mathcal \to \mathcal \\ \mathcal_ = \mathcal_ \mathcal_^ \end

which given the feature mapping of

x

outputs the conditional embedding of

Y

given

X = x.

Assuming that for all

\in \mathcal,

it can be shown that :

\mu_ = \mathcal_ \varphi(x)

This assumption is always true for finite domains with characteristic kernels, but may not necessarily hold for continuous domains. Nevertheless, even in cases where the assumption fails,

\mathcal_ \varphi(x)

may still be used to approximate the conditional kernel embedding

\mu_,

and in practice, the inversion operator is replaced with a regularized version of itself

(\mathcal_ + \lambda \mathbf)^

(where

\mathbf

denotes the

identity matrix In linear algebra, the identity matrix of size n is the n\times n square matrix with ones on the main diagonal and zeros elsewhere. Terminology and notation The identity matrix is often denoted by I_n, or simply by I if the size is immaterial ...

). Given training examples

\,

the empirical kernel conditional embedding operator may be estimated as :

\widehat_ = \boldsymbol (\mathbf + \lambda \mathbf)^ \boldsymbol^T

where

\boldsymbol = \left(\varphi(y_i),\dots, (y_n)\right), \boldsymbol = \left(\varphi(x_i),\dots, (x_n)\right)

are implicitly formed feature matrices,

\mathbf =\boldsymbol^T \boldsymbol

is the Gram matrix for samples of

X

, and

\lambda

is a regularization parameter needed to avoid

overfitting mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...

. Thus, the empirical estimate of the kernel conditional embedding is given by a weighted sum of samples of

Y

in the feature space: :

\widehat_ = \sum_^n \beta_i (x) \varphi(y_i) = \boldsymbol \boldsymbol(x)

where

\boldsymbol(x) = (\mathbf + \lambda \mathbf)^ \mathbf_x

and

\mathbf_x = \left( k(x_1, x), \dots, k(x_n, x) \right)^T

Properties

* The expectation of any function

f

in the RKHS can be computed as an inner product with the kernel embedding: ::

\mathbb

= \langle f, \mu_X \rangle_\mathcal * In the presence of large sample sizes, manipulations of the

n \times n

Gram matrix may be computationally demanding. Through use of a low-rank approximation of the Gram matrix (such as the

incomplete Cholesky factorization In numerical analysis, an incomplete Cholesky factorization of a symmetric positive definite matrix is a sparse approximation of the Cholesky factorization. An incomplete Cholesky factorization is often used as a preconditioner for algorithms like ...

), running time and memory requirements of kernel-embedding-based learning algorithms can be drastically reduced without suffering much loss in approximation accuracy.

Convergence of empirical kernel mean to the true distribution embedding

* If

k

is defined such that

f

takes values in

,1 /math> for all f \in \mathcal with \,  f\, _\mathcal \le 1 (as is the case for the widely used

radial basis function A radial basis function (RBF) is a real-valued function \varphi whose value depends only on the distance between the input and some fixed point, either the origin, so that \varphi(\mathbf) = \hat\varphi(\left\, \mathbf\right\, ), or some other fixed ...

kernels), then with probability at least

1-\delta

: ::

\, \mu_X - \widehat_X \, _\mathcal = \sup_ \left,  \mathbb

- \frac \sum_^n f(x_i) \ \le \frac \mathbb \left \sqrt \right+ \sqrt :where

\mathcal(0,1)

denotes the unit ball in

\mathcal

and

\mathbf =(k_)

is the Gram matrix with

k_ = k(x_i, x_j).

* The rate of convergence (in RKHS norm) of the empirical kernel embedding to its distribution counterpart is

O(n^)

and does ''not'' depend on the dimension of

X

. * Statistics based on kernel embeddings thus avoid the

curse of dimensionality The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. Th ...

, and though the true underlying distribution is unknown in practice, one can (with high probability) obtain an approximation within

O(n^)

of the true kernel embedding based on a finite sample of size

n

. * For the embedding of conditional distributions, the empirical estimate can be seen as a ''weighted'' average of feature mappings (where the weights

\beta_i(x)

depend on the value of the conditioning variable and capture the effect of the conditioning on the kernel embedding). In this case, the empirical estimate converges to the conditional distribution RKHS embedding with rate

O\left(n^ \right)

if the regularization parameter

\lambda

is decreased as

O\left( n^ \right),

though faster rates of convergence may be achieved by placing additional assumptions on the joint distribution.

Universal kernels

* Letting

C(\mathcal)

denote the space of

continuous Continuity or continuous may refer to: Mathematics * Continuity (mathematics), the opposing concept to discreteness; common examples include ** Continuous probability distribution or random variable in probability and statistics ** Continuous g ...

bounded functions on

compact Compact as used in politics may refer broadly to a pact or treaty; in more specific cases it may refer to: * Interstate compact * Blood compact, an ancient ritual of the Philippines * Compact government, a type of colonial rule utilized in British ...

domain

\mathcal

, we call a kernel

k

''universal'' if

k(x,\cdot)

is continuous for all

x

and the RKHS induced by

k

dense Density (volumetric mass density or specific mass) is the substance's mass per unit of volume. The symbol most often used for density is ''ρ'' (the lower case Greek letter rho), although the Latin letter ''D'' can also be used. Mathematically ...

C(\mathcal)

. * If

k

induces a strictly positive definite kernel matrix for any set of distinct points, then it is a universal kernel. For example, the widely used Gaussian RBF kernel ::

k(x,x') = \exp\left(-\frac \, x-x'\, ^2 \right)

:on compact subsets of

\mathbb^d

is universal. * If

k

is shift-invariant

h(x-y)=k(x, y)

and its representation in Fourier domain is ::

h(t) = \int e^ \mu(d\omega)

:and

support Support may refer to: Arts, entertainment, and media * Supporting character Business and finance * Support (technical analysis) * Child support * Customer support * Income Support Construction * Support (structure), or lateral support, a ...

\mu

is an entire space, then

k

is universal. For example, Gaussian RBF is universal,

sinc In mathematics, physics and engineering, the sinc function, denoted by , has two forms, normalized and unnormalized.. In mathematics, the historical unnormalized sinc function is defined for by \operatornamex = \frac. Alternatively, the ...

kernel is not universal. * If

k

is universal, then it is ''characteristic'', i.e. the kernel embedding is one-to-one.

Parameter selection for conditional distribution kernel embeddings

* The empirical kernel conditional distribution embedding operator

\widehat_

can alternatively be viewed as the solution of the following regularized least squares (function-valued) regression problem ::

\min_ \sum_^n \left \, \varphi(y_i)-\mathcal \varphi(x_i) \right \, _\mathcal^2 + \lambda \, \mathcal \, _^2

:where

\, \cdot\, _

is the Hilbert–Schmidt norm. * One can thus select the regularization parameter

\lambda

by performing cross-validation based on the squared loss function of the regression problem.

Rules of probability as operations in the RKHS

This section illustrates how basic probabilistic rules may be reformulated as (multi)linear algebraic operations in the kernel embedding framework and is primarily based on the work of Song et al. The following notation is adopted: *

P(X,Y)=

joint distribution over random variables

X, Y

P(X)= \int_\Omega P(X, \mathrmy) =

marginal distribution of

X

;

P(Y)=

marginal distribution of

Y

P(Y \mid X) = \frac =

conditional distribution of

Y

given

X

with corresponding conditional embedding operator

\mathcal_

\pi(Y) =

prior distribution over

Y

Q

is used to distinguish distributions which incorporate the prior from distributions

P

which do not rely on the prior In practice, all embeddings are empirically estimated from data

\

and it assumed that a set of samples

\

may be used to estimate the kernel embedding of the prior distribution

\pi(Y)

Kernel sum rule

In probability theory, the marginal distribution of

X

can be computed by integrating out

Y

from the joint density (including the prior distribution on

Y

) :

Q(X) = \int_\Omega P(X \mid Y) \, \mathrm \pi(Y)

The analog of this rule in the kernel embedding framework states that

\mu_X^\pi,

the RKHS embedding of

Q(X)

, can be computed via :

= \mathcal_ \mu_Y^\pi

where

\mu_Y^\pi

is the kernel embedding of

\pi(Y).

In practical implementations, the kernel sum rule takes the following form :

\widehat_X^\pi = \widehat_ \widehat_Y^\pi = \boldsymbol (\mathbf + \lambda \mathbf)^ \widetilde \boldsymbol

where :

\mu_Y^\pi = \sum_^ \alpha_i \varphi(\widetilde_i)

is the empirical kernel embedding of the prior distribution,

\boldsymbol = (\alpha_1, \ldots, \alpha_ )^T,

\boldsymbol = \left(\varphi(x_1), \ldots, \varphi(x_n) \right)

, and

\mathbf, \widetilde

are Gram matrices with entries

\mathbf_ = k(y_i, y_j), \widetilde_ = k(y_i, \widetilde_j)

respectively.

Kernel chain rule

In probability theory, a joint distribution can be factorized into a product between conditional and marginal distributions :

Q(X,Y) = P(X \mid Y) \pi(Y)

The analog of this rule in the kernel embedding framework states that

\mathcal_^\pi,

the joint embedding of

Q(X,Y),

can be factorized as a composition of conditional embedding operator with the auto-covariance operator associated with

\pi(Y)

\mathcal_^\pi = \mathcal_ \mathcal_^\pi

where :

\mathcal_^\pi = \mathbb varphi(X) \otimes \varphi(Y)

\mathcal_^\pi = \mathbb varphi(Y) \otimes \varphi(Y)

In practical implementations, the kernel chain rule takes the following form :

\widehat_^\pi = \widehat_ \widehat_^\pi = \boldsymbol (\mathbf + \lambda \mathbf)^ \widetilde \operatorname(\boldsymbol) \boldsymbol^T

Kernel Bayes' rule

In probability theory, a posterior distribution can be expressed in terms of a prior distribution and a likelihood function as :

Q(Y\mid x) = \frac

where

Q(x) = \int_\Omega P(x \mid y) \, \mathrm \pi(y)

The analog of this rule in the kernel embedding framework expresses the kernel embedding of the conditional distribution in terms of conditional embedding operators which are modified by the prior distribution :

\mu_^\pi = \mathcal_^\pi \varphi(x) = \mathcal_^\pi \left ( \mathcal_^\pi \right )^ \varphi(x)

where from the chain rule: :

\mathcal_^\pi = \left( \mathcal_ \mathcal_^\pi \right)^T.

In practical implementations, the kernel Bayes' rule takes the following form :

\widehat_^\pi = \widehat_^\pi \left( \left (\widehat_ \right )^2 + \widetilde \mathbf \right)^ \widehat_^\pi \varphi(x) = \widetilde \boldsymbol^T \left( (\mathbf \mathbf)^2 + \widetilde \mathbf \right)^ \mathbf \mathbf \mathbf_x

where :

\boldsymbol = \left(\mathbf + \widetilde \mathbf \right)^ \widetilde \operatorname(\boldsymbol), \qquad \mathbf = \operatorname\left(\left(\mathbf + \widetilde \mathbf \right)^ \widetilde \boldsymbol \right).

Two regularization parameters are used in this framework:

\lambda

for the estimation of

\widehat_^\pi, \widehat_^\pi = \boldsymbol \mathbf \boldsymbol^T

and

\widetilde

for the estimation of the final conditional embedding operator :

\widehat_^\pi = \widehat_^\pi \left( \left (\widehat_^\pi \right )^2 + \widetilde \mathbf \right)^ \widehat_^\pi.

The latter regularization is done on square of

\widehat_^\pi

because

D

may not be

positive definite In mathematics, positive definiteness is a property of any object to which a bilinear form or a sesquilinear form may be naturally associated, which is positive-definite. See, in particular: * Positive-definite bilinear form * Positive-definite ...

Applications

Measuring distance between distributions

The maximum mean discrepancy (MMD) is a distance-measure between distributions

P(X)

and

Q(Y)

which is defined as the squared distance between their embeddings in the RKHS :

\text(P,Q) = \left \,  \mu_X - \mu_Y \right \, _^2

While most distance-measures between distributions such as the widely used

either require density estimation (either parametrically or nonparametrically) or space partitioning/bias correction strategies, the MMD is easily estimated as an empirical mean which is concentrated around the true value of the MMD. The characterization of this distance as the ''maximum mean discrepancy'' refers to the fact that computing the MMD is equivalent to finding the RKHS function that maximizes the difference in expectations between the two probability distributions :

\text(P,Q) = \sup_ \left( \mathbb

- \mathbb (Y)\right)

Kernel two-sample test

Given ''n'' training examples from

P(X)

and ''m'' samples from

Q(Y)

, one can formulate a test statistic based on the empirical estimate of the MMD :

& = \frac \sum_^n\sum_^n k(x_i, x_j) + \frac \sum_^m\sum_^m k(y_i, y_j) - \frac \sum_^n\sum_^m k(x_i, y_j) \end

to obtain a two-sample test of the null hypothesis that both samples stem from the same distribution (i.e.

P = Q

) against the broad alternative

P \neq Q

Density estimation via kernel embeddings

Although learning algorithms in the kernel embedding framework circumvent the need for intermediate density estimation, one may nonetheless use the empirical embedding to perform density estimation based on ''n'' samples drawn from an underlying distribution

P_X^*

. This can be done by solving the following optimization problem :

\max_ H(P_X)

subject to

\, _\mathcal \le \varepsilon

where the maximization is done over the entire space of distributions on

\Omega.

Here,

\mu_X_X /math> is the kernel embedding of the proposed density P_X and H is an entropy-like quantity (e.g.

Entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodyna ...

, KL divergence, Bregman divergence). The distribution which solves this optimization may be interpreted as a compromise between fitting the empirical kernel means of the samples well, while still allocating a substantial portion of the probability mass to all regions of the probability space (much of which may not be represented in the training examples). In practice, a good approximate solution of the difficult optimization may be found by restricting the space of candidate densities to a mixture of ''M'' candidate distributions with regularized mixing proportions. Connections between the ideas underlying

Gaussian process In probability theory and statistics, a Gaussian process is a stochastic process (a collection of random variables indexed by time or space), such that every finite collection of those random variables has a multivariate normal distribution, i.e. ...

es and conditional random fields may be drawn with the estimation of conditional probability distributions in this fashion, if one views the feature mappings associated with the kernel as sufficient statistics in generalized (possibly infinite-dimensional)

exponential families In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate ...

Measuring dependence of random variables

A measure of the statistical dependence between random variables

X

and

Y

(from any domains on which sensible kernels can be defined) can be formulated based on the Hilbert–Schmidt Independence Criterion :

\text(X, Y) = \left \,  \mathcal_ - \mu_X \otimes \mu_Y \right \, _^2

and can be used as a principled replacement for

Pearson correlation In statistics, the Pearson correlation coefficient (PCC, pronounced ) ― also known as Pearson's ''r'', the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation, or colloquially simply as the correlation coefficien ...

or any other dependence measure used in learning algorithms. Most notably, HSIC can detect arbitrary dependencies (when a characteristic kernel is used in the embeddings, HSIC is zero if and only if the variables are

independent Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in the New Hope, Pennsylvania, area of the United States during the early 1930s * Independe ...

), and can be used to measure dependence between different types of data (e.g. images and text captions). Given ''n'' i.i.d. samples of each random variable, a simple parameter-free unbiased estimator of HSIC which exhibits

concentration In chemistry, concentration is the abundance of a constituent divided by the total volume of a mixture. Several types of mathematical description can be distinguished: '' mass concentration'', '' molar concentration'', '' number concentration'' ...

about the true value can be computed in

O(n(d_f^2 +d_g^2))

time, where the Gram matrices of the two datasets are approximated using

\mathbf \mathbf^T, \mathbf \mathbf^T

with

\mathbf \in \R^, \mathbf \in \R^

. The desirable properties of HSIC have led to the formulation of numerous algorithms which utilize this dependence measure for a variety of common machine learning tasks such as:

feature selection In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construc ...

(BAHSIC ), clustering (CLUHSIC ), and

dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...

(MUHSIC L. Song, A. Smola, K. Borgwardt, A. Gretton. (2007)
Colored maximum variance unfolding
''Neural Information Processing Systems''.). HSIC can be extended to measure the dependence of multiple random variables. The question of when HSIC captures independence in this case has recently been studied:Zoltán Szabó, Bharath K. Sriperumbudur

''Journal of Machine Learning Research'', 19:1–29, 2018. for more than two variables * on

\R^d

: the characteristic property of the individual kernels remains an equivalent condition. * on general domains: the characteristic property of the kernel components is necessary but ''not sufficient''.

Kernel belief propagation

Belief propagation A belief is an attitude that something is the case, or that some proposition is true. In epistemology, philosophers use the term "belief" to refer to attitudes about the world which can be either true or false. To believe something is to take i ...

is a fundamental algorithm for inference in

graphical model A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. They are commonly used in probability ...

s in which nodes repeatedly pass and receive messages corresponding to the evaluation of conditional expectations. In the kernel embedding framework, the messages may be represented as RKHS functions and the conditional distribution embeddings can be applied to efficiently compute message updates. Given ''n'' samples of random variables represented by nodes in a

Markov random field In the domain of physics and probability, a Markov random field (MRF), Markov network or undirected graphical model is a set of random variables having a Markov property described by an undirected graph. In other words, a random field is said to b ...

, the incoming message to node ''t'' from node ''u'' can be expressed as :

m_(\cdot) = \sum_^n \beta_^i \varphi(x_t^i)

if it assumed to lie in the RKHS. The kernel belief propagation update message from ''t'' to node ''s'' is then given by :

\widehat_ = \left( \odot_ \mathbf_t \boldsymbol_ \right)^T (\mathbf_s + \lambda \mathbf )^ \boldsymbol_s^T \varphi(x_s)

where

\odot

denotes the element-wise vector product,

N(t) \backslash s

is the set of nodes connected to ''t'' excluding node ''s'',

\boldsymbol_ = \left(\beta_^1, \dots, \beta_^n \right)

\mathbf_t, \mathbf_s

are the Gram matrices of the samples from variables

X_t, X_s

, respectively, and

\boldsymbol_s = \left(\varphi(x_s^1),\dots, \varphi(x_s^n)\right)

is the feature matrix for the samples from

X_s

. Thus, if the incoming messages to node ''t'' are linear combinations of feature mapped samples from

X_t

, then the outgoing message from this node is also a linear combination of feature mapped samples from

X_s

. This RKHS function representation of message-passing updates therefore produces an efficient belief propagation algorithm in which the potentials are nonparametric functions inferred from the data so that arbitrary statistical relationships may be modeled.

Nonparametric filtering in hidden Markov models

In the

hidden Markov model A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...

(HMM), two key quantities of interest are the transition probabilities between hidden states

P(S^t \mid S^)

and the emission probabilities

P(O^t \mid S^t)

for observations. Using the kernel conditional distribution embedding framework, these quantities may be expressed in terms of samples from the HMM. A serious limitation of the embedding methods in this domain is the need for training samples containing hidden states, as otherwise inference with arbitrary distributions in the HMM is not possible. One common use of HMMs is

filtering Filter, filtering or filters may refer to: Science and technology Computing * Filter (higher-order function), in functional programming * Filter (software), a computer program to process a data stream * Filter (video), a software component th ...

in which the goal is to estimate posterior distribution over the hidden state

s^t

at time step ''t'' given a history of previous observations

h^t = (o^1, \dots, o^t)

from the system. In filtering, a belief state

P(S^ \mid h^)

is recursively maintained via a prediction step (where updates

P(S^ \mid h^t) = \mathbb (S^ \mid S^t) \mid h^t /math> are computed by marginalizing out the previous hidden state) followed by a conditioning step (where updates P(S^ \mid h^t, o^) \propto P(o^ \mid S^) P(S^ \mid h^t) are computed by applying Bayes' rule to condition on a new observation). The RKHS embedding of the belief state at time ''t+1'' can be recursively expressed as 

: \mu_ = \mathcal_^\pi \left(\mathcal_^\pi \right)^ \varphi(o^) by computing the embeddings of the prediction step via the kernel sum rule and the embedding of the conditioning step via kernel Bayes' rule . Assuming a training sample (\widetilde^1, \dots, \widetilde^T, \widetilde^1, \dots, \widetilde^T) is given, one can in practice estimate 

: \widehat_ = \sum_^T \alpha_i^t \varphi(\widetilde^t) and filtering with kernel embeddings is thus implemented recursively using the following updates for the weights \boldsymbol = (\alpha_1, \dots, \alpha_T) : \mathbf^ = \operatorname\left((G+\lambda \mathbf)^ \widetilde \boldsymbol^t \right) : \boldsymbol^ = \mathbf^ \mathbf \left( (\mathbf^ K)^2 + \widetilde \mathbf \right)^ \mathbf^ \mathbf_where \mathbf, \mathbf denote the Gram matrices of \widetilde^1, \dots, \widetilde^T and \widetilde^1, \dots, \widetilde^T respectively, \widetilde is a transfer Gram matrix defined as \widetilde_ = k(\widetilde_i, \widetilde_), and \mathbf_ = (k(\widetilde^1, o^), \dots, k(\widetilde^T, o^))^T.

Support measure machines

The support measure machine (SMM) is a generalization of the

support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories ...

(SVM) in which the training examples are probability distributions paired with labels

\_^n, \ y_i \in \

.K. Muandet, K. Fukumizu, F. Dinuzzo, B. Schölkopf. (2012)
Learning from Distributions via Support Measure Machines
''Advances in Neural Information Processing Systems'': 10–18. SMMs solve the standard SVM dual optimization problem using the following expected kernel :

K\left(P(X), Q(Z)\right) = \langle \mu_X , \mu_Z \rangle_\mathcal = \mathbb (x,z)

which is computable in closed form for many common specific distributions

P_i

(such as the Gaussian distribution) combined with popular embedding kernels

k

(e.g. the Gaussian kernel or polynomial kernel), or can be accurately empirically estimated from i.i.d. samples

\_^n \sim P(X), \_^m \sim Q(Z)

via :

\widehat (X, Z) = \frac \sum_^n \sum_^m k(x_i, z_j)

Under certain choices of the embedding kernel

k

, the SMM applied to training examples

\_^n

is equivalent to a SVM trained on samples

\_^n

, and thus the SMM can be viewed as a ''flexible'' SVM in which a different data-dependent kernel (specified by the assumed form of the distribution

P_i

) may be placed on each training point.

Domain adaptation under covariate, target, and conditional shift

The goal of

domain adaptation Domain adaptation is a field associated with machine learning and transfer learning. This scenario arises when we aim at learning from a source data distribution a well performing model on a different (but related) target data distribution. For ...

is the formulation of learning algorithms which generalize well when the training and test data have different distributions. Given training examples

\_^n

and a test set

\_^m

where the

y_j^\text

are unknown, three types of differences are commonly assumed between the distribution of the training examples

P^\text(X,Y)

and the test distribution

P^\text(X,Y)

:K. Zhang, B. Schölkopf, K. Muandet, Z. Wang. (2013)
Domain adaptation under target and conditional shift
''Journal of Machine Learning Research, 28(3): 819–827.A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, B. Schölkopf. (2008). Covariate shift and local learning by distribution matching. ''In J. Quinonero-Candela, M. Sugiyama, A. Schwaighofer, N. Lawrence (eds.). Dataset shift in machine learning'', MIT Press, Cambridge, MA: 131–160. # Covariate shift in which the marginal distribution of the covariates changes across domains:

P^\text(X) \neq P^\text(X)

# Target shift in which the marginal distribution of the outputs changes across domains:

P^\text(Y) \neq P^\text(Y)

# Conditional shift in which

P(Y)

remains the same across domains, but the conditional distributions differ:

P^\text(X \mid Y) \neq P^\text(X \mid Y)

. In general, the presence of conditional shift leads to an ill-posed problem, and the additional assumption that

P(X \mid Y)

changes only under

location In geography, location or place are used to denote a region (point, line, or area) on Earth's surface or elsewhere. The term ''location'' generally implies a higher degree of certainty than ''place'', the latter often indicating an entity with an ...

scale Scale or scales may refer to: Mathematics * Scale (descriptive set theory), an object defined on a set of points * Scale (ratio), the ratio of a linear dimension of a model to the corresponding dimension of the original * Scale factor, a number ...

(LS) transformations on

X

is commonly imposed to make the problem tractable. By utilizing the kernel embedding of marginal and conditional distributions, practical approaches to deal with the presence of these types of differences between training and test domains can be formulated. Covariate shift may be accounted for by reweighting examples via estimates of the ratio

P^\text(X)/P^\text(X)

obtained directly from the kernel embeddings of the marginal distributions of

X

in each domain without any need for explicit estimation of the distributions. Target shift, which cannot be similarly dealt with since no samples from

Y

are available in the test domain, is accounted for by weighting training examples using the vector

\boldsymbol^*(\mathbf^\text)

which solves the following optimization problem (where in practice, empirical approximations must be used) :

- \mu_ \right \, _\mathcal^2

subject to

\boldsymbol(y) \ge 0, \mathbb boldsymbol(y^\text) = 1

To deal with location scale conditional shift, one can perform a LS transformation of the training points to obtain new transformed training data

\mathbf^\text = \mathbf^\text \odot \mathbf + \mathbf

(where

\odot

denotes the element-wise vector product). To ensure similar distributions between the new transformed training samples and the test data,

\mathbf,\mathbf

are estimated by minimizing the following empirical kernel embedding distance :

\left \,  \widehat_ - \widehat_ \right \, _^2 = \left \,  \widehat_ \widehat_ - \widehat_ \right \, _^2

In general, the kernel embedding methods for dealing with LS conditional shift and target shift may be combined to find a reweighted transformation of the training data which mimics the test distribution, and these methods may perform well even in the presence of conditional shifts other than location-scale changes.

Domain generalization via invariant feature representation

Given ''N'' sets of training examples sampled i.i.d. from distributions

P^(X,Y), P^(X,Y), \ldots, P^(X,Y)

, the goal of domain generalization is to formulate learning algorithms which perform well on test examples sampled from a previously unseen domain

P^*(X,Y)

where no data from the test domain is available at training time. If conditional distributions

P(Y \mid X)

are assumed to be relatively similar across all domains, then a learner capable of domain generalization must estimate a functional relationship between the variables which is robust to changes in the marginals

P(X)

. Based on kernel embeddings of these distributions, Domain Invariant Component Analysis (DICA) is a method which determines the transformation of the training data that minimizes the difference between marginal distributions while preserving a common conditional distribution shared between all training domains.K. Muandet, D. Balduzzi, B. Schölkopf. (2013
Domain Generalization Via Invariant Feature Representation
''30th International Conference on Machine Learning''. DICA thus extracts ''invariants'', features that transfer across domains, and may be viewed as a generalization of many popular dimension-reduction methods such as

kernel principal component analysis In the field of multivariate statistics, kernel principal component analysis (kernel PCA) is an extension of principal component analysis (PCA) using techniques of kernel methods. Using a kernel, the originally linear operations of PCA are performed ...

, transfer component analysis, and covariance operator inverse regression. Defining a probability distribution

\mathcal

on the RKHS

\mathcal

with :

\mathcal \left (\mu_ \right ) = \frac \qquad \text i=1,\dots, N,

DICA measures dissimilarity between domains via distributional variance which is computed as :

V_\mathcal (\mathcal) = \frac \operatorname(\mathbf) - \frac \sum_^N \mathbf_

where :

\mathbf_ = \left \langle \mu_, \mu_ \right \rangle_\mathcal

\mathbf

is a

N \times N

Gram matrix over the distributions from which the training data are sampled. Finding an

orthogonal transform In linear algebra, an orthogonal matrix, or orthonormal matrix, is a real square matrix whose columns and rows are orthonormal vectors. One way to express this is Q^\mathrm Q = Q Q^\mathrm = I, where is the transpose of and is the identity ma ...

onto a low-dimensional subspace ''B'' (in the feature space) which minimizes the distributional variance, DICA simultaneously ensures that ''B'' aligns with the bases of a central subspace ''C'' for which

Y

becomes independent of

X

given

C^T X

across all domains. In the absence of target values

Y

, an unsupervised version of DICA may be formulated which finds a low-dimensional subspace that minimizes distributional variance while simultaneously maximizing the variance of

X

(in the feature space) across all domains (rather than preserving a central subspace).

Distribution regression

In distribution regression, the goal is to regress from probability distributions to reals (or vectors). Many important

and statistical tasks fit into this framework, including multi-instance learning, and

point estimation In statistics, point estimation involves the use of sample data to calculate a single value (known as a point estimate since it identifies a point in some parameter space) which is to serve as a "best guess" or "best estimate" of an unknown popu ...

problems without analytical solution (such as hyperparameter or entropy estimation). In practice only samples from sampled distributions are observable, and the estimates have to rely on similarities computed between ''sets of points''. Distribution regression has been successfully applied for example in supervised entropy learning, and aerosol prediction using multispectral satellite images.Z. Szabó, B. Sriperumbudur, B. Póczos, A. Gretton
Learning Theory for Distribution Regression
''Journal of Machine Learning Research'', 17(152):1–40, 2016. Given

_^\ell

training data, where the

\hat := \_^

bag contains samples from a probability distribution

X_i

and the

i^\text

output label is

y_i\in \R

, one can tackle the distribution regression task by taking the embeddings of the distributions, and learning the regressor from the embeddings to the outputs. In other words, one can consider the following kernel

ridge regression Ridge regression is a method of estimating the coefficients of multiple- regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also ...

problem

(\lambda>0)

2 + \lambda \, f\, _^2 \to \min_,

where :

\mu_ = \int_\Omega k(\cdot,u) \, \mathrm \hat_i(u)= \frac \sum_^ k(\cdot, X_)

with a

k

kernel on the domain of

X_i

-s

(k:\Omega\times \Omega \to \R)

K

is a kernel on the embedded distributions, and

\mathcal(K)

is the RKHS determined by

K

. Examples for

K

include the linear kernel

\left K(\mu_P,\mu_Q) = \langle\mu_P,\mu_Q\rangle_ \right

, the Gaussian kernel

\left K(\mu_P,\mu_Q) = e^ \right

, the exponential kernel

\left K(\mu_P,\mu_Q) = e^ \right

, the Cauchy kernel

\left \mu_P-\mu_Q\right\, _^2/\sigma^2 \right)^ \right

, the generalized t-student kernel

\left \mu_P-\mu_Q\right\, _^ \right)^, (\sigma \le 2) \right

, or the inverse multiquadrics kernel

\left \mu_P-\mu_Q\right\, _^2 + \sigma^2 \right)^ \right

. The prediction on a new distribution

(\hat)

takes the simple, analytical form ::

\mathbf,

where

in \R^

in \R^

G_ = K\big(\mu_,\mu_\big)\in \R

in \R^\ell

. Under mild regularity conditions this estimator can be shown to be consistent and it can achieve the one-stage sampled (as if one had access to the true

X_i

-s) minimax optimal rate. In the

J

objective function

y_i

-s are real numbers; the results can also be extended to the case when

y_i

-s are

d

-dimensional vectors, or more generally elements of a separable

using operator-valued

K

kernels.

Example

In this simple example, which is taken from Song et al.,

X, Y

are assumed to be discrete random variables which take values in the set

\

and the kernel is chosen to be the

Kronecker delta In mathematics, the Kronecker delta (named after Leopold Kronecker) is a function of two variables, usually just non-negative integers. The function is 1 if the variables are equal, and 0 otherwise: \delta_ = \begin 0 &\text i \neq j, \\ 1 ...

function, so

k(x,x') = \delta(x,x')

. The feature map corresponding to this kernel is the

standard basis In mathematics, the standard basis (also called natural basis or canonical basis) of a coordinate vector space (such as \mathbb^n or \mathbb^n) is the set of vectors whose components are all zero, except one that equals 1. For example, in th ...

vector

\varphi(x) = \mathbf_x

. The kernel embeddings of such a distributions are thus vectors of marginal probabilities while the embeddings of joint distributions in this setting are

K\times K

matrices specifying joint probability tables, and the explicit form of these embeddings is :

= \begin P(X=1) \\ \vdots \\ P(X=K) \\ \end

\mathcal_ = \mathbb mathbf_X \otimes \mathbf_Y = ( P(X=s, Y=t))_

The conditional distribution embedding operator, :

\mathcal_ = \mathcal_ \mathcal_^,

is in this setting a conditional probability table :

\mathcal_ = ( P(Y=s \mid X=t))_

and :

\mathcal_ =\begin P(X=1) & \dots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \dots & P(X=K) \\ \end

Thus, the embeddings of the conditional distribution under a fixed value of

X

may be computed as :

\mu_ = \mathcal_ \varphi(x) = \begin P(Y=1 \mid X = x) \\ \vdots \\ P(Y=K \mid X = x) \\ \end

In this discrete-valued setting with the Kronecker delta kernel, the kernel sum rule becomes :

\underbrace_ = \underbrace_ \underbrace_

The kernel chain rule in this case is given by :

\underbrace_ = \underbrace_ \underbrace_

References

{{reflist

External links

Information Theoretical Estimators toolbox
(distribution regression demonstration). Machine learning Theory of probability distributions