operator theory In mathematics, operator theory is the study of linear operators on function spaces, beginning with differential operators and integral operators. The operators may be presented abstractly by their characteristics, such as bounded linear operat ...

, a branch of mathematics, a positive-definite kernel is a generalization of a

positive-definite function In mathematics, a positive-definite function is, depending on the context, either of two types of function. Most common usage A ''positive-definite function'' of a real variable ''x'' is a complex-valued function f: \mathbb \to \mathbb such th ...

or a

positive-definite matrix In mathematics, a symmetric matrix M with real entries is positive-definite if the real number z^\textsfMz is positive for every nonzero real column vector z, where z^\textsf is the transpose of More generally, a Hermitian matrix (that is, a co ...

. It was first introduced by James Mercer in the early 20th century, in the context of solving integral operator equations. Since then, positive-definite functions and their various analogues and generalizations have arisen in diverse parts of mathematics. They occur naturally in

Fourier analysis In mathematics, Fourier analysis () is the study of the way general functions may be represented or approximated by sums of simpler trigonometric functions. Fourier analysis grew from the study of Fourier series, and is named after Josep ...

probability theory Probability theory is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set o ...

, complex function-theory,

moment problem In mathematics, a moment problem arises as the result of trying to invert the mapping that takes a measure ''μ'' to the sequences of moments :m_n = \int_^\infty x^n \,d\mu(x)\,. More generally, one may consider :m_n = \int_^\infty M_n(x) \,d ...

integral equation In mathematics, integral equations are equations in which an unknown Function (mathematics), function appears under an integral sign. In mathematical notation, integral equations may thus be expressed as being of the form: f(x_1,x_2,x_3,...,x_n ; ...

s, boundary-value problems for

partial differential equation In mathematics, a partial differential equation (PDE) is an equation which imposes relations between the various partial derivatives of a Multivariable calculus, multivariable function. The function is often thought of as an "unknown" to be sol ...

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

embedding problem In Galois theory, a branch of mathematics, the embedding problem is a generalization of the inverse Galois problem. Roughly speaking, it asks whether a given Galois extension can be embedded into a Galois extension in such a way that the restriction ...

information theory Information theory is the scientific study of the quantification (science), quantification, computer data storage, storage, and telecommunication, communication of information. The field was originally established by the works of Harry Nyquist a ...

, and other areas. This article will discuss some of the historical and current developments of the theory of positive-definite kernels, starting with the general idea and properties before considering practical applications.

Definition

Let

\mathcal X

be a nonempty set, sometimes referred to as the index set. A

symmetric function In mathematics, a function of n variables is symmetric if its value is the same no matter the order of its arguments. For example, a function f\left(x_1,x_2\right) of two arguments is a symmetric function if and only if f\left(x_1,x_2\right) = f\l ...

K: \mathcal X \times \mathcal X \to \mathbb

is called a positive-definite (p.d.) kernel on

\mathcal X

if holds for any

x_1, \dots, x_n\in \mathcal X

, given

n\in \mathbb, c_1, \dots, c_n \in \mathbb

. In probability theory, a distinction is sometimes made between positive-definite kernels, for which equality in (1.1) implies

c_i = 0\; (\forall i)

, and positive semi-definite (p.s.d.) kernels, which do not impose this condition. Note that this is equivalent to requiring that any finite matrix constructed by pairwise evaluation,

\mathbf_ = K(x_i, x_j)

, has either entirely positive (p.d.) or nonnegative (p.s.d.)

eigenvalues In linear algebra, an eigenvector () or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denoted b ...

. In mathematical literature, kernels are usually complex valued functions, but in this article we assume real-valued functions, which is the common practice in applications of p.d. kernels.

Some general properties

* For a family of p.d. kernels

(K_i)_,\ \  K_i:\mathcal X\times \mathcal X\to \mathbb

** The conical sum

\sum_^n \lambda_i K_i

is p.d., given

\lambda_1,\dots,\lambda_n \ge 0

** The product

K_1^\dots K_n^

is p.d., given

a_1,\dots,a_n\in \mathbb

** The limit

K = \lim_K_n

is p.d. if the limit exists. * If

(\mathcal X_i)_^n

is a sequence of sets, and

(K_i)_^n,\ \  K_i:\mathcal X_i\times \mathcal X_i\to \mathbb

a sequence of p.d. kernels, then both

K((x_1,\dots,x_n),(y_1,\dots,y_n)) =\prod_^n K_i(x_i,y_i)

and

K((x_1,\dots,x_n),(y_1,\dots,y_n)) =\sum_^n K_i(x_i,y_i)

are p.d. kernels on

\mathcal X = \mathcal X_1 \times \dots \times \mathcal X_n

. * Let

\mathcal X_0\subset \mathcal X

. Then the restriction

K_0

K

\mathcal X_0\times \mathcal X_0

is also a p.d. kernel.

Examples of p.d. kernels

* Common examples of p.d. kernels defined on Euclidean space

\mathbb^d

include: ** Linear kernel:

K(\mathbf, \mathbf) = \mathbf^T\mathbf, \quad \mathbf,\mathbf \in\mathbb^d

. **

Polynomial kernel In machine learning, the polynomial kernel is a kernel function commonly used with support vector machines (SVMs) and other kernelized models, that represents the similarity of vectors (training samples) in a feature space over polynomials of th ...

K(\mathbf, \mathbf) =(\mathbf^T\mathbf+r)^n, \quad \mathbf,\mathbf \in\mathbb^d, r \ge 0, n \ge 1

. ** Gaussian kernel (

RBF kernel In machine learning, the radial basis function kernel, or RBF kernel, is a popular kernel function used in various kernelized learning algorithms. In particular, it is commonly used in support vector machine classification. The RBF kernel on two s ...

K(\mathbf, \mathbf) =e^, \quad \mathbf, \mathbf\in\mathbb^d, \sigma>0

. ** Laplacian kernel:

K(\mathbf, \mathbf) =e^, \quad \mathbf, \mathbf\in\mathbb^d, \alpha>0

. ** Abel kernel:

K(x,y) =e^, x,y \quad \in\mathbb, \alpha>0

. ** kernel generating

Sobolev space In mathematics, a Sobolev space is a vector space of functions equipped with a norm that is a combination of ''Lp''-norms of the function together with its derivatives up to a given order. The derivatives are understood in a suitable weak sense t ...

W^k_2(\mathbb^d)

K(x,y) = \, x-y\, _2^B_(\, x-y\, _2)

, where

B_\nu

is the Bessel function of the third kind. ** kernel generating Paley–Wiener space:

K(x,y) = \mbox(\alpha(x-y)), x,y\in\mathbb, \alpha>0

. * If

H

is a

Hilbert space In mathematics, Hilbert spaces (named after David Hilbert) allow generalizing the methods of linear algebra and calculus from (finite-dimensional) Euclidean vector spaces to spaces that may be infinite-dimensional. Hilbert spaces arise natural ...

, then its corresponding inner product

(\cdot,\cdot)_H:H\times H\to \mathbb

is a p.d. kernel. Indeed, we have

\sum_^n c_i c_j (x_i, x_j)_H = \left(\sum_^n c_i x_i, \sum_^n c_j x_j\right)_H= \left\, \sum_^n c_i x_i\right\, _H^2\ge 0

* Kernels defined on

\mathbb_+^d

and histograms: Histograms are frequently encountered in applications of real-life problems. Most observations are usually available under the form of nonnegative vectors of counts, which, if normalized, yield histograms of frequencies. It has been shown that the following family of squared metrics, respectively Jensen divergence, the

\chi

-square, Total Variation, and two variations of the Hellinger distance:

\psi_= H\left(\frac2\right)-\frac2,

\psi_= \sum_i\frac,\quad \psi_= \sum_i \left, \theta_i - \theta_i'\,

\psi_= \sum_i \left, \sqrt-\sqrt\,\psi_= \sum_i \left, \sqrt - \sqrt\^2,

can be used to define p.d. kernels using the following formula

K(\theta,\theta') = e^, \alpha>0.

History

Positive-definite kernels, as defined in (1.1), appeared first in 1909 in a paper on integral equations by James Mercer. Several other authors made use of this concept in the following two decades, but none of them explicitly used kernels

K(x,y) = f(x-y)

, i.e. p.d. functions (indeed M. Mathias and S. Bochner seem not to have been aware of the study of p.d. kernels). Mercer’s work arose from Hilbert’s paper of 1904 on

Fredholm integral equation In mathematics, the Fredholm integral equation is an integral equation whose solution gives rise to Fredholm theory, the study of Fredholm kernels and Fredholm operators. The integral equation was studied by Ivar Fredholm. A useful method to solve ...

s of the second kind: In particular, Hilbert had shown that where

K

is a continuous real symmetric kernel,

x

is continuous,

\

is a complete system of orthonormal eigenfunctions, and

\lambda_n

’s are the corresponding

of (1.2). Hilbert defined a “definite” kernel as one for which the double integral

J(x) =\int_a^b \int_a^b K(s,t)x(s)x(t)\ \mathrms\; \mathrmt

satisfies

J(x)>0

except for

x(t) = 0

. The original object of Mercer’s paper was to characterize the kernels which are definite in the sense of Hilbert, but Mercer soon found that the class of such functions was too restrictive to characterize in terms of determinants. He therefore defined a continuous real symmetric kernel

K(s,t)

to be of positive type (i.e. positive-definite) if

J(x)\ge 0

for all real continuous functions

x

,b /math>, and he proved that (1.1) is a necessary and sufficient condition for a kernel to be of positive type. Mercer then proved that for any continuous p.d. kernel the expansion K(s,t) = \sum_n \frac holds absolutely and uniformly.

At about the same time W. H. Young, motivated by a different question in the theory of integral equations, showed that for continuous kernels condition (1.1) is equivalent to J(x)\ge 0 for all x\in L^1,b /math>.

E.H. Moore  initiated the study of a very general kind of p.d. kernel. If E is an abstract set, he calls functions K(x,y) defined on E \times E “positive Hermitian matrices” if they satisfy (1.1) for all x_i\in E . Moore was interested in generalization of integral equations and showed that to each  such K there is a Hilbert space H of functions such that, for each f\in H, f(y) = (f,K(\cdot,y))_H . This property is called the reproducing property of the kernel and turns out to have importance in the solution of boundary-value problems for elliptic partial differential equations.

Another line of development in which p.d. kernels played a large role was the theory of harmonics on homogeneous spaces as begun by E. Cartan in 1929, and continued by H. Weyl and S. Ito. The most comprehensive theory of p.d. kernels in homogeneous spaces is that of M. Krein which includes as special cases the work on p.d. functions and irreducible unitary representations of locally compact groups.

In probability theory p.d. kernels arise as covariance kernels of stochastic processes.

Connection with reproducing kernel Hilbert spaces and feature maps

Positive-definite kernels provide a framework that encompasses some basic Hilbert space constructions. In the following we present a tight relationship between positive-definite kernels and two mathematical objects, namely reproducing Hilbert spaces and feature maps. Let

X

be a set,

H

a Hilbert space of functions

f:X\to \mathbb

, and

(\cdot,\cdot)_H : H \times H \to \mathbb

the corresponding inner product on

H

. For any

x\in X

the evaluation functional

e_x:H\to \mathbb

is defined by

f\mapsto e_x(f) =f(x)

. We first define a reproducing kernel Hilbert space (RKHS):

Definition: Space $H$ is called a reproducing kernel Hilbert space if the evaluation functionals are continuous.

Every RKHS has a special function associated to it, namely the reproducing kernel:

Definition: Reproducing kernel is a function $K:X\times X \to \mathbb$ such that # $K_x(\cdot)\in H, \forall x\in X$ , and # $(f,K_x) =f(x)$ , for all $f\in H$ and $x\in X$ . The latter property is called the reproducing property.

The following result shows equivalence between RKHS and reproducing kernels: Now the connection between positive definite kernels and RKHS is given by the following theorem Thus, given a positive-definite kernel

K

, it is possible to build an associated RKHS with

K

as a reproducing kernel. As stated earlier, positive definite kernels can be constructed from inner products. This fact can be used to connect p.d. kernels with another interesting object that arises in machine learning applications, namely the feature map. Let

F

be a Hilbert space, and

(\cdot,\cdot)_F

the corresponding inner product. Any map

\Phi: X\to F

is called a feature map. In this case we call

F

the feature space. It is easy to see that every feature map defines a unique p.d. kernel by

K(x,y) =(\Phi(x),\Phi(y))_F.

Indeed, positive definiteness of

K

follows from the p.d. property of the inner product. On the other hand, every p.d. kernel, and its corresponding RKHS, have many associated feature maps. For example: Let

F=H

, and

\Phi(x) = K_x

for all

x\in X

. Then

(\Phi(x),\Phi(y))_F = (K_x,K_y)_H = K(x,y)

, by the reproducing property. This suggests a new look at p.d. kernels as inner products in appropriate Hilbert spaces, or in other words p.d. kernels can be viewed as similarity maps which quantify effectively how similar two points

x

and

y

are through the value

K(x,y)

. Moreover, through the equivalence of p.d. kernels and its corresponding RKHS, every feature map can be used to construct a RKHS.

Kernels and distances

Kernel methods are often compared to distance based methods such as nearest neighbors. In this section we discuss parallels between their two respective ingredients, namely kernels

K

and distances

d

. Here by a distance function between each pair of elements of some set

X

, we mean a

metric Metric or metrical may refer to: * Metric system, an internationally adopted decimal system of measurement * An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement Mathematics In mathema ...

defined on that set, i.e. any nonnegative-valued function

d

\mathcal X\times \mathcal X

which satisfies *

d(x,y) \ge 0

, and

d(x,y)=0

if and only if

x=y

, *

d(x,y) = d(y,x)

, *

d(x,z) \le d(x,y)+d(y,z)

. One link between distances and p.d. kernels is given by a particular kind of kernel, called a negative definite kernel, and defined as follows

Definition: A symmetric function $\psi:\mathcal X\times \mathcal X\to \mathbb$ is called a negative definite (n.d.) kernel on $\mathcal X$ if holds for any $n\in \mathbb, x_1, \dots, x_n\in \mathcal X,$ and $c_1, \dots, c_n \in \mathbb$ such that $\sum_^n c_i=0$ .

The parallel between n.d. kernels and distances is in the following: whenever a n.d. kernel vanishes on the set

\

, and is zero only on this set, then its square root is a distance for

\mathcal X

. At the same time each distance does not correspond necessarily to a n.d. kernel. This is only true for Hilbertian distances, where distance

d

is called Hilbertian if one can embed the metric space

(\mathcal X,d)

isometrically into some Hilbert space. On the other hand, n.d. kernels can be identified with a subfamily of p.d. kernels known as infinitely divisible kernels. A nonnegative-valued kernel

K

is said to be infinitely divisible if for every

n\in\mathbb

there exists a positive-definite kernel

K_n

such that

K=(K_n)^n

. Another link is that a p.d. kernel induces a pseudometric, where the first constraint on the distance function is loosened to allow

d(x,y) = 0

for

x \neq y

. Given a positive-definite kernel

K

, we can define a distance function as:

d(x,y) = \sqrt

Some applications

Kernels in machine learning

Positive-definite kernels, through their equivalence with reproducing kernel Hilbert spaces, are particularly important in the field of

statistical learning theory Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on dat ...

because of the celebrated

representer theorem For computer science, in statistical learning theory, a representer theorem is any of several related results stating that a minimizer f^ of a regularized empirical risk functional defined over a reproducing kernel Hilbert space can be represente ...

which states that every minimizer function in an RKHS can be written as a linear combination of the kernel function evaluated at the training points. This is a practically useful result as it effectively simplifies the empirical risk minimization problem from an infinite dimensional to a finite dimensional optimization problem.

Kernels in probabilistic models

There are several different ways in which kernels arise in probability theory. * Nondeterministic recovery problems: Assume that we want to find the response

f(x)

of an unknown model function

f

at a new point

x

of a set

\mathcal X

, provided that we have a sample of input-response pairs

(x_i,f_i)=(x_i,f(x_i))

given by observation or experiment. The response

f_i

x_i

is not a fixed function of

x_i

but rather a realization of a real-valued random variable

Z(x_i)

. The goal is to get information about the function

E

(x_i) X, or x, is the twenty-fourth and third-to-last Letter (alphabet), letter in the Latin alphabet, used in the English alphabet, modern English alphabet, the alphabets of other western European languages and others worldwide. Its English a ...

/math> which replaces

f

in the deterministic setting. For two elements

x,y\in\mathcal X

the random variables

Z(x)

and

Z(y)

will not be uncorrelated, because if

x

is too close to

y

the random experiments described by

Z(x)

and

Z(y)

will often show similar behaviour. This is described by a covariance kernel

K(x,y)=E (x)\cdot Z(y)

. Such a kernel exists and is positive-definite under weak additional assumptions. Now a good estimate for

Z(x)

can be obtained by using kernel interpolation with the covariance kernel, ignoring the probabilistic background completely. Assume now that a noise variable

\epsilon(x)

, with zero mean and variance

\sigma^2

, is added to

x

, such that the noise is independent for different

x

and independent of

Z

there, then the problem of finding a good estimate for

f

is identical to the above one, but with a modified kernel given by

+ \sigma^2\delta_

. * Density estimation by kernels: The problem is to recover the density

f

of a multivariate distribution over a domain

\mathcal X

, from a large sample

x_1,\dots,x_n\in\mathcal X

including repetitions. Where sampling points lie dense, the true density function must take large values. A simple density estimate is possible by counting the number of samples in each cell of a grid, and plotting the resulting histogram, which yields a piecewise constant density estimate. A better estimate can be obtained by using a nonnegative translation invariant kernel

K

, with total integral equal to one, and define

f(x) = \frac 1 n \sum_^n K\left(\frach\right)

as a smooth estimate.

Numerical solution of partial differential equations

One of the greatest application areas of so-called

meshfree methods In the field of numerical analysis, meshfree methods are those that do not require connection between nodes of the simulation domain, i.e. a mesh, but are rather based on interaction of each node with all its neighbors. As a consequence, original ...

is in the numerical solution of

PDEs In mathematics, a partial differential equation (PDE) is an equation which imposes relations between the various partial derivatives of a multivariable function. The function is often thought of as an "unknown" to be solved for, similarly to h ...

. Some of the popular meshfree methods are closely related to positive-definite kernels (such as meshless local Petrov Galerkin (MLPG), Reproducing kernel particle method (RKPM) and smoothed-particle hydrodynamics (SPH)). These methods use radial basis kernel for

collocation In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words th ...

Stinespring dilation theorem

Other applications

In the literature on computer experiments and other engineering experiments one increasingly encounters models based on p.d. kernels, RBFs or

kriging In statistics, originally in geostatistics, kriging or Kriging, also known as Gaussian process regression, is a method of interpolation based on Gaussian process governed by prior covariances. Under suitable assumptions of the prior, kriging giv ...

. One such topic is

response surface methodology In statistics, response surface methodology (RSM) explores the relationships between several explanatory variables and one or more response variables. The method was introduced by George E. P. Box and K. B. Wilson in 1951. The main idea of RSM ...

. Other types of applications that boil down to data fitting are

rapid prototyping Rapid prototyping is a group of techniques used to quickly fabricate a scale model of a physical part or assembly using three-dimensional computer aided design (CAD) data. Construction of the part or assembly is usually done using 3D printin ...

and

computer graphics Computer graphics deals with generating images with the aid of computers. Today, computer graphics is a core technology in digital photography, film, video games, cell phone and computer displays, and many specialized applications. A great de ...

. Here one often uses implicit surface models to approximate or interpolate point cloud data. Applications of p.d. kernels in various other branches of mathematics are in multivariate integration, multivariate optimization, and in numerical analysis and scientific computing, where one studies fast, accurate and adaptive algorithms ideally implemented in high-performance computing environments.Gumerov, N. A. and Duraiswami, R. (2007).
Fast radial basis function interpolation via preconditioned Krylov iteration
. SIAM J. Scient. Computing 29/5, pp. 1876-1899.

References

{{DEFAULTSORT:Positive Definite Kernel Operator theory Hilbert space

Definition

Some general properties

Examples of p.d. kernels

History

Connection with reproducing kernel Hilbert spaces and feature maps

Kernels and distances

Some applications

Kernels in machine learning

Kernels in probabilistic models

Numerical solution of partial differential equations

Stinespring dilation theorem

Other applications

See also

References