Kernel methods
In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (for example c ...
are a well-established tool to analyze the relationship between input data and the corresponding output of a function. Kernels encapsulate the properties of functions in a
computationally efficient way and allow algorithms to easily swap functions of varying complexity.
In typical
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
algorithms, these functions produce a scalar output. Recent development of kernel methods for functions with vector-valued output is due, at least in part, to interest in simultaneously solving related problems. Kernels which capture the relationship between the problems allow them to ''borrow strength'' from each other. Algorithms of this type include
multi-task learning (also called multi-output learning or vector-valued learning),
transfer learning
Transfer learning (TL) is a research problem in machine learning (ML) that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. For example, knowledge gained while learning to recognize ...
, and co-
kriging
In statistics, originally in geostatistics, kriging or Kriging, also known as Gaussian process regression, is a method of interpolation based on Gaussian process governed by prior covariances. Under suitable assumptions of the prior, kriging g ...
.
Multi-label classification
In machine learning, multi-label classification or multi-output classification is a variant of the classification problem where multiple nonexclusive labels may be assigned to each instance. Multi-label classification is a generalization of mult ...
can be interpreted as mapping inputs to (binary) coding vectors with length equal to the number of classes.
In
Gaussian process
In probability theory and statistics, a Gaussian process is a stochastic process (a collection of random variables indexed by time or space), such that every finite collection of those random variables has a multivariate normal distribution, i.e. ...
es, kernels are called
covariance function In probability theory and statistics, the covariance function describes how much two random variables change together (their ''covariance'') with varying spatial or temporal separation. For a random field or stochastic process ''Z''(''x'') on a d ...
s. Multiple-output functions correspond to considering multiple processes. See
Bayesian interpretation of regularization for the connection between the two perspectives.
History
The history of learning vector-valued functions is closely linked to
transfer learning
Transfer learning (TL) is a research problem in machine learning (ML) that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. For example, knowledge gained while learning to recognize ...
- storing knowledge gained while solving one problem and applying it to a different but related problem. The fundamental motivation for transfer learning in the field of machine learning was discussed in a NIPS-95 workshop on “Learning to Learn,” which focused on the need for lifelong machine learning methods that retain and reuse previously learned knowledge. Research on transfer learning has attracted much attention since 1995 in different names: learning to learn, lifelong learning, knowledge transfer, inductive transfer, multitask learning, knowledge consolidation, context-sensitive learning, knowledge-based inductive bias, metalearning, and incremental/
cumulative learning
Cumulative learning is the cognitive process by which we accumulate knowledge and abilities that serve as building blocks for subsequent cognitive development. A primary benefit of such is that it consolidates knowledge one has obtained through ex ...
.
[S.J. Pan and Q. Yang, "A survey on transfer learning," IEEE Transactions on Knowledge and Data Engineering, 22, 2010] Interest in learning vector-valued functions was particularly sparked by multitask learning, a framework which tries to learn multiple, possibly different tasks simultaneously.
Much of the initial research in multitask learning in the machine learning community was algorithmic in nature, and applied to methods such as neural networks, decision trees and -nearest neighbors in the 1990s.
[Rich Caruana, "Multitask Learning," Machine Learning, 41–76, 1997] The use of probabilistic models and Gaussian processes was pioneered and largely developed in the context of geostatistics, where prediction over vector-valued output data is known as cokriging.
[J. Ver Hoef and R. Barry,]
Constructing and fitting models for cokriging and multivariable spatial prediction
" Journal of Statistical Planning and Inference, 69:275–294, 1998[P. Goovaerts, "Geostatistics for Natural Resources Evaluation," Oxford University Press, USA, 1997][N. Cressie "Statistics for Spatial Data," John Wiley & Sons Inc. (Revised Edition), USA, 1993] Geostatistical approaches to multivariate modeling are mostly formulated around the linear model of coregionalization (LMC), a generative approach for developing valid covariance functions that has been used for multivariate regression and in statistics for computer emulation of expensive multivariate computer codes. The regularization and kernel theory literature for vector-valued functions followed in the 2000s.
[C.A. Micchelli and M. Pontil,]
On learning vector-valued functions
" Neural Computation, 17:177–204, 2005[C. Carmeli et al.,]
Vector valued reproducing kernel hilbert spaces of integrable functions and mercer theorem
" Anal. Appl. (Singap.), 4 While the Bayesian and regularization perspectives were developed independently, they are in fact closely related.
[Mauricio A. Álvarez, Lorenzo Rosasco, and Neil D. Lawrence, "Kernels for Vector-Valued Functions: A Review," Foundations and Trends in Machine Learning 4, no. 3 (2012): 195–266. doi: 10.1561/220000003]
arXiv:1106.6251
/ref>
Notation
In this context, the supervised learning problem is to learn the function which best predicts vector-valued outputs given inputs (data) .
: for
: , an input space (e.g. )
:
In general, each component of (), could have different input data () with different cardinality () and even different input spaces ().
Geostatistics literature calls this case ''heterotopic'', and uses ''isotopic'' to indicate that the each component of the output vector has the same set of inputs.[Hans Wackernagel. Multivariate Geostatistics. Springer-Verlag Heidelberg New york, 2003.]
Here, for simplicity in the notation, we assume the number and sample space of the data for each output are the same.
Regularization perspective[C.Carmeli, E.DeVito, and A.Toigo. Vector valued reproducing kernel Hilbert spaces of integrable functions and Mercer theorem. Anal. Appl. (Singap.), 4(4):377–408, 2006.]
From the regularization perspective, the problem is to learn belonging to a reproducing kernel Hilbert space
In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g i ...
of vector-valued functions (). This is similar to the scalar case of Tikhonov regularization
Ridge regression is a method of estimating the coefficients of multiple- regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Als ...
, with some extra care in the notation.
It is possible, though non-trivial, to show that a representer theorem also holds for Tikhonov regularization in the vector-valued setting.
Note, the matrix-valued kernel can also be defined by a scalar kernel on the space . An isometry
In mathematics, an isometry (or congruence, or congruent transformation) is a distance-preserving transformation between metric spaces, usually assumed to be bijective. The word isometry is derived from the Ancient Greek: ἴσος ''isos'' mea ...
exists between the Hilbert spaces associated with these two kernels:
:
Gaussian process perspective
The estimator of the vector-valued regularization framework can also be derived from a Bayesian viewpoint using Gaussian process methods in the case of a finite dimensional Reproducing kernel Hilbert space
In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g i ...
. The derivation is similar to the scalar-valued case Bayesian interpretation of regularization. The vector-valued function , consisting of outputs , is assumed to follow a Gaussian process:
:
where is now a vector of the mean functions for the outputs and is a positive definite matrix-valued function with entry corresponding to the covariance between the outputs and .
For a set of inputs , the prior distribution over the vector is given by , where is a vector that concatenates the mean vectors associated to the outputs and is a block-partitioned matrix. The distribution of the outputs is taken to be Gaussian:
:
where is a diagonal matrix with elements specifying the noise for each output. Using this form for the likelihood, the predictive distribution for a new vector is:
:
where is the training data, and is a set of hyperparameters for and .
Equations for and can then be obtained:
:
:
where has entries for and . Note that the predictor is identical to the predictor derived in the regularization framework. For non-Gaussian likelihoods different methods such as Laplace approximation and variational methods are needed to approximate the estimators.
Example kernels
Separable
A simple, but broadly applicable, class of multi-output kernels can be separated into the product of a kernel on the input-space and a kernel representing the correlations among the outputs:
:
: : scalar kernel on
: : scalar kernel on
In matrix form:
where is a symmetric and positive semi-definite matrix. Note, setting to the identity matrix treats the outputs as unrelated and is equivalent to solving the scalar-output problems separately.
For a slightly more general form, adding several of these kernels yields sum of separable kernels (SoS kernels).
From regularization literature[C.A. Micchelli and M. Pontil. On learning vector–valued functions. Neural Computation, 17:177–204, 2005.][C. A. Micchelli and M. Pontil. Kernels for multi-task learning. In Advances in Neural Information Processing Systems (NIPS). MIT Press, 2004.][T.Evgeniou, C.A.Micchelli, and M.Pontil]
Learning multiple tasks with kernel methods
Journal of Machine Learning Research, 6:615–637, 2005.[L. Baldassarre, L. Rosasco, A. Barla, and A. Verri]
Multi-output learning via spectral filtering
Technical report, Massachusetts Institute of Technology, 2011. MIT-CSAIL-TR-2011-004, CBCL-296.
= Derived from regularizer
=
One way of obtaining is to specify a regularizer which limits the complexity of in a desirable way, and then derive the corresponding kernel. For certain regularizers, this kernel will turn out to be separable.
''Mixed-effect regularizer''
:
where:
*
*
*
*
where matrix with all entries equal to 1.
This regularizer is a combination of limiting the complexity of each component of the estimator () and forcing each component of the estimator to be close to the mean of all the components. Setting treats all the components as independent and is the same as solving the scalar problems separately. Setting assumes all the components are explained by the same function.
''Cluster-based regularizer''
:
where:
* is the index set of components that belong to cluster
* is the cardinality of cluster
*
* if and both belong to cluster ( otherwise
*
where
This regularizer divides the components into clusters and forces the components in each cluster to be similar.
''Graph regularizer''
:
where matrix of weights encoding the similarities between the components
:
where ,
Note, is the graph laplacian
In mathematics, the Laplace operator or Laplacian is a differential operator given by the divergence of the gradient of a scalar function on Euclidean space. It is usually denoted by the symbols \nabla\cdot\nabla, \nabla^2 (where \nabla is ...
. See also: graph kernel.
= Learned from data
=
Several approaches to learning from data have been proposed. These include: performing a preliminary inference step to estimate from the training data, a proposal to learn and together based on the cluster regularizer,[Laurent Jacob, Francis Bach, and Jean-Philippe Vert]
Clustered multi-task learning: A convex formulation
In NIPS 21, pages 745–752, 2008. and sparsity-based approaches which assume only a few of the features are needed.[Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243–272, 2008.]
[Andreas Argyriou, Andreas Maurer, and Massimiliano Pontil. An algorithm for transfer learning in a heterogeneous environment. In ECML/PKDD (1), pages 71–85, 2008.]
From Bayesian literature
=Linear model of coregionalization (LMC)
=
In LMC, outputs are expressed as linear combinations of independent random functions such that the resulting covariance function (over all inputs and outputs) is a valid positive semidefinite function. Assuming outputs with , each is expressed as:
:
where are scalar coefficients and the independent functions have zero mean and covariance cov if and 0 otherwise. The cross covariance between any two functions and can then be written as:
:
where the functions , with and have zero mean and covariance cov if and . But