Multiple kernel learning refers to a set of machine learning methods that use a predefined set of

kernels Kernel may refer to: Computing * Kernel (operating system), the central component of most operating systems * Kernel (image processing), a matrix used for image convolution * Compute kernel, in GPGPU programming * Kernel method, in machine learn ...

and learn an optimal linear or non-linear combination of kernels as part of the algorithm. Reasons to use multiple kernel learning include a) the ability to select for an optimal kernel and parameters from a larger set of kernels, reducing bias due to kernel selection while allowing for more automated machine learning methods, and b) combining data from different sources (e.g. sound and images from a video) that have different notions of similarity and thus require different kernels. Instead of creating a new kernel, multiple kernel algorithms can be used to combine kernels already established for each individual data source. Multiple kernel learning approaches have been used in many applications, such as event recognition in video, object recognition in images, and biomedical data fusion.

Algorithms

Multiple kernel learning algorithms have been developed for supervised, semi-supervised, as well as unsupervised learning. Most work has been done on the supervised learning case with linear combinations of kernels, however, many algorithms have been developed. The basic idea behind multiple kernel learning algorithms is to add an extra parameter to the minimization problem of the learning algorithm. As an example, consider the case of supervised learning of a linear combination of a set of

n

kernels

K

. We introduce a new kernel

K'=\sum_^n\beta_iK_i

, where

\beta

is a vector of coefficients for each kernel. Because the kernels are additive (due to properties of

reproducing kernel Hilbert spaces In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g in ...

), this new function is still a kernel. For a set of data

X

with labels

Y

, the minimization problem can then be written as :

\min_\Epsilon(Y, K'c)+R(K,c)

where

\Epsilon

is an error function and

R

is a regularization term.

\Epsilon

is typically the square loss function (

Tikhonov regularization Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also ...

) or the hinge loss function (for SVM algorithms), and

R

is usually an

\ell_n

norm or some combination of the norms (i.e.

elastic net regularization In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods. Specification The elas ...

). This optimization problem can then be solved by standard optimization methods. Adaptations of existing techniques such as the Sequential Minimal Optimization have also been developed for multiple kernel SVM-based methods.Francis R. Bach, Gert R. G. Lanckriet, and Michael I. Jordan. 2004
Multiple kernel learning, conic duality, and the SMO algorithm
In Proceedings of the twenty-first international conference on Machine learning (ICML '04). ACM, New York, NY, USA

Supervised learning

For supervised learning, there are many other algorithms that use different methods to learn the form of the kernel. The following categorization has been proposed by Gonen and Alpaydın (2011)Mehmet Gönen, Ethem Alpaydın
Multiple Kernel Learning Algorithms
Jour. Mach. Learn. Res. 12(Jul):2211−2268, 2011

Fixed rules approaches

Fixed rules approaches such as the linear combination algorithm described above use rules to set the combination of the kernels. These do not require parameterization and use rules like summation and multiplication to combine the kernels. The weighting is learned in the algorithm. Other examples of fixed rules include pairwise kernels, which are of the form :

k((x_, x_),(x_,x_))=k(x_,x_)k(x_,x_)+k(x_,x_)k(x_,x_)

. These pairwise approaches have been used in predicting protein-protein interactions.

Heuristic approaches

These algorithms use a combination function that is parameterized. The parameters are generally defined for each individual kernel based on single-kernel performance or some computation from the kernel matrix. Examples of these include the kernel from Tenabe et al. (2008). Letting

\pi_m

be the accuracy obtained using only

K_m

, and letting

\delta

be a threshold less than the minimum of the single-kernel accuracies, we can define :

\beta_m=\frac

Other approaches use a definition of kernel similarity, such as :

A(K_1,K_2)=\frac

Using this measure, Qui and Lane (2009) used the following heuristic to define :

\beta_m=\frac

Optimization approaches

These approaches solve an optimization problem to determine parameters for the kernel combination function. This has been done with similarity measures and structural risk minimization approaches. For similarity measures such as the one defined above, the problem can be formulated as follows: :

\max_ A(K'_,YY^T).

where

K'_

is the kernel of the training set.

Structural risk minimization Structural risk minimization (SRM) is an inductive principle of use in machine learning. Commonly in machine learning, a generalized model must be selected from a finite data set, with the consequent problem of overfitting – the model becomin ...

approaches that have been used include linear approaches, such as that used by Lanckriet et al. (2002). We can define the implausibility of a kernel

\omega(K)

to be the value of the objective function after solving a canonical SVM problem. We can then solve the following minimization problem: :

\min_\omega(K'_)

where

c

is a positive constant. Many other variations exist on the same idea, with different methods of refining and solving the problem, e.g. with nonnegative weights for individual kernels and using non-linear combinations of kernels.

Bayesian approaches

Bayesian approaches put priors on the kernel parameters and learn the parameter values from the priors and the base algorithm. For example, the decision function can be written as :

f(x)=\sum^n_\alpha_i\sum^p_\eta_mK_m(x_i^m,x^m)

\eta

can be modeled with a Dirichlet prior and

\alpha

can be modeled with a zero-mean Gaussian and an inverse gamma variance prior. This model is then optimized using a customized

multinomial probit In statistics and econometrics, the multinomial probit model is a generalization of the probit model used when there are several possible categories that the dependent variable can fall into. As such, it is an alternative to the multinomial log ...

approach with a Gibbs sampler. These methods have been used successfully in applications such as protein fold recognition and protein homology problems

Boosting approaches

Boosting approaches add new kernels iteratively until some stopping criteria that is a function of performance is reached. An example of this is the MARK model developed by Bennett et al. (2002) :

f(x)=\sum_^N\sum_^P\alpha_i^mK_m(x_i^m,x^m)+b

The parameters

\alpha_i^m

and

b

are learned by gradient descent on a coordinate basis. In this way, each iteration of the descent algorithm identifies the best kernel column to choose at each particular iteration and adds that to the combined kernel. The model is then rerun to generate the optimal weights

\alpha_i

and

b

Semisupervised learning

Semisupervised learning Weak supervision is a branch of machine learning where noisy, limited, or imprecise sources are used to provide supervision signal for labeling large amounts of training data in a supervised learning setting. This approach alleviates the burden of o ...

approaches to multiple kernel learning are similar to other extensions of supervised learning approaches. An inductive procedure has been developed that uses a log-likelihood empirical loss and group LASSO regularization with conditional expectation consensus on unlabeled data for image categorization. We can define the problem as follows. Let

L=

be the labeled data, and let

U=

be the set of unlabeled data. Then, we can write the decision function as follows. :

f(x)=\alpha_0+\sum_^\alpha_iK_i(x)

The problem can be written as :

\min_f L(f) + \lambda R(f)+\gamma\Theta(f)

where

L

is the loss function (weighted negative log-likelihood in this case),

R

is the regularization parameter ( Group LASSO in this case), and

\Theta

is the conditional expectation consensus (CEC) penalty on unlabeled data. The CEC penalty is defined as follows. Let the marginal kernel density for all the data be :

g^_m(x)=\langle\phi^_m,\psi_m(x)\rangle

where

\psi_m(x)=_m(x_1,x),\ldots,K_m(x_L,x) T

(the kernel distance between the labeled data and all of the labeled and unlabeled data) and

\phi^_m

is a non-negative random vector with a 2-norm of 1. The value of

\Pi

is the number of times each kernel is projected. Expectation regularization is then performed on the MKD, resulting in a reference expectation

q^_m(y, g^_m(x))

and model expectation

p^_m(f(x), g^_m(x))

. Then, we define :

\Theta=\frac \sum^_\sum^_ D(q^_m(y, g^_m(x)), , p^_m(f(x), g^_m(x)))

where

D(Q, , P)=\sum_iQ(i)\ln\frac

is the Kullback-Leibler divergence. The combined minimization problem is optimized using a modified block gradient descent algorithm. For more information, see Wang et al.

Unsupervised learning

Unsupervised ''Unsupervised'' is an American adult animated sitcom created by David Hornsby, Rob Rosell, and Scott Marder which ran on FX from January 19 to December 20, 2012. The show was created, and for the most part, written by David Hornsby, Scott Marder ...

multiple kernel learning algorithms have also been proposed by Zhuang et al. The problem is defined as follows. Let

U=

be a set of unlabeled data. The kernel definition is the linear combined kernel

K'=\sum_^M\beta_iK_m

. In this problem, the data needs to be "clustered" into groups based on the kernel distances. Let

B_i

be a group or cluster of which

x_i

is a member. We define the loss function as

\sum^n_\left\Vert x_i - \sum_  K(x_i,x_j)x_j\right\Vert^2

. Furthermore, we minimize the distortion by minimizing

\sum_^n\sum_K(x_i,x_j)\left\Vert x_i - x_j \right\Vert^2

. Finally, we add a regularization term to avoid overfitting. Combining these terms, we can write the minimization problem as follows. :

\min_\sum^n_\left\Vert x_i - \sum_  K(x_i,x_j)x_j\right\Vert^2 + \gamma_1\sum_^n\sum_K(x_i,x_j)\left\Vert x_i - x_j \right\Vert^2 + \gamma_2\sum_i , B_i,

where . One formulation of this is defined as follows. Let

D\in ^

be a matrix such that

D_=1

means that

x_i

and

x_j

are neighbors. Then,

B_i=

. Note that these groups must be learned as well. Zhuang et al. solve this problem by an alternating minimization method for

K

and the groups

B_i

. For more information, see Zhuang et al.

Libraries

Available MKL libraries include
SPG-GMKL
A scalable C++ MKL SVM library that can handle a million kernels.

Generalized Multiple Kernel Learning code in

MATLAB MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementation ...

, does

\ell_1

and

\ell_2

regularization for supervised learning.
(Another) GMKL
A different MATLAB MKL code that can also perform elastic net regularization

C++ source code for a Sequential Minimal Optimization MKL algorithm. Does

p

-n orm regularization.

A MATLAB code based on the SimpleMKL algorithm for MKL SVM.
MKLPy
A Python framework for MKL and kernel machines scikit-compliant with different algorithms, e.g. EasyMKLFabio Aiolli, Michele Donini
EasyMKL: a scalable multiple kernel learning algorithm
Neurocomputing, 169, pp.215-224. and others.

References

{{reflist Machine learning algorithms Data mining