Least-squares support-vector machines (LS-SVM) for
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
and in
statistical model
A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repre ...
ing, are
least-squares
The method of least squares is a mathematical optimization technique that aims to determine the best fit function by minimizing the sum of the squares of the differences between the observed values and the predicted values of the model. The me ...
versions of
support-vector machine
In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning, supervised Maximum-margin hyperplane, max-margin models with associated learning algorithms that analyze data for Statistical classification ...
s (SVM), which are a set of related
supervised learning
In machine learning, supervised learning (SL) is a paradigm where a Statistical model, model is trained using input objects (e.g. a vector of predictor variables) and desired output values (also known as a ''supervisory signal''), which are often ...
methods that analyze data and recognize patterns, and which are used for
classification
Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...
and
regression analysis. In this version one finds the solution by solving a set of
linear equation
In mathematics, a linear equation is an equation that may be put in the form
a_1x_1+\ldots+a_nx_n+b=0, where x_1,\ldots,x_n are the variables (or unknowns), and b,a_1,\ldots,a_n are the coefficients, which are often real numbers. The coeffici ...
s instead of a convex
quadratic programming
Quadratic programming (QP) is the process of solving certain mathematical optimization problems involving quadratic functions. Specifically, one seeks to optimize (minimize or maximize) a multivariate quadratic function subject to linear constr ...
(QP) problem for classical SVMs. Least-squares SVM classifiers were proposed by
Johan Suykens and Joos Vandewalle. LS-SVMs are a class of
kernel-based learning methods.
From support-vector machine to least-squares support-vector machine
Given a training set
with input data
and corresponding binary class labels
, the
SVM classifier, according to
Vapnik
Vladimir Naumovich Vapnik (; born 6 December 1936) is a statistician, researcher, and academic. He is one of the main developers of the Vapnik–Chervonenkis theory of statistical learning and the co-inventor of the support-vector machine method a ...
's original formulation, satisfies the following conditions:

:
which is equivalent to
:
where
is the nonlinear map from original space to the high- or infinite-dimensional space.
Inseparable data
In case such a separating hyperplane does not exist, we introduce so-called slack variables
such that
:

According to the
structural risk minimization
Structural risk minimization (SRM) is an inductive principle of use in machine learning. Commonly in machine learning, a generalized model must be selected from a finite data set, with the consequent problem of overfitting – the model becomin ...
principle, the risk bound is minimized by the following minimization problem:
:
:
To solve this problem, we could construct the
Lagrangian function
In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equation constraints (i.e., subject to the condition that one or more equations have to be satisfie ...
:
:
where
are the
Lagrangian multipliers
Lagrangian may refer to:
Mathematics
* Lagrangian function, used to solve constrained minimization problems in optimization theory; see Lagrange multiplier
** Lagrangian relaxation, the method of approximating a difficult constrained problem with ...
. The optimal point will be in the
saddle point
In mathematics, a saddle point or minimax point is a Point (geometry), point on the surface (mathematics), surface of the graph of a function where the slopes (derivatives) in orthogonal directions are all zero (a Critical point (mathematics), ...
of the Lagrangian function, and then we obtain
By substituting
by its expression in the Lagrangian formed from the appropriate objective and constraints, we will get the following quadratic programming problem:
:
where
is called the
kernel function. Solving this QP problem subject to constraints in (), we will get the
hyperplane
In geometry, a hyperplane is a generalization of a two-dimensional plane in three-dimensional space to mathematical spaces of arbitrary dimension. Like a plane in space, a hyperplane is a flat hypersurface, a subspace whose dimension is ...
in the high-dimensional space and hence the
classifier in the original space.
Least-squares SVM formulation
The least-squares version of the SVM classifier is obtained by reformulating the minimization problem as
:
subject to the equality constraints
:
The least-squares SVM (LS-SVM) classifier formulation above implicitly corresponds to a
regression interpretation with binary targets
.
Using
, we have
:
with
Notice, that this error would also make sense for least-squares data fitting, so that the same end results holds for the regression case.
Hence the LS-SVM classifier formulation is equivalent to
:
with
and

Both
and
should be considered as hyperparameters to tune the amount of regularization versus the sum squared error. The solution does only depend on the ratio
, therefore the original formulation uses only
as tuning parameter. We use both
and
as parameters in order to provide a Bayesian interpretation to LS-SVM.
The solution of LS-SVM regressor will be obtained after we construct the
Lagrangian function
In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equation constraints (i.e., subject to the condition that one or more equations have to be satisfie ...
:
:
where
are the Lagrange multipliers. The conditions for optimality are
:
Elimination of
and
will yield a
linear system
In systems theory, a linear system is a mathematical model of a system based on the use of a linear operator.
Linear systems typically exhibit features and properties that are much simpler than the nonlinear case.
As a mathematical abstractio ...
instead of a
quadratic programming
Quadratic programming (QP) is the process of solving certain mathematical optimization problems involving quadratic functions. Specifically, one seeks to optimize (minimize or maximize) a multivariate quadratic function subject to linear constr ...
problem:
:
with
,
and
. Here,
is an
identity matrix
In linear algebra, the identity matrix of size n is the n\times n square matrix with ones on the main diagonal and zeros elsewhere. It has unique properties, for example when the identity matrix represents a geometric transformation, the obje ...
, and
is the kernel matrix defined by
.
Kernel function ''K''
For the kernel function ''K''(•, •) one typically has the following choices:
*
Linear
In mathematics, the term ''linear'' is used in two distinct senses for two different properties:
* linearity of a '' function'' (or '' mapping'');
* linearity of a '' polynomial''.
An example of a linear function is the function defined by f(x) ...
kernel :
*
Polynomial
In mathematics, a polynomial is a Expression (mathematics), mathematical expression consisting of indeterminate (variable), indeterminates (also called variable (mathematics), variables) and coefficients, that involves only the operations of addit ...
kernel of degree
:
*
Radial basis function
In mathematics a radial basis function (RBF) is a real-valued function \varphi whose value depends only on the distance between the input and some fixed point, either the origin, so that \varphi(\mathbf) = \hat\varphi(\left\, \mathbf\right\, ), o ...
RBF kernel :
* MLP kernel :
where
,
,
,
and
are constants. Notice that the Mercer condition holds for all
and
values in the
polynomial
In mathematics, a polynomial is a Expression (mathematics), mathematical expression consisting of indeterminate (variable), indeterminates (also called variable (mathematics), variables) and coefficients, that involves only the operations of addit ...
and RBF case, but not for all possible choices of
and
in the MLP case. The scale parameters
,
and
determine the scaling of the inputs in the polynomial, RBF and MLP
kernel function. This scaling is related to the bandwidth of the kernel in
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, where it is shown that the bandwidth is an important parameter of the generalization behavior of a kernel method.
Bayesian interpretation for LS-SVM
A
Bayesian interpretation of the SVM has been proposed by Smola et al. They showed that the use of different kernels in SVM can be regarded as defining different
prior probability
A prior probability distribution of an uncertain quantity, simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the ...
distributions on the functional space, as
. Here
is a constant and
is the regularization operator corresponding to the selected kernel.
A general Bayesian evidence framework was developed by MacKay,
[MacKay, D. J. C. The evidence framework applied to classification networks. Neural Computation, 4(5): 720–736, Sep. 1992.] and MacKay has used it to the problem of regression, forward
neural network
A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or signal pathways. While individual neurons are simple, many of them together in a network can perfor ...
and classification network. Provided data set
, a model
with parameter vector
and a so-called hyperparameter or regularization parameter
,
Bayesian inference
Bayesian inference ( or ) is a method of statistical inference in which Bayes' theorem is used to calculate a probability of a hypothesis, given prior evidence, and update it as more information becomes available. Fundamentally, Bayesian infer ...
is constructed with 3 levels of inference:
* In level 1, for a given value of
, the first level of inference infers the posterior distribution of
by Bayesian rule
::
* The second level of inference determines the value of
, by maximizing
::
* The third level of inference in the evidence framework ranks different models by examining their posterior probabilities
::
We can see that Bayesian evidence framework is a unified theory for
learning
Learning is the process of acquiring new understanding, knowledge, behaviors, skills, value (personal and cultural), values, Attitude (psychology), attitudes, and preferences. The ability to learn is possessed by humans, non-human animals, and ...
the model and model selection.
Kwok used the Bayesian evidence framework to interpret the formulation of SVM and model selection. And he also applied Bayesian evidence framework to support vector regression.
Now, given the data points
and the hyperparameters
and
of the model
, the model parameters
and
are estimated by maximizing the posterior
. Applying Bayes’ rule, we obtain
:
where
is a normalizing constant such the integral over all possible
and
is equal to 1.
We assume
and
are independent of the hyperparameter
, and are conditional independent, i.e., we assume
:
When
, the distribution of
will approximate a uniform distribution. Furthermore, we assume
and
are Gaussian distribution, so we obtain the a priori distribution of
and
with
to be
:
Here
is the dimensionality of the feature space, same as the dimensionality of
.
The probability of
is assumed to depend only on
and
. We assume that the data points are independently identically distributed (i.i.d.), so that:
:
In order to obtain the least square cost function, it is assumed that the probability of a data point is proportional to:
:
A Gaussian distribution is taken for the errors
as:
:
It is assumed that the
and
are determined in such a way that the class centers
and
are mapped onto the target -1 and +1, respectively. The projections
of the class elements
follow a multivariate Gaussian distribution, which have variance
.
Combining the preceding expressions, and neglecting all constants, Bayes’ rule becomes
:
The maximum posterior density estimates
and
are then obtained by minimizing the negative logarithm of (26), so we arrive (10).
References
Bibliography
* J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, J. Vandewalle, Least Squares Support Vector Machines, World Scientific Pub. Co., Singapore, 2002.
* Suykens J. A. K., Vandewalle J., Least squares support vector machine classifiers, ''Neural Processing Letters'', vol. 9, no. 3, Jun. 1999, pp. 293–300.
* Vladimir Vapnik. ''The Nature of Statistical Learning Theory''. Springer-Verlag, 1995. {{ISBN, 0-387-98780-0
* MacKay, D. J. C., Probable networks and plausible predictions—A review of practical Bayesian methods for supervised neural networks. ''Network: Computation in Neural Systems'', vol. 6, 1995, pp. 469–505.
External links
www.esat.kuleuven.be/sista/lssvmlab/"Least squares support vector machine Lab (LS-SVMlab) toolbox contains Matlab/C implementations for a number of LS-SVM algorithms".
www.kernel-machines.org"Support Vector Machines and Kernel based methods (Smola & Schölkopf)".
www.gaussianprocess.org"Gaussian Processes: Data modeling using Gaussian Process priors over functions for regression and classification (MacKay, Williams)".
www.support-vector.net"Support Vector Machines and kernel based methods (Cristianini)".
Contains a least-squares SVM implementation for large-scale datasets.
Support vector machines
Classification algorithms
Statistical classification
Least squares