machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

, support vector machines (SVMs, also support vector networks) are

supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...

models with associated learning

algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...

s that analyze data for classification and

regression analysis In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...

. Developed at AT&T Bell Laboratories by

Vladimir Vapnik Vladimir Naumovich Vapnik (russian: Владимир Наумович Вапник; born 6 December 1936) is one of the main developers of the Vapnik–Chervonenkis theory of statistical learning, and the co-inventor of the support-vector machin ...

with colleagues (Boser et al., 1992, Guyon et al., 1993,

Cortes Cortes, Cortés, Cortês, Corts, or Cortès may refer to: People * Cortes (surname), including a list of people with the name ** Hernán Cortés (1485–1547), a Spanish conquistador Places * Cortes, Navarre, a village in the South border of ...

and

Vapnik Vladimir Naumovich Vapnik (russian: Владимир Наумович Вапник; born 6 December 1936) is one of the main developers of the Vapnik–Chervonenkis theory of statistical learning, and the co-inventor of the support-vector machin ...

, 1995, Vapnik et al., 1997) SVMs are one of the most robust prediction methods, being based on statistical learning frameworks or

VC theory VC may refer to: Military decorations * Victoria Cross, a military decoration awarded by the United Kingdom and also by certain Commonwealth nations ** Victoria Cross for Australia ** Victoria Cross (Canada) ** Victoria Cross for New Zealand * Vic ...

proposed by Vapnik (1982, 1995) and Chervonenkis (1974). Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non- probabilistic

binary Binary may refer to: Science and technology Mathematics * Binary number, a representation of numbers using only two digits (0 and 1) * Binary function, a function that takes two arguments * Binary operation, a mathematical operation that ta ...

linear classifier In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class (or group) it belongs to. A linear classifier achieves this by making a classification decision based on the val ...

(although methods such as

Platt scaling In machine learning, Platt scaling or Platt calibration is a way of transforming the outputs of a classification model into a probability distribution over classes. The method was invented by John Platt in the context of support vector machine ...

exist to use SVM in a probabilistic classification setting). SVM maps training examples to points in space so as to maximise the width of the gap between the two categories. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the

kernel trick In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (for example ...

, implicitly mapping their inputs into high-dimensional feature spaces. When data are unlabelled, supervised learning is not possible, and an

unsupervised learning Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a concise representation of its world and t ...

approach is required, which attempts to find natural clustering of the data to groups, and then map new data to these formed groups. The support vector clustering algorithm, created by

Hava Siegelmann Hava Siegelmann is a professor of computer science. Her academic position is in the school of Computer Science and the Program of Neuroscience and Behavior at the University of Massachusetts Amherst; she is the director of the school's Biologica ...

and

, applies the statistics of support vectors, developed in the support vector machines algorithm, to categorize unlabeled data.

Motivation

Classifying data is a common task in

. Suppose some given data points each belong to one of two classes, and the goal is to decide which class a ''new''

data point In statistics, a unit of observation is the unit described by the data that one analyzes. A study may treat groups as a unit of observation with a country as the unit of analysis, drawing conclusions on group characteristics from data collected at ...

will be in. In the case of support vector machines, a data point is viewed as a

p

-dimensional vector (a list of

p

numbers), and we want to know whether we can separate such points with a

(p-1)

-dimensional hyperplane. This is called a

. There are many hyperplanes that might classify the data. One reasonable choice as the best hyperplane is the one that represents the largest separation, or

margin Margin may refer to: Physical or graphical edges *Margin (typography), the white space that surrounds the content of a page *Continental margin, the zone of the ocean floor that separates the thin oceanic crust from thick continental crust *Leaf ...

, between the two classes. So we choose the hyperplane so that the distance from it to the nearest data point on each side is maximized. If such a hyperplane exists, it is known as the ''

maximum-margin hyperplane In geometry, the hyperplane separation theorem is a theorem about disjoint convex sets in ''n''-dimensional Euclidean space. There are several rather similar versions. In one version of the theorem, if both these sets are closed and at least one ...

'' and the linear classifier it defines is known as a ''maximum-

margin classifier In machine learning, a margin classifier is a classifier which is able to give an associated distance from the decision boundary for each example. For instance, if a linear classifier (e.g. perceptron or linear discriminant analysis) is used, the ...

''; or equivalently, the '' perceptron of optimal stability''. More formally, a support vector machine constructs a hyperplane or set of hyperplanes in a high or infinite-dimensional space, which can be used for classification, regression, or other tasks like outliers detection. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training-data point of any class (so-called functional margin), since in general the larger the margin, the lower the generalization error of the classifier. Kernel_Machine

Whereas the original problem may be stated in a finite-dimensional space, it often happens that the sets to discriminate are not

linearly separable In Euclidean geometry, linear separability is a property of two sets of point (geometry), points. This is most easily visualized in two dimensions (the Euclidean plane) by thinking of one set of points as being colored blue and the other set of poi ...

in that space. For this reason, it was proposed that the original finite-dimensional space be mapped into a much higher-dimensional space, presumably making the separation easier in that space. To keep the computational load reasonable, the mappings used by SVM schemes are designed to ensure that

dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an alge ...

s of pairs of input data vectors may be computed easily in terms of the variables in the original space, by defining them in terms of a

kernel function In operator theory, a branch of mathematics, a positive-definite kernel is a generalization of a positive-definite function or a positive-definite matrix. It was first introduced by James Mercer in the early 20th century, in the context of solving ...

k(x, y)

selected to suit the problem. The hyperplanes in the higher-dimensional space are defined as the set of points whose dot product with a vector in that space is constant, where such a set of vectors is an orthogonal (and thus minimal) set of vectors that defines a hyperplane. The vectors defining the hyperplanes can be chosen to be linear combinations with parameters

\alpha_i

of images of

feature vector In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon. Choosing informative, discriminating and independent features is a crucial element of effective algorithms in pattern r ...

x_i

that occur in the data base. With this choice of a hyperplane, the points

x

in the

feature space In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon. Choosing informative, discriminating and independent features is a crucial element of effective algorithms in pattern r ...

that are mapped into the hyperplane are defined by the relation

\textstyle\sum_i \alpha_i k(x_i, x) = \text.

Note that if

k(x, y)

becomes small as

y

grows further away from

x

, each term in the sum measures the degree of closeness of the test point

x

to the corresponding data base point

x_i

. In this way, the sum of kernels above can be used to measure the relative nearness of each test point to the data points originating in one or the other of the sets to be discriminated. Note the fact that the set of points

x

mapped into any hyperplane can be quite convoluted as a result, allowing much more complex discrimination between sets that are not convex at all in the original space.

Applications

SVMs can be used to solve various real-world problems: * SVMs are helpful in text and hypertext categorization, as their application can significantly reduce the need for labeled training instances in both the standard inductive and transductive settings. Some methods for

shallow semantic parsing In natural language processing, semantic role labeling (also called Semantic parsing, shallow semantic parsing or slot-filling) is the process that assigns labels to words or phrases in a sentence that indicates their semantic role in the sentence, ...

are based on support vector machines. * Classification of images can also be performed using SVMs. Experimental results show that SVMs achieve significantly higher search accuracy than traditional query refinement schemes after just three to four rounds of relevance feedback. This is also true for

image segmentation In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects ( sets of pixels). The goal of segmentation is to simpli ...

systems, including those using a modified version SVM that uses the privileged approach as suggested by Vapnik. * Classification of satellite data like SAR data using supervised SVM. * Hand-written characters can be recognized using SVM. * The SVM algorithm has been widely applied in the biological and other sciences. They have been used to classify proteins with up to 90% of the compounds classified correctly.

Permutation test A permutation test (also called re-randomization test) is an exact statistical hypothesis test making use of the proof by contradiction. A permutation test involves two or more samples. The null hypothesis is that all samples come from the same di ...

s based on SVM weights have been suggested as a mechanism for interpretation of SVM models. Support vector machine weights have also been used to interpret SVM models in the past. Posthoc interpretation of support vector machine models in order to identify features used by the model to make predictions is a relatively new area of research with special significance in the biological sciences.

History

The original SVM algorithm was invented by Vladimir N. Vapnik and Alexey Ya. Chervonenkis in 1964. In 1992, Bernhard Boser, Isabelle Guyon and

suggested a way to create nonlinear classifiers by applying the

to maximum-margin hyperplanes. The "soft margin" incarnation, as is commonly used in software packages, was proposed by

Corinna Cortes Corinna Cortes is a Danish computer scientist known for her contributions to machine learning. She is currently the Head of Google Research, New York. Cortes is a recipient of the Paris Kanellakis Theory and Practice Award for her work on theoreti ...

and Vapnik in 1993 and published in 1995.

Linear SVM

We are given a training dataset of

n

points of the form

(\mathbf_1, y_1), \ldots, (\mathbf_n, y_n),

where the

y_i

are either 1 or −1, each indicating the class to which the point

\mathbf_i

belongs. Each

\mathbf_i

is a

p

-dimensional

real Real may refer to: Currencies * Brazilian real (R$) * Central American Republic real * Mexican real * Portuguese real * Spanish real * Spanish colonial real Music Albums * ''Real'' (L'Arc-en-Ciel album) (2000) * ''Real'' (Bright album) (2010) ...

vector. We want to find the "maximum-margin hyperplane" that divides the group of points

\mathbf_i

for which

y_i = 1

from the group of points for which

y_i = -1

, which is defined so that the distance between the hyperplane and the nearest point

\mathbf_i

from either group is maximized. Any hyperplane can be written as the set of points

\mathbf

satisfying

\mathbf^\mathsf \mathbf - b = 0,

where

\mathbf

is the (not necessarily normalized) normal vector to the hyperplane. This is much like

Hesse normal form The Hesse normal form named after Otto Hesse, is an equation used in analytic geometry, and describes a line in \mathbb^2 or a plane in Euclidean space \mathbb^3 or a hyperplane in higher dimensions.John Vince: ''Geometry for Computer Graphics''. ...

, except that

\mathbf

is not necessarily a unit vector. The parameter

\tfrac

determines the offset of the hyperplane from the origin along the normal vector

\mathbf

Hard-margin

If the training data is

, we can select two parallel hyperplanes that separate the two classes of data, so that the distance between them is as large as possible. The region bounded by these two hyperplanes is called the "margin", and the maximum-margin hyperplane is the hyperplane that lies halfway between them. With a normalized or standardized dataset, these hyperplanes can be described by the equations :

\mathbf^\mathsf \mathbf - b = 1

(anything on or above this boundary is of one class, with label 1) and :

\mathbf^\mathsf \mathbf - b = -1

(anything on or below this boundary is of the other class, with label −1). Geometrically, the distance between these two hyperplanes is

\tfrac

, so to maximize the distance between the planes we want to minimize

\, \mathbf\,

. The distance is computed using the

distance from a point to a plane In Euclidean space, the distance from a point to a plane is the distance between a given point and its orthogonal projection on the plane, the perpendicular distance to the nearest point on the plane. It can be found starting with a change of varia ...

equation. We also have to prevent data points from falling into the margin, we add the following constraint: for each

i

either

\mathbf^\mathsf \mathbf_i - b \ge 1 \, , \text y_i = 1,

\mathbf^\mathsf \mathbf_i - b \le -1 \, , \text y_i = -1.

These constraints state that each data point must lie on the correct side of the margin. This can be rewritten as We can put this together to get the optimization problem:

\begin
&\underset && \, \mathbf\, _2^2\\
&\text && y_i(\mathbf^\top \mathbf_i - b) \geq 1 \quad \forall i \in \
\end

The

\mathbf

and

b

that solve this problem determine our classifier,

\mathbf \mapsto \sgn(\mathbf^\mathsf \mathbf - b)

where

\sgn(\cdot)

is the sign function. An important consequence of this geometric description is that the max-margin hyperplane is completely determined by those

\mathbf_i

that lie nearest to it. These

\mathbf_i

are called ''support vectors''.

Soft-margin

To extend SVM to cases in which the data are not linearly separable, the ''

hinge loss In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs). For an intended output and a classifier score , th ...

'' function is helpful

\max\left(0, 1 - y_i(\mathbf^\mathsf \mathbf_i - b)\right).

Note that

y_i

is the ''i''-th target (i.e., in this case, 1 or −1), and

\mathbf^\mathsf \mathbf_i - b

is the ''i''-th output. This function is zero if the constraint in is satisfied, in other words, if

\mathbf_i

lies on the correct side of the margin. For data on the wrong side of the margin, the function's value is proportional to the distance from the margin. The goal of the optimization then is to minimize

\lambda \lVert \mathbf \rVert^2 +\left frac 1 n \sum_^n \max\left(0, 1 - y_i(\mathbf^\mathsf \mathbf_i - b)\right) \right

where the parameter

\lambda > 0

determines the trade-off between increasing the margin size and ensuring that the

\mathbf_i

lie on the correct side of the margin. By deconstructing the hinge loss, this optimization problem can be massaged into the following:

\begin
&\underset &&\, \mathbf\, _2^2 + C\sum_^N \zeta_i\\
&\text && y_i(\mathbf^\top \mathbf_i - b) \geq 1 - \zeta_i, \quad \zeta_i \geq 0 \quad \forall i\in \
\end

Thus, for large values of

C

, it will behave similar to the hard-margin SVM, if the input data are linearly classifiable, but will still learn if a classification rule is viable or not. (

\lambda

is inversely related to

C

, e.g. in ''

LIBSVM LIBSVM and LIBLINEAR are two popular open source machine learning libraries, both developed at the National Taiwan University and both written in C++ though with a C API. LIBSVM implements the Sequential minimal optimization (SMO) algorithm ...

''.)

Nonlinear Kernels

The original maximum-margin hyperplane algorithm proposed by Vapnik in 1963 constructed a

. However, in 1992, Bernhard Boser, Isabelle Guyon and

suggested a way to create nonlinear classifiers by applying the

(originally proposed by Aizerman et al.) to maximum-margin hyperplanes. The resulting algorithm is formally similar, except that every

is replaced by a nonlinear

kernel Kernel may refer to: Computing * Kernel (operating system), the central component of most operating systems * Kernel (image processing), a matrix used for image convolution * Compute kernel, in GPGPU programming * Kernel method, in machine learn ...

function. This allows the algorithm to fit the maximum-margin hyperplane in a transformed

. The transformation may be nonlinear and the transformed space high-dimensional; although the classifier is a hyperplane in the transformed feature space, it may be nonlinear in the original input space. It is noteworthy that working in a higher-dimensional feature space increases the generalization error of support vector machines, although given enough samples the algorithm still performs well. Some common kernels include: * Polynomial (homogeneous):

k(\mathbf_i, \mathbf_j) = (\mathbf_i \cdot \mathbf_j)^d

. Particularly, when

d = 1

, this becomes the linear kernel. *

Polynomial In mathematics, a polynomial is an expression consisting of indeterminates (also called variables) and coefficients, that involves only the operations of addition, subtraction, multiplication, and positive-integer powers of variables. An example ...

(inhomogeneous):

k(\mathbf_i, \mathbf_j) = (\mathbf_i \cdot \mathbf_j + r)^d

. * Gaussian

radial basis function A radial basis function (RBF) is a real-valued function \varphi whose value depends only on the distance between the input and some fixed point, either the origin, so that \varphi(\mathbf) = \hat\varphi(\left\, \mathbf\right\, ), or some other fixed ...

k(\mathbf_i, \mathbf_j) = \exp\left(-\gamma \left\, \mathbf_i - \mathbf_j\right\, ^2\right)

for

\gamma > 0

. Sometimes parametrized using

\gamma = 1/(2\sigma^2)

. *

Sigmoid function A sigmoid function is a mathematical function having a characteristic "S"-shaped curve or sigmoid curve. A common example of a sigmoid function is the logistic function shown in the first figure and defined by the formula: :S(x) = \frac = \ ...

(

Hyperbolic tangent In mathematics, hyperbolic functions are analogues of the ordinary trigonometric functions, but defined using the hyperbola rather than the circle. Just as the points form a circle with a unit radius, the points form the right half of the u ...

k(\mathbf, \mathbf) = \tanh(\kappa \mathbf_i \cdot \mathbf_j + c)

for some (not every)

\kappa > 0

and

c < 0

. The kernel is related to the transform

\varphi(\mathbf_i)

by the equation

k(\mathbf_i, \mathbf_j) = \varphi(\mathbf_i) \cdot \varphi(\mathbf)

. The value is also in the transformed space, with

\mathbf = \sum_i \alpha_i y_i \varphi(\mathbf_i)

. Dot products with for classification can again be computed by the kernel trick, i.e.

\mathbf \cdot \varphi(\mathbf) = \sum_i \alpha_i y_i k(\mathbf_i, \mathbf)

Computing the SVM classifier

Computing the (soft-margin) SVM classifier amounts to minimizing an expression of the form We focus on the soft-margin classifier since, as noted above, choosing a sufficiently small value for

\lambda

yields the hard-margin classifier for linearly classifiable input data. The classical approach, which involves reducing to a

quadratic programming Quadratic programming (QP) is the process of solving certain mathematical optimization problems involving quadratic functions. Specifically, one seeks to optimize (minimize or maximize) a multivariate quadratic function subject to linear constr ...

problem, is detailed below. Then, more recent approaches such as sub-gradient descent and coordinate descent will be discussed.

Primal

Minimizing can be rewritten as a constrained optimization problem with a differentiable objective function in the following way. For each

i \in \

we introduce a variable

\zeta_i = \max\left(0, 1 - y_i(\mathbf^\mathsf \mathbf_i - b)\right)

. Note that

\zeta_i

is the smallest nonnegative number satisfying

y_i(\mathbf^\mathsf \mathbf_i - b) \geq 1 - \zeta_i.

Thus we can rewrite the optimization problem as follows

&\text y_i\left(\mathbf^\mathsf \mathbf_i - b\right) \geq 1 - \zeta_i \, \text \, \zeta_i \geq 0,\, \text i. \end

This is called the ''primal'' problem.

Dual

By solving for the Lagrangian dual of the above problem, one obtains the simplified problem

\begin
&\text\,\, f(c_1 \ldots c_n) =  \sum_^n c_i - \frac 1 2 \sum_^n\sum_^n y_i c_i(\mathbf_i^\mathsf \mathbf_j)y_j c_j, \\
&\text \sum_^n c_iy_i = 0,\,\text 0 \leq c_i \leq \frac\;\texti.
\end

This is called the ''dual'' problem. Since the dual maximization problem is a quadratic function of the

c_i

subject to linear constraints, it is efficiently solvable by

algorithms. Here, the variables

c_i

are defined such that

\mathbf = \sum_^n c_iy_i \mathbf_i.

Moreover,

c_i = 0

exactly when

\mathbf_i

lies on the correct side of the margin, and

0 < c_i <(2n\lambda)^

when

\mathbf_i

lies on the margin's boundary. It follows that

\mathbf

can be written as a linear combination of the support vectors. The offset,

b

, can be recovered by finding an

\mathbf_i

on the margin's boundary and solving

y_i(\mathbf^\mathsf \mathbf_i - b) = 1 \iff b = \mathbf^\mathsf \mathbf_i - y_i .

(Note that

y_i^=y_i

since

y_i=\pm 1

Kernel trick

Suppose now that we would like to learn a nonlinear classification rule which corresponds to a linear classification rule for the transformed data points

\varphi(\mathbf_i).

Moreover, we are given a kernel function

k

which satisfies

k(\mathbf_i, \mathbf_j) = \varphi(\mathbf_i) \cdot \varphi(\mathbf_j)

. We know the classification vector

\mathbf

in the transformed space satisfies

\mathbf = \sum_^n c_iy_i\varphi(\mathbf_i),

where, the

c_i

are obtained by solving the optimization problem

\begin
\text\,\, f(c_1 \ldots c_n) &=  \sum_^n c_i - \frac 1 2 \sum_^n\sum_^n y_ic_i(\varphi(\mathbf_i) \cdot \varphi(\mathbf_j))y_jc_j \\
                                      &=  \sum_^n c_i - \frac 1 2 \sum_^n\sum_^n y_ic_ik(\mathbf_i, \mathbf_j)y_jc_j \\
\text \sum_^n c_i y_i &= 0,\,\text 0 \leq c_i \leq \frac\;\texti.
\end

The coefficients

c_i

can be solved for using quadratic programming, as before. Again, we can find some index

i

such that

0 < c_i <(2n\lambda)^

, so that

\varphi(\mathbf_i)

lies on the boundary of the margin in the transformed space, and then solve

- y_i. \end

Finally,

- b\right).

Modern methods

Recent algorithms for finding the SVM classifier include sub-gradient descent and coordinate descent. Both techniques have proven to offer significant advantages over the traditional approach when dealing with large, sparse datasets—sub-gradient methods are especially efficient when there are many training examples, and coordinate descent when the dimension of the feature space is high.

Sub-gradient descent

Sub-gradient descent algorithms for the SVM work directly with the expression

+ \lambda \, \mathbf\, ^2.

Note that

f

is a convex function of

\mathbf

and

b

. As such, traditional

gradient descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...

(or SGD) methods can be adapted, where instead of taking a step in the direction of the function's gradient, a step is taken in the direction of a vector selected from the function's sub-gradient. This approach has the advantage that, for certain implementations, the number of iterations does not scale with

n

, the number of data points.

Coordinate descent

Coordinate descent Coordinate descent is an optimization algorithm that successively minimizes along coordinate directions to find the minimum of a function. At each iteration, the algorithm determines a coordinate or coordinate block via a coordinate selection rule, ...

algorithms for the SVM work from the dual problem

\begin
&\text\,\, f(c_1 \ldots c_n) =  \sum_^n c_i - \frac 1 2 \sum_^n\sum_^n y_i c_i(x_i \cdot x_j)y_j c_j,\\
&\text \sum_^n c_iy_i = 0,\,\text 0 \leq c_i \leq \frac\;\texti.
\end

For each

i \in \

, iteratively, the coefficient

c_i

is adjusted in the direction of

\partial f/ \partial c_i

. Then, the resulting vector of coefficients

(c_1',\,\ldots,\,c_n')

is projected onto the nearest vector of coefficients that satisfies the given constraints. (Typically Euclidean distances are used.) The process is then repeated until a near-optimal vector of coefficients is obtained. The resulting algorithm is extremely fast in practice, although few performance guarantees have been proven.

Empirical risk minimization

The soft-margin support vector machine described above is an example of an

empirical risk minimization Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance. The core idea is that we cannot know exactly how well an al ...

(ERM) algorithm for the ''

''. Seen this way, support vector machines belong to a natural class of algorithms for statistical inference, and many of its unique features are due to the behavior of the hinge loss. This perspective can provide further insight into how and why SVMs work, and allow us to better analyze their statistical properties.

Risk minimization

In supervised learning, one is given a set of training examples

X_1 \ldots X_n

with labels

y_1 \ldots y_n

, and wishes to predict

y_

given

X_

. To do so one forms a

hypothesis A hypothesis (plural hypotheses) is a proposed explanation for a phenomenon. For a hypothesis to be a scientific hypothesis, the scientific method requires that one can test it. Scientists generally base scientific hypotheses on previous obse ...

f

, such that

f(X_)

is a "good" approximation of

y_

. A "good" approximation is usually defined with the help of a '' loss function,''

\ell(y,z)

, which characterizes how bad

z

is as a prediction of

y

. We would then like to choose a hypothesis that minimizes the ''

expected risk Expected may refer to: *Expectation (epistemic) *Expected value *Expected shortfall *Expected utility hypothesis *Expected return *Expected loss ;See also *Unexpected (disambiguation) Unexpected may refer to: Film and television * ''Unexpecte ...

:''

\varepsilon(f) = \mathbb\left ell(y_, f(X_)) \right

In most cases, we don't know the joint distribution of

X_,\,y_

outright. In these cases, a common strategy is to choose the hypothesis that minimizes the ''empirical risk:''

\hat \varepsilon(f) = \frac 1 n \sum_^n \ell(y_k, f(X_k)).

Under certain assumptions about the sequence of random variables

X_k,\, y_k

(for example, that they are generated by a finite Markov process), if the set of hypotheses being considered is small enough, the minimizer of the empirical risk will closely approximate the minimizer of the expected risk as

n

grows large. This approach is called ''empirical risk minimization,'' or ERM.

Regularization and stability

In order for the minimization problem to have a well-defined solution, we have to place constraints on the set

\mathcal

of hypotheses being considered. If

\mathcal

is a normed space (as is the case for SVM), a particularly effective technique is to consider only those hypotheses

f

for which

\lVert f \rVert_ < k

. This is equivalent to imposing a ''regularization penalty''

\mathcal R(f) = \lambda_k\lVert f \rVert_

, and solving the new optimization problem

\hat f = \mathrm\min_ \hat \varepsilon(f) + \mathcal(f).

This approach is called ''

Tikhonov regularization Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also ...

.'' More generally,

\mathcal(f)

can be some measure of the complexity of the hypothesis

f

, so that simpler hypotheses are preferred.

SVM and the hinge loss

Recall that the (soft-margin) SVM classifier

\hat\mathbf, b: \mathbf \mapsto \sgn(\hat\mathbf^\mathsf \mathbf - b)

is chosen to minimize the following expression:

+ \lambda \, \mathbf\, ^2.

In light of the above discussion, we see that the SVM technique is equivalent to empirical risk minimization with Tikhonov regularization, where in this case the loss function is the

\ell(y,z) = \max\left(0, 1 - yz \right).

From this perspective, SVM is closely related to other fundamental classification algorithms such as

regularized least-squares Regularized least squares (RLS) is a family of methods for solving the least-squares problem while using regularization to further constrain the resulting solution. RLS is used for two main reasons. The first comes up when the number of variables ...

and

logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear combination of one or more independent variables. In regression a ...

. The difference between the three lies in the choice of loss function: regularized least-squares amounts to empirical risk minimization with the square-loss,

\ell_(y,z) = (y-z)^2

; logistic regression employs the log-loss,

\ell_(y,z) = \ln(1 + e^).

Target functions

The difference between the hinge loss and these other loss functions is best stated in terms of ''target functions -'' the function that minimizes expected risk for a given pair of random variables

X,\,y

. In particular, let

y_x

denote

y

conditional on the event that

X = x

. In the classification setting, we have:

y_x = \begin 1 & \text p_x \\ -1 & \text 1-p_x  \end

The optimal classifier is therefore:

f^*(x) = \begin1 & \textp_x \geq 1/2 \\ -1 & \text\end

For the square-loss, the target function is the conditional expectation function,

f_(x) = \mathbb\left_x\right /math>; For the logistic loss, it's the logit function, f_(x) = \ln\left(p_x / ()\right) . While both of these target functions yield the correct classifier, as \sgn(f_) = \sgn(f_\log) = f^*, they give us more information than we need. In fact, they give us enough information to completely describe the distribution of y_x .

On the other hand, one can check that the target function for the hinge loss is ''exactly'' f^* . Thus, in a sufficiently rich hypothesis space—or equivalently, for an appropriately chosen kernel—the SVM classifier will converge to the simplest function (in terms of \mathcal) that correctly classifies the data. This extends the geometric interpretation of SVM—for linear classification, the empirical risk is minimized by any function whose margins lie between the support vectors, and the simplest of these is the max-margin classifier.

Properties

SVMs belong to a family of generalized

s and can be interpreted as an extension of the

perceptron In machine learning, the perceptron (or McCulloch-Pitts neuron) is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belon ...

. They can also be considered a special case of

. A special property is that they simultaneously minimize the empirical ''classification error'' and maximize the ''geometric margin''; hence they are also known as maximum

s. A comparison of the SVM to other classifiers has been made by Meyer, Leisch and Hornik.

Parameter selection

The effectiveness of SVM depends on the selection of kernel, the kernel's parameters, and soft margin parameter

\lambda

. A common choice is a Gaussian kernel, which has a single parameter ''

\gamma

''. The best combination of

\lambda

and

\gamma

is often selected by a grid search with exponentially growing sequences of

\lambda

and ''

\gamma

'', for example,

\lambda \in \

;

\gamma \in \

. Typically, each combination of parameter choices is checked using cross validation, and the parameters with best cross-validation accuracy are picked. Alternatively, recent work in Bayesian optimization can be used to select

\lambda

and ''

\gamma

'' , often requiring the evaluation of far fewer parameter combinations than grid search. The final model, which is used for testing and for classifying new data, is then trained on the whole training set using the selected parameters.

Issues

Potential drawbacks of the SVM include the following aspects: * Requires full labeling of input data * Uncalibrated

class membership probabilities In machine learning, a probabilistic classifier is a classifier that is able to predict, given an observation of an input, a probability distribution over a set of classes, rather than only outputting the most likely class that the observation sho ...

—SVM stems from Vapnik's theory which avoids estimating probabilities on finite data * The SVM is only directly applicable for two-class tasks. Therefore, algorithms that reduce the multi-class task to several binary problems have to be applied; see the multi-class SVM section. * Parameters of a solved model are difficult to interpret.

Extensions

Support vector clustering (SVC)

SVC is a similar method that also builds on kernel functions but is appropriate for unsupervised learning.

Multiclass SVM

Multiclass SVM aims to assign labels to instances by using support vector machines, where the labels are drawn from a finite set of several elements. The dominant approach for doing so is to reduce the single multiclass problem into multiple

binary classification Binary classification is the task of classifying the elements of a set into two groups (each called ''class'') on the basis of a classification rule. Typical binary classification problems include: * Medical testing to determine if a patient has c ...

problems. Common methods for such reduction include: * Building binary classifiers that distinguish between one of the labels and the rest (''one-versus-all'') or between every pair of classes (''one-versus-one''). Classification of new instances for the one-versus-all case is done by a winner-takes-all strategy, in which the classifier with the highest-output function assigns the class (it is important that the output functions be calibrated to produce comparable scores). For the one-versus-one approach, classification is done by a max-wins voting strategy, in which every classifier assigns the instance to one of the two classes, then the vote for the assigned class is increased by one vote, and finally the class with the most votes determines the instance classification. *

Directed acyclic graph In mathematics, particularly graph theory, and computer science, a directed acyclic graph (DAG) is a directed graph with no directed cycles. That is, it consists of vertices and edges (also called ''arcs''), with each edge directed from one v ...

SVM (DAGSVM) * Error-correcting output codes Crammer and Singer proposed a multiclass SVM method which casts the

multiclass classification In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes (classifying instances into one of two classes is called binary c ...

problem into a single optimization problem, rather than decomposing it into multiple binary classification problems. See also Lee, Lin and Wahba and Van den Burg and Groenen.

Transductive support vector machines

Transductive support vector machines extend SVMs in that they could also treat partially labeled data in

semi-supervised learning Weak supervision is a branch of machine learning where noisy, limited, or imprecise sources are used to provide supervision signal for labeling large amounts of training data in a supervised learning setting. This approach alleviates the burden of ...

by following the principles of transduction. Here, in addition to the training set

\mathcal

, the learner is also given a set

\mathcal^\star = \_^k

of test examples to be classified. Formally, a transductive support vector machine is defined by the following primal optimization problem: Minimize (in

\mathbf, b, \mathbf^\star

)

\frac\, \mathbf\, ^2

subject to (for any

i = 1, \dots, n

and any

j = 1, \dots, k

)

\begin
&y_i(\mathbf \cdot \mathbf_i - b) \ge 1, \\
&y^\star_j(\mathbf \cdot \mathbf^\star_j - b) \ge 1,
\end

and

y^\star_j \in \.

Transductive support vector machines were introduced by Vladimir N. Vapnik in 1998.

Structured SVM

SVMs have been generalized to

structured SVM The structured support-vector machine is a machine learning algorithm that generalizes the Support-Vector Machine (SVM) classifier. Whereas the SVM classifier supports binary classification, multiclass classification and regression, the structur ...

s, where the label space is structured and of possibly infinite size.

Regression

A version of SVM for regression was proposed in 1996 by Vladimir N. Vapnik, Harris Drucker, Christopher J. C. Burges, Linda Kaufman and Alexander J. Smola. This method is called support vector regression (SVR). The model produced by support vector classification (as described above) depends only on a subset of the training data, because the cost function for building the model does not care about training points that lie beyond the margin. Analogously, the model produced by SVR depends only on a subset of the training data, because the cost function for building the model ignores any training data close to the model prediction. Another SVM version known as least-squares support vector machine (LS-SVM) has been proposed by Suykens and Vandewalle. Training the original SVR means solving : minimize

\tfrac \, w\, ^2

: subject to

,  y_i - \langle w, x_i \rangle  - b ,  \le \varepsilon

where

x_i

is a training sample with target value

y_i

. The inner product plus intercept

\langle w, x_i \rangle + b

is the prediction for that sample, and

\varepsilon

is a free parameter that serves as a threshold: all predictions have to be within an

\varepsilon

range of the true predictions. Slack variables are usually added into the above to allow for errors and to allow approximation in the case the above problem is infeasible.

Bayesian SVM

In 2011 it was shown by Polson and Scott that the SVM admits a

Bayesian Thomas Bayes (/beɪz/; c. 1701 – 1761) was an English statistician, philosopher, and Presbyterian minister. Bayesian () refers either to a range of concepts and approaches that relate to statistical methods based on Bayes' theorem, or a followe ...

interpretation through the technique of data augmentation. In this approach the SVM is viewed as a

graphical model A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. They are commonly used in probabili ...

(where the parameters are connected via probability distributions). This extended view allows the application of

techniques to SVMs, such as flexible feature modeling, automatic

hyperparameter In Bayesian statistics, a hyperparameter is a parameter of a prior distribution; the term is used to distinguish them from parameters of the model for the underlying system under analysis. For example, if one is using a beta distribution to mo ...

tuning, and predictive uncertainty quantification. Recently, a scalable version of the Bayesian SVM was developed b
Florian Wenzel
enabling the application of Bayesian SVMs to big data. Florian Wenzel developed two different versions, a variational inference (VI) scheme for the Bayesian kernel support vector machine (SVM) and a stochastic version (SVI) for the linear Bayesian SVM.

Implementation

The parameters of the maximum-margin hyperplane are derived by solving the optimization. There exist several specialized algorithms for quickly solving the

(QP) problem that arises from SVMs, mostly relying on heuristics for breaking the problem down into smaller, more manageable chunks. Another approach is to use an

interior-point method Interior-point methods (also referred to as barrier methods or IPMs) are a certain class of algorithms that solve linear and nonlinear convex optimization problems. An interior point method was discovered by Soviet mathematician I. I. Dikin in 1 ...

that uses Newton-like iterations to find a solution of the Karush–Kuhn–Tucker conditions of the primal and dual problems. Instead of solving a sequence of broken-down problems, this approach directly solves the problem altogether. To avoid solving a linear system involving the large kernel matrix, a low-rank approximation to the matrix is often used in the kernel trick. Another common method is Platt's

sequential minimal optimization Sequential minimal optimization (SMO) is an algorithm for solving the quadratic programming (QP) problem that arises during the training of support-vector machines (SVM). It was invented by John Platt in 1998 at Microsoft Research. SMO is widely ...

(SMO) algorithm, which breaks the problem down into 2-dimensional sub-problems that are solved analytically, eliminating the need for a numerical optimization algorithm and matrix storage. This algorithm is conceptually simple, easy to implement, generally faster, and has better scaling properties for difficult SVM problems. The special case of linear support vector machines can be solved more efficiently by the same kind of algorithms used to optimize its close cousin,

; this class of algorithms includes sub-gradient descent (e.g., PEGASOS) and

coordinate descent Coordinate descent is an optimization algorithm that successively minimizes along coordinate directions to find the minimum of a function. At each iteration, the algorithm determines a coordinate or coordinate block via a coordinate selection rule, ...

(e.g., LIBLINEAR). LIBLINEAR has some attractive training-time properties. Each convergence iteration takes time linear in the time taken to read the train data, and the iterations also have a Q-linear convergence property, making the algorithm extremely fast. The general kernel SVMs can also be solved more efficiently using sub-gradient descent (e.g. P-packSVM), especially when

parallelization Parallel computing is a type of computation Computation is any type of arithmetic or non-arithmetic calculation that follows a well-defined model (e.g., an algorithm). Mechanical or electronic devices (or, historically, people) that perform ...

is allowed. Kernel SVMs are available in many machine-learning toolkits, including

MATLAB MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementa ...

SAS
SVMlight
kernlab

scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector ...

, Shogun,

Weka The weka, also known as the Māori hen or woodhen (''Gallirallus australis'') is a flightless bird species of the rail family. It is endemic to New Zealand. It is the only extant member of the genus '' Gallirallus''. Four subspecies are recogni ...

SharkJKernelMachines

OpenCV OpenCV (''Open Source Computer Vision Library'') is a library of programming functions mainly aimed at real-time computer vision. Originally developed by Intel, it was later supported by Willow Garage then Itseez (which was later acquired by In ...

and others. Preprocessing of data (standardization) is highly recommended to enhance accuracy of classification. There are a few methods of standardization, such as min-max, normalization by decimal scaling, Z-score. Subtraction of mean and division by variance of each feature is usually used for SVM.

References

External links

libsvm

is a popular library of SVM learners
liblinear
is a library for large linear classification including some SVMs
SVM light
is a collection of software tools for learning and classification using SVM
SVMJS live demo
is a GUI demo for

JavaScript JavaScript (), often abbreviated as JS, is a programming language that is one of the core technologies of the World Wide Web, alongside HTML and CSS. As of 2022, 98% of websites use JavaScript on the client side for webpage behavior, of ...

implementation of SVMs {{DEFAULTSORT:Support Vector Machine Classification algorithms Statistical classification

Motivation

Applications

History

Linear SVM

Hard-margin

Soft-margin

Nonlinear Kernels

Computing the SVM classifier

Primal

Dual

Kernel trick

Modern methods

Sub-gradient descent

Coordinate descent

Empirical risk minimization

Risk minimization

Regularization and stability

SVM and the hinge loss

Target functions

Properties

Parameter selection

Issues

Extensions

Support vector clustering (SVC)

Multiclass SVM

Transductive support vector machines

Structured SVM

Regression

Bayesian SVM

Implementation

See also

References

Further reading

External links