machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

, the hinge loss is a

loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...

used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for

support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratorie ...

s (SVMs). For an intended output and a classifier score , the hinge loss of the prediction is defined as :

\ell(y) = \max(0, 1-t \cdot y)

Note that

y

should be the "raw" output of the classifier's decision function, not the predicted class label. For instance, in linear SVMs,

y = \mathbf \cdot \mathbf + b

, where

(\mathbf,b)

are the parameters of the

hyperplane In geometry, a hyperplane is a subspace whose dimension is one less than that of its ''ambient space''. For example, if a space is 3-dimensional then its hyperplanes are the 2-dimensional planes, while if the space is 2-dimensional, its hyper ...

and

\mathbf

is the input variable(s). When and have the same sign (meaning predicts the right class) and

, y,  \ge 1

, the hinge loss

\ell(y) = 0

. When they have opposite signs,

\ell(y)

increases linearly with , and similarly if

, y,  < 1

, even if it has the same sign (correct prediction, but not by enough margin).

Extensions

While binary SVMs are commonly extended to

multiclass classification In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes (classifying instances into one of two classes is called binary c ...

in a one-vs.-all or one-vs.-one fashion, it is also possible to extend the hinge loss itself for such an end. Several different variations of multiclass hinge loss have been proposed. For example, Crammer and Singer defined it for a linear classifier as :

\ell(y) = \max(0, 1 + \max_ \mathbf_y \mathbf - \mathbf_t \mathbf)

Where

t

the target label,

\mathbf_t

and

\mathbf_y

the model parameters. Weston and Watkins provided a similar definition, but with a sum rather than a max: :

\ell(y) = \sum_ \max(0, 1 + \mathbf_y \mathbf - \mathbf_t \mathbf)

structured prediction Structured prediction or structured (output) learning is an umbrella term for supervised machine learning techniques that involves predicting structured objects, rather than scalar discrete or real values. Similar to commonly used supervised l ...

, the hinge loss can be further extended to structured output spaces. Structured SVMs with margin rescaling use the following variant, where denotes the SVM's parameters, the SVM's predictions, the joint feature function, and the Hamming loss: :

\begin
\ell(\mathbf) & = \max(0, \Delta(\mathbf, \mathbf) + \langle \mathbf, \phi(\mathbf, \mathbf) \rangle - \langle \mathbf, \phi(\mathbf, \mathbf) \rangle) \\
                 & = \max(0, \max_ \left( \Delta(\mathbf, \mathbf) + \langle \mathbf, \phi(\mathbf, \mathbf) \rangle \right) - \langle \mathbf, \phi(\mathbf, \mathbf) \rangle)
\end

Optimization

The hinge loss is a

convex function In mathematics, a real-valued function is called convex if the line segment between any two points on the graph of a function, graph of the function lies above the graph between the two points. Equivalently, a function is convex if its epigra ...

, so many of the usual convex optimizers used in machine learning can work with it. It is not

differentiable In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non-vertical tangent line at each interior point in its ...

, but has a subgradient with respect to model parameters of a linear SVM with score function

y = \mathbf \cdot \mathbf

that is given by :

\frac = \begin
 -t \cdot x_i & \text t \cdot y < 1 \\
 0            & \text
\end

However, since the derivative of the hinge loss at

ty = 1

is undefined, smoothed versions may be preferred for optimization, such as Rennie and Srebro's :

\ell(y) = \begin
\frac - ty       & \text ~~ ty \le 0, \\
\frac (1 - ty)^2 & \text ~~ 0 < ty < 1, \\
0                      & \text ~~ 1 \le ty
\end

or the quadratically smoothed :

\ell_\gamma(y) = \begin
\frac \max(0, 1 - ty)^2 & \text ~~ ty \ge 1 - \gamma \\
1 - \frac - ty           & \text
\end

suggested by Zhang. The modified Huber loss

L

is a special case of this loss function with

\gamma = 2

, specifically

L(t,y) = 4 \ell_2(y)

References

{{Reflist Loss functions Support vector machines

Extensions

Optimization

See also

References