machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

and

mathematical optimization Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criteria, from some set of available alternatives. It is generally divided into two subfiel ...

, loss functions for classification are computationally feasible

loss functions In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...

representing the price paid for inaccuracy of predictions in classification problems (problems of identifying which category a particular observation belongs to). Given

\mathcal

as the space of all possible inputs (usually

\mathcal \subset \mathbb^d

), and

\mathcal = \

as the set of labels (possible outputs), a typical goal of classification algorithms is to find a function

f: \mathcal \to \mathcal

which best predicts a label

y

for a given input

\vec

. However, because of incomplete information, noise in the measurement, or probabilistic components in the underlying process, it is possible for the same

\vec

to generate different

y

. As a result, the goal of the learning problem is to minimize expected loss (also known as the risk), defined as :

= \displaystyle \int_ V(f(\vec),y) \, p(\vec,y) \, d\vec \, dy

where

V(f(\vec),y)

is a given loss function, and

p(\vec,y)

is the

probability density function In probability theory, a probability density function (PDF), density function, or density of an absolutely continuous random variable, is a Function (mathematics), function whose value at any given sample (or point) in the sample space (the s ...

of the process that generated the data, which can equivalently be written as :

p(\vec,y)=p(y\mid\vec) p(\vec).

Within classification, several commonly used

are written solely in terms of the product of the true label

y

and the predicted label

f(\vec)

. Therefore, they can be defined as functions of only one variable

\upsilon=y f(\vec)

, so that

V(f(\vec),y) = \phi(yf(\vec)) = \phi(\upsilon)

with a suitably chosen function

\phi:\mathbb\to\mathbb

. These are called margin-based loss functions. Choosing a margin-based loss function amounts to choosing

\phi

. Selection of a loss function within this framework impacts the optimal

f^_\phi

which minimizes the expected risk, see

empirical risk minimization In statistical learning theory, the principle of empirical risk minimization defines a family of learning algorithms based on evaluating performance over a known and fixed dataset. The core idea is based on an application of the law of large num ...

. In the case of binary classification, it is possible to simplify the calculation of expected risk from the integral specified above. Specifically, :

, p(\vec)\,d\vec \end

The second equality follows from the properties described above. The third equality follows from the fact that 1 and −1 are the only possible values for

y

, and the fourth because

p(-1\mid x)=1-p(1\mid x)

. The term within brackets

phi(f(\vec)) p(1\mid\vec)+\phi(-f(\vec)) (1-p(1\mid\vec))

is known as the ''conditional risk.'' One can solve for the minimizer of

I /math> by taking the functional derivative of the last equality with respect to f and setting the derivative equal to 0.  This will result in the following equation

: \frac\eta + \frac(1-\eta)=0, \;\;\;\;\;(1) where \eta=p(y=1, \vec), which is also equivalent to setting the derivative of the conditional risk equal to zero.

Given the binary nature of classification, a natural selection for a loss function (assuming equal cost for

false positives and false negatives A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test res ...

) would be the 0-1 loss function (0–1

indicator function In mathematics, an indicator function or a characteristic function of a subset of a set is a function that maps elements of the subset to one, and all other elements to zero. That is, if is a subset of some set , then the indicator functio ...

), which takes the value of 0 if the predicted classification equals that of the true class or a 1 if the predicted classification does not match the true class. This selection is modeled by :

V(f(\vec),y)=H(-yf(\vec))

where

H

indicates the

Heaviside step function The Heaviside step function, or the unit step function, usually denoted by or (but sometimes , or ), is a step function named after Oliver Heaviside, the value of which is zero for negative arguments and one for positive arguments. Differen ...

. However, this loss function is non-convex and non-smooth, and solving for the optimal solution is an

NP-hard In computational complexity theory, a computational problem ''H'' is called NP-hard if, for every problem ''L'' which can be solved in non-deterministic polynomial-time, there is a polynomial-time reduction from ''L'' to ''H''. That is, assumi ...

combinatorial optimization problem. As a result, it is better to substitute loss function surrogates which are tractable for commonly used learning algorithms, as they have convenient properties such as being convex and smooth. In addition to their computational tractability, one can show that the solutions to the learning problem using these loss surrogates allow for the recovery of the actual solution to the original classification problem. Some of these surrogates are described below. In practice, the probability distribution

p(\vec,y)

is unknown. Consequently, utilizing a training set of

n

independently and identically distributed sample points :

S = \

drawn from the data

sample space In probability theory, the sample space (also called sample description space, possibility space, or outcome space) of an experiment or random trial is the set of all possible outcomes or results of that experiment. A sample space is usually den ...

, one seeks to minimize empirical risk :

= \frac \sum_^n V( f(\vec_i),y_i)

as a proxy for expected risk. (See

statistical learning theory Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on da ...

for a more detailed description.)

Bayes consistency

Utilizing

Bayes' theorem Bayes' theorem (alternatively Bayes' law or Bayes' rule, after Thomas Bayes) gives a mathematical rule for inverting Conditional probability, conditional probabilities, allowing one to find the probability of a cause given its effect. For exampl ...

, it can be shown that the optimal

f^*_

, i.e., the one that minimizes the expected risk associated with the zero-one loss, implements the Bayes optimal decision rule for a binary classification problem and is in the form of :

f^*_(\vec) \;=\; \begin \;\;\;1& \textp(1\mid\vec) >  p(-1\mid \vec) \\ \;\;\;0 & \textp(1\mid\vec) =  p(-1\mid\vec) \\ -1 & \textp(1\mid\vec) <  p(-1\mid\vec) \end

. A loss function is said to be ''classification-calibrated or Bayes consistent'' if its optimal

f^*_

is such that

f^*_(\vec)  = \operatorname(f^*_(\vec))

and is thus optimal under the Bayes decision rule. A Bayes consistent loss function allows us to find the Bayes optimal decision function

f^*_

by directly minimizing the expected risk and without having to explicitly model the probability density functions. For convex margin loss

\phi(\upsilon)

, it can be shown that

\phi(\upsilon)

is Bayes consistent if and only if it is differentiable at 0 and

\phi'(0)<0

. Yet, this result does not exclude the existence of non-convex Bayes consistent loss functions. A more general result states that Bayes consistent loss functions can be generated using the following formulation :

\phi(v)=C^(v) (1-f^(v))C'^(v) \;\;\;\;\;(2)

, where

f(\eta), (0\leq \eta \leq 1)

is any invertible function such that

f^(-v)=1-f^(v)

and

C(\eta)

is any differentiable strictly concave function such that

C(\eta)=C(1-\eta)

. Table-I shows the generated Bayes consistent loss functions for some example choices of

C(\eta)

and

f^(v)

. Note that the Savage and Tangent loss are not convex. Such non-convex loss functions have been shown to be useful in dealing with outliers in classification. For all loss functions generated from (2), the posterior probability

p(y=1, \vec)

can be found using the invertible ''link function'' as

p(y=1, \vec)=\eta=f^(v)

. Such loss functions where the posterior probability can be recovered using the invertible link are called ''proper loss functions''.
The sole minimizer of the expected risk,

f^*_

, associated with the above generated loss functions can be directly found from equation (1) and shown to be equal to the corresponding

f(\eta)

. This holds even for the nonconvex loss functions, which means that gradient descent based algorithms such as

gradient boosting Gradient boosting is a machine learning technique based on boosting in a functional space, where the target is ''pseudo-residuals'' instead of residuals as in traditional boosting. It gives a prediction model in the form of an ensemble of weak ...

can be used to construct the minimizer.

Proper loss functions, loss margin and regularization

For proper loss functions, the ''loss margin'' can be defined as

\mu_=-\frac

and shown to be directly related to the regularization properties of the classifier. Specifically a loss function of larger margin increases regularization and produces better estimates of the posterior probability. For example, the loss margin can be increased for the logistic loss by introducing a

\gamma

parameter and writing the logistic loss as

\frac\log(1+e^)

where smaller

0<\gamma<1

increases the margin of the loss. It is shown that this is directly equivalent to decreasing the learning rate in

F_m(x) = F_(x) + \gamma h_m(x),

where decreasing

\gamma

improves the regularization of the boosted classifier. The theory makes it clear that when a learning rate of

\gamma

is used, the correct formula for retrieving the posterior probability is now

\eta=f^(\gamma F(x))

. In conclusion, by choosing a loss function with larger margin (smaller

\gamma

) we increase regularization and improve our estimates of the posterior probability which in turn improves the ROC curve of the final classifier.

Square loss

While more commonly used in regression, the square loss function can be re-written as a function

\phi(yf(\vec))

and utilized for classification. It can be generated using (2) and Table-I as follows :

= 4(\frac(v+1))(1-\frac(v+1))+(1-\frac(v+1))(4-8(\frac(v+1)))=(1-v)^2.

The square loss function is both convex and smooth. However, the square loss function tends to penalize outliers excessively, leading to slower convergence rates (with regards to sample complexity) than for the logistic loss or hinge loss functions. In addition, functions which yield high values of

f(\vec)

for some

x \in X

will perform poorly with the square loss function, since high values of

yf(\vec)

will be penalized severely, regardless of whether the signs of

y

and

f(\vec)

match. A benefit of the square loss function is that its structure lends itself to easy cross validation of regularization parameters. Specifically for

Tikhonov regularization Ridge regression (also known as Tikhonov regularization, named for Andrey Tikhonov) is a method of estimating the coefficients of multiple- regression models in scenarios where the independent variables are highly correlated. It has been used in m ...

, one can solve for the regularization parameter using leave-one-out cross-validation in the same time as it would take to solve a single problem. The minimizer of

I /math> for the square loss function can be directly found from equation (1) as

: f^*_\text= 2\eta-1=2p(1\mid x)-1.

Logistic loss

The logistic loss function can be generated using (2) and Table-I as follows :

\\ &=\frac\log(1+e^). \end

The logistic loss is convex and grows linearly for negative values which make it less sensitive to outliers. The logistic loss is used in the LogitBoost algorithm. The minimizer of

I /math> for the logistic loss function  can be directly found from equation (1) as

: f^*_\text= \log\left(\frac\right)=\log\left(\frac\right). This function is undefined when p(1\mid x)=1 or p(1\mid x)=0 (tending toward ∞ and −∞ respectively), but predicts a smooth curve which grows when p(1\mid x) increases and equals 0 when p(1\mid x)= 0.5 . It's easy to check that the logistic loss and binary

cross-entropy In information theory, the cross-entropy between two probability distributions p and q, over the same underlying set of events, measures the average number of bits needed to identify an event drawn from the set when the coding scheme used for the ...

loss (Log loss) are in fact the same (up to a multiplicative constant

\frac

). The cross-entropy loss is closely related to the

Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...

between the empirical distribution and the predicted distribution. The cross-entropy loss is ubiquitous in modern

deep neural networks Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...

Exponential loss

The exponential loss function can be generated using (2) and Table-I as follows :

= 2\sqrt+\left(1-\frac\right)\left(\frac\right) = e^

The exponential loss is convex and grows exponentially for negative values which makes it more sensitive to outliers. The exponentially-weighted 0-1 loss is used in the AdaBoost algorithm giving implicitly rise to the exponential loss. The minimizer of

I /math> for the exponential loss function  can be directly found from equation (1) as

: f^*_\text= \frac\log\left(\frac\right)=\frac\log\left(\frac\right).

Savage loss

The Savage loss can be generated using (2) and Table-I as follows :

= \left(\frac\right)\left(1-\frac\right)+\left(1-\frac\right)\left(1-\frac\right) = \frac.

The Savage loss is quasi-convex and is bounded for large negative values which makes it less sensitive to outliers. The Savage loss has been used in

and the SavageBoost algorithm. The minimizer of

I /math> for the Savage loss function  can be directly found from equation (1) as

: f^*_\text= \log\left(\frac\right)=\log\left(\frac\right).

Tangent loss

The Tangent loss can be generated using (2) and Table-I as follows :

\\ & = 4 \left( \arctan(v)+\frac \right) \left( 1- \left( \arctan(v)+\frac \right) \right) + \left( 1- \left( \arctan(v)+\frac \right) \right) \left( 4-8 \left( \arctan(v)+\frac \right) \right) \\ & = \left( 2\arctan(v)-1 \right) ^2. \end

The Tangent loss is quasi-convex and is bounded for large negative values which makes it less sensitive to outliers. Interestingly, the Tangent loss also assigns a bounded penalty to data points that have been classified "too correctly". This can help prevent over-training on the data set. The Tangent loss has been used in

, the TangentBoost algorithm and Alternating Decision Forests. The minimizer of

I /math> for the Tangent loss function  can be directly found from equation (1) as

: f^*_\text= \tan \left( \eta-\frac \right)  =\tan \left( p \left( 1\mid x \right) -\frac\right) .

Hinge loss

The hinge loss function is defined with

\phi(\upsilon) = \max(0, 1-\upsilon) = -\upsilon

, where

= \max(0,a)

is the

positive part In mathematics, the positive part of a real or extended real-valued function is defined by the formula f^+(x) = \max(f(x),0) = \begin f(x) & \text f(x) > 0 \\ 0 & \text \end Intuitively, the graph of f^+ is obtained by taking the graph of f, ...

function. :

V(f(\vec),y) = \max(0, 1-yf(\vec)) = - yf(\vec) .

The hinge loss provides a relatively tight, convex upper bound on the 0–1

. Specifically, the hinge loss equals the 0–1

when

\operatorname(f(\vec)) = y

and

, yf(\vec),  \geq 1

. In addition, the empirical risk minimization of this loss is equivalent to the classical formulation for

support vector machines In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...

(SVMs). Correctly classified points lying outside the margin boundaries of the support vectors are not penalized, whereas points within the margin boundaries or on the wrong side of the hyperplane are penalized in a linear fashion compared to their distance from the correct boundary. While the hinge loss function is both convex and continuous, it is not smooth (is not differentiable) at

yf(\vec)=1

. Consequently, the hinge loss function cannot be used with

gradient descent Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function. The idea is to take repeated steps in the opposite direction of the gradi ...

methods or

stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an Iterative method, iterative method for optimizing an objective function with suitable smoothness properties (e.g. Differentiable function, differentiable or Subderivative, subdifferentiable ...

methods which rely on differentiability over the entire domain. However, the hinge loss does have a subgradient at

yf(\vec)=1

, which allows for the utilization of subgradient descent methods. SVMs utilizing the hinge loss function can also be solved using

quadratic programming Quadratic programming (QP) is the process of solving certain mathematical optimization problems involving quadratic functions. Specifically, one seeks to optimize (minimize or maximize) a multivariate quadratic function subject to linear constr ...

. The minimizer of

I /math> for the hinge loss function is

: f^*_\text(\vec) \;=\; \begin 1& \textp(1\mid\vec) >  p(-1\mid\vec) \\ -1 & \textp(1\mid\vec) <  p(-1\mid\vec) \end when p(1\mid x) \ne 0.5, which matches that of the 0–1 indicator function.  This conclusion makes the hinge loss quite attractive, as bounds can be placed on the difference between expected risk and the sign of hinge loss function. The Hinge loss cannot be derived from (2) since f^*_is not invertible.

Generalized smooth hinge loss

The generalized smooth hinge loss function with parameter

\alpha

is defined as :

f^*_\alpha(z) \;=\; \begin \frac - z & \textz \leq 0 \\ \fracz^ - z + \frac & \text 0 where
: z = yf(\vec). It is monotonically increasing and reaches 0 when z = 1 .

References

{{Artificial intelligence navbox Machine learning algorithms