In
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
, early stopping is a form of
regularization
Regularization may refer to:
* Regularization (linguistics)
* Regularization (mathematics)
* Regularization (physics)
In physics, especially quantum field theory, regularization is a method of modifying observables which have singularities in ...
used to avoid
overfitting
mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
when training a learner with an iterative method, such as
gradient descent
In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...
. Such methods update the learner so as to make it better fit the training data with each iteration. Up to a point, this improves the learner's performance on data outside of the training set. Past that point, however, improving the learner's fit to the training data comes at the expense of increased
generalization error
For supervised learning applications in machine learning and statistical learning theory, generalization error (also known as the out-of-sample error or the risk) is a measure of how accurately an algorithm is able to predict outcome values for pre ...
. Early stopping rules provide guidance as to how many iterations can be run before the learner begins to over-fit. Early stopping rules have been employed in many different machine learning methods, with varying amounts of theoretical foundation.
Background
This section presents some of the basic machine-learning concepts required for a description of early stopping methods.
Overfitting
Machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
algorithms train a model based on a finite set of training data. During this training, the model is evaluated based on how well it predicts the observations contained in the training set. In general, however, the goal of a machine learning scheme is to produce a model that generalizes, that is, that predicts previously unseen observations. Overfitting occurs when a model fits the data in the training set well, while incurring larger
generalization error
For supervised learning applications in machine learning and statistical learning theory, generalization error (also known as the out-of-sample error or the risk) is a measure of how accurately an algorithm is able to predict outcome values for pre ...
.
Regularization
Regularization, in the context of machine learning, refers to the process of modifying a learning algorithm so as to prevent overfitting. This generally involves imposing some sort of smoothness constraint on the learned model.
This smoothness may be enforced explicitly, by fixing the number of parameters in the model, or by augmenting the cost function as in
Tikhonov regularization
Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also ...
. Tikhonov regularization, along with
principal component regression
In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). More specifically, PCR is used for estimating the unknown regression coefficients in a standard linear regr ...
and many other regularization schemes, fall under the umbrella of spectral regularization, regularization characterized by the application of a filter. Early stopping also belongs to this class of methods.
Gradient descent methods
Gradient descent methods are first-order, iterative, optimization methods. Each iteration updates an approximate solution to the optimization problem by taking a step in the direction of the negative of the gradient of the objective function. By choosing the step-size appropriately, such a method can be made to converge to a local minimum of the objective function. Gradient descent is used in machine-learning by defining a ''
loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
'' that reflects the error of the learner on the training set and then minimizing that function.
Early stopping based on analytical results
Early stopping in
statistical learning theory
Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on dat ...
Early-stopping can be used to regularize
non-parametric regression
Nonparametric regression is a category of regression analysis in which the predictor does not take a predetermined form but is constructed according to information derived from the data. That is, no parametric form is assumed for the relationship ...
problems encountered in
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
. For a given input space,
, output space,
, and samples drawn from an unknown probability measure,
, on
, the goal of such problems is to approximate a ''regression function'',
, given by
:
where
is the conditional distribution at
induced by
.
One common choice for approximating the regression function is to use functions from a
reproducing kernel Hilbert space
In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g in ...
.
These spaces can be infinite dimensional, in which they can supply solutions that overfit training sets of arbitrary size. Regularization is, therefore, especially important for these methods. One way to regularize non-parametric regression problems is to apply an early stopping rule to an iterative procedure such as gradient descent.
The early stopping rules proposed for these problems are based on analysis of upper bounds on the generalization error as a function of the iteration number. They yield prescriptions for the number of iterations to run that can be computed prior to starting the solution process.
Example: Least-squares loss
(Adapted from Yao, Rosasco and Caponnetto, 2007
)
Let
and
Given a set of samples
:
drawn independently from
, minimize the functional
:
where,
is a member of the reproducing kernel Hilbert space
. That is, minimize the expected risk for a Least-squares loss function. Since
depends on the unknown probability measure
, it cannot be used for computation. Instead, consider the following empirical risk
:
Let
and
be the ''t''-th iterates of gradient descent applied to the expected and empirical risks, respectively, where both iterations are initialized at the origin, and both use the step size
. The
form the ''population iteration'', which converges to
, but cannot be used in computation, while the
form the ''sample iteration'' which usually converges to an overfitting solution.
We want to control the difference between the expected risk of the sample iteration and the minimum expected risk, that is, the expected risk of the regression function:
:
This difference can be rewritten as the sum of two terms: the difference in expected risk between the sample and population iterations and that between the population iteration and the regression function:
: