In
mathematics
Mathematics is an area of knowledge that includes the topics of numbers, formulas and related structures, shapes and the spaces in which they are contained, and quantities and their changes. These topics are represented in modern mathematics ...
,
statistics
Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
,
finance
Finance is the study and discipline of money, currency and capital assets. It is related to, but not synonymous with economics, the study of production, distribution, and consumption of money, assets, goods and services (the discipline of fina ...
,
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, particularly in
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
and
inverse problem
An inverse problem in science is the process of calculating from a set of observations the causal factors that produced them: for example, calculating an image in X-ray computed tomography, source reconstruction in acoustics, or calculating the ...
s, regularization is a process that changes the result answer to be "simpler". It is often used to obtain results for
ill-posed problems
The mathematical term well-posed problem stems from a definition given by 20th-century French mathematician Jacques Hadamard. He believed that mathematical models of physical phenomena should have the properties that:
# a solution exists,
# the s ...
or to prevent
overfitting
mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
.
Although regularization procedures can be divided in many ways, following delineation is particularly helpful:
* Explicit regularization is regularization whenever one explicitly adds a term to the optimization problem. These terms could be priors, penalties, or constraints. Explicit regularization is commonly employed with ill-posed optimization problems. The regularization term, or penalty, imposes a cost on the optimization function to make the optimal solution unique.
* Implicit regularization is all other forms of regularization. This includes, for example, early stopping, using a robust loss function, and discarding outliers. Implicit regularization is essentially ubiquitous in modern machine learning approaches, including stochastic gradient descent for training deep neural networks, and ensemble methods (such as random forests and gradient boosted trees).
In explicit regularization, independent of the problem or model, there is always a data term, that corresponds to a likelihood of the measurement and a regularization term that corresponds to a prior. By combining both using Bayesian statistics, one can compute a posterior, that includes both information sources and therefore stabilizes the estimation process. By trading off both objectives, one chooses to be more addictive to the data or to enforce generalization (to prevent overfitting). There is a whole research branch dealing with all possible regularizations. The work flow usually is, that one tries a specific regularization and then figures out the probability density that corresponds to that regularization to justify the choice. It can also be physically motivated by common sense or intuition.
In machine learning, the data term corresponds to the training data and the regularization is either the choice of the model or modifications to the algorithm. It is always intended to reduce the generalization error, i.e. the error score with the trained model on the evaluation set and not the training data.
One of the earliest uses of regularization is
Tikhonov regularization
Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also ...
, related to the method of least squares.
Classification
Empirical learning of classifiers (from a finite data set) is always an underdetermined problem, because it attempts to infer a function of any
given only examples
.
A regularization term (or regularizer)
is added to a
loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
:
:
where
is an underlying loss function that describes the cost of predicting
when the label is
, such as the
square loss or
hinge loss
In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).
For an intended output and a classifier score , th ...
; and
is a parameter which controls the importance of the regularization term.
is typically chosen to impose a penalty on the complexity of
. Concrete notions of complexity used include restrictions for
smoothness and bounds on the
vector space norm.
A theoretical justification for regularization is that it attempts to impose
Occam's razor
Occam's razor, Ockham's razor, or Ocham's razor ( la, novacula Occami), also known as the principle of parsimony or the law of parsimony ( la, lex parsimoniae), is the problem-solving principle that "entities should not be multiplied beyond neces ...
on the solution (as depicted in the figure above, where the green function, the simpler one, may be preferred). From a
Bayesian
Thomas Bayes (/beɪz/; c. 1701 – 1761) was an English statistician, philosopher, and Presbyterian minister.
Bayesian () refers either to a range of concepts and approaches that relate to statistical methods based on Bayes' theorem, or a followe ...
point of view, many regularization techniques correspond to imposing certain
prior
Prior (or prioress) is an ecclesiastical title for a superior in some religious orders. The word is derived from the Latin for "earlier" or "first". Its earlier generic usage referred to any monastic superior. In abbeys, a prior would be l ...
distributions on model parameters.
Regularization can serve multiple purposes, including learning simpler models, inducing models to be sparse and introducing group structure into the learning problem.
The same idea arose in many fields of
science
Science is a systematic endeavor that builds and organizes knowledge in the form of testable explanations and predictions about the universe.
Science may be as old as the human species, and some of the earliest archeological evidence for ...
. A simple form of regularization applied to
integral equation
In mathematics, integral equations are equations in which an unknown Function (mathematics), function appears under an integral sign. In mathematical notation, integral equations may thus be expressed as being of the form: f(x_1,x_2,x_3,...,x_n ; ...
s (
Tikhonov regularization
Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also ...
) is essentially a trade-off between fitting the data and reducing a norm of the solution. More recently, non-linear regularization methods, including
total variation regularization
In signal processing, particularly image processing, total variation denoising, also known as total variation regularization or total variation filtering, is a noise removal process (filter). It is based on the principle that signals with excessi ...
, have become popular.
Generalization
Regularization can be motivated as a technique to improve the generalizability of a learned model.
The goal of this learning problem is to find a function that fits or predicts the outcome (label) that minimizes the expected error over all possible inputs and labels. The expected error of a function
is:
:
where
and
are the domains of input data
and their labels
respectively.
Typically in learning problems, only a subset of input data and labels are available, measured with some noise. Therefore, the expected error is unmeasurable, and the best surrogate available is the empirical error over the
available samples:
:
Without bounds on the complexity of the function space (formally, the
reproducing kernel Hilbert space
In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g in ...
) available, a model will be learned that incurs zero loss on the surrogate empirical error. If measurements (e.g. of
) were made with noise, this model may suffer from
overfitting
mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
and display poor expected error. Regularization introduces a penalty for exploring certain regions of the function space used to build the model, which can improve generalization.
Tikhonov regularization
These techniques are named for
Andrey Nikolayevich Tikhonov
Andrey Nikolayevich Tikhonov (russian: Андре́й Никола́евич Ти́хонов; October 17, 1906 – October 7, 1993) was a leading Soviet Russian mathematician and geophysicist known for important contributions to topology, ...
, who applied regularization to
integral equation
In mathematics, integral equations are equations in which an unknown Function (mathematics), function appears under an integral sign. In mathematical notation, integral equations may thus be expressed as being of the form: f(x_1,x_2,x_3,...,x_n ; ...
s and made important contributions in many other areas.
When learning a linear function
, characterized by an unknown
vector
Vector most often refers to:
*Euclidean vector, a quantity with a magnitude and a direction
*Vector (epidemiology), an agent that carries and transmits an infectious pathogen into another living organism
Vector may also refer to:
Mathematic ...
such that
, one can add the
-norm of the vector
to the loss expression in order to prefer solutions with smaller norms. Tikhonov regularization is one of the most common forms. It is also known as ridge regression. It is expressed as:
:
,
where
would represent samples used for training.
In the case of a general function, the norm of the function in its
reproducing kernel Hilbert space
In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g in ...
is:
:
As the
norm is
differentiable
In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non-vertical tangent line at each interior point in its ...
, learning can be advanced by
gradient descent
In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...
.
Tikhonov-regularized least squares
The learning problem with the
least squares loss function and Tikhonov regularization can be solved analytically. Written in matrix form, the optimal
is the one for which the gradient of the loss function with respect to
is 0.
:
:
:
(
first-order condition
In calculus, a derivative test uses the derivatives of a function to locate the critical points of a function and determine whether each point is a local maximum, a local minimum, or a saddle point. Derivative tests can also give information abou ...
)
:
By construction of the optimization problem, other values of
give larger values for the loss function. This can be verified by examining the
second derivative
In calculus, the second derivative, or the second order derivative, of a function is the derivative of the derivative of . Roughly speaking, the second derivative measures how the rate of change of a quantity is itself changing; for example, ...
.
During training, this algorithm takes
time
Time is the continued sequence of existence and events that occurs in an apparently irreversible succession from the past, through the present, into the future. It is a component quantity of various measurements used to sequence events, to ...
. The terms correspond to the matrix inversion and calculating
, respectively. Testing takes
time.
Early stopping
Early stopping can be viewed as regularization in time. Intuitively, a training procedure such as gradient descent tends to learn more and more complex functions with increasing iterations. By regularizing for time, model complexity can be controlled, improving generalization.
Early stopping is implemented using one data set for training, one statistically independent data set for validation and another for testing. The model is trained until performance on the validation set no longer improves and then applied to the test set.
Theoretical motivation in least squares
Consider the finite approximation of
Neumann series A Neumann series is a mathematical series of the form
: \sum_^\infty T^k
where T is an operator and T^k := T^\circ its k times repeated application.
This generalizes the geometric series.
The series is named after the mathematician Carl Neumann ...
for an invertible matrix where
:
:
This can be used to approximate the analytical solution of unregularized least squares, if is introduced to ensure the norm is less than one.
:
The exact solution to the unregularized least squares learning problem minimizes the empirical error, but may fail. By limiting , the only free parameter in the algorithm above, the problem is regularized for time, which may improve its generalization.
The algorithm above is equivalent to restricting the number of gradient descent iterations for the empirical risk
:
with the gradient descent update:
:
The base case is trivial. The inductive case is proved as follows:
:
Regularizers for sparsity
Assume that a dictionary
with dimension
is given such that a function in the function space can be expressed as:
:
Enforcing a sparsity constraint on
can lead to simpler and more interpretable models. This is useful in many real-life applications such as
computational biology
Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has fo ...
. An example is developing a simple predictive test for a disease in order to minimize the cost of performing medical tests while maximizing predictive power.
A sensible sparsity constraint is the
norm , defined as the number of non-zero elements in
. Solving a
regularized learning problem, however, has been demonstrated to be
NP-hard
In computational complexity theory, NP-hardness ( non-deterministic polynomial-time hardness) is the defining property of a class of problems that are informally "at least as hard as the hardest problems in NP". A simple example of an NP-hard pr ...
.
The
norm (see also
Norms) can be used to approximate the optimal
norm via convex relaxation. It can be shown that the
norm induces sparsity. In the case of least squares, this problem is known as
LASSO
A lasso ( or ), also called lariat, riata, or reata (all from Castilian, la reata 're-tied rope'), is a loop of rope designed as a restraint to be thrown around a target and tightened when pulled. It is a well-known tool of the Spanish an ...
in statistics and
basis pursuit
Basis pursuit is the mathematical optimization problem of the form
: \min_x \, x\, _1 \quad \text \quad y = Ax,
where ''x'' is a ''N''-dimensional solution vector (signal), ''y'' is a ''M''-dimensional vector of observations (measurements), ''A ...
in signal processing.
:
regularization can occasionally produce non-unique solutions. A simple example is provided in the figure when the space of possible solutions lies on a 45 degree line. This can be problematic for certain applications, and is overcome by combining
with
regularization in
elastic net regularization
In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods.
Specification
The elas ...
, which takes the following form:
: