mathematics Mathematics is a field of study that discovers and organizes methods, Mathematical theory, theories and theorems that are developed and Mathematical proof, proved for the needs of empirical sciences and mathematics itself. There are many ar ...

statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

finance Finance refers to monetary resources and to the study and Academic discipline, discipline of money, currency, assets and Liability (financial accounting), liabilities. As a subject of study, is a field of Business administration, Business Admin ...

, and

computer science Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...

, particularly in

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

and

inverse problem An inverse problem in science is the process of calculating from a set of observations the causal factors that produced them: for example, calculating an image in X-ray computed tomography, sound source reconstruction, source reconstruction in ac ...

s, regularization is a process that converts the answer to a problem to a simpler one. It is often used in solving ill-posed problems or to prevent

overfitting In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...

. Although regularization procedures can be divided in many ways, the following delineation is particularly helpful: * Explicit regularization is regularization whenever one explicitly adds a term to the optimization problem. These terms could be priors, penalties, or constraints. Explicit regularization is commonly employed with ill-posed optimization problems. The regularization term, or penalty, imposes a cost on the optimization function to make the optimal solution unique. * Implicit regularization is all other forms of regularization. This includes, for example, early stopping, using a robust loss function, and discarding outliers. Implicit regularization is essentially ubiquitous in modern machine learning approaches, including

stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an Iterative method, iterative method for optimizing an objective function with suitable smoothness properties (e.g. Differentiable function, differentiable or Subderivative, subdifferentiable ...

for training

deep neural networks Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...

, and ensemble methods (such as

random forest Random forests or random decision forests is an ensemble learning method for statistical classification, classification, regression analysis, regression and other tasks that works by creating a multitude of decision tree learning, decision trees ...

s and

gradient boosted trees Gradient boosting is a machine learning technique based on Boosting (machine learning), boosting in a functional space, where the target is ''pseudo-residuals'' instead of Residuals (statistics), residuals as in traditional boosting. It gives a pr ...

). In explicit regularization, independent of the problem or model, there is always a data term, that corresponds to a likelihood of the measurement, and a regularization term that corresponds to a prior. By combining both using Bayesian statistics, one can compute a posterior, that includes both information sources and therefore stabilizes the estimation process. By trading off both objectives, one chooses to be more aligned to the data or to enforce regularization (to prevent overfitting). There is a whole research branch dealing with all possible regularizations. In practice, one usually tries a specific regularization and then figures out the probability density that corresponds to that regularization to justify the choice. It can also be physically motivated by common sense or intuition. In

, the data term corresponds to the training data and the regularization is either the choice of the model or modifications to the algorithm. It is always intended to reduce the

generalization error For supervised learning applications in machine learning and statistical learning theory, generalization errorMohri, M., Rostamizadeh A., Talwakar A., (2018) ''Foundations of Machine learning'', 2nd ed., Boston: MIT Press (also known as the out-of- ...

, i.e. the error score with the trained model on the evaluation set (testing data) and not the training data. One of the earliest uses of regularization is Tikhonov regularization (ridge regression), related to the method of least squares.

Regularization in machine learning

, a key challenge is enabling models to accurately predict outcomes on unseen data, not just on familiar training data. Regularization is crucial for addressing

—where a model memorizes training data details but cannot generalize to new data. The goal of regularization is to encourage models to learn the broader patterns within the data rather than memorizing it. Techniques like

early stopping In machine learning, early stopping is a form of Regularization (mathematics), regularization used to avoid overfitting when training a model with an iterative method, such as gradient descent. Such methods update the model to make it better fit th ...

, L1 and L2 regularization, and dropout are designed to prevent overfitting and underfitting, thereby enhancing the model's ability to adapt to and perform well with new data, thus improving model generalization.

Early Stopping

Stops training when validation performance deteriorates, preventing overfitting by halting before the model memorizes training data.

L1 and L2 Regularization

Adds penalty terms to the cost function to discourage complex models: * L1 regularization (also called

LASSO A lasso or lazo ( or ), also called reata or la reata in Mexico, and in the United States riata or lariat (from Mexican Spanish lasso for roping cattle), is a loop of rope designed as a restraint to be thrown around a target and tightened when ...

) leads to sparse models by adding a penalty based on the absolute value of coefficients. * L2 regularization (also called ridge regression) encourages smaller, more evenly distributed weights by adding a penalty based on the square of the coefficients.

Dropout

In the context of neural networks, the Dropout technique repeatedly ignores random subsets of neurons during training, which simulates the training of multiple neural network architectures at once to improve generalization.

Classification

Empirical learning of classifiers (from a finite data set) is always an underdetermined problem, because it attempts to infer a function of any

x

given only examples

x_1, x_2, \dots, x_n

. A regularization term (or regularizer)

R(f)

is added to a

loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...

\min_f \sum_^ V(f(x_i), y_i) + \lambda R(f)

where

V

is an underlying loss function that describes the cost of predicting

f(x)

when the label is

y

, such as the square loss or hinge loss; and

\lambda

is a parameter which controls the importance of the regularization term.

R(f)

is typically chosen to impose a penalty on the complexity of

f

. Concrete notions of complexity used include restrictions for

smoothness In mathematical analysis, the smoothness of a function is a property measured by the number of continuous derivatives (''differentiability class)'' it has over its domain. A function of class C^k is a function of smoothness at least ; t ...

and bounds on the vector space norm. A theoretical justification for regularization is that it attempts to impose

Occam's razor In philosophy, Occam's razor (also spelled Ockham's razor or Ocham's razor; ) is the problem-solving principle that recommends searching for explanations constructed with the smallest possible set of elements. It is also known as the principle o ...

on the solution (as depicted in the figure above, where the green function, the simpler one, may be preferred). From a Bayesian point of view, many regularization techniques correspond to imposing certain

prior The term prior may refer to: * Prior (ecclesiastical), the head of a priory (monastery) * Prior convictions, the life history and previous convictions of a suspect or defendant in a criminal case * Prior probability, in Bayesian statistics * Prio ...

distributions on model parameters. Regularization can serve multiple purposes, including learning simpler models, inducing models to be sparse and introducing group structure into the learning problem. The same idea arose in many fields of

science Science is a systematic discipline that builds and organises knowledge in the form of testable hypotheses and predictions about the universe. Modern science is typically divided into twoor threemajor branches: the natural sciences, which stu ...

. A simple form of regularization applied to

integral equation In mathematical analysis, integral equations are equations in which an unknown function appears under an integral sign. In mathematical notation, integral equations may thus be expressed as being of the form: f(x_1,x_2,x_3,\ldots,x_n ; u(x_1,x_2 ...

s ( Tikhonov regularization) is essentially a trade-off between fitting the data and reducing a norm of the solution. More recently, non-linear regularization methods, including

total variation regularization In signal processing, particularly image processing, total variation denoising, also known as total variation regularization or total variation filtering, is a noise removal process (Filter (signal processing), filter). It is based on the princip ...

, have become popular.

Generalization

Regularization can be motivated as a technique to improve the generalizability of a learned model. The goal of this learning problem is to find a function that fits or predicts the outcome (label) that minimizes the expected error over all possible inputs and labels. The expected error of a function

f_n

is:

= \int_ V(f_n(x),y) \rho(x,y) \, dx \, dy

where

X

and

Y

are the domains of input data

x

and their labels

y

respectively. Typically in learning problems, only a subset of input data and labels are available, measured with some noise. Therefore, the expected error is unmeasurable, and the best surrogate available is the empirical error over the

N

available samples:

= \frac \sum_^N V(f_n(\hat x_i), \hat y_i)

Without bounds on the complexity of the function space (formally, the

reproducing kernel Hilbert space In functional analysis, a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Specifically, a Hilbert space H of functions from a set X (to \mathbb or \mathbb) is ...

) available, a model will be learned that incurs zero loss on the surrogate empirical error. If measurements (e.g. of

x_i

) were made with noise, this model may suffer from

and display poor expected error. Regularization introduces a penalty for exploring certain regions of the function space used to build the model, which can improve generalization.

Tikhonov regularization (ridge regression)

These techniques are named for Andrey Nikolayevich Tikhonov, who applied regularization to

s and made important contributions in many other areas. When learning a linear function

f

, characterized by an unknown

vector Vector most often refers to: * Euclidean vector, a quantity with a magnitude and a direction * Disease vector, an agent that carries and transmits an infectious pathogen into another living organism Vector may also refer to: Mathematics a ...

w

such that

f(x) = w \cdot x

, one can add the

L_2

-norm of the vector

w

to the loss expression in order to prefer solutions with smaller norms. Tikhonov regularization is one of the most common forms. It is also known as ridge regression. It is expressed as:

\min_w \sum_^ V(\hat x_i \cdot w, \hat y_i) + \lambda \left\, w\right\, _2^2,

where

(\hat x_i, \hat y_i), \, 1 \leq i \leq n,

would represent samples used for training. In the case of a general function, the norm of the function in its

is:

\min_f \sum_^ V(f(\hat x_i), \hat y_i) + \lambda \left\, f\right\, _^2

As the

L_2

norm is

differentiable In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non- vertical tangent line at each interior point in ...

, learning can be advanced by

gradient descent Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function. The idea is to take repeated steps in the opposite direction of the gradi ...

Tikhonov-regularized least squares

The learning problem with the

least squares The method of least squares is a mathematical optimization technique that aims to determine the best fit function by minimizing the sum of the squares of the differences between the observed values and the predicted values of the model. The me ...

loss function and Tikhonov regularization can be solved analytically. Written in matrix form, the optimal

w

is the one for which the gradient of the loss function with respect to

w

is 0.

\min_w \frac \left(\hat X w - Y\right)^\mathsf \left(\hat X w - Y\right) + \lambda \left\, w\right\, _2^2

\nabla_w = \frac \hat X^\mathsf \left(\hat X w -  Y\right) + 2 \lambda w

0 = \hat X^\mathsf \left(\hat X w -  Y\right) + n \lambda w

w = \left(\hat X^\mathsf \hat X + \lambda n I\right)^ \left(\hat X^\mathsf Y\right)

where the third statement is a

first-order condition In calculus, a derivative test uses the derivatives of a function to locate the critical points of a function and determine whether each point is a local maximum, a local minimum, or a saddle point. Derivative tests can also give information abou ...

. By construction of the optimization problem, other values of

w

give larger values for the loss function. This can be verified by examining the

second derivative In calculus, the second derivative, or the second-order derivative, of a function is the derivative of the derivative of . Informally, the second derivative can be phrased as "the rate of change of the rate of change"; for example, the secon ...

\nabla_

. During training, this algorithm takes

O(d^3 + n d^2)

time Time is the continuous progression of existence that occurs in an apparently irreversible process, irreversible succession from the past, through the present, and into the future. It is a component quantity of various measurements used to sequ ...

. The terms correspond to the matrix inversion and calculating

X^\mathsf X

, respectively. Testing takes

O(nd)

time.

Early stopping

Early stopping can be viewed as regularization in time. Intuitively, a training procedure such as gradient descent tends to learn more and more complex functions with increasing iterations. By regularizing for time, model complexity can be controlled, improving generalization. Early stopping is implemented using one data set for training, one statistically independent data set for validation and another for testing. The model is trained until performance on the validation set no longer improves and then applied to the test set.

Theoretical motivation in least squares

Consider the finite approximation of Neumann series for an invertible matrix where

\left\,  I - A \right\,  < 1

\sum_^ \left(I - A\right)^i \approx A^

This can be used to approximate the analytical solution of unregularized least squares, if is introduced to ensure the norm is less than one.

w_T = \frac \sum_^ \left( I - \frac \hat X^\mathsf \hat X \right)^i \hat X^\mathsf \hat Y

The exact solution to the unregularized least squares learning problem minimizes the empirical error, but may fail. By limiting , the only free parameter in the algorithm above, the problem is regularized for time, which may improve its generalization. The algorithm above is equivalent to restricting the number of gradient descent iterations for the empirical risk

= \frac \left\, \hat X w - \hat Y \right\, ^_

with the gradient descent update:

w_ &= \left(I - \frac \hat X^\mathsf \hat X\right) w_t + \frac \hat X^\mathsf \hat Y \end

The base case is trivial. The inductive case is proved as follows:

&= \frac \sum_^ \left(I - \frac \hat X^\mathsf \hat X \right)^i \hat X^\mathsf \hat Y \end

Regularizers for sparsity

Assume that a dictionary

\phi_j

with dimension

p

is given such that a function in the function space can be expressed as:

f(x) = \sum_^ \phi_j(x) w_j

Enforcing a sparsity constraint on

w

can lead to simpler and more interpretable models. This is useful in many real-life applications such as

computational biology Computational biology refers to the use of techniques in computer science, data analysis, mathematical modeling and Computer simulation, computational simulations to understand biological systems and relationships. An intersection of computer sci ...

. An example is developing a simple predictive test for a disease in order to minimize the cost of performing medical tests while maximizing predictive power. A sensible sparsity constraint is the

L_0

norm

\, w\, _0

, defined as the number of non-zero elements in

w

. Solving a

L_0

regularized learning problem, however, has been demonstrated to be

NP-hard In computational complexity theory, a computational problem ''H'' is called NP-hard if, for every problem ''L'' which can be solved in non-deterministic polynomial-time, there is a polynomial-time reduction from ''L'' to ''H''. That is, assumi ...

. The

L_1

norm (see also Norms) can be used to approximate the optimal

L_0

norm via convex relaxation. It can be shown that the

L_1

norm induces sparsity. In the case of least squares, this problem is known as

in statistics and basis pursuit in signal processing.

\min_ \frac \left\, \hat X w - \hat Y \right\, ^2 + \lambda \left\, w\right\, _

L_1

regularization can occasionally produce non-unique solutions. A simple example is provided in the figure when the space of possible solutions lies on a 45 degree line. This can be problematic for certain applications, and is overcome by combining

L_1

with

L_2

regularization in elastic net regularization, which takes the following form:

\min_ \frac \left\, \hat X w - \hat Y \right\, ^2 + \lambda \left(\alpha \left\, w\right\, _1 + (1 - \alpha)\left\, w\right\, _2^2\right), \alpha \in

, 1 The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...

/math> Elastic net regularization tends to have a grouping effect, where correlated input features are assigned equal weights. Elastic net regularization is commonly used in practice and is implemented in many machine learning libraries.

Proximal methods

While the

L_1

norm does not result in an NP-hard problem, the

L_1

norm is convex but is not strictly differentiable due to the kink at x = 0. Subgradient methods which rely on the

subderivative In mathematics, the subderivative (or subgradient) generalizes the derivative to convex functions which are not necessarily differentiable. The set of subderivatives at a point is called the subdifferential at that point. Subderivatives arise in c ...

can be used to solve

L_1

regularized learning problems. However, faster convergence can be achieved through proximal methods. For a problem

\min_ F(w) + R(w)

such that

F

is convex, continuous, differentiable, with Lipschitz continuous gradient (such as the least squares loss function), and

R

is convex, continuous, and proper, then the proximal method to solve the problem is as follows. First define the proximal operator

\operatorname_R(v) = \mathop\operatorname_ \left\,

and then iterate

w_ = \mathop\operatorname_ \left(w_k - \gamma \nabla F(w_k)\right)

The proximal method iteratively performs gradient descent and then projects the result back into the space permitted by

R

. When

R

is the regularizer, the proximal operator is equivalent to the soft-thresholding operator,

\\ v_i + \lambda, & \textv_i < - \lambda \end

This allows for efficient computation.

Group sparsity without overlaps

Groups of features can be regularized by a sparsity constraint, which can be useful for expressing certain prior knowledge into an optimization problem. In the case of a linear model with non-overlapping known groups, a regularizer can be defined:

R(w) = \sum_^G \left\, w_g\right\, _2,

where

\, w_g\, _2 = \sqrt

This can be viewed as inducing a regularizer over the

L_2

norm over members of each group followed by an

L_1

norm over groups. This can be solved by the proximal method, where the proximal operator is a block-wise soft-thresholding function:

0, & \text \, w_g\, _2 \leq \lambda \end

Group sparsity with overlaps

The algorithm described for group sparsity without overlaps can be applied to the case where groups do overlap, in certain situations. This will likely result in some groups with all zero elements, and other groups with some non-zero and some zero elements. If it is desired to preserve the group structure, a new regularizer can be defined:

R(w) = \inf \left\

For each

w_g

\bar w_g

is defined as the vector such that the restriction of

\bar w_g

to the group

g

equals

w_g

and all other entries of

\bar w_g

are zero. The regularizer finds the optimal disintegration of

w

into parts. It can be viewed as duplicating all elements that exist in multiple groups. Learning problems with this regularizer can also be solved with the proximal method with a complication. The proximal operator cannot be computed in closed form, but can be effectively solved iteratively, inducing an inner iteration within the proximal method iteration.

Regularizers for semi-supervised learning

When labels are more expensive to gather than input examples, semi-supervised learning can be useful. Regularizers have been designed to guide learning algorithms to learn models that respect the structure of unsupervised training samples. If a symmetric weight matrix

W

is given, a regularizer can be defined:

R(f) = \sum_ w_ \left(f(x_i) - f(x_j)\right)^2

W_

encodes the result of some distance metric for points

x_i

and

x_j

, it is desirable that

f(x_i) \approx f(x_j)

. This regularizer captures this intuition, and is equivalent to:

R(f) = \bar f^\mathsf L \bar f

where

L = D- W

is the

Laplacian matrix In the mathematical field of graph theory, the Laplacian matrix, also called the graph Laplacian, admittance matrix, Kirchhoff matrix, or discrete Laplacian, is a matrix representation of a graph. Named after Pierre-Simon Laplace, the graph Lap ...

of the graph induced by

W

. The optimization problem

\min_ R(f), m = u + l

can be solved analytically if the constraint

f(x_i) = y_i

is applied for all supervised samples. The labeled part of the vector

f

is therefore obvious. The unlabeled part of

f

is solved for by:

\min_ f^\mathsf L f = \min_ \left\

\nabla_ = 2L_f_u + 2L_Y

f_u = L_^\dagger \left(L_ Y\right)

The pseudo-inverse can be taken because

L_

has the same range as

L_

Regularizers for multitask learning

In the case of multitask learning,

T

problems are considered simultaneously, each related in some way. The goal is to learn

T

functions, ideally borrowing strength from the relatedness of tasks, that have predictive power. This is equivalent to learning the matrix

W: T \times D

Sparse regularizer on columns

R(w) = \sum_^D \left\, W\right\, _

This regularizer defines an L2 norm on each column and an L1 norm over all columns. It can be solved by proximal methods.

Nuclear norm regularization

R(w) = \left\, \sigma(W)\right\, _1

where

\sigma(W)

is the

eigenvalues In linear algebra, an eigenvector ( ) or characteristic vector is a vector that has its direction unchanged (or reversed) by a given linear transformation. More precisely, an eigenvector \mathbf v of a linear transformation T is scaled by a ...

in the

singular value decomposition In linear algebra, the singular value decomposition (SVD) is a Matrix decomposition, factorization of a real number, real or complex number, complex matrix (mathematics), matrix into a rotation, followed by a rescaling followed by another rota ...

W

Mean-constrained regularization

R(f_1 \cdots f_T) = \sum_^T \left\, f_t - \frac \sum_^T f_s \right\, _^2

This regularizer constrains the functions learned for each task to be similar to the overall average of the functions across all tasks. This is useful for expressing prior information that each task is expected to share with each other task. An example is predicting blood iron levels measured at different times of the day, where each task represents an individual.

Clustered mean-constrained regularization

R(f_1 \cdots f_T) = \sum_^C \sum_ \left\, f_t - \frac \sum_ f_s\right\, _^2

where

I(r)

is a cluster of tasks. This regularizer is similar to the mean-constrained regularizer, but instead enforces similarity between tasks within the same cluster. This can capture more complex prior information. This technique has been used to predict

Netflix Netflix is an American subscription video on-demand over-the-top streaming service. The service primarily distributes original and acquired films and television shows from various genres, and it is available internationally in multiple lang ...

recommendations. A cluster would correspond to a group of people who share similar preferences.

Graph-based similarity

More generally than above, similarity between tasks can be defined by a function. The regularizer encourages the model to learn similar functions for similar tasks.

R(f_1 \cdots f_T) = \sum_^\mathsf \left\,  f_t - f_s \right\, ^2 M_

for a given symmetric similarity matrix

M

Other uses of regularization in statistics and machine learning

Bayesian learning methods make use of a

prior probability A prior probability distribution of an uncertain quantity, simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the ...

that (usually) gives lower probability to more complex models. Well-known model selection techniques include the Akaike information criterion (AIC), minimum description length (MDL), and the

Bayesian information criterion In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion (also SIC, SBC, SBIC) is a criterion for model selection among a finite set of models; models with lower BIC are generally preferred. It is based, in part, on ...

(BIC). Alternative methods of controlling overfitting not involving regularization include cross-validation. Examples of applications of different methods of regularization to the

linear model In statistics, the term linear model refers to any model which assumes linearity in the system. The most common occurrence is in connection with regression models and the term is often taken as synonymous with linear regression model. However, t ...

are:

Notes

References

* * {{Authority control Mathematical analysis Inverse problems

Regularization in machine learning

Early Stopping

L1 and L2 Regularization

Dropout

Classification

Generalization

Tikhonov regularization (ridge regression)

Tikhonov-regularized least squares

Early stopping

Theoretical motivation in least squares

Regularizers for sparsity

Proximal methods

Group sparsity without overlaps

Group sparsity with overlaps

Regularizers for semi-supervised learning

Regularizers for multitask learning

Sparse regularizer on columns

Nuclear norm regularization

Mean-constrained regularization

Clustered mean-constrained regularization

Graph-based similarity

Other uses of regularization in statistics and machine learning

See also

Notes

References