Regularized least squares (RLS) is a family of methods for solving the

least-squares The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the res ...

problem while using

regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) In physics, especially quantum field theory, regularization is a method of modifying observables which have singularities in ...

to further constrain the resulting solution. RLS is used for two main reasons. The first comes up when the number of variables in the linear system exceeds the number of observations. In such settings, the ordinary least-squares problem is

ill-posed The mathematical term well-posed problem stems from a definition given by 20th-century French mathematician Jacques Hadamard. He believed that mathematical models of physical phenomena should have the properties that: # a solution exists, # the sol ...

and is therefore impossible to fit because the associated optimization problem has infinitely many solutions. RLS allows the introduction of further constraints that uniquely determine the solution. The second reason for using RLS arises when the learned model suffers from poor

generalization A generalization is a form of abstraction whereby common properties of specific instances are formulated as general concepts or claims. Generalizations posit the existence of a domain or set of elements, as well as one or more common characte ...

. RLS can be used in such cases to improve the generalizability of the model by constraining it at training time. This constraint can either force the solution to be "sparse" in some way or to reflect other prior knowledge about the problem such as information about correlations between features. A

Bayesian Thomas Bayes (/beɪz/; c. 1701 – 1761) was an English statistician, philosopher, and Presbyterian minister. Bayesian () refers either to a range of concepts and approaches that relate to statistical methods based on Bayes' theorem, or a followe ...

understanding of this can be reached by showing that RLS methods are often equivalent to

priors Prior (or prioress) is an ecclesiastical title for a superior in some religious orders. The word is derived from the Latin for "earlier" or "first". Its earlier generic usage referred to any monastic superior. In abbeys, a prior would be ...

on the solution to the least-squares problem.

General formulation

Consider a learning setting given by a probabilistic space

(X \times Y, \rho(X,Y))

Y \in R

. Let

S=\_^n

denote a training set of

n

pairs i.i.d. with respect to

\rho

. Let

V:Y \times R \rightarrow

symmetric Symmetry (from grc, συμμετρία "agreement in dimensions, due proportion, arrangement") in everyday language refers to a sense of harmonious and beautiful proportion and balance. In mathematics, "symmetry" has a more precise definiti ...

positive-definite kernel function

K(x,z)

with the reproducing property: :

\langle K_x,f\rangle_=f(x),

where

K_x(z)=K(x,z)

. The RKHS for a kernel

K

consists of the complete metric space#Completion, completion of the space of functions spanned by

\left\

f(x)=\sum_^n \alpha_i K_(x),\, f\in\mathcal

, where all

\alpha_i

are real numbers. Some commonly used kernels include the linear kernel, inducing the space of linear functions: :

K(x,z)=x^T z,

the polynomial kernel, inducing the space of polynomial functions of order

d

: :

K(x,z)=(x^T z+1)^d,

and the Gaussian kernel: :

K(x,z)=e^.

Note that for an arbitrary loss function

V

, this approach defines a general class of algorithms named Tikhonov regularization. For instance, using the

hinge loss In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs). For an intended output and a classifier score , th ...

leads to the support vector machine algorithm, and using the epsilon-insensitive loss leads to

support vector regression In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories ...

Arbitrary kernel

The

representer theorem For computer science, in statistical learning theory, a representer theorem is any of several related results stating that a minimizer f^ of a regularized empirical risk functional defined over a reproducing kernel Hilbert space can be represente ...

guarantees that the solution can be written as: :

f(x) = \sum_^n c_i K(x_i,x)

for some

c \in \mathbb R^n

. The minimization problem can be expressed as: :

\min_\frac\, Y-Kc\, ^2_ + \lambda\, f\, ^2_H,

where, with some abuse of notation, the

i,j

entry of kernel matrix

K

(as opposed to kernel function

K(\cdot, \cdot)

) is

K(x_i, x_j)

. For such a function, :

\begin
& \, f\, ^2_H = \langle f,f \rangle_ =\left\langle \sum_^n c_i K(x_i,\cdot), \sum_^n c_j K(x_,\cdot) \right\rangle_H \\
=  & \sum_^n \sum_^n c_i c_j \langle K(x_i,\cdot), K(x_j,\cdot) \rangle_H = \sum_^n \sum_^n c_i c_j K(x_i,x_j) = c^T Kc,
\end

The following minimization problem can be obtained: :

\min_\frac\, Y-Kc\, ^2_ + \lambda c^T Kc

. As the sum of convex functions is convex, the solution is unique and its minimum can be found by setting the gradient w.r.t

c

0

: :

-\fracK(Y-Kc) + \lambda Kc = 0 \Rightarrow K(K+\lambda n I)c = K Y \Rightarrow c  = (K+\lambda n I)^Y,

where

c \in\mathbb R^n.

Complexity

The complexity of training is basically the cost of computing the kernel matrix plus the cost of solving the linear system which is roughly

O(n^3)

. The computation of the kernel matrix for the linear or

Gaussian kernel In mathematics, a Gaussian function, often simply referred to as a Gaussian, is a function of the base form f(x) = \exp (-x^2) and with parametric extension f(x) = a \exp\left( -\frac \right) for arbitrary real constants , and non-zero . It is ...

O(n^2 D)

. The complexity of testing is

O(n)

Prediction

The prediction at a new test point

x_

is: :

f(x_) = \sum_^n c_i K(x_i,x_) = K(X,X_)^T c

Linear kernel

For convenience a vector notation is introduced. Let

X

be an

n\times d

matrix, where the rows are input vectors, and

Y

n\times 1

vector where the entries are corresponding outputs. In terms of vectors, the kernel matrix can be written as

\operatorname K=\operatorname X\operatorname X^T

. The learning function can be written as: :

f(x_)  = \operatorname K_c
  =  x_^T \operatorname X^T c
  =  x_^T w

Here we define

w = X^T c, w \in R^d

. The objective function can be rewritten as: :

= & \frac\, y-\operatorname X\operatorname X^T c\, ^2_+\lambda c^T \operatorname X\operatorname X^T c = \frac\, y-\operatorname Xw\, ^2_+\lambda \, w\, ^2_ \end

The first term is the objective function from

ordinary least squares In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the ...

(OLS) regression, corresponding to the

residual sum of squares In statistics, the residual sum of squares (RSS), also known as the sum of squared estimate of errors (SSE), is the sum of the squares of residuals (deviations predicted from actual empirical values of data). It is a measure of the discrepan ...

. The second term is a regularization term, not present in OLS, which penalizes large

w

values. As a smooth finite dimensional problem is considered and it is possible to apply standard calculus tools. In order to minimize the objective function, the gradient is calculated with respect to

w

and set it to zero: :

\operatorname X^T \operatorname Xw-\operatorname X^T y+\lambda n w=0

w=(\operatorname X^T \operatorname X+\lambda n \operatorname I)^\operatorname X^T y

This solution closely resembles that of standard linear regression, with an extra term

\lambda\operatorname I

. If the assumptions of OLS regression hold, the solution

w=(\operatorname X^T\operatorname X)^\operatorname X^T y

, with

\lambda=0

, is an unbiased estimator, and is the minimum-variance linear unbiased estimator, according to the

Gauss–Markov theorem In statistics, the Gauss–Markov theorem (or simply Gauss theorem for some authors) states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the ...

. The term

\lambda n \operatorname I

therefore leads to a biased solution; however, it also tends to reduce variance. This is easy to see, as the

covariance In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the ...

matrix of the

w

-values is proportional to

(\operatorname X^T \operatorname X+\lambda n \operatorname I)^

, and therefore large values of

\lambda

will lead to lower variance. Therefore, manipulating

\lambda

corresponds to trading-off bias and variance. For problems with high-variance

w

estimates, such as cases with relatively small

n

or with correlated regressors, the optimal prediction accuracy may be obtained by using a nonzero

\lambda

, and thus introducing some bias to reduce variance. Furthermore, it is not uncommon in

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

to have cases where

n, in which case X^T X is

rank Rank is the relative position, value, worth, complexity, power, importance, authority, level, etc. of a person or object within a ranking, such as: Level or position in a hierarchical organization * Academic rank * Diplomatic rank * Hierarchy * ...

-deficient, and a nonzero

\lambda

is necessary to compute

(\operatorname X^T \operatorname X+\lambda n \operatorname I)^

Complexity

The parameter

\lambda

controls the invertibility of the matrix

X^T X + \lambda n I

. Several methods can be used to solve the above linear system,

Cholesky decomposition In linear algebra, the Cholesky decomposition or Cholesky factorization (pronounced ) is a decomposition of a Hermitian, positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose, which is useful for effici ...

being probably the method of choice, since the matrix

X^T X + \lambda n I

and

positive definite In mathematics, positive definiteness is a property of any object to which a bilinear form or a sesquilinear form may be naturally associated, which is positive-definite. See, in particular: * Positive-definite bilinear form * Positive-definite f ...

. The complexity of this method is

O(nD^2)

for training and

O(D)

for testing. The cost

O(nD^2)

is essentially that of computing

X^T X

, whereas the inverse computation (or rather the solution of the linear system) is roughly

O(D^3)

Feature maps and Mercer's theorem

In this section it will be shown how to extend RLS to any kind of reproducing kernel K. Instead of linear kernel a feature map is considered

\Phi: X \rightarrow F

for some Hilbert space

F

, called the feature space. In this case the kernel is defined as: The matrix

X

is now replaced by the new data matrix

\Phi

, where

\Phi_ = \varphi_j(x_i)

, or the

j

-th component of the

\varphi(x_i)

. :

K(x,x') = \langle \Phi(x), \Phi(x') \rangle_F.

It means that for a given training set

K = \Phi \Phi^T

. Thus, the objective function can be written as :

\min_\, Y - \Phi \Phi^T c \, ^2_ + \lambda c^T \Phi \Phi^T c.

This approach is known as the

kernel trick In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (for example ...

. This technique can significantly simplify the computational operations. If

F

is high dimensional, computing

\varphi(x_i)

may be rather intensive. If the explicit form of the kernel function is known, we just need to compute and store the

n\times n

kernel matrix

\operatorname K

. In fact, the Hilbert space

F

need not be isomorphic to

\mathbb^m

, and can be infinite dimensional. This follows from

Mercer's theorem In mathematics, specifically functional analysis, Mercer's theorem is a representation of a symmetric positive-definite function on a square as a sum of a convergent sequence of product functions. This theorem, presented in , is one of the most no ...

, which states that a continuous, symmetric, positive definite kernel function can be expressed as :

K(x,z)=\sum_^\infty \sigma_i e_i(x) e_i(z)

where

e_i(x)

form an

orthonormal basis In mathematics, particularly linear algebra, an orthonormal basis for an inner product space ''V'' with finite dimension is a basis for V whose vectors are orthonormal, that is, they are all unit vectors and orthogonal to each other. For examp ...

for

\ell^2(X)

, and

\sigma_i \in\mathbb

. If feature maps is defined

\varphi(x)

with components

\varphi_i(x)=\sqrte_i(x)

, it follows that

K(x,z)=\langle\varphi(x),\varphi(z)\rangle

. This demonstrates that any kernel can be associated with a feature map, and that RLS generally consists of linear RLS performed in some possibly higher-dimensional feature space. While Mercer's theorem shows how one feature map that can be associated with a kernel, in fact multiple feature maps can be associated with a given reproducing kernel. For instance, the map

\varphi(x)=K_x

satisfies the property

K(x,z)=\langle\varphi(x), \varphi(z) \rangle

for an arbitrary reproducing kernel.

Bayesian interpretation

Least squares can be viewed as a likelihood maximization under an assumption of normally distributed residuals. This is because the exponent of the Gaussian distribution is quadratic in the data, and so is the least-squares objective function. In this framework, the regularization terms of RLS can be understood to be encoding

w

. For instance, Tikhonov regularization corresponds to a normally distributed prior on

w

that is centered at 0. To see this, first note that the OLS objective is proportional to the

log-likelihood The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood functi ...

function when each sampled

y^i

is normally distributed around

w^T \cdot x^i

. Then observe that a normal prior on

w

centered at 0 has a log-probability of the form :

\log P(w) = q - \alpha \sum_^d w_j^2

where

q

and

\alpha

are constants that depend on the variance of the prior and are independent of

w

. Thus, minimizing the logarithm of the likelihood times the prior is equivalent to minimizing the sum of the OLS loss function and the ridge regression regularization term. This gives a more intuitive interpretation for why

Tikhonov regularization Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also ...

leads to a unique solution to the least-squares problem: there are infinitely many vectors

w

satisfying the constraints obtained from the data, but since we come to the problem with a prior belief that

w

is normally distributed around the origin, we will end up choosing a solution with this constraint in mind. Other regularization methods correspond to different priors. See the

list A ''list'' is any set of items in a row. List or lists may also refer to: People * List (surname) Organizations * List College, an undergraduate division of the Jewish Theological Seminary of America * SC Germania List, German rugby unio ...

below for more details.

Specific examples

Ridge regression (or Tikhonov regularization)

One particularly common choice for the penalty function

R

is the squared

\ell_2

norm, i.e., :

R(w) = \sum_^d w_j^2

\frac\, Y-\operatorname Xw\, ^2_2+\lambda  \sum_^d , w_j, ^2 \rightarrow  \min_

The most common names for this are called

and

ridge regression Ridge regression is a method of estimating the coefficients of multiple- regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Als ...

. It admits a closed-form solution for

w

: :

w = (X^T X + \lambda I)^ X^T Y

The name ridge regression alludes to the fact that the

\lambda I

term adds positive entries along the diagonal "ridge" of the sample covariance matrix

X^T X

. When

\lambda=0

, i.e., in the case of

, the condition that

d > n

causes the sample covariance matrix

X^T X

to not have full rank and so it cannot be inverted to yield a unique solution. This is why there can be an infinitude of solutions to the

problem when

d > n

. However, when

\lambda > 0

, i.e., when ridge regression is used, the addition of

\lambda I

to the sample covariance matrix ensures that all of its eigenvalues will be strictly greater than 0. In other words, it becomes invertible, and the solution becomes unique. Compared to ordinary least squares, ridge regression is not unbiased. It accepts little bias to reduce variance and the mean square error, and helps to improve the prediction accuracy. Thus, ridge estimator yields more stable solutions by shrinking coefficients but suffers from the lack of sensitivity to the data.

Lasso regression

The least absolute selection and shrinkage (LASSO) method is another popular choice. In

lasso regression In statistics and machine learning, lasso (least absolute shrinkage and selection operator; also Lasso or LASSO) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy ...

, the lasso penalty function

R

is the

\ell_1

norm, i.e. :

R(w) = \sum_^d \left,  w_j \

\frac\, Y-\operatorname Xw\, ^2_2+\lambda  \sum_^d , w_j,  \rightarrow  \min_

Note that the lasso penalty function is convex but not strictly convex. Unlike

, this scheme does not have a convenient closed-form solution: instead, the solution is typically found using

quadratic programming Quadratic programming (QP) is the process of solving certain mathematical optimization problems involving quadratic functions. Specifically, one seeks to optimize (minimize or maximize) a multivariate quadratic function subject to linear constr ...

or more general

convex optimization Convex optimization is a subfield of mathematical optimization that studies the problem of minimizing convex functions over convex sets (or, equivalently, maximizing concave functions over convex sets). Many classes of convex optimization pr ...

methods, as well as by specific algorithms such as the

least-angle regression In statistics, least-angle regression (LARS) is an algorithm for fitting linear regression models to high-dimensional data, developed by Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani. Suppose we expect a response variable ...

algorithm. An important difference between lasso regression and Tikhonov regularization is that lasso regression forces more entries of

w

to actually equal 0 than would otherwise. In contrast, while Tikhonov regularization forces entries of

w

to be small, it does not force more of them to be 0 than would be otherwise. Thus, LASSO regularization is more appropriate than Tikhonov regularization in cases in which we expect the number of non-zero entries of

w

to be small, and Tikhonov regularization is more appropriate when we expect that entries of

w

will generally be small but not necessarily zero. Which of these regimes is more relevant depends on the specific data set at hand. Besides feature selection described above, LASSO has some limitations. Ridge regression provides better accuracy in the case

n > d

for highly correlated variables. In another case,

n < d

, LASSO selects at most

n

variables. Moreover, LASSO tends to select some arbitrary variables from group of highly correlated samples, so there is no grouping effect.

''ℓ''₀ Penalization

\frac\, Y-\operatorname Xw\, ^2_2+\lambda \, w_j\, _0 \rightarrow  \min_

The most extreme way to enforce sparsity is to say that the actual magnitude of the coefficients of

w

does not matter; rather, the only thing that determines the complexity of

w

is the number of non-zero entries. This corresponds to setting

R(w)

to be the

\ell_0

norm of

w

. This regularization function, while attractive for the sparsity that it guarantees, is very difficult to solve because doing so requires optimization of a function that is not even weakly

convex Convex or convexity may refer to: Science and technology * Convex lens, in optics Mathematics * Convex set, containing the whole line segment that joins points ** Convex polygon, a polygon which encloses a convex set of points ** Convex polytop ...

. Lasso regression is the minimal possible relaxation of

\ell_0

penalization that yields a weakly convex optimization problem.

Elastic net

For any non-negative

\lambda_1

and

\lambda_2

the objective has the following form: :

\frac\, Y-\operatorname Xw\, ^2_2+\lambda_1 \sum_^d , w_j,  + \lambda_2 \sum_^d , w_j, ^2 \rightarrow  \min_

Let

\alpha = \frac

, then the solution of the minimization problem is described as: :

\frac\, Y-\operatorname Xw\, ^2_2 \rightarrow  \min_ \text (1-\alpha)\, w\, _1 + \alpha \, w\, _2 \leq t

for some

t

. Consider

(1-\alpha)\, w\, _1 + \alpha \, w\, _2 \leq t

as an Elastic Net penalty function. When

\alpha = 1

, elastic net becomes ridge regression, whereas

\alpha = 0

it becomes Lasso.

\forall \alpha \in (0,1]

Elastic Net penalty function doesn't have the first derivative at 0 and it is strictly convex

\forall \alpha > 0

taking the properties both

and

. One of the main properties of the Elastic Net is that it can select groups of correlated variables. The difference between weight vectors of samples

x_i

and

x_j

is given by: :

, w^_i(\lambda_1, \lambda_2) -  w^_j(\lambda_1, \lambda_2),  \leq \frac\sqrt

, where

\rho_ = x_i^T x_j

. If

x_i

and

x_j

are highly correlated (

\rho_ \rightarrow 1

), the weight vectors are very close. In the case of negatively correlated samples (

\rho_ \rightarrow -1

) the samples

-x_j

can be taken. To summarize, for highly correlated variables the weight vectors tend to be equal up to a sign in the case of negative correlated variables.

Partial list of RLS methods

The following is a list of possible choices of the regularization function

R(\cdot)

, along with the name for each one, the corresponding prior if there is a simple one, and ways for computing the solution to the resulting optimization problem.

References

{{Reflist

External links

http://www.stanford.edu/~hastie/TALKS/enet_talk.pdf Regularization and Variable Selection via the Elastic Net
(presentation)
Regularized Least Squares and Support Vector Machines
(presentation)
Regularized Least Squares
presentation) Least squares Linear algebra Inverse problems

General formulation

Arbitrary kernel

Complexity

Prediction

Linear kernel

Complexity

Feature maps and Mercer's theorem

Bayesian interpretation

Specific examples

Ridge regression (or Tikhonov regularization)

Lasso regression

''ℓ''0 Penalization

Elastic net

Partial list of RLS methods

See also

References

External links

''ℓ''₀ Penalization