Spectral regularization is any of a class of

regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) * Regularization (solid modeling) * Regularization Law, an Israeli law intended to retroactively legalize settlements See also ...

techniques used in

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

to control the impact of noise and prevent

overfitting In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...

. Spectral regularization can be used in a broad range of applications, from deblurring images to classifying emails into a spam folder and a non-spam folder. For instance, in the email classification example, spectral regularization can be used to reduce the impact of noise and prevent overfitting when a machine learning system is being trained on a labeled set of emails to learn how to tell a spam and a non-spam email apart. Spectral regularization algorithms rely on methods that were originally defined and studied in the theory of ill-posed

inverse problem An inverse problem in science is the process of calculating from a set of observations the causal factors that produced them: for example, calculating an image in X-ray computed tomography, sound source reconstruction, source reconstruction in ac ...

s (for instance, see) focusing on the inversion of a linear operator (or a matrix) that possibly has a bad

condition number In numerical analysis, the condition number of a function measures how much the output value of the function can change for a small change in the input argument. This is used to measure how sensitive a function is to changes or errors in the inpu ...

or an unbounded inverse. In this context, regularization amounts to substituting the original operator by a

bounded operator In functional analysis and operator theory, a bounded linear operator is a linear transformation L : X \to Y between topological vector spaces (TVSs) X and Y that maps bounded subsets of X to bounded subsets of Y. If X and Y are normed vector ...

called the "regularization operator" that has a condition number controlled by a regularization parameter,L. Lo Gerfo, L. Rosasco, F. Odone, E. De Vito, and A. Verri. Spectral Algorithms for Supervised Learning, ''Neural Computation'', 20(7), 2008. a classical example being

Tikhonov regularization Ridge regression (also known as Tikhonov regularization, named for Andrey Tikhonov) is a method of estimating the coefficients of multiple- regression models in scenarios where the independent variables are highly correlated. It has been used in m ...

. To ensure stability, this regularization parameter is tuned based on the level of noise. The main idea behind spectral regularization is that each regularization operator can be described using spectral calculus as an appropriate filter on the eigenvalues of the operator that defines the problem, and the role of the filter is to "suppress the oscillatory behavior corresponding to small eigenvalues". Therefore, each algorithm in the class of spectral regularization algorithms is defined by a suitable filter function (which needs to be derived for that particular algorithm). Three of the most commonly used regularization algorithms for which spectral filtering is well-studied are Tikhonov regularization, Landweber iteration, and truncated singular value decomposition (TSVD). As for choosing the regularization parameter, examples of candidate methods to compute this parameter include the discrepancy principle, generalized cross validation, and the L-curve criterion.P. C. Hansen, J. G. Nagy, and D. P. O'Leary. ''Deblurring Images: Matrices, Spectra, and Filtering'', Fundamentals of Algorithms 3, SIAM, Philadelphia, 2006. It is of note that the notion of spectral filtering studied in the context of machine learning is closely connected to the literature on

function approximation In general, a function approximation problem asks us to select a function (mathematics), function among a that closely matches ("approximates") a in a task-specific way. The need for function approximations arises in many branches of applied ...

(in signal processing).

Notation

The training set is defined as

S = \

, where

X

is the

n \times d

input matrix and

Y = (y_1,\dots,y_n)

is the output vector. Where applicable, the kernel function is denoted by

k

, and the

n \times n

kernel matrix is denoted by

K

which has entries

K_ = k(x_i,x_j)

and

\mathcal

denotes the

Reproducing Kernel Hilbert Space In functional analysis, a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Specifically, a Hilbert space H of functions from a set X (to \mathbb or \mathbb) is ...

(RKHS) with kernel

k

. The regularization parameter is denoted by

\lambda

. ''(Note: For

g \in G

and

f \in F

, with

G

and

F

being Hilbert spaces, given a linear, continuous operator

L

, assume that

g = Lf

holds. In this setting, the direct problem would be to solve for

g

given

f

and the inverse problem would be to solve for

f

given

g

. If the solution exists, is unique and stable, the inverse problem (i.e. the problem of solving for

f

) is well-posed; otherwise, it is ill-posed.) ''

Relation to the theory of ill-posed inverse problems

The connection between the regularized least squares (RLS) estimation problem (Tikhonov regularization setting) and the theory of ill-posed inverse problems is an example of how spectral regularization algorithms are related to the theory of ill-posed inverse problems. The RLS estimator solves

\min_ \frac\sum_^ (y_i-f(x_i))^2 + \lambda \left\, f\right\, ^2_\mathcal

and the RKHS allows for expressing this RLS estimator as

f_S^\lambda (X)=\sum_^n c_i k(x,x_i)

where

(K + n\lambda I)c = Y

with

c=(c_1,\dots,c_n)

.L. Rosasco. Lecture 6 of the Lecture Notes for 9.520: Statistical Learning Theory and Applications. Massachusetts Institute of Technology, Fall 2013. Available at https://www.mit.edu/~9.520/fall13/slides/class06/class06_RLSSVM.pdf The penalization term is used for controlling smoothness and preventing overfitting. Since the solution of empirical risk minimization

\min_ \frac\sum_^(y_i-f(x_i))^2

can be written as

f_S^(X)=\sum_^n c_i k(x,x_i)

such that

Kc=Y

, adding the penalty function amounts to the following change in the system that needs to be solved:L. Rosasco. Lecture 7 of the Lecture Notes for 9.520: Statistical Learning Theory and Applications. Massachusetts Institute of Technology, Fall 2013. Available at https://www.mit.edu/~9.520/fall13/slides/class07/class07_spectral.pdf

\left\
\equiv
\biggl\.

In this learning setting, the kernel matrix can be decomposed as

K = Q\Sigma Q^T

, with

\sigma = \operatorname(\sigma_1,\dots,\sigma_n),~\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_n \geq 0

and

q_1, \dots, q_n

are the corresponding eigenvectors. Therefore, in the initial learning setting, the following holds:

c=K^Y = Q\Sigma^Q^TY = \sum_^n \frac \langle q_i,Y \rangle q_i.

Thus, for small eigenvalues, even small perturbations in the data can lead to considerable changes in the solution. Hence, the problem is ill-conditioned, and solving this RLS problem amounts to stabilizing a possibly ill-conditioned matrix inversion problem, which is studied in the theory of ill-posed inverse problems; in both problems, a main concern is to deal with the issue of

numerical stability In the mathematical subfield of numerical analysis, numerical stability is a generally desirable property of numerical algorithms. The precise definition of stability depends on the context: one important context is numerical linear algebra, and ...

Implementation of algorithms

Each algorithm in the class of spectral regularization algorithms is defined by a suitable filter function, denoted here by

G_(\cdot)

. If the Kernel matrix is denoted by

K

, then

\lambda

should control the magnitude of the smaller eigenvalues of

G_(K)

. In a filtering setup, the goal is to find estimators

f_S^(X) := \sum_^n c_i k(x,x_i)

where

c=G_(K)Y

. To do so, a scalar filter function

G_\lambda(\sigma)

is defined using the eigen-decomposition of the kernel matrix:

G_\lambda(K)=QG_(\Sigma)Q^T,

which yields

G_(K)Y~=~\sum_^G_(\sigma_i) \langle q_i,Y\rangle q_i.

Typically, an appropriate filter function should have the following properties: # As

\lambda

goes to zero,

G_\lambda (\sigma)~\rightarrow ~1/\sigma

. # The magnitude of the (smaller) eigenvalues of

G_

is controlled by

\lambda

. While the above items give a rough characterization of the general properties of filter functions for all spectral regularization algorithms, the derivation of the filter function (and hence its exact form) varies depending on the specific regularization method that spectral filtering is applied to.

Filter function for Tikhonov regularization

In the Tikhonov regularization setting, the filter function for RLS is described below. As shown in, in this setting,

c = \left(K + n\lambda I\right)^ Y

. Thus,

c = (K+n\lambda I)^Y = Q(\Sigma+n\lambda I)^Q^T Y=\sum_^n \frac q_i.

The undesired components are filtered out using regularization: * If

\sigma \gg \lambda n

, then

\frac \sim  \frac

. * If

\sigma \ll \lambda n

, then

\frac \sim \frac

. The filter function for Tikhonov regularization is therefore defined as:

G_\lambda (\sigma) = \frac.

Filter function for Landweber iteration

The idea behind the Landweber iteration is

gradient descent Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function. The idea is to take repeated steps in the opposite direction of the gradi ...

: c⁰ := 0 for ''i'' = 1, ..., ''t'' − 1 ''c''^''i'' := ''c''^''i''−1 + ''η''(''Y'' − ''Kc''^''i''−1) end In this setting, if

n

is larger than

K

's largest eigenvalue, the above iteration converges by choosing

\eta = 2/n

as the step-size:. The above iteration is equivalent to minimizing

\frac \left\, Y - Kc\right\, _2^2

(i.e. the empirical risk) via gradient descent; using induction, it can be proved that at the

t

-th iteration, the solution is given by

c = \eta\sum_^ \left(I-\eta K\right)^i Y.

Thus, the appropriate filter function is defined by:

G_\lambda(\sigma) = \eta\sum_^ \left(I - \eta\sigma\right)^i.

It can be shown that this filter function corresponds to a truncated power expansion of

K^

; to see this, note that the relation

\sum_ x^i = 1/(1 - x)

, would still hold if

x

is replaced by a matrix; thus, if

K

(the kernel matrix), or rather

I - \eta K

, is considered, the following holds:

K^ = \eta\sum_^\infty \left(I - \eta K\right)^i \sim \eta\sum_^ \left(I - \eta K\right)^i.

In this setting, the number of iterations gives the regularization parameter; roughly speaking,

t \sim 1/\lambda

. If

t

is large, overfitting may be a concern. If

t

is small, oversmoothing may be a concern. Thus, choosing an appropriate time for early stopping of the iterations provides a regularization effect.

Filter function for TSVD

In the TSVD setting, given the eigen-decomposition

K = Q\Sigma Q^T

and using a prescribed threshold

\lambda n

, a regularized inverse can be formed for the kernel matrix by discarding all the eigenvalues that are smaller than this threshold. Thus, the filter function for TSVD can be defined as

0, & \text \end

It can be shown that TSVD is equivalent to the (unsupervised) projection of the data using (kernel)

Principal Component Analysis Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing. The data is linearly transformed onto a new coordinate system such that th ...

(PCA), and that it is also equivalent to minimizing the empirical risk on the projected data (without regularization). Note that the number of components kept for the projection is the only

free parameter A free parameter is a variable in a mathematical model which cannot be predicted precisely or constrained by the model and must be estimated experimentally or theoretically. A mathematical model, theory, or conjecture is more likely to be right a ...

here.

References

{{reflist Mathematical analysis Inverse problems Computer engineering