Example of unlabeled data in semisupervised learning

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

, Manifold regularization is a technique for using the shape of a dataset to constrain the functions that should be learned on that dataset. In many machine learning problems, the data to be learned do not cover the entire input space. For example, a

facial recognition system A facial recognition system is a technology potentially capable of matching a human face from a digital image or a Film frame, video frame against a database of faces. Such a system is typically employed to authenticate users through ID verif ...

may not need to classify any possible image, but only the subset of images that contain faces. The technique of manifold learning assumes that the relevant subset of data comes from a

manifold In mathematics, a manifold is a topological space that locally resembles Euclidean space near each point. More precisely, an n-dimensional manifold, or ''n-manifold'' for short, is a topological space with the property that each point has a N ...

, a mathematical structure with useful properties. The technique also assumes that the function to be learned is ''smooth'': data with different labels are not likely to be close together, and so the labeling function should not change quickly in areas where there are likely to be many data points. Because of this assumption, a manifold regularization algorithm can use unlabeled data to inform where the learned function is allowed to change quickly and where it is not, using an extension of the technique of Tikhonov regularization. Manifold regularization algorithms can extend

supervised learning In machine learning, supervised learning (SL) is a paradigm where a Statistical model, model is trained using input objects (e.g. a vector of predictor variables) and desired output values (also known as a ''supervisory signal''), which are often ...

algorithms in

semi-supervised learning Weak supervision (also known as semi-supervised learning) is a paradigm in machine learning, the relevance and notability of which increased with the advent of large language models due to large amount of data required to train them. It is charact ...

and transductive learning settings, where unlabeled data are available. The technique has been used for applications including medical imaging, geographical imaging, and object recognition.

Manifold regularizer

Motivation

Manifold regularization is a type of regularization, a family of techniques that reduces

overfitting In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...

and ensures that a problem is well-posed by penalizing complex solutions. In particular, manifold regularization extends the technique of Tikhonov regularization as applied to Reproducing kernel Hilbert spaces (RKHSs). Under standard Tikhonov regularization on RKHSs, a learning algorithm attempts to learn a function

f

from among a hypothesis space of functions

\mathcal

. The hypothesis space is an RKHS, meaning that it is associated with a kernel

K

, and so every candidate function

f

has a norm

\left\,  f \right\, _K

, which represents the complexity of the candidate function in the hypothesis space. When the algorithm considers a candidate function, it takes its norm into account in order to penalize complex functions. Formally, given a set of labeled training data

(x_1, y_1), \ldots, (x_, y_)

with

x_i \in X, y_i \in Y

and a

loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...

V

, a learning algorithm using Tikhonov regularization will attempt to solve the expression :

\underset \frac \sum_^ V(f(x_i), y_i) + \gamma \left\,  f \right\, _K^2

where

\gamma

is a hyperparameter that controls how much the algorithm will prefer simpler functions over functions that fit the data better. Swissroll manifold unrolled

Manifold regularization adds a second regularization term, the ''intrinsic regularizer'', to the ''ambient regularizer'' used in standard Tikhonov regularization. Under the manifold assumption in machine learning, the data in question do not come from the entire input space

X

, but instead from a nonlinear

M\subset X

. The geometry of this manifold, the intrinsic space, is used to determine the regularization norm.

Laplacian norm

There are many possible choices for the intrinsic regularizer

\left\,  f \right\, _I

. Many natural choices involve the gradient on the manifold

\nabla_

, which can provide a measure of how smooth a target function is. A smooth function should change slowly where the input data are dense; that is, the gradient

\nabla_ f(x)

should be small where the ''marginal probability density''

\mathcal_X(x)

, the probability density of a randomly drawn data point appearing at

x

, is large. This gives one appropriate choice for the intrinsic regularizer: :

\left\,  f \right\, _I^2 = \int_ \left\,  \nabla_ f(x) \right\, ^2 \, d \mathcal_X(x)

In practice, this norm cannot be computed directly because the marginal distribution

\mathcal_X

is unknown, but it can be estimated from the provided data.

Graph-based approach of the Laplacian norm

When the distances between input points are interpreted as a graph, then the

Laplacian matrix In the mathematical field of graph theory, the Laplacian matrix, also called the graph Laplacian, admittance matrix, Kirchhoff matrix, or discrete Laplacian, is a matrix representation of a graph. Named after Pierre-Simon Laplace, the graph Lap ...

of the graph can help to estimate the marginal distribution. Suppose that the input data include

\ell

labeled examples (pairs of an input

x

and a label

y

) and

u

unlabeled examples (inputs without associated labels). Define

W

to be a matrix of edge weights for a graph, where

W_

is a distance measure between the data points

x_i

and

x_j

. Define

D

to be a diagonal matrix with

D_ = \sum_^ W_

and

L

to be the Laplacian matrix

D-W

. Then, as the number of data points

\ell + u

increases,

L

converges to the

Laplace–Beltrami operator In differential geometry, the Laplace–Beltrami operator is a generalization of the Laplace operator to functions defined on submanifolds in Euclidean space and, even more generally, on Riemannian and pseudo-Riemannian manifolds. It is named aft ...

\Delta_

, which is the

divergence In vector calculus, divergence is a vector operator that operates on a vector field, producing a scalar field giving the rate that the vector field alters the volume in an infinitesimal neighborhood of each point. (In 2D this "volume" refers to ...

of the gradient

\nabla_M

. Then, if

\mathbf

is a vector of the values of

f

at the data,

\mathbf = (x_1), \ldots, f(x_)

, the intrinsic norm can be estimated: :

\left\,  f \right\, _I^2 = \frac \mathbf^ L \mathbf

As the number of data points

\ell + u

increases, this empirical definition of

\left\,  f \right\, _I^2

converges to the definition when

\mathcal_X

is known.

Solving the regularization problem with graph-based approach

Using the weights

\gamma_A

and

\gamma_I

for the ambient and intrinsic regularizers, the final expression to be solved becomes: :

\underset \frac \sum_^ V(f(x_i), y_i) + \gamma_A \left\,  f \right\, _K^2 + \frac \mathbf^ L \mathbf

As with other

kernel methods In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). These methods involve using linear classifiers to solve nonlinear problems. The general task of pa ...

\mathcal

may be an infinite-dimensional space, so if the regularization expression cannot be solved explicitly, it is impossible to search the entire space for a solution. Instead, a

representer theorem For computer science, in statistical learning theory, a representer theorem is any of several related results stating that a minimizer f^ of a regularized Empirical risk minimization, empirical risk functional defined over a reproducing kernel Hi ...

shows that under certain conditions on the choice of the norm

\left\,  f \right\, _I

, the optimal solution

f^*

must be a linear combination of the kernel centered at each of the input points: for some weights

\alpha_i

, :

f^*(x) = \sum_^ \alpha_i K(x_i, x)

Using this result, it is possible to search for the optimal solution

f^*

by searching the finite-dimensional space defined by the possible choices of

\alpha_i

Functional approach of the Laplacian norm

The idea beyond graph-Laplacian is to use neighbors to estimate Laplacian. This method is akin local averaging methods, that are known to scale poorly in high-dimensional problem. Indeed, graph Laplacian is known to suffer from the

curse of dimensionality The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. T ...

. Luckily, it is possible to leverage expected smoothness of the function to estimate thanks to more advanced functional analysis. This method consists in estimating the Laplacian operator thanks to derivatives of the kernel reading

\partial_ K(x_i, x)

where

\partial_

denotes the partial derivatives according to the ''j''-th coordinate of the first variable. This second approach of the Laplacian norm is to put in relation with meshfree methods, that contrast with the

finite difference method In numerical analysis, finite-difference methods (FDM) are a class of numerical techniques for solving differential equations by approximating Derivative, derivatives with Finite difference approximation, finite differences. Both the spatial doma ...

in PDE.

Applications

Manifold regularization can extend a variety of algorithms that can be expressed using Tikhonov regularization, by choosing an appropriate loss function

V

and hypothesis space

\mathcal

. Two commonly used examples are the families of

support vector machines In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...

and regularized least squares algorithms. (Regularized least squares includes the ridge regression algorithm; the related algorithms of LASSO and elastic net regularization can be expressed as support vector machines.) The extended versions of these algorithms are called Laplacian Regularized Least Squares (abbreviated LapRLS) and Laplacian Support Vector Machines (LapSVM), respectively.

Laplacian Regularized Least Squares (LapRLS)

Regularized least squares (RLS) is a family of regression algorithms: algorithms that predict a value

y = f(x)

for its inputs

x

, with the goal that the predicted values should be close to the true labels for the data. In particular, RLS is designed to minimize the

mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference betwee ...

between the predicted values and the true labels, subject to regularization. Ridge regression is one form of RLS; in general, RLS is the same as ridge regression combined with the kernel method. The problem statement for RLS results from choosing the loss function

V

in Tikhonov regularization to be the mean squared error: :

f^* = \underset \frac \sum_^ (f(x_i) - y_i)^2 + \gamma \left\,  f \right\, _K^2

Thanks to the

, the solution can be written as a weighted sum of the kernel evaluated at the data points: :

f^*(x) = \sum_^ \alpha_i^* K(x_i, x)

and solving for

\alpha^*

gives: :

\alpha^* = (K + \gamma \ell I)^ Y

where

K

is defined to be the kernel matrix, with

K_ = K(x_i, x_j)

, and

Y

is the vector of data labels. Adding a Laplacian term for manifold regularization gives the Laplacian RLS statement: :

f^* = \underset \frac \sum_^ (f(x_i) - y_i)^2 + \gamma_A \left\,  f \right\, _K^2 + \frac \mathbf^ L \mathbf

The representer theorem for manifold regularization again gives :

f^*(x) = \sum_^ \alpha_i^* K(x_i, x)

and this yields an expression for the vector

\alpha^*

. Letting

K

be the kernel matrix as above,

Y

be the vector of data labels, and

J

be the

(\ell + u) \times (\ell + u)

block matrix

\begin I_ & 0 \\ 0 & 0_u \end

: :

\alpha^* = \underset \frac (Y - J K \alpha)^ (Y - J K \alpha) + \gamma_A \alpha^ K \alpha + \frac \alpha^ K L K \alpha

with a solution of :

\alpha^* = \left( JK + \gamma_A \ell I + \frac L K \right)^ Y

LapRLS has been applied to problems including sensor networks,

medical imaging Medical imaging is the technique and process of imaging the interior of a body for clinical analysis and medical intervention, as well as visual representation of the function of some organs or tissues (physiology). Medical imaging seeks to revea ...

, object detection,

spectroscopy Spectroscopy is the field of study that measures and interprets electromagnetic spectra. In narrower contexts, spectroscopy is the precise study of color as generalized from visible light to all bands of the electromagnetic spectrum. Spectro ...

document classification Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more Class (philosophy), classes or Categorization, categories. This may be do ...

, drug-protein interactions, and compressing images and videos.

Laplacian Support Vector Machines (LapSVM)

Support vector machines In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...

(SVMs) are a family of algorithms often used for classifying data into two or more groups, or ''classes''. Intuitively, an SVM draws a boundary between classes so that the closest labeled examples to the boundary are as far away as possible. This can be directly expressed as a

linear program Linear programming (LP), also called linear optimization, is a method to achieve the best outcome (such as maximum profit or lowest cost) in a mathematical model whose requirements and objective are represented by linear relationships. Linear ...

, but it is also equivalent to Tikhonov regularization with the hinge loss function,

V(f(x), y) = \max(0, 1 - yf(x))

: :

f^* = \underset \frac \sum_^ \max(0, 1 - y_if(x_i)) + \gamma \left\,  f \right\, _K^2

Adding the intrinsic regularization term to this expression gives the LapSVM problem statement: :

f^* = \underset \frac \sum_^ \max(0, 1 - y_if(x_i)) + \gamma_A \left\,  f \right\, _K^2 + \frac \mathbf^ L \mathbf

Again, the representer theorem allows the solution to be expressed in terms of the kernel evaluated at the data points: :

f^*(x) = \sum_^ \alpha_i^* K(x_i, x)

\alpha

can be found by writing the problem as a linear program and solving the

dual problem In mathematical optimization theory, duality or the duality principle is the principle that optimization problems may be viewed from either of two perspectives, the primal problem or the dual problem. If the primal is a minimization problem then th ...

. Again letting

K

be the kernel matrix and

J

be the block matrix

\begin I_ & 0 \\ 0 & 0_u \end

, the solution can be shown to be :

\alpha = \left( 2 \gamma_A I + 2 \frac L K \right)^ J^ Y \beta^*

where

\beta^*

is the solution to the dual problem :

\begin
& & \beta^* = \max_ & \sum_^ \beta_i - \frac \beta^ Q \beta \\
& \text && \sum_^ \beta_i y_i = 0 \\
& && 0 \le \beta_i \le \frac\; i = 1, \ldots, \ell
\end

and

Q

is defined by :

Q = YJK \left( 2 \gamma_A I + 2 \frac L K \right)^ J^ Y

LapSVM has been applied to problems including geographical imaging, medical imaging, face recognition, machine maintenance, and brain–computer interfaces.

Limitations

* Manifold regularization assumes that data with different labels are not likely to be close together. This assumption is what allows the technique to draw information from unlabeled data, but it only applies to some problem domains. Depending on the structure of the data, it may be necessary to use a different semi-supervised or transductive learning algorithm. * In some datasets, the intrinsic norm of a function

\left\,  f \right\, _I

can be very close to the ambient norm

\left\,  f \right\, _K

: for example, if the data consist of two classes that lie on perpendicular lines, the intrinsic norm will be equal to the ambient norm. In this case, unlabeled data have no effect on the solution learned by manifold regularization, even if the data fit the algorithm's assumption that the separator should be smooth. Approaches related to co-training have been proposed to address this limitation. * If there are a very large number of unlabeled examples, the kernel matrix

K

becomes very large, and a manifold regularization algorithm may become prohibitively slow to compute. Online algorithms and sparse approximations of the manifold may help in this case.

References

{{Reflist

External links

Software

* Th
ManifoldLearn library
and th
Primal LapSVM library
implement LapRLS and LapSVM in

MATLAB MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementat ...

. * Th
Dlib library
for C++ includes a linear manifold regularization function. Machine learning

Manifold regularizer

Motivation

Laplacian norm

Graph-based approach of the Laplacian norm

Solving the regularization problem with graph-based approach

Functional approach of the Laplacian norm

Applications

Laplacian Regularized Least Squares (LapRLS)

Laplacian Support Vector Machines (LapSVM)

Limitations

See also

References

External links

Software