Statistical learning theory is a framework for

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

drawing from the fields of

statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

and

functional analysis Functional analysis is a branch of mathematical analysis, the core of which is formed by the study of vector spaces endowed with some kind of limit-related structure (for example, Inner product space#Definition, inner product, Norm (mathematics ...

. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on data. Statistical learning theory has led to successful applications in fields such as

computer vision Computer vision tasks include methods for image sensor, acquiring, Image processing, processing, Image analysis, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical ...

, speech recognition, and bioinformatics.

Introduction

The goals of learning are understanding and prediction. Learning falls into many categories, including supervised learning, unsupervised learning, online learning, and reinforcement learning. From the perspective of statistical learning theory, supervised learning is best understood. Supervised learning involves learning from a training set of data. Every point in the training is an input–output pair, where the input maps to an output. The learning problem consists of inferring the function that maps between the input and the output, such that the learned function can be used to predict the output from future input. Depending on the type of output, supervised learning problems are either problems of regression or problems of

classification Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...

. If the output takes a continuous range of values, it is a regression problem. Using Ohm's law as an example, a regression could be performed with voltage as input and current as an output. The regression would find the functional relationship between voltage and current to be such that

V = I R

Classification problems are those for which the output will be an element from a discrete set of labels. Classification is very common for machine learning applications. In facial recognition, for instance, a picture of a person's face would be the input, and the output label would be that person's name. The input would be represented by a large multidimensional vector whose elements represent pixels in the picture. After learning a function based on the training set data, that function is validated on a test set of data, data that did not appear in the training set.

Formal description

Take

X

to be the vector space of all possible inputs, and

Y

to be the vector space of all possible outputs. Statistical learning theory takes the perspective that there is some unknown

probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...

over the product space

Z = X \times Y

, i.e. there exists some unknown

p(z) = p(\mathbf,y)

. The training set is made up of

n

samples from this probability distribution, and is notated

S = \ = \

Every

\mathbf_i

is an input vector from the training data, and

y_i

is the output that corresponds to it. In this formalism, the inference problem consists of finding a function

f: X \to Y

such that

f(\mathbf) \sim y

. Let

\mathcal

be a space of functions

f: X \to Y

called the hypothesis space. The hypothesis space is the space of functions the algorithm will search through. Let

V(f(\mathbf),y)

be the loss function, a metric for the difference between the predicted value

f(\mathbf)

and the actual value

y

. The expected risk is defined to be

= \int_ V(f(\mathbf),y)\, p(\mathbf,y) \,d\mathbf \,dy

The target function, the best possible function

f

that can be chosen, is given by the

f

that satisfies

f = \mathop_ I /math>

Because the probability distribution p(\mathbf,y) is unknown, a proxy measure for the expected risk must be used. This measure is based on the training set, a sample from this unknown probability distribution. It is called the empirical risk I_S = \frac \sum_^n V( f(\mathbf_i),y_i) A learning algorithm that chooses the function f_S that minimizes the empirical risk is called empirical risk minimization .

Loss functions

The choice of loss function is a determining factor on the function

f_S

that will be chosen by the learning algorithm. The loss function also affects the convergence rate for an algorithm. It is important for the loss function to be convex. Different loss functions are used depending on whether the problem is one of regression or one of classification.

Regression

The most common loss function for regression is the square loss function (also known as the L2-norm). This familiar loss function is used in Ordinary Least Squares regression. The form is:

V(f(\mathbf),y) = (y - f(\mathbf))^2

The absolute value loss (also known as the L1-norm) is also sometimes used:

V(f(\mathbf),y) = , y - f(\mathbf),

Classification

In some sense the 0-1 indicator function is the most natural loss function for classification. It takes the value 0 if the predicted output is the same as the actual output, and it takes the value 1 if the predicted output is different from the actual output. For binary classification with

Y = \

, this is:

V(f(\mathbf),y) = \theta(- y f(\mathbf))

where

\theta

is the Heaviside step function.

Regularization

In machine learning problems, a major problem that arises is that of overfitting. Because learning is a prediction problem, the goal is not to find a function that most closely fits the (previously observed) data, but to find one that will most accurately predict output from future input. Empirical risk minimization runs this risk of overfitting: finding a function that matches the data exactly but does not predict future output well. Overfitting is symptomatic of unstable solutions; a small perturbation in the training set data would cause a large variation in the learned function. It can be shown that if the stability for the solution can be guaranteed, generalization and consistency are guaranteed as well. Regularization can solve the overfitting problem and give the problem stability. Regularization can be accomplished by restricting the hypothesis space

\mathcal

. A common example would be restricting

\mathcal

to linear functions: this can be seen as a reduction to the standard problem of linear regression.

\mathcal

could also be restricted to polynomial of degree

p

, exponentials, or bounded functions on L1. Restriction of the hypothesis space avoids overfitting because the form of the potential functions are limited, and so does not allow for the choice of a function that gives empirical risk arbitrarily close to zero. One example of regularization is Tikhonov regularization. This consists of minimizing

\frac \sum_^n V(f(\mathbf_i),y_i) + \gamma \left\, f\right\, _^2

where

\gamma

is a fixed and positive parameter, the regularization parameter. Tikhonov regularization ensures existence, uniqueness, and stability of the solution.Tomaso Poggio, Lorenzo Rosasco, et al. ''Statistical Learning Theory and Applications'', 2012
Class 2
/ref>

Bounding empirical risk

Consider a binary classifier

f: \mathcal \to \

. We can apply Hoeffding's inequality to bound the probability that the empirical risk deviates from the true risk to be a Sub-Gaussian distribution.

\mathbb (, \hat (f) - R(f),  \geq \epsilon) \leq 2e^

But generally, when we do empirical risk minimization, we are not given a classifier; we must choose it. Therefore, a more useful result is to bound the probability of the supremum of the difference over the whole class.

\mathbb \bigg( \sup_ ,  \hat (f) - R(f) ,  \geq \epsilon \bigg) \leq 2 S(\mathcal, n) e^ \approx n^d e^

where

S(\mathcal,n)

is the shattering number and

n

is the number of samples in your dataset. The exponential term comes from Hoeffding but there is an extra cost of taking the supremum over the whole class, which is the shattering number.

References

{{Differentiable computing Machine learning Estimation theory

Introduction

Formal description

Loss functions

Regression

Classification

Regularization

Bounding empirical risk

See also

References