HOME

TheInfoList



OR:

Statistical learning theory is a framework for
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
drawing from the fields of
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
and
functional analysis Functional analysis is a branch of mathematical analysis, the core of which is formed by the study of vector spaces endowed with some kind of limit-related structure (e.g. Inner product space#Definition, inner product, Norm (mathematics)#Defini ...
. Statistical learning theory deals with the
statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution, distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical ...
problem of finding a predictive function based on data. Statistical learning theory has led to successful applications in fields such as
computer vision Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the hum ...
,
speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the m ...
, and
bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
.


Introduction

The goals of learning are understanding and prediction. Learning falls into many categories, including
supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...
,
unsupervised learning Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a concise representation of its world and t ...
,
online learning Educational technology (commonly abbreviated as edutech, or edtech) is the combined use of computer hardware, software, and educational theory and practice to facilitate learning. When referred to with its abbreviation, edtech, it often refer ...
, and
reinforcement learning Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...
. From the perspective of statistical learning theory, supervised learning is best understood. Supervised learning involves learning from a
training set In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...
of data. Every point in the training is an input-output pair, where the input maps to an output. The learning problem consists of inferring the function that maps between the input and the output, such that the learned function can be used to predict the output from future input. Depending on the type of output, supervised learning problems are either problems of regression or problems of
classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...
. If the output takes a continuous range of values, it is a regression problem. Using
Ohm's Law Ohm's law states that the current through a conductor between two points is directly proportional to the voltage across the two points. Introducing the constant of proportionality, the resistance, one arrives at the usual mathematical equat ...
as an example, a regression could be performed with voltage as input and current as an output. The regression would find the functional relationship between voltage and current to be , such that : V = I R Classification problems are those for which the output will be an element from a discrete set of labels. Classification is very common for machine learning applications. In facial recognition, for instance, a picture of a person's face would be the input, and the output label would be that person's name. The input would be represented by a large multidimensional vector whose elements represent pixels in the picture. After learning a function based on the training set data, that function is validated on a test set of data, data that did not appear in the training set.


Formal description

Take X to be the
vector space In mathematics and physics, a vector space (also called a linear space) is a set whose elements, often called ''vectors'', may be added together and multiplied ("scaled") by numbers called '' scalars''. Scalars are often real numbers, but can ...
of all possible inputs, and Y to be the vector space of all possible outputs. Statistical learning theory takes the perspective that there is some unknown
probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...
over the product space Z = X \times Y, i.e. there exists some unknown p(z) = p(\vec,y). The training set is made up of n samples from this probability distribution, and is notated :S = \ = \ Every \vec_i is an input vector from the training data, and y_iis the output that corresponds to it. In this formalism, the inference problem consists of finding a function f: X \to Y such that f(\vec) \sim y. Let \mathcal be a space of functions f: X \to Y called the hypothesis space. The hypothesis space is the space of functions the algorithm will search through. Let V(f(\vec),y) be the
loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
, a metric for the difference between the predicted value f(\vec) and the actual value y. The
expected risk Expected may refer to: *Expectation (epistemic) *Expected value *Expected shortfall *Expected utility hypothesis *Expected return *Expected loss ;See also *Unexpected (disambiguation) Unexpected may refer to: Film and television * ''Unexpecte ...
is defined to be :I = \displaystyle \int_ V(f(\vec),y)\, p(\vec,y) \,d\vec \,dy The target function, the best possible function f that can be chosen, is given by the f that satisfies :f = \inf_ I /math> Because the probability distribution p(\vec,y) is unknown, a proxy measure for the expected risk must be used. This measure is based on the training set, a sample from this unknown probability distribution. It is called the
empirical risk Empirical evidence for a proposition is evidence, i.e. what supports or counters this proposition, that is constituted by or accessible to sense experience or experimental procedure. Empirical evidence is of central importance to the sciences and ...
:I_S = \frac \displaystyle \sum_^n V( f(\vec_i),y_i) A learning algorithm that chooses the function f_S that minimizes the empirical risk is called
empirical risk minimization Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance. The core idea is that we cannot know exactly how well an al ...
.


Loss functions

The choice of loss function is a determining factor on the function f_S that will be chosen by the learning algorithm. The loss function also affects the convergence rate for an algorithm. It is important for the loss function to be
convex Convex or convexity may refer to: Science and technology * Convex lens, in optics Mathematics * Convex set, containing the whole line segment that joins points ** Convex polygon, a polygon which encloses a convex set of points ** Convex polytope ...
. Different loss functions are used depending on whether the problem is one of regression or one of classification.


Regression

The most common loss function for regression is the square loss function (also known as the
L2-norm In mathematics, a norm is a function from a real or complex vector space to the non-negative real numbers that behaves in certain ways like the distance from the origin: it commutes with scaling, obeys a form of the triangle inequality, and is z ...
). This familiar loss function is used in
Ordinary Least Squares regression In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the pri ...
. The form is: :V(f(\vec),y) = (y - f(\vec))^2 The absolute value loss (also known as the
L1-norm In mathematics, the spaces are function spaces defined using a natural generalization of the -norm for finite-dimensional vector spaces. They are sometimes called Lebesgue spaces, named after Henri Lebesgue , although according to the Bourbaki ...
) is also sometimes used: :V(f(\vec),y) = , y - f(\vec),


Classification

In some sense the 0-1
indicator function In mathematics, an indicator function or a characteristic function of a subset of a set is a function that maps elements of the subset to one, and all other elements to zero. That is, if is a subset of some set , one has \mathbf_(x)=1 if x\i ...
is the most natural loss function for classification. It takes the value 0 if the predicted output is the same as the actual output, and it takes the value 1 if the predicted output is different from the actual output. For binary classification with Y = \, this is: :V(f(\vec),y) = \theta(- y f(\vec)) where \theta is the
Heaviside step function The Heaviside step function, or the unit step function, usually denoted by or (but sometimes , or ), is a step function, named after Oliver Heaviside (1850–1925), the value of which is zero for negative arguments and one for positive argum ...
.


Regularization

In machine learning problems, a major problem that arises is that of
overfitting mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
. Because learning is a prediction problem, the goal is not to find a function that most closely fits the (previously observed) data, but to find one that will most accurately predict output from future input.
Empirical risk minimization Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance. The core idea is that we cannot know exactly how well an al ...
runs this risk of overfitting: finding a function that matches the data exactly but does not predict future output well. Overfitting is symptomatic of unstable solutions; a small perturbation in the training set data would cause a large variation in the learned function. It can be shown that if the stability for the solution can be guaranteed, generalization and consistency are guaranteed as well.
Regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) In physics, especially quantum field theory, regularization is a method of modifying observables which have singularities in ...
can solve the overfitting problem and give the problem stability. Regularization can be accomplished by restricting the hypothesis space \mathcal. A common example would be restricting \mathcal to linear functions: this can be seen as a reduction to the standard problem of
linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is call ...
. \mathcal could also be restricted to polynomial of degree p, exponentials, or bounded functions on L1. Restriction of the hypothesis space avoids overfitting because the form of the potential functions are limited, and so does not allow for the choice of a function that gives empirical risk arbitrarily close to zero. One example of regularization is
Tikhonov regularization Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also ...
. This consists of minimizing :\frac \displaystyle \sum_^n V(f(\vec_i),y_i) + \gamma \, f\, _^2 where \gamma is a fixed and positive parameter, the regularization parameter. Tikhonov regularization ensures existence, uniqueness, and stability of the solution.Tomaso Poggio, Lorenzo Rosasco, et al. ''Statistical Learning Theory and Applications'', 2012
Class 2
/ref>


See also

* Reproducing kernel Hilbert spaces are a useful choice for \mathcal. *
Proximal gradient methods for learning Proximal gradient (forward backward splitting) methods for learning is an area of research in optimization and statistical learning theory which studies algorithms for a general class of convex regularization problems where the regularization penal ...


References

{{Differentiable computing Machine learning Estimation theory