HOME

TheInfoList



OR:

For
supervised learning In machine learning, supervised learning (SL) is a paradigm where a Statistical model, model is trained using input objects (e.g. a vector of predictor variables) and desired output values (also known as a ''supervisory signal''), which are often ...
applications in
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
and
statistical learning theory Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on da ...
, generalization errorMohri, M., Rostamizadeh A., Talwakar A., (2018) ''Foundations of Machine learning'', 2nd ed., Boston: MIT Press (also known as the out-of-sample errorY S. Abu-Mostafa, M.Magdon-Ismail, and H.-T. Lin (2012) Learning from Data, AMLBook Press. or the risk) is a measure of how accurately an
algorithm In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...
is able to predict outcomes for previously unseen data. As learning algorithms are evaluated on finite samples, the evaluation of a learning algorithm may be sensitive to
sampling error In statistics, sampling errors are incurred when the statistical characteristics of a population are estimated from a subset, or sample, of that population. Since the sample does not include all members of the population, statistics of the sample ...
. As a result, measurements of prediction error on the current data may not provide much information about the algorithm's predictive ability on new, unseen data. The generalization error can be minimized by avoiding
overfitting In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...
in the learning algorithm. The performance of machine learning algorithms is commonly visualized by
learning curve A learning curve is a graphical representation of the relationship between how proficient people are at a task and the amount of experience they have. Proficiency (measured on the vertical axis) usually increases with increased experience (the ...
plots that show estimates of the generalization error throughout the learning process.


Definition

In a learning problem, the goal is to develop a function f_n(\vec) that predicts output values y for each input datum \vec. The subscript n indicates that the function f_n is developed based on a data set of n data points. The generalization error or expected loss or risk I /math> of a particular function f over all possible values of \vec and y is the
expected value In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first Moment (mathematics), moment) is a generalization of the weighted average. Informa ...
of the
loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
V(f): : I = \int_ V(f(\vec),y) \rho(\vec,y) d\vec dy, where \rho(\vec,y) is the unknown
joint probability distribution A joint or articulation (or articular surface) is the connection made between bones, ossicles, or other hard structures in the body which link an animal's skeletal system into a functional whole.Saladin, Ken. Anatomy & Physiology. 7th ed. McGraw- ...
for \vec and y. Without knowing the joint probability distribution \rho, it is impossible to compute I /math>. Instead, we can compute the error on sample data, which is called empirical error (or empirical risk). Given n data points, the empirical error of a candidate function f is: : I_n = \frac \sum_^n V(f(\vec_i),y_i) An algorithm is said to generalize if: : \lim_ I - I_n = 0 Of particular importance is the generalization error I _n/math> of the data-dependent function f_n that is found by a learning algorithm based on the sample. Again, for an unknown probability distribution, I _n/math> cannot be computed. Instead, the aim of many problems in statistical learning theory is to bound or characterize the difference of the generalization error and the empirical error in probability: : P_G = P(I _n- I_n _n\leq \epsilon) \geq 1 - \delta_n That is, the goal is to characterize the probability 1 - \delta_n that the generalization error is less than the empirical error plus some error bound \epsilon (generally dependent on \delta and n). For many types of algorithms, it has been shown that an algorithm has generalization bounds if it meets certain
stability Stability may refer to: Mathematics *Stability theory, the study of the stability of solutions to differential equations and dynamical systems ** Asymptotic stability ** Exponential stability ** Linear stability **Lyapunov stability ** Marginal s ...
criteria. Specifically, if an algorithm is symmetric (the order of inputs does not affect the result), has bounded loss and meets two stability conditions, it will generalize. The first stability condition,
leave-one-out cross-validation Cross-validation, sometimes called rotation estimation or out-of-sample testing, is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. Cross-v ...
stability, says that to be stable, the prediction error for each data point when leave-one-out cross validation is used must converge to zero as n\rightarrow \infty. The second condition, expected-to-leave-one-out error stability (also known as hypothesis stability if operating in the L_1 norm) is met if the prediction on a left-out datapoint does not change when a single data point is removed from the training dataset. These conditions can be formalized as:


Leave-one-out cross-validation Stability

An algorithm L has CVloo stability if for each n, there exists a \beta_^ and \delta_^ such that: :\forall i\in\, \mathbb_S\\geq1-\delta_^ and \beta_^ and \delta_^ go to zero as n goes to infinity.


Expected-leave-one-out error Stability

An algorithm L has Eloo_ stability if for each n there exists a \beta_^m and a \delta_^m such that: :\forall i\in\, \mathbb_S\left\\geq1-\delta_^ with \beta_^ and \delta_^ going to zero for n\rightarrow\infty. For leave-one-out stability in the L_1 norm, this is the same as hypothesis stability: : \mathbb_


Algorithms with proven stability

A number of algorithms have been proven to be stable and as a result have bounds on their generalization error. A list of these algorithms and the papers that proved stability is available Stability (learning theory)#Algorithms that are stable, here.


Relation to overfitting

The concepts of generalization error and overfitting are closely related. Overfitting occurs when the learned function f_S becomes sensitive to the noise in the sample. As a result, the function will perform well on the training set but not perform well on other data from the joint probability distribution of x and y. Thus, the more overfitting occurs, the larger the generalization error. The amount of overfitting can be tested using cross-validation methods, that split the sample into simulated training samples and testing samples. The model is then trained on a training sample and evaluated on the testing sample. The testing sample is previously unseen by the algorithm and so represents a random sample from the joint probability distribution of x and y. This test sample allows us to approximate the expected error and as a result approximate a particular form of the generalization error. Many algorithms exist to prevent overfitting. The minimization algorithm can penalize more complex functions (known as Tikhonov
regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) * Regularization (solid modeling) * Regularization Law, an Israeli law intended to retroactively legalize settlements See also ...
), or the hypothesis space can be constrained, either explicitly in the form of the functions or by adding constraints to the minimization function (Ivanov regularization). The approach to finding a function that does not overfit is at odds with the goal of finding a function that is sufficiently complex to capture the particular characteristics of the data. This is known as the
bias–variance tradeoff In statistics and machine learning, the bias–variance tradeoff describes the relationship between a model's complexity, the accuracy of its predictions, and how well it can make predictions on previously unseen data that were not used to train ...
. Keeping a function simple to avoid overfitting may introduce a bias in the resulting predictions, while allowing it to be more complex leads to overfitting and a higher variance in the predictions. It is impossible to minimize both simultaneously.


References


Further reading

* * * Mohri, M., Rostamizadeh A., Talwakar A., (2018) ''Foundations of Machine learning'', 2nd ed., Boston: MIT Press. * Moody, J.E. (1992),
The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems
", in Moody, J.E., Hanson, S.J., and Lippmann, R.P., ''Advances in Neural Information Processing Systems'' 4, 847–854. * White, H. (1992b), ''Artificial Neural Networks: Approximation and Learning Theory'', Blackwell. {{Differentiable computing Classification algorithms>V(f_S,z) - V(f_,z), \leq \beta_H^ with \beta_H^ going to zero as n goes to infinity.


Algorithms with proven stability

A number of algorithms have been proven to be stable and as a result have bounds on their generalization error. A list of these algorithms and the papers that proved stability is available Stability (learning theory)#Algorithms that are stable, here.


Relation to overfitting

The concepts of generalization error and overfitting are closely related. Overfitting occurs when the learned function f_S becomes sensitive to the noise in the sample. As a result, the function will perform well on the training set but not perform well on other data from the joint probability distribution of x and y. Thus, the more overfitting occurs, the larger the generalization error. The amount of overfitting can be tested using cross-validation methods, that split the sample into simulated training samples and testing samples. The model is then trained on a training sample and evaluated on the testing sample. The testing sample is previously unseen by the algorithm and so represents a random sample from the joint probability distribution of x and y. This test sample allows us to approximate the expected error and as a result approximate a particular form of the generalization error. Many algorithms exist to prevent overfitting. The minimization algorithm can penalize more complex functions (known as Tikhonov
regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) * Regularization (solid modeling) * Regularization Law, an Israeli law intended to retroactively legalize settlements See also ...
), or the hypothesis space can be constrained, either explicitly in the form of the functions or by adding constraints to the minimization function (Ivanov regularization). The approach to finding a function that does not overfit is at odds with the goal of finding a function that is sufficiently complex to capture the particular characteristics of the data. This is known as the
bias–variance tradeoff In statistics and machine learning, the bias–variance tradeoff describes the relationship between a model's complexity, the accuracy of its predictions, and how well it can make predictions on previously unseen data that were not used to train ...
. Keeping a function simple to avoid overfitting may introduce a bias in the resulting predictions, while allowing it to be more complex leads to overfitting and a higher variance in the predictions. It is impossible to minimize both simultaneously.


References


Further reading

* * * Mohri, M., Rostamizadeh A., Talwakar A., (2018) ''Foundations of Machine learning'', 2nd ed., Boston: MIT Press. * Moody, J.E. (1992),
The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems
", in Moody, J.E., Hanson, S.J., and Lippmann, R.P., ''Advances in Neural Information Processing Systems'' 4, 847–854. * White, H. (1992b), ''Artificial Neural Networks: Approximation and Learning Theory'', Blackwell. {{Differentiable computing Classification algorithms