HOME

TheInfoList



OR:

Empirical risk minimization (ERM) is a principle in
statistical learning theory Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on dat ...
which defines a family of
learning algorithms Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
and is used to give theoretical bounds on their performance. The core idea is that we cannot know exactly how well an algorithm will work in practice (the true "risk") because we don't know the true distribution of data that the algorithm will work on, but we can instead measure its performance on a known set of training data (the "empirical" risk).


Background

Consider the following situation, which is a general setting of many
supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...
problems. We have two spaces of objects X and Y and would like to learn a function \ h: X \to Y (often called ''hypothesis'') which outputs an object y \in Y, given x \in X. To do so, we have at our disposal a ''training set'' of n examples \ (x_1, y_1), \ldots, (x_n, y_n) where x_i \in X is an input and y_i \in Y is the corresponding response that we wish to get from h(x_i). To put it more formally, we assume that there is a
joint probability distribution Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered ...
P(x, y) over X and Y, and that the training set consists of n instances \ (x_1, y_1), \ldots, (x_n, y_n) drawn
i.i.d. In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is us ...
from P(x, y). Note that the assumption of a joint probability distribution allows us to model uncertainty in predictions (e.g. from noise in data) because y is not a deterministic function of but rather a
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
with
conditional distribution In probability theory and statistics, given two jointly distributed random variables X and Y, the conditional probability distribution of Y given X is the probability distribution of Y when X is known to be a particular value; in some cases the ...
P(y , x) for a fixed x. We also assume that we are given a non-negative real-valued
loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
L(\hat, y) which measures how different the prediction \hat of a hypothesis is from the true outcome y. The
risk In simple terms, risk is the possibility of something bad happening. Risk involves uncertainty about the effects/implications of an activity with respect to something that humans value (such as health, well-being, wealth, property or the environme ...
associated with hypothesis h(x) is then defined as the expectation of the loss function: : R(h) = \mathbf (h(x), y)= \int L(h(x), y)\,dP(x, y). A loss function commonly used in theory is the
0-1 loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event (probability theory), event or values of one or more variables onto a real number intuiti ...
: L(\hat, y) = \begin 1 & \mbox\quad \hat \ne y \\ 0 & \mbox\quad \hat = y \end. The ultimate goal of a learning algorithm is to find a hypothesis h^* among a fixed class of functions \mathcal for which the risk R(h) is minimal: : h^* = \underset\, . For classification problems, the Bayes classifier is defined to be the classifier minimizing the risk defined with the 0–1 loss function.


Empirical risk minimization

In general, the risk R(h) cannot be computed because the distribution P(x, y) is unknown to the learning algorithm (this situation is referred to as agnostic learning). However, we can compute an approximation, called ''empirical risk'', by averaging the loss function on the training set; more formally, computing the expectation with respect to the
empirical measure In probability theory, an empirical measure is a random measure arising from a particular realization of a (usually finite) sequence of random variables. The precise definition is found below. Empirical measures are relevant to mathematical sta ...
: : \! R_\text(h) = \frac \sum_^n L(h(x_i), y_i). The empirical risk minimization principle states that the learning algorithm should choose a hypothesis \hat which minimizes the empirical risk: : \hat = \underset\, R_(h). Thus the learning algorithm defined by the ERM principle consists in solving the above
optimization Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...
problem.


Properties


Computational complexity

Empirical risk minimization for a classification problem with a
0-1 loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event (probability theory), event or values of one or more variables onto a real number intuiti ...
is known to be an
NP-hard In computational complexity theory, NP-hardness ( non-deterministic polynomial-time hardness) is the defining property of a class of problems that are informally "at least as hard as the hardest problems in NP". A simple example of an NP-hard pr ...
problem even for a relatively simple class of functions such as
linear classifier In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class (or group) it belongs to. A linear classifier achieves this by making a classification decision based on the val ...
s.V. Feldman, V. Guruswami, P. Raghavendra and Yi Wu (2009)
''Agnostic Learning of Monomials by Halfspaces is Hard.''
(See the paper and references therein)
Nevertheless, it can be solved efficiently when the minimal empirical risk is zero, i.e., data is
linearly separable In Euclidean geometry, linear separability is a property of two sets of point (geometry), points. This is most easily visualized in two dimensions (the Euclidean plane) by thinking of one set of points as being colored blue and the other set of poi ...
. In practice, machine learning algorithms cope with this issue either by employing a convex approximation to the 0–1 loss function (like
hinge loss In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs). For an intended output and a classifier score , th ...
for SVM), which is easier to optimize, or by imposing assumptions on the distribution P(x, y) (and thus stop being agnostic learning algorithms to which the above result applies).


See also

*
Maximum likelihood estimation In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statis ...
*
M-estimator In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estima ...


References


Further reading

* {{cite book , last=Vapnik , first=V. , author-link = Vladimir Vapnik , title=The Nature of Statistical Learning Theory , publisher =
Springer-Verlag Springer Science+Business Media, commonly known as Springer, is a German multinational publishing company of books, e-books and peer-reviewed journals in science, humanities, technical and medical (STM) publishing. Originally founded in 1842 in ...
, series=Information Science and Statistics , year = 2000 , isbn=978-0-387-98780-4 Machine learning