Empirical risk minimization (ERM) is a principle in
statistical learning theory
Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on dat ...
which defines a family of
learning algorithms
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
and is used to give theoretical bounds on their performance. The core idea is that we cannot know exactly how well an algorithm will work in practice (the true "risk") because we don't know the true distribution of data that the algorithm will work on, but we can instead measure its performance on a known set of training data (the "empirical" risk).
Background
Consider the following situation, which is a general setting of many
supervised learning
Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...
problems. We have two spaces of objects
and
and would like to learn a function
(often called ''hypothesis'') which outputs an object
, given
. To do so, we have at our disposal a ''training set'' of
examples
where
is an input and
is the corresponding response that we wish to get from
.
To put it more formally, we assume that there is a
joint probability distribution
Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered ...
over
and
, and that the training set consists of
instances
drawn
i.i.d.
In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is us ...
from
. Note that the assumption of a joint probability distribution allows us to model uncertainty in predictions (e.g. from noise in data) because
is not a deterministic function of but rather a
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
with
conditional distribution
In probability theory and statistics, given two jointly distributed random variables X and Y, the conditional probability distribution of Y given X is the probability distribution of Y when X is known to be a particular value; in some cases the ...
for a fixed
.
We also assume that we are given a non-negative real-valued
loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
which measures how different the prediction
of a hypothesis is from the true outcome
The
risk
In simple terms, risk is the possibility of something bad happening. Risk involves uncertainty about the effects/implications of an activity with respect to something that humans value (such as health, well-being, wealth, property or the environme ...
associated with hypothesis
is then defined as the
expectation of the loss function:
:
A loss function commonly used in theory is the
0-1 loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event (probability theory), event or values of one or more variables onto a real number intuiti ...
:
.
The ultimate goal of a learning algorithm is to find a hypothesis
among a fixed class of functions
for which the risk
is minimal:
:
For classification problems, the
Bayes classifier is defined to be the classifier minimizing the risk defined with the 0–1 loss function.
Empirical risk minimization
In general, the risk
cannot be computed because the distribution
is unknown to the learning algorithm (this situation is referred to as
agnostic learning). However, we can compute an approximation, called ''empirical risk'', by averaging the loss function on the training set; more formally, computing the expectation with respect to the
empirical measure
In probability theory, an empirical measure is a random measure arising from a particular realization of a (usually finite) sequence of random variables. The precise definition is found below. Empirical measures are relevant to mathematical sta ...
:
:
The empirical risk minimization principle states that the learning algorithm should choose a hypothesis
which minimizes the empirical risk:
:
Thus the learning algorithm defined by the ERM principle consists in solving the above
optimization
Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...
problem.
Properties
Computational complexity
Empirical risk minimization for a classification problem with a
0-1 loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event (probability theory), event or values of one or more variables onto a real number intuiti ...
is known to be an
NP-hard
In computational complexity theory, NP-hardness ( non-deterministic polynomial-time hardness) is the defining property of a class of problems that are informally "at least as hard as the hardest problems in NP". A simple example of an NP-hard pr ...
problem even for a relatively simple class of functions such as
linear classifier
In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class (or group) it belongs to. A linear classifier achieves this by making a classification decision based on the val ...
s.
[V. Feldman, V. Guruswami, P. Raghavendra and Yi Wu (2009)]
''Agnostic Learning of Monomials by Halfspaces is Hard.''
(See the paper and references therein) Nevertheless, it can be solved efficiently when the minimal empirical risk is zero, i.e., data is
linearly separable
In Euclidean geometry, linear separability is a property of two sets of point (geometry), points. This is most easily visualized in two dimensions (the Euclidean plane) by thinking of one set of points as being colored blue and the other set of poi ...
.
In practice, machine learning algorithms cope with this issue either by employing a
convex approximation to the 0–1 loss function (like
hinge loss
In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).
For an intended output and a classifier score , th ...
for
SVM), which is easier to optimize, or by imposing assumptions on the distribution
(and thus stop being agnostic learning algorithms to which the above result applies).
See also
*
Maximum likelihood estimation
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statis ...
*
M-estimator
In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estima ...
References
Further reading
* {{cite book
, last=Vapnik
, first=V.
, author-link = Vladimir Vapnik
, title=The Nature of Statistical Learning Theory
, publisher =
Springer-Verlag
Springer Science+Business Media, commonly known as Springer, is a German multinational publishing company of books, e-books and peer-reviewed journals in science, humanities, technical and medical (STM) publishing.
Originally founded in 1842 in ...
, series=Information Science and Statistics
, year = 2000
, isbn=978-0-387-98780-4
Machine learning