Empirical Risk Minimization
   HOME

TheInfoList



OR:

Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance. The core idea is that we cannot know exactly how well an algorithm will work in practice (the true "risk") because we don't know the true distribution of data that the algorithm will work on, but we can instead measure its performance on a known set of training data (the "empirical" risk).


Background

Consider the following situation, which is a general setting of many supervised learning problems. We have two spaces of objects X and Y and would like to learn a function \ h: X \to Y (often called ''hypothesis'') which outputs an object y \in Y, given x \in X. To do so, we have at our disposal a ''training set'' of n examples \ (x_1, y_1), \ldots, (x_n, y_n) where x_i \in X is an input and y_i \in Y is the corresponding response that we wish to get from h(x_i). To put it more formally, we assume that there is a
joint probability distribution Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considere ...
P(x, y) over X and Y, and that the training set consists of n instances \ (x_1, y_1), \ldots, (x_n, y_n) drawn i.i.d. from P(x, y). Note that the assumption of a joint probability distribution allows us to model uncertainty in predictions (e.g. from noise in data) because y is not a deterministic function of but rather a
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the p ...
with conditional distribution P(y , x) for a fixed x. We also assume that we are given a non-negative real-valued
loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "co ...
L(\hat, y) which measures how different the prediction \hat of a hypothesis is from the true outcome y. The
risk In simple terms, risk is the possibility of something bad happening. Risk involves uncertainty about the effects/implications of an activity with respect to something that humans value (such as health, well-being, wealth, property or the environme ...
associated with hypothesis h(x) is then defined as the
expectation Expectation or Expectations may refer to: Science * Expectation (epistemic) * Expected value, in mathematical probability theory * Expectation value (quantum mechanics) * Expectation–maximization algorithm, in statistics Music * ''Expectation' ...
of the loss function: : R(h) = \mathbf
(h(x), y) H, or h, is the eighth letter in the Latin alphabet, used in the modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in English is ''aitch'' (pronounced , plural ''aitches''), or region ...
= \int L(h(x), y)\,dP(x, y). A loss function commonly used in theory is the
0-1 loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
: L(\hat, y) = \begin 1 & \mbox\quad \hat \ne y \\ 0 & \mbox\quad \hat = y \end. The ultimate goal of a learning algorithm is to find a hypothesis h^* among a fixed class of functions \mathcal for which the risk R(h) is minimal: : h^* = \underset\, . For classification problems, the
Bayes classifier In statistical classification, the Bayes classifier minimizes the probability of misclassification. Definition Suppose a pair (X,Y) takes values in \mathbb^d \times \, where Y is the class label of X. Assume that the conditional distribution of ' ...
is defined to be the classifier minimizing the risk defined with the 0–1 loss function.


Empirical risk minimization

In general, the risk R(h) cannot be computed because the distribution P(x, y) is unknown to the learning algorithm (this situation is referred to as agnostic learning). However, we can compute an approximation, called ''empirical risk'', by averaging the loss function on the training set; more formally, computing the expectation with respect to the empirical measure: : \! R_\text(h) = \frac \sum_^n L(h(x_i), y_i). The empirical risk minimization principle states that the learning algorithm should choose a hypothesis \hat which minimizes the empirical risk: : \hat = \underset\, R_(h). Thus the learning algorithm defined by the ERM principle consists in solving the above optimization problem.


Properties


Computational complexity

Empirical risk minimization for a classification problem with a
0-1 loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
is known to be an NP-hard problem even for a relatively simple class of functions such as linear classifiers.V. Feldman, V. Guruswami, P. Raghavendra and Yi Wu (2009)
''Agnostic Learning of Monomials by Halfspaces is Hard.''
(See the paper and references therein)
Nevertheless, it can be solved efficiently when the minimal empirical risk is zero, i.e., data is linearly separable. In practice, machine learning algorithms cope with this issue either by employing a convex approximation to the 0–1 loss function (like hinge loss for SVM), which is easier to optimize, or by imposing assumptions on the distribution P(x, y) (and thus stop being agnostic learning algorithms to which the above result applies).


See also

* Maximum likelihood estimation * M-estimator


References


Further reading

* {{cite book , last=Vapnik , first=V. , author-link = Vladimir Vapnik , title=The Nature of Statistical Learning Theory , publisher = Springer-Verlag , series=Information Science and Statistics , year = 2000 , isbn=978-0-387-98780-4 Machine learning