In
mathematical optimization and
decision theory
Decision theory (or the theory of choice; not to be confused with choice theory) is a branch of applied probability theory concerned with the theory of making decisions based on assigning probabilities to various factors and assigning numerical ...
, a loss function or cost function (sometimes also called an error function)
is a function that maps an
event
Event may refer to:
Gatherings of people
* Ceremony, an event of ritual significance, performed on a special occasion
* Convention (meeting), a gathering of individuals engaged in some common interest
* Event management, the organization of e ...
or values of one or more variables onto a
real number
In mathematics, a real number is a number that can be used to measure a ''continuous'' one-dimensional quantity such as a distance, duration or temperature. Here, ''continuous'' means that values can have arbitrarily small variations. Every real ...
intuitively representing some "cost" associated with the event. An
optimization problem
In mathematics, computer science and economics, an optimization problem is the problem of finding the ''best'' solution from all feasible solutions.
Optimization problems can be divided into two categories, depending on whether the variables ...
seeks to minimize a loss function. An objective function is either a loss function or its opposite (in specific domains, variously called a
reward function
Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...
, a
profit function
In economics, profit maximization is the short run or long run process by which a firm may determine the price, input and output levels that will lead to the highest possible total profit (or just profit in short). In neoclassical economics, ...
, a
utility function
As a topic of economics, utility is used to model worth or value. Its usage has evolved significantly over time. The term was introduced initially as a measure of pleasure or happiness as part of the theory of utilitarianism by moral philosoph ...
, a
fitness function {{no footnotes, date=May 2015
A fitness function is a particular type of objective function that is used to summarise, as a single figure of merit, how close a given design solution is to achieving the set aims. Fitness functions are used in geneti ...
, etc.), in which case it is to be maximized. The loss function could include terms from several levels of the hierarchy.
In statistics, typically a loss function is used for
parameter estimation
Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their valu ...
, and the event in question is some function of the difference between estimated and true values for an instance of data. The concept, as old as
Laplace, was reintroduced in statistics by
Abraham Wald
Abraham Wald (; hu, Wald Ábrahám, yi, אברהם וואַלד; – ) was a Jewish Hungarian mathematician who contributed to decision theory, geometry, and econometrics and founded the field of statistical sequential analysis. One ...
in the middle of the 20th century. In the context of
economics
Economics () is the social science that studies the Production (economics), production, distribution (economics), distribution, and Consumption (economics), consumption of goods and services.
Economics focuses on the behaviour and intera ...
, for example, this is usually
economic cost Economic cost is the combination of losses of any goods that have a value attached to them by any one individual. Economic cost is used mainly by economists as means to compare the prudence of one course of action with that of another. The comparis ...
or
regret
Regret is the emotion of wishing one had made a different decision in the past, because the consequences of the decision were unfavorable.
Regret is related to perceived opportunity. Its intensity varies over time after the decision, in regard ...
. In
classification, it is the penalty for an incorrect classification of an example. In
actuarial science, it is used in an insurance context to model benefits paid over premiums, particularly since the works of
Harald Cramér
Harald Cramér (; 25 September 1893 – 5 October 1985) was a Swedish mathematician, actuary, and statistician, specializing in mathematical statistics and probabilistic number theory. John Kingman described him as "one of the giants of statist ...
in the 1920s. In
optimal control
Optimal control theory is a branch of mathematical optimization that deals with finding a control for a dynamical system over a period of time such that an objective function is optimized. It has numerous applications in science, engineering and ...
, the loss is the penalty for failing to achieve a desired value. In
financial risk management, the function is mapped to a monetary loss.
Examples
Regret
Leonard J. Savage argued that using non-Bayesian methods such as
minimax, the loss function should be based on the idea of ''
regret
Regret is the emotion of wishing one had made a different decision in the past, because the consequences of the decision were unfavorable.
Regret is related to perceived opportunity. Its intensity varies over time after the decision, in regard ...
'', i.e., the loss associated with a decision should be the difference between the consequences of the best decision that could have been made had the underlying circumstances been known and the decision that was in fact taken before they were known.
Quadratic loss function
The use of a
quadratic loss function is common, for example when using
least squares techniques. It is often more mathematically tractable than other loss functions because of the properties of
variance
In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...
s, as well as being symmetric: an error above the target causes the same loss as the same magnitude of error below the target. If the target is ''t'', then a quadratic loss function is
:
for some constant ''C''; the value of the constant makes no difference to a decision, and can be ignored by setting it equal to 1. This is also known as the squared error loss (SEL).
Many common
statistic
A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose. Statistical purposes include estimating a population parameter, describing a sample, or evaluating a hypo ...
s, including
t-test
A ''t''-test is any statistical hypothesis test in which the test statistic follows a Student's ''t''-distribution under the null hypothesis. It is most commonly applied when the test statistic would follow a normal distribution if the value of ...
s,
regression models,
design of experiments
The design of experiments (DOE, DOX, or experimental design) is the design of any task that aims to describe and explain the variation of information under conditions that are hypothesized to reflect the variation. The term is generally associ ...
, and much else, use
least squares methods applied using
linear regression theory, which is based on the quadratic loss function.
The quadratic loss function is also used in
linear-quadratic optimal control problems. In these problems, even in the absence of uncertainty, it may not be possible to achieve the desired values of all target variables. Often loss is expressed as a
quadratic form in the deviations of the variables of interest from their desired values; this approach is
tractable because it results in linear
first-order condition
In calculus, a derivative test uses the derivatives of a function to locate the critical points of a function and determine whether each point is a local maximum, a local minimum, or a saddle point. Derivative tests can also give information abou ...
s. In the context of
stochastic control, the expected value of the quadratic form is used.
0-1 loss function
In
statistics
Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
and
decision theory
Decision theory (or the theory of choice; not to be confused with choice theory) is a branch of applied probability theory concerned with the theory of making decisions based on assigning probabilities to various factors and assigning numerical ...
, a frequently used loss function is the ''0-1 loss function''
:
where
is the
indicator function.
Constructing loss and objective functions
In many applications, objective functions, including loss functions as a particular case, are determined by the problem formulation. In other situations, the decision maker’s preference must be elicited and represented by a scalar-valued function (called also
utility
As a topic of economics, utility is used to model worth or value. Its usage has evolved significantly over time. The term was introduced initially as a measure of pleasure or happiness as part of the theory of utilitarianism by moral philosoph ...
function) in a form suitable for optimization — the problem that
Ragnar Frisch
Ragnar Anton Kittil Frisch (3 March 1895 – 31 January 1973) was an influential Norwegian economist known for being one of the major contributors to establishing economics as a quantitative and statistically informed science in the early 20th ce ...
has highlighted in his Nobel Prize lecture.
The existing methods for constructing objective functions are collected in the proceedings of two dedicated conferences.
In particular,
Andranik Tangian
Andranik Semovich Tangian (Melik-Tangyan) (Russian: Андраник Семович Тангян (Мелик-Тангян)); born March 29, 1952) is a Soviet Armenian-German mathematician, political economist and music theorist. Tangian is known ...
showed that the most usable objective functions — quadratic and additive — are determined by a few indifference points. He used this property in the models for constructing these objective functions from either
ordinal or
cardinal data that were elicited through computer-assisted interviews with decision makers.
Among other things, he constructed objective functions to optimally distribute budgets for 16 Westfalian universities
and the European subsidies for equalizing unemployment rates among 271 German regions.
Expected loss
In some contexts, the value of the loss function itself is a random quantity because it depends on the outcome of a random variable ''X''.
Statistics
Both
frequentist
Frequentist inference is a type of statistical inference based in frequentist probability, which treats “probability” in equivalent terms to “frequency” and draws conclusions from sample-data by means of emphasizing the frequency or pro ...
and
Bayesian
Thomas Bayes (/beɪz/; c. 1701 – 1761) was an English statistician, philosopher, and Presbyterian minister.
Bayesian () refers either to a range of concepts and approaches that relate to statistical methods based on Bayes' theorem, or a followe ...
statistical theory involve making a decision based on the
expected value
In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a l ...
of the loss function; however, this quantity is defined differently under the two paradigms.
Frequentist expected loss
We first define the expected loss in the frequentist context. It is obtained by taking the expected value with respect to the probability distribution, ''P''
''θ'', of the observed data, ''X''. This is also referred to as the risk function of the decision rule ''δ'' and the parameter ''θ''. Here the decision rule depends on the outcome of ''X''. The risk function is given by:
:
Here, ''θ'' is a fixed but possibly unknown state of nature, ''X'' is a vector of observations stochastically drawn from a
population
Population typically refers to the number of people in a single area, whether it be a city or town, region, country, continent, or the world. Governments typically quantify the size of the resident population within their jurisdiction using a ...
,
is the expectation over all population values of ''X'', ''dP''
''θ'' is a
probability measure over the event space of ''X'' (parametrized by ''θ'') and the integral is evaluated over the entire
support of ''X''.
Bayesian expected loss
In a Bayesian approach, the expectation is calculated using the
posterior distribution
The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior p ...
* of the parameter ''θ'':
:
One then should choose the action ''a
*'' which minimises the expected loss. Although this will result in choosing the same action as would be chosen using the frequentist risk, the emphasis of the Bayesian approach is that one is only interested in choosing the optimal action under the actual observed data, whereas choosing the actual frequentist optimal decision rule, which is a function of all possible observations, is a much more difficult problem.
Examples in statistics
* For a scalar parameter ''θ'', a decision function whose output
is an estimate of ''θ'', and a quadratic loss function (
squared error loss)
the risk function becomes the
mean squared error
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
of the estimate,
An
Estimator
In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
found by minimizing the
Mean squared error
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
estimates the
Posterior distribution
The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior p ...
's mean.
* In
density estimation
In statistics, probability density estimation or simply density estimation is the construction of an estimate, based on observed data, of an unobservable underlying probability density function. The unobservable density function is thought of ...
, the unknown parameter is
probability density
In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can ...
itself. The loss function is typically chosen to be a
norm
Naturally occurring radioactive materials (NORM) and technologically enhanced naturally occurring radioactive materials (TENORM) consist of materials, usually industrial wastes or by-products enriched with radioactive elements found in the envi ...
in an appropriate
function space. For example, for
''L''2 norm,
the risk function becomes the
mean integrated squared error
In statistics, the mean integrated squared error (MISE) is used in density estimation. The MISE of an estimate of an unknown probability density is given by
:\operatorname\, f_n-f\, _2^2=\operatorname\int (f_n(x)-f(x))^2 \, dx
where ''ƒ'' is ...
Economic choice under uncertainty
In economics, decision-making under uncertainty is often modelled using the
von Neumann–Morgenstern utility function The expected utility hypothesis is a popular concept in economics that serves as a reference guide for decisions when the payoff is uncertain. The theory recommends which option rational individuals should choose in a complex situation, based on the ...
of the uncertain variable of interest, such as end-of-period wealth. Since the value of this variable is uncertain, so is the value of the utility function; it is the expected value of utility that is maximized.
Decision rules
A
decision rule
In decision theory, a decision rule is a function which maps an observation to an appropriate action. Decision rules play an important role in the theory of statistics and economics, and are closely related to the concept of a strategy in game t ...
makes a choice using an optimality criterion. Some commonly used criteria are:
*
Minimax: Choose the decision rule with the lowest worst loss — that is, minimize the worst-case (maximum possible) loss:
*
Invariance: Choose the decision rule which satisfies an invariance requirement.
*Choose the decision rule with the lowest average loss (i.e. minimize the
expected value
In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a l ...
of the loss function):
Selecting a loss function
Sound statistical practice requires selecting an estimator consistent with the actual acceptable variation experienced in the context of a particular applied problem. Thus, in the applied use of loss functions, selecting which statistical method to use to model an applied problem depends on knowing the losses that will be experienced from being wrong under the problem's particular circumstances.
A common example involves estimating "
location
In geography, location or place are used to denote a region (point, line, or area) on Earth's surface or elsewhere. The term ''location'' generally implies a higher degree of certainty than ''place'', the latter often indicating an entity with an ...
". Under typical statistical assumptions, the
mean
There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set.
For a data set, the '' ari ...
or average is the statistic for estimating location that minimizes the expected loss experienced under the
squared-error loss function, while the
median is the estimator that minimizes expected loss experienced under the absolute-difference loss function. Still different estimators would be optimal under other, less common circumstances.
In economics, when an agent is
risk neutral
In economics and finance, risk neutral preferences are preferences that are neither risk averse nor risk seeking. A risk neutral party's decisions are not affected by the degree of uncertainty in a set of outcomes, so a risk neutral party is indif ...
, the objective function is simply expressed as the expected value of a monetary quantity, such as profit, income, or end-of-period wealth. For
risk-averse
In economics and finance, risk aversion is the tendency of people to prefer outcomes with low uncertainty to those outcomes with high uncertainty, even if the average outcome of the latter is equal to or higher in monetary value than the more ce ...
or
risk-loving
In accounting, finance, and economics, a risk-seeker or risk-lover is a person who has a preference ''for'' risk.
While most investors are considered risk ''averse'', one could view casino-goers as risk-seeking. A common example to explain ris ...
agents, loss is measured as the negative of a
utility function
As a topic of economics, utility is used to model worth or value. Its usage has evolved significantly over time. The term was introduced initially as a measure of pleasure or happiness as part of the theory of utilitarianism by moral philosoph ...
, and the objective function to be optimized is the expected value of utility.
Other measures of cost are possible, for example
mortality or
morbidity
A disease is a particular abnormal condition that negatively affects the structure or function of all or part of an organism, and that is not immediately due to any external injury. Diseases are often known to be medical conditions that a ...
in the field of
public health
Public health is "the science and art of preventing disease, prolonging life and promoting health through the organized efforts and informed choices of society, organizations, public and private, communities and individuals". Analyzing the det ...
or
safety engineering
Safety engineering is an engineering discipline which assures that engineered systems provide acceptable levels of safety. It is strongly related to industrial engineering/systems engineering, and the subset system safety engineering. Safety eng ...
.
For most
optimization algorithm
Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...
s, it is desirable to have a loss function that is globally
continuous
Continuity or continuous may refer to:
Mathematics
* Continuity (mathematics), the opposing concept to discreteness; common examples include
** Continuous probability distribution or random variable in probability and statistics
** Continuous ...
and
differentiable
In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non-vertical tangent line at each interior point in its ...
.
Two very commonly used loss functions are the
squared loss,
, and the
absolute loss,
. However the absolute loss has the disadvantage that it is not differentiable at
. The squared loss has the disadvantage that it has the tendency to be dominated by
outliers—when summing over a set of
's (as in
), the final sum tends to be the result of a few particularly large ''a''-values, rather than an expression of the average ''a''-value.
The choice of a loss function is not arbitrary. It is very restrictive and sometimes the loss function may be characterized by its desirable properties. Among the choice principles are, for example, the requirement of completeness of the class of symmetric statistics in the case of
i.i.d.
In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is us ...
observations, the principle of complete information, and some others.
W. Edwards Deming
William Edwards Deming (October 14, 1900 – December 20, 1993) was an American engineer, statistician, professor, author, lecturer, and management consultant. Educated initially as an electrical engineer and later specializing in mathematical ...
and
Nassim Nicholas Taleb argue that empirical reality, not nice mathematical properties, should be the sole basis for selecting loss functions, and real losses often are not mathematically nice and are not differentiable, continuous, symmetric, etc. For example, a person who arrives before a plane gate closure can still make the plane, but a person who arrives after can not, a discontinuity and asymmetry which makes arriving slightly late much more costly than arriving slightly early. In drug dosing, the cost of too little drug may be lack of efficacy, while the cost of too much may be tolerable toxicity, another example of asymmetry. Traffic, pipes, beams, ecologies, climates, etc. may tolerate increased load or stress with little noticeable change up to a point, then become backed up or break catastrophically. These situations, Deming and Taleb argue, are common in real-life problems, perhaps more common than classical smooth, continuous, symmetric, differentials cases.
See also
*
Bayesian regret
In stochastic game theory, Bayesian regret is the expected difference ("regret") between the utility of a Bayesian strategy and that of the optimal strategy (the one with the highest expected payoff).
The term ''Bayesian'' refers to Thomas Baye ...
*
Loss functions for classification
In machine learning and mathematical optimization, loss functions for classification are computationally feasible loss functions representing the price paid for inaccuracy of predictions in classification problems (problems of identifying whic ...
*
Discounted maximum loss
*
Hinge loss
In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).
For an intended output and a classifier score , th ...
*
Scoring rule
In decision theory, a scoring rule
provides a summary measure for the evaluation of probabilistic forecasting, probabilistic predictions or forecasts. It is applicable to tasks in which predictions assign probabilities to events, i.e. one issues a ...
*
Statistical risk
Statistical risk is a Quantification (science), quantification of a situation's risk using statistical methods. These methods can be used to estimate a probability distribution for the outcome of a specific variable (statistics), variable, or at l ...
References
Further reading
*
*
*
*
*
{{DEFAULTSORT:Loss Function
Optimal decisions
*