In
statistics
Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
and
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
, the bias–variance tradeoff is the property of a model that the
variance
In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
of the parameter estimated across
samples can be reduced by increasing the
bias
Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group, ...
in the
estimated
Estimation (or estimating) is the process of finding an estimate or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is der ...
parameters
A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
.
The bias–variance dilemma or bias–variance problem is the conflict in trying to simultaneously minimize these two sources of
error
An error (from the Latin ''error'', meaning "wandering") is an action which is inaccurate or incorrect. In some usages, an error is synonymous with a mistake. The etymology derives from the Latin term 'errare', meaning 'to stray'.
In statistics ...
that prevent
supervised learning
Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...
algorithms from generalizing beyond their
training set
In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...
:
* The
''bias'' error is an error from erroneous assumptions in the learning
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...
. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
* The ''
variance
In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
'' is an error from sensitivity to small fluctuations in the training set. High variance may result from an algorithm modeling the random
noise
Noise is unwanted sound considered unpleasant, loud or disruptive to hearing. From a physics standpoint, there is no distinction between noise and desired sound, as both are vibrations through a medium, such as air or water. The difference arise ...
in the training data (
overfitting
mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
).
The bias–variance decomposition is a way of analyzing a learning algorithm's
expected generalization error with respect to a particular problem as a sum of three terms, the bias, variance, and a quantity called the ''irreducible error'', resulting from noise in the problem itself.
Motivation
File:En low bias low variance.png, bias low, variance low
File:Truen bad prec ok.png, bias high,
variance low
File:Truen ok prec bad.png, bias low,
variance high
File:Truen bad prec bad.png, bias high,
variance high
The bias–variance tradeoff is a central problem in supervised learning. Ideally, one wants to
choose a model that both accurately captures the regularities in its training data, but also
generalizes well to unseen data. Unfortunately, it is typically impossible to do both simultaneously. High-variance learning methods may be able to represent their training set well but are at risk of overfitting to noisy or unrepresentative training data. In contrast, algorithms with high bias typically produce simpler models that may fail to capture important regularities (i.e. underfit) in the data.
It is an often made
fallacy
A fallacy is the use of invalid or otherwise faulty reasoning, or "wrong moves," in the construction of an argument which may appear stronger than it really is if the fallacy is not spotted. The term in the Western intellectual tradition was intr ...
to assume that complex models must have high variance; High variance models are 'complex' in some sense, but the reverse needs not be true.
In addition, one has to be careful how to define complexity: In particular, the number of parameters used to describe the model is a poor measure of complexity. This is illustrated by an example adapted from: The model
has only two parameters (
) but it can interpolate any number of points by oscillating with a high enough frequency, resulting in both a high bias and high variance.
An analogy can be made to the relationship between
accuracy and precision
Accuracy and precision are two measures of ''observational error''.
''Accuracy'' is how close a given set of measurements ( observations or readings) are to their ''true value'', while ''precision'' is how close the measurements are to each oth ...
. Accuracy is a description of bias and can intuitively be improved by selecting from only
local
Local may refer to:
Geography and transportation
* Local (train), a train serving local traffic demand
* Local, Missouri, a community in the United States
* Local government, a form of public administration, usually the lowest tier of administrat ...
information. Consequently, a sample will appear accurate (i.e. have low bias) under the aforementioned selection conditions, but may result in underfitting. In other words,
test data
Test data is data which has been specifically identified for use in tests, typically of a computer program.
Background
Some data may be used in a confirmatory way, typically to verify that a given set of input to a given function produces some e ...
may not agree as closely with training data, which would indicate imprecision and therefore inflated variance. A graphical example would be a straight line fit to data exhibiting quadratic behavior overall. Precision is a description of variance and generally can only be improved by selecting information from a comparatively larger space. The option to select many data points over a broad sample space is the ideal condition for any analysis. However, intrinsic constraints (whether physical, theoretical, computational, etc.) will always play a limiting role. The limiting case where only a finite number of data points are selected over a broad sample space may result in improved precision and lower variance overall, but may also result in an overreliance on the training data (overfitting). This means that test data would also not agree as closely with the training data, but in this case the reason is due to inaccuracy or high bias. To borrow from the previous example, the graphical representation would appear as a high-order polynomial fit to the same data exhibiting quadratic behavior. Note that error in each case is measured the same way, but the reason ascribed to the error is different depending on the balance between bias and variance. To mitigate how much information is used from neighboring observations, a model can be
smoothed via explicit
regularization
Regularization may refer to:
* Regularization (linguistics)
* Regularization (mathematics)
* Regularization (physics)
In physics, especially quantum field theory, regularization is a method of modifying observables which have singularities in ...
, such as
shrinkage.
Bias–variance decomposition of mean squared error
Suppose that we have a training set consisting of a set of points
and real values
associated with each point
. We assume that there is a function f(x) such as
, where the noise,
, has zero mean and variance
.
We want to find a function
, that approximates the true function
as well as possible, by means of some learning algorithm based on a training dataset (sample)
. We make "as well as possible" precise by measuring the
mean squared error
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
between
and
: we want
to be minimal, both for
''and for points outside of our sample''. Of course, we cannot hope to do so perfectly, since the
contain noise
; this means we must be prepared to accept an ''irreducible error'' in any function we come up with.
Finding an
that generalizes to points outside of the training set can be done with any of the countless algorithms used for supervised learning. It turns out that whichever function
we select, we can decompose its
expected error on an unseen sample
as follows:
:
where
:
and
:
Thus, since
\varepsilon and
\hat are independent, we can write
:
\begin MSE =
\operatorname\big y - \hat)^2\big & = \operatorname\big f+\varepsilon - \hat )^2\big\\ pt & = \operatorname\big f+\varepsilon__-_\hat_+\operatorname[\hat\operatorname[\hat.html" ;"title="hat.html" ;"title="f+\varepsilon - \hat +\operatorname[\hat">f+\varepsilon - \hat +\operatorname[\hat\operatorname[\hat">hat.html" ;"title="f+\varepsilon - \hat +\operatorname[\hat">f+\varepsilon - \hat +\operatorname[\hat\operatorname[\hat^2\big] \\ pt & = \operatorname\big f-\operatorname[\hat^2\big]+\operatorname varepsilon^2\operatorname\big \operatorname[\hat_\hat)^2\big.html" ;"title="hat.html" ;"title="\operatorname[\hat">\operatorname[\hat \hat)^2\big">hat.html" ;"title="\operatorname[\hat">\operatorname[\hat \hat)^2\big
+2\operatorname\big f-\operatorname[\hat\varepsilon\big]
+2\operatorname\big[\varepsilon(\operatorname[\hat]- \hat)\big]
+2\operatorname\big \operatorname[\hat \hat)(f-\operatornamehat
A hat is a head covering which is worn for various reasons, including protection against weather conditions, ceremonial reasons such as university graduation, religious reasons, safety, or as a fashion accessory. Hats which incorporate mecha ...
\big] \\ pt & = (f-\operatornamehat
A hat is a head covering which is worn for various reasons, including protection against weather conditions, ceremonial reasons such as university graduation, religious reasons, safety, or as a fashion accessory. Hats which incorporate mecha ...
^2+\operatorname varepsilon^2\operatorname\big \operatorname[\hat_\hat)^2\big.html" ;"title="hat.html" ;"title="\operatorname[\hat">\operatorname[\hat \hat)^2\big">hat.html" ;"title="\operatorname[\hat">\operatorname[\hat \hat)^2\big
+2(f-\operatornamehat
A hat is a head covering which is worn for various reasons, including protection against weather conditions, ceremonial reasons such as university graduation, religious reasons, safety, or as a fashion accessory. Hats which incorporate mecha ...
\operatorname[\varepsilon]
+2\operatorname[\varepsilon]\operatorname\big[\operatorname[\hat]- \hat\big]
+2\operatorname\big[\operatorname[\hat]- \hat\big](f-\operatornamehat
A hat is a head covering which is worn for various reasons, including protection against weather conditions, ceremonial reasons such as university graduation, religious reasons, safety, or as a fashion accessory. Hats which incorporate mecha ...
\\ pt & = (f-\operatornamehat
A hat is a head covering which is worn for various reasons, including protection against weather conditions, ceremonial reasons such as university graduation, religious reasons, safety, or as a fashion accessory. Hats which incorporate mecha ...
^2+\operatorname varepsilon^2\operatorname\big \operatorname[\hat_\hat)^2\big.html" ;"title="hat.html" ;"title="\operatorname[\hat">\operatorname[\hat \hat)^2\big">hat.html" ;"title="\operatorname[\hat">\operatorname[\hat \hat)^2\big\ pt & = (f-\operatornamehat
A hat is a head covering which is worn for various reasons, including protection against weather conditions, ceremonial reasons such as university graduation, religious reasons, safety, or as a fashion accessory. Hats which incorporate mecha ...
^2+\operatornamevarepsilon
Epsilon (, ; uppercase , lowercase or lunate ; el, έψιλον) is the fifth letter of the Greek alphabet, corresponding phonetically to a mid front unrounded vowel or . In the system of Greek numerals it also has the value five. It was de ...
\operatorname\big hat\big\ pt & = \operatornamehat
A hat is a head covering which is worn for various reasons, including protection against weather conditions, ceremonial reasons such as university graduation, religious reasons, safety, or as a fashion accessory. Hats which incorporate mecha ...
2+\operatornamevarepsilon
Epsilon (, ; uppercase , lowercase or lunate ; el, έψιλον) is the fifth letter of the Greek alphabet, corresponding phonetically to a mid front unrounded vowel or . In the system of Greek numerals it also has the value five. It was de ...
\operatorname\big hat\big\ pt & = \operatornamehat
A hat is a head covering which is worn for various reasons, including protection against weather conditions, ceremonial reasons such as university graduation, religious reasons, safety, or as a fashion accessory. Hats which incorporate mecha ...
2+\sigma^2+\operatorname\big hat\big
\end
Finally, MSE loss function (or negative log-likelihood) is obtained by taking the expectation value over
x\sim P:
:
\text = \operatorname_x\bigg\ + \sigma^2.
Approaches
Dimensionality reduction
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
and
feature selection
In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construc ...
can decrease variance by simplifying models. Similarly, a larger training set tends to decrease variance. Adding features (predictors) tends to decrease bias, at the expense of introducing additional variance. Learning algorithms typically have some tunable parameters that control bias and variance; for example,
*
linear
Linearity is the property of a mathematical relationship (''function'') that can be graphically represented as a straight line. Linearity is closely related to '' proportionality''. Examples in physics include rectilinear motion, the linear r ...
and
Generalized linear models can be
regularized to decrease their variance at the cost of increasing their bias.
* In
artificial neural network
Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains.
An ANN is based on a collection of connected unit ...
s, the variance increases and the bias decreases as the number of hidden units increase,
although this classical assumption has been the subject of recent debate.
Like in GLMs, regularization is typically applied.
* In
''k''-nearest neighbor models, a high value of leads to high bias and low variance (see below).
* In
instance-based learning In machine learning, instance-based learning (sometimes called memory-based learning) is a family of learning algorithms that, instead of performing explicit generalization, compare new problem instances with instances seen in training, which have b ...
, regularization can be achieved varying the mixture of
prototype
A prototype is an early sample, model, or release of a product built to test a concept or process. It is a term used in a variety of contexts, including semantics, design, electronics, and Software prototyping, software programming. A prototyp ...
s and exemplars.
* In
decision tree
A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains condit ...
s, the depth of the tree determines the variance. Decision trees are commonly pruned to control variance.
One way of resolving the trade-off is to use
mixture models
In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation ...
and
ensemble learning
In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.
Unlike a statistical ensemble in statisti ...
. For example,
boosting combines many "weak" (high bias) models in an ensemble that has lower bias than the individual models, while
bagging combines "strong" learners in a way that reduces their variance.
Model validation
In statistics, model validation is the task of evaluating whether a chosen statistical model is appropriate or not. Oftentimes in statistical inference, inferences from models that appear to fit their data may be flukes, resulting in a misunderstan ...
methods such as
cross-validation (statistics)
Cross-validation, sometimes called rotation estimation or out-of-sample testing, is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set.
Cross-v ...
can be used to tune models so as to optimize the trade-off.
''k''-nearest neighbors
In the case of
-nearest neighbors regression, when the expectation is taken over the possible labeling of a fixed training set, a
closed-form expression
In mathematics, a closed-form expression is a mathematical expression that uses a finite number of standard operations. It may contain constants, variables, certain well-known operations (e.g., + − × ÷), and functions (e.g., ''n''th roo ...
exists that relates the bias–variance decomposition to the parameter :
:
\operatorname y - \hat(x))^2\mid X=x= \left( f(x) - \frac\sum_^k f(N_i(x)) \right)^2 + \frac + \sigma^2
where
N_1(x), \dots, N_k(x) are the nearest neighbors of in the training set. The bias (first term) is a monotone rising function of , while the variance (second term) drops off as is increased. In fact, under "reasonable assumptions" the bias of the first-nearest neighbor (1-NN) estimator vanishes entirely as the size of the training set approaches infinity.
Applications
In regression
The bias–variance decomposition forms the conceptual basis for regression
regularization
Regularization may refer to:
* Regularization (linguistics)
* Regularization (mathematics)
* Regularization (physics)
In physics, especially quantum field theory, regularization is a method of modifying observables which have singularities in ...
methods such as
Lasso
A lasso ( or ), also called lariat, riata, or reata (all from Castilian, la reata 're-tied rope'), is a loop of rope designed as a restraint to be thrown around a target and tightened when pulled. It is a well-known tool of the Spanish an ...
and
ridge regression
Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also ...
. Regularization methods introduce bias into the regression solution that can reduce variance considerably relative to the
ordinary least squares (OLS) solution. Although the OLS solution provides non-biased regression estimates, the lower variance solutions produced by regularization techniques provide superior MSE performance.
In classification
The bias–variance decomposition was originally formulated for least-squares regression. For the case of
classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood.
Classification is the grouping of related facts into classes.
It may also refer to:
Business, organizat ...
under the
0-1 loss (misclassification rate), it is possible to find a similar decomposition. Alternatively, if the classification problem can be phrased as
probabilistic classification
In machine learning, a probabilistic classifier is a classifier that is able to predict, given an observation of an input, a probability distribution over a set of classes, rather than only outputting the most likely class that the observation sho ...
, then the expected squared error of the predicted probabilities with respect to the true probabilities can be decomposed as before.
It has been argued that as training data increases, the variance of learned models will tend to decrease, and hence that as training data quantity increases, error is minimized by methods that learn models with lesser bias, and that conversely, for smaller training data quantities it is ever more important to minimize variance.
In reinforcement learning
Even though the bias–variance decomposition does not directly apply in
reinforcement learning
Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...
, a similar tradeoff can also characterize generalization. When an agent has limited information on its environment, the suboptimality of an RL algorithm can be decomposed into the sum of two terms: a term related to an asymptotic bias and a term due to overfitting. The asymptotic bias is directly related to the learning algorithm (independently of the quantity of data) while the overfitting term comes from the fact that the amount of data is limited.
In human learning
While widely discussed in the context of machine learning, the bias–variance dilemma has been examined in the context of
human cognition, most notably by
Gerd Gigerenzer
Gerd Gigerenzer (born 3 September 1947) is a German psychologist who has studied the use of bounded rationality and heuristics in decision making. Gigerenzer is director emeritus of the Center for Adaptive Behavior and Cognition (ABC) at the Max ...
and co-workers in the context of learned heuristics. They have argued (see references below) that the human brain resolves the dilemma in the case of the typically sparse, poorly-characterised training-sets provided by experience by adopting high-bias/low variance heuristics. This reflects the fact that a zero-bias approach has poor generalisability to new situations, and also unreasonably presumes precise knowledge of the true state of the world. The resulting heuristics are relatively simple, but produce better inferences in a wider variety of situations.
Geman et al.
argue that the bias–variance dilemma implies that abilities such as generic
object recognition
Object recognition – technology in the field of computer vision for finding and identifying objects in an image or video sequence. Humans recognize a multitude of objects in images with little effort, despite the fact that the image of the ...
cannot be learned from scratch, but require a certain degree of "hard wiring" that is later tuned by experience. This is because model-free approaches to inference require impractically large training sets if they are to avoid high variance.
See also
*
Accuracy and precision
Accuracy and precision are two measures of ''observational error''.
''Accuracy'' is how close a given set of measurements ( observations or readings) are to their ''true value'', while ''precision'' is how close the measurements are to each oth ...
*
Bias of an estimator
In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In s ...
*
Double descent
In statistics and machine learning, double descent is the phenomenon where a statistical model with a small number of parameter
A parameter (), generally, is any characteristic that can help in defining or classifying a particular system ...
*
Gauss–Markov theorem
In statistics, the Gauss–Markov theorem (or simply Gauss theorem for some authors) states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the ...
*
Hyperparameter optimization In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the va ...
*
Law of total variance In probability theory, the law of total variance or variance decomposition formula or conditional variance formulas or law of iterated variances also known as Eve's law, states that if X and Y are random variables on the same probability space, and ...
*
Minimum-variance unbiased estimator In statistics a minimum-variance unbiased estimator (MVUE) or uniformly minimum-variance unbiased estimator (UMVUE) is an unbiased estimator that has lower variance than any other unbiased estimator for all possible values of the parameter.
For pra ...
*
Model selection
Model selection is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the ...
*
Regression model validation
In statistics, regression validation is the process of deciding whether the numerical results quantifying hypothesized relationships between variables, obtained from regression analysis, are acceptable as descriptions of the data. The validation ...
*
Supervised learning
Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...
References
External links
MLU-Explain: The Bias Variance Tradeoff— An interactive visualization of the bias-variance tradeoff in LOESS Regression and K-Nearest Neighbors.
{{DEFAULTSORT:Bias-variance dilemma
Dilemmas
Model selection
Machine learning
Statistical classification