Bias and variance contributing to total error

statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

and

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

, the bias–variance tradeoff describes the relationship between a model's complexity, the accuracy of its predictions, and how well it can make predictions on previously unseen data that were not used to train the model. In general, as the number of tunable parameters in a model increase, it becomes more flexible, and can better fit a training data set. That is, the model has lower error or lower

bias Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is inaccurate, closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individ ...

. However, for more flexible models, there will tend to be greater variance to the model fit each time we take a set of samples to create a new training data set. It is said that there is greater

variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...

in the model's

estimated Estimation (or estimating) is the process of finding an estimate or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is de ...

parameters A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...

. The bias–variance dilemma or bias–variance problem is the conflict in trying to simultaneously minimize these two sources of

error An error (from the Latin , meaning 'to wander'Oxford English Dictionary, s.v. “error (n.), Etymology,” September 2023, .) is an inaccurate or incorrect action, thought, or judgement. In statistics, "error" refers to the difference between t ...

that prevent

supervised learning In machine learning, supervised learning (SL) is a paradigm where a Statistical model, model is trained using input objects (e.g. a vector of predictor variables) and desired output values (also known as a ''supervisory signal''), which are often ...

algorithms from generalizing beyond their

training set In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...

: * The ''bias'' error is an error from erroneous assumptions in the learning

algorithm In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...

. High bias can cause an algorithm to miss the relevant relations between features and target outputs ( underfitting). * The ''

'' is an error from sensitivity to small fluctuations in the training set. High variance may result from an algorithm modeling the random

noise Noise is sound, chiefly unwanted, unintentional, or harmful sound considered unpleasant, loud, or disruptive to mental or hearing faculties. From a physics standpoint, there is no distinction between noise and desired sound, as both are vibrat ...

in the training data (

overfitting In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...

). The bias–variance decomposition is a way of analyzing a learning algorithm's expected

generalization error For supervised learning applications in machine learning and statistical learning theory, generalization errorMohri, M., Rostamizadeh A., Talwakar A., (2018) ''Foundations of Machine learning'', 2nd ed., Boston: MIT Press (also known as the out-of- ...

with respect to a particular problem as a sum of three terms, the bias, variance, and a quantity called the ''irreducible error'', resulting from noise in the problem itself.

Motivation

File:Truen bad prec ok.png, High bias, low variance File:Truen bad prec bad.png, High bias, high variance File:En low bias low variance.png, Low bias, low variance File:Truen ok prec bad.png, Low bias, high variance The bias–variance tradeoff is a central problem in supervised learning. Ideally, one wants to choose a model that both accurately captures the regularities in its training data, but also generalizes well to unseen data. Unfortunately, it is typically impossible to do both simultaneously. High-variance learning methods may be able to represent their training set well but are at risk of overfitting to noisy or unrepresentative training data. In contrast, algorithms with high bias typically produce simpler models that may fail to capture important regularities (i.e. underfit) in the data. It is an often made

fallacy A fallacy is the use of invalid or otherwise faulty reasoning in the construction of an argument that may appear to be well-reasoned if unnoticed. The term was introduced in the Western intellectual tradition by the Aristotelian '' De Sophisti ...

to assume that complex models must have high variance. High variance models are "complex" in some sense, but the reverse needs not be true. In addition, one has to be careful how to define complexity. In particular, the number of parameters used to describe the model is a poor measure of complexity. This is illustrated by an example adapted from: The model

f_(x)=a\sin(bx)

has only two parameters (

a,b

) but it can interpolate any number of points by oscillating with a high enough frequency, resulting in both a high bias and high variance. An analogy can be made to the relationship between

accuracy and precision Accuracy and precision are two measures of ''observational error''. ''Accuracy'' is how close a given set of measurements (observations or readings) are to their ''true value''. ''Precision'' is how close the measurements are to each other. The ...

. Accuracy is one way of quantifying bias and can intuitively be improved by selecting from only

local Local may refer to: Geography and transportation * Local (train), a train serving local traffic demand * Local, Missouri, a community in the United States Arts, entertainment, and media * ''Local'' (comics), a limited series comic book by Bria ...

information. Consequently, a sample will appear accurate (i.e. have low bias) under the aforementioned selection conditions, but may result in underfitting. In other words,

test data Test data are sets of inputs or information used to verify the correctness, performance, and reliability of software systems. Test data encompass various types, such as positive and negative scenarios, edge cases, and realistic user scenarios, and ...

may not agree as closely with training data, which would indicate imprecision and therefore inflated variance. A graphical example would be a straight line fit to data exhibiting quadratic behavior overall. Precision is a description of variance and generally can only be improved by selecting information from a comparatively larger space. The option to select many data points over a broad sample space is the ideal condition for any analysis. However, intrinsic constraints (whether physical, theoretical, computational, etc.) will always play a limiting role. The limiting case where only a finite number of data points are selected over a broad sample space may result in improved precision and lower variance overall, but may also result in an overreliance on the training data (overfitting). This means that test data would also not agree as closely with the training data, but in this case the reason is inaccuracy or high bias. To borrow from the previous example, the graphical representation would appear as a high-order polynomial fit to the same data exhibiting quadratic behavior. Note that error in each case is measured the same way, but the reason ascribed to the error is different depending on the balance between bias and variance. To mitigate how much information is used from neighboring observations, a model can be smoothed via explicit

regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) * Regularization (solid modeling) * Regularization Law, an Israeli law intended to retroactively legalize settlements See also ...

, such as shrinkage.

Bias–variance decomposition of mean squared error

Suppose that we have a training set consisting of a set of points

x_1, \dots, x_n

and real-valued labels

y_i

associated with the points

x_i

. We assume that the data is generated by a function

f(x)

such as

y = f(x) + \varepsilon

, where the noise,

\varepsilon

, has zero mean and variance

\sigma^2

. That is,

y_i = f(x_i) + \varepsilon_i

, where

\varepsilon_i

is a noise sample. We want to find a function

\hat(x;D)

, that approximates the true function

f(x)

as well as possible, by means of some learning algorithm based on a training dataset (sample)

D=\

. We make "as well as possible" precise by measuring the

mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference betwee ...

between

y

and

\hat(x;D)

: we want

(y - \hat(x;D))^2

to be minimal, both for

x_1, \dots, x_n

''and for points outside of our sample''. Of course, we cannot hope to do so perfectly, since the

y_i

contain noise

\varepsilon

; this means we must be prepared to accept an ''irreducible error'' in any function we come up with. Finding an

\hat

that generalizes to points outside of the training set can be done with any of the countless algorithms used for supervised learning. It turns out that whichever function

\hat

we select, we can decompose its expected error on an unseen sample

x

(''i.e. conditional to x'') as follows: :

+ \sigma^2

where :

\end

and :

- \hat(x;D) \big)^2 \Big">hat(x;D).html" ;"title="\big( \mathbb_D[\hat(x;D)">\big( \mathbb_D[\hat(x;D)- \hat(x;D) \big)^2 \Big

and :

\sigma^2 = \operatorname_y \Big[ \big( y - \underbrace_ \big)^2 \Big]

The expectation ranges over different choices of the training set

D=\

, all sampled from the same joint distribution

P(x,y)

which can for example be done via Bootstrapping (statistics), bootstrapping. The three terms represent: * the square of the ''bias'' of the learning method, which can be thought of as the error caused by the simplifying assumptions built into the method. E.g., when approximating a non-linear function

f(x)

using a learning method for

linear model In statistics, the term linear model refers to any model which assumes linearity in the system. The most common occurrence is in connection with regression models and the term is often taken as synonymous with linear regression model. However, t ...

s, there will be error in the estimates

\hat(x)

due to this assumption; * the ''variance'' of the learning method, or, intuitively, how much the learning method

\hat(x)

will move around its mean; * the irreducible error

\sigma^2

. Since all three terms are non-negative, the irreducible error forms a lower bound on the expected error on unseen samples. The more complex the model

\hat(x)

is, the more data points it will capture, and the lower the bias will be. However, complexity will make the model "move" more to capture the data points, and hence its variance will be larger.

Derivation

The derivation of the bias–variance decomposition for squared error proceeds as follows. For convenience, we drop the

D

subscript in the following lines, such that

\hat(x;D) = \hat(x)

. Let us write the mean-squared error of our model: :

\end

We can show that the second term of this equation is null:

= 0 \end

Moreover, the third term of this equation is nothing but

\sigma^2

, the variance of

\varepsilon

. Let us now expand the remaining term:

- \hat(x) \big)^2 \Big] \end

We show that:

\Big)^2 \end

This last series of equalities comes from the fact that

f(x)

is not a random variable, but a fixed, deterministic function of

x

. Therefore,

\mathbb \big f(x) \big = f(x)

. Similarly

\mathbb \big f(x)^2 \big = f(x)^2

, and

\Big] = f(x) \ \mathbb \big[ \hat(x) \big]

. Using the same reasoning, we can expand the second term and show that it is null:

2\\ &= 0 \end

Eventually, we plug our derivations back into the original equation, and identify each term:

\, + \, \sigma^2 \end

Finally, the MSE loss function (or negative log-likelihood) is obtained by taking the expectation value over

x\sim P

: :

\text = \mathbb_x\bigg\ + \sigma^2.

Approaches

Dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...

and

feature selection In machine learning, feature selection is the process of selecting a subset of relevant Feature (machine learning), features (variables, predictors) for use in model construction. Feature selection techniques are used for several reasons: * sim ...

can decrease variance by simplifying models. Similarly, a larger training set tends to decrease variance. Adding features (predictors) tends to decrease bias, at the expense of introducing additional variance. Learning algorithms typically have some tunable parameters that control bias and variance; for example, *

linear In mathematics, the term ''linear'' is used in two distinct senses for two different properties: * linearity of a '' function'' (or '' mapping''); * linearity of a '' polynomial''. An example of a linear function is the function defined by f(x) ...

and Generalized linear models can be regularized to decrease their variance at the cost of increasing their bias. * In

artificial neural network In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks. A neural network consists of connected ...

s, the variance increases and the bias decreases as the number of hidden units increase, although this classical assumption has been the subject of recent debate. Like in GLMs, regularization is typically applied. * In ''k''-nearest neighbor models, a high value of leads to high bias and low variance (see below). * In

instance-based learning In machine learning, instance-based learning (sometimes called memory-based learning) is a family of learning algorithms that, instead of performing explicit generalization, compare new problem instances with instances seen in training, which have b ...

, regularization can be achieved varying the mixture of

prototype A prototype is an early sample, model, or release of a product built to test a concept or process. It is a term used in a variety of contexts, including semantics, design, electronics, and Software prototyping, software programming. A prototype ...

s and exemplars. * In

decision tree A decision tree is a decision support system, decision support recursive partitioning structure that uses a Tree (graph theory), tree-like Causal model, model of decisions and their possible consequences, including probability, chance event ou ...

s, the depth of the tree determines the variance. Decision trees are commonly pruned to control variance. One way of resolving the trade-off is to use mixture models and

ensemble learning In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Unlike a statistical ensemble in statist ...

. For example, boosting combines many "weak" (high bias) models in an ensemble that has lower bias than the individual models, while bagging combines "strong" learners in a way that reduces their variance. Model validation methods such as

cross-validation (statistics) Cross-validation, sometimes called rotation estimation or out-of-sample testing, is any of various similar model validation techniques for assessing how the results of a statistics, statistical analysis will Generalization error, generalize to ...

can be used to tune models so as to optimize the trade-off.

''k''-nearest neighbors

In the case of -nearest neighbors regression, when the expectation is taken over the possible labeling of a fixed training set, a

closed-form expression In mathematics, an expression or equation is in closed form if it is formed with constants, variables, and a set of functions considered as ''basic'' and connected by arithmetic operations (, and integer powers) and function composition. ...

exists that relates the bias–variance decomposition to the parameter : :

= \left( f(x) - \frac\sum_^k f(N_i(x)) \right)^2 + \frac + \sigma^2

where

N_1(x), \dots, N_k(x)

are the nearest neighbors of in the training set. The bias (first term) is a monotone rising function of , while the variance (second term) drops off as is increased. In fact, under "reasonable assumptions" the bias of the first-nearest neighbor (1-NN) estimator vanishes entirely as the size of the training set approaches infinity.

Applications

In regression

The bias–variance decomposition forms the conceptual basis for regression

methods such as

LASSO A lasso or lazo ( or ), also called reata or la reata in Mexico, and in the United States riata or lariat (from Mexican Spanish lasso for roping cattle), is a loop of rope designed as a restraint to be thrown around a target and tightened when ...

and

ridge regression Ridge regression (also known as Tikhonov regularization, named for Andrey Tikhonov) is a method of estimating the coefficients of multiple- regression models in scenarios where the independent variables are highly correlated. It has been used in m ...

. Regularization methods introduce bias into the regression solution that can reduce variance considerably relative to the ordinary least squares (OLS) solution. Although the OLS solution provides non-biased regression estimates, the lower variance solutions produced by regularization techniques provide superior MSE performance.

In classification

The bias–variance decomposition was originally formulated for least-squares regression. For the case of

classification Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...

under the 0-1 loss (misclassification rate), it is possible to find a similar decomposition, with the caveat that the variance term becomes dependent on the target label. Alternatively, if the classification problem can be phrased as

probabilistic classification In machine learning, a probabilistic classifier is a classifier that is able to predict, given an observation of an input, a probability distribution over a set of classes, rather than only outputting the most likely class that the observation sh ...

, then the expected cross-entropy can instead be decomposed to give bias and variance terms with the same semantics but taking a different form. It has been argued that as training data increases, the variance of learned models will tend to decrease, and hence that as training data quantity increases, error is minimised by methods that learn models with lesser bias, and that conversely, for smaller training data quantities it is ever more important to minimise variance.

In reinforcement learning

Even though the bias–variance decomposition does not directly apply in

reinforcement learning Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...

, a similar tradeoff can also characterize generalization. When an agent has limited information on its environment, the suboptimality of an RL algorithm can be decomposed into the sum of two terms: a term related to an asymptotic bias and a term due to overfitting. The asymptotic bias is directly related to the learning algorithm (independently of the quantity of data) while the overfitting term comes from the fact that the amount of data is limited.

In Monte Carlo methods

While in traditional Monte Carlo methods the bias is typically zero, modern approaches, such as

Markov chain Monte Carlo In statistics, Markov chain Monte Carlo (MCMC) is a class of algorithms used to draw samples from a probability distribution. Given a probability distribution, one can construct a Markov chain whose elements' distribution approximates it – that ...

are only asymptotically unbiased, at best. Convergence diagnostics can be used to control bias via

burn-in Burn-in is the process by which components of a system are exercised before being placed in service (and often, before the system being completely assembled from those components). This testing process will force certain failures to occur under ...

removal, but due to a limited computational budget, a bias–variance trade-off arises, leading to a wide-range of approaches, in which a controlled bias is accepted, if this allows to dramatically reduce the variance, and hence the overall estimation error.

In human learning

While widely discussed in the context of machine learning, the bias–variance dilemma has been examined in the context of human cognition, most notably by

Gerd Gigerenzer Gerd Gigerenzer (; born 3 September 1947) is a German psychologist who has studied the use of bounded rationality and heuristics in decision making. Gigerenzer is director emeritus of the Center for Adaptive Behavior and Cognition (ABC) at the Ma ...

and co-workers in the context of learned heuristics. They have argued (see references below) that the human brain resolves the dilemma in the case of the typically sparse, poorly-characterized training-sets provided by experience by adopting high-bias/low variance heuristics. This reflects the fact that a zero-bias approach has poor generalizability to new situations, and also unreasonably presumes precise knowledge of the true state of the world. The resulting heuristics are relatively simple, but produce better inferences in a wider variety of situations. Geman et al. argue that the bias–variance dilemma implies that abilities such as generic

object recognition Object recognition – technology in the field of computer vision for finding and identifying objects in an image or video sequence. Humans recognize a multitude of objects in images with little effort, despite the fact that the image of the ...

cannot be learned from scratch, but require a certain degree of "hard wiring" that is later tuned by experience. This is because model-free approaches to inference require impractically large training sets if they are to avoid high variance.

References

External links

MLU-Explain: The Bias Variance Tradeoff
— An interactive visualization of the bias–variance tradeoff in LOESS Regression and K-Nearest Neighbors. {{DEFAULTSORT:Bias-variance dilemma Dilemmas Model selection Machine learning Statistical classification