HOME

TheInfoList



OR:

In
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
and
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
, the bias–variance tradeoff is the property of a model that the
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
of the parameter estimated across samples can be reduced by increasing the
bias Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group, ...
in the
estimated Estimation (or estimating) is the process of finding an estimate or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is der ...
parameters A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
. The bias–variance dilemma or bias–variance problem is the conflict in trying to simultaneously minimize these two sources of
error An error (from the Latin ''error'', meaning "wandering") is an action which is inaccurate or incorrect. In some usages, an error is synonymous with a mistake. The etymology derives from the Latin term 'errare', meaning 'to stray'. In statistics ...
that prevent
supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...
algorithms from generalizing beyond their
training set In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...
: * The ''bias'' error is an error from erroneous assumptions in the learning
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...
. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting). * The ''
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
'' is an error from sensitivity to small fluctuations in the training set. High variance may result from an algorithm modeling the random
noise Noise is unwanted sound considered unpleasant, loud or disruptive to hearing. From a physics standpoint, there is no distinction between noise and desired sound, as both are vibrations through a medium, such as air or water. The difference arise ...
in the training data (
overfitting mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
). The bias–variance decomposition is a way of analyzing a learning algorithm's expected generalization error with respect to a particular problem as a sum of three terms, the bias, variance, and a quantity called the ''irreducible error'', resulting from noise in the problem itself.


Motivation

File:En low bias low variance.png, bias low, variance low File:Truen bad prec ok.png, bias high,
variance low File:Truen ok prec bad.png, bias low,
variance high File:Truen bad prec bad.png, bias high,
variance high
The bias–variance tradeoff is a central problem in supervised learning. Ideally, one wants to choose a model that both accurately captures the regularities in its training data, but also generalizes well to unseen data. Unfortunately, it is typically impossible to do both simultaneously. High-variance learning methods may be able to represent their training set well but are at risk of overfitting to noisy or unrepresentative training data. In contrast, algorithms with high bias typically produce simpler models that may fail to capture important regularities (i.e. underfit) in the data. It is an often made
fallacy A fallacy is the use of invalid or otherwise faulty reasoning, or "wrong moves," in the construction of an argument which may appear stronger than it really is if the fallacy is not spotted. The term in the Western intellectual tradition was intr ...
to assume that complex models must have high variance; High variance models are 'complex' in some sense, but the reverse needs not be true. In addition, one has to be careful how to define complexity: In particular, the number of parameters used to describe the model is a poor measure of complexity. This is illustrated by an example adapted from: The model f_(x)=a\sin(bx) has only two parameters (a,b) but it can interpolate any number of points by oscillating with a high enough frequency, resulting in both a high bias and high variance. An analogy can be made to the relationship between
accuracy and precision Accuracy and precision are two measures of ''observational error''. ''Accuracy'' is how close a given set of measurements ( observations or readings) are to their ''true value'', while ''precision'' is how close the measurements are to each oth ...
. Accuracy is a description of bias and can intuitively be improved by selecting from only
local Local may refer to: Geography and transportation * Local (train), a train serving local traffic demand * Local, Missouri, a community in the United States * Local government, a form of public administration, usually the lowest tier of administrat ...
information. Consequently, a sample will appear accurate (i.e. have low bias) under the aforementioned selection conditions, but may result in underfitting. In other words,
test data Test data is data which has been specifically identified for use in tests, typically of a computer program. Background Some data may be used in a confirmatory way, typically to verify that a given set of input to a given function produces some e ...
may not agree as closely with training data, which would indicate imprecision and therefore inflated variance. A graphical example would be a straight line fit to data exhibiting quadratic behavior overall. Precision is a description of variance and generally can only be improved by selecting information from a comparatively larger space. The option to select many data points over a broad sample space is the ideal condition for any analysis. However, intrinsic constraints (whether physical, theoretical, computational, etc.) will always play a limiting role. The limiting case where only a finite number of data points are selected over a broad sample space may result in improved precision and lower variance overall, but may also result in an overreliance on the training data (overfitting). This means that test data would also not agree as closely with the training data, but in this case the reason is due to inaccuracy or high bias. To borrow from the previous example, the graphical representation would appear as a high-order polynomial fit to the same data exhibiting quadratic behavior. Note that error in each case is measured the same way, but the reason ascribed to the error is different depending on the balance between bias and variance. To mitigate how much information is used from neighboring observations, a model can be smoothed via explicit
regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) In physics, especially quantum field theory, regularization is a method of modifying observables which have singularities in ...
, such as shrinkage.


Bias–variance decomposition of mean squared error

Suppose that we have a training set consisting of a set of points x_1, \dots, x_n and real values y_i associated with each point x_i. We assume that there is a function f(x) such as y = f(x) + \varepsilon, where the noise, \varepsilon, has zero mean and variance \sigma^2. We want to find a function \hat(x;D), that approximates the true function f(x) as well as possible, by means of some learning algorithm based on a training dataset (sample) D=\. We make "as well as possible" precise by measuring the
mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
between y and \hat(x;D): we want (y - \hat(x;D))^2 to be minimal, both for x_1, \dots, x_n ''and for points outside of our sample''. Of course, we cannot hope to do so perfectly, since the y_i contain noise \varepsilon; this means we must be prepared to accept an ''irreducible error'' in any function we come up with. Finding an \hat that generalizes to points outside of the training set can be done with any of the countless algorithms used for supervised learning. It turns out that whichever function \hat we select, we can decompose its expected error on an unseen sample x as follows: : \operatorname_ \Big big(y - \hat(x;D)\big)^2\Big= \Big(\operatorname_D\big hat(x;D)\big\Big) ^2 + \operatorname_D\big hat(x;D)\big+ \sigma^2 where : \operatorname_D\big hat(x;D)\big= \operatorname_D\big hat(x;D)- f(x)\big \operatorname_D\big hat(x;D)\big- \operatorname\big (x)\big and : \operatorname_D\big hat(x;D)\big= \operatorname_D big(\operatorname_D
big(\operatorname_D[\hat(x;D)-_\hat(x;D)\big)^2">hat(x;D).html"_;"title="big(\operatorname_D[\hat(x;D)">big(\operatorname_D[\hat(x;D)-_\hat(x;D)\big)^2 The_expectation_ranges_over_different_choices_of_the_training_set_D=\,_all_sampled_from_the_same_joint_distribution_P(x,y)_which_can_for_example_be_done_via_Bootstrapping_(statistics).html" "title="hat(x;D)-_\hat(x;D)\big)^2.html" ;"title="hat(x;D).html" ;"title="big(\operatorname_D[\hat(x;D)">big(\operatorname_D[\hat(x;D)- \hat(x;D)\big)^2">hat(x;D).html" ;"title="big(\operatorname_D[\hat(x;D)">big(\operatorname_D[\hat(x;D)- \hat(x;D)\big)^2 The expectation ranges over different choices of the training set D=\, all sampled from the same joint distribution P(x,y) which can for example be done via Bootstrapping (statistics)">bootstrapping In general, bootstrapping usually refers to a self-starting process that is supposed to continue or grow without external input. Etymology Tall boots may have a tab, loop or handle at the top known as a bootstrap, allowing one to use fingers ...
. The three terms represent: * the square of the ''bias'' of the learning method, which can be thought of as the error caused by the simplifying assumptions built into the method. E.g., when approximating a non-linear function f(x) using a learning method for linear models, there will be error in the estimates \hat(x) due to this assumption; * the ''variance'' of the learning method, or, intuitively, how much the learning method \hat(x) will move around its mean; * the irreducible error \sigma^2. Since all three terms are non-negative, the irreducible error forms a lower bound on the expected error on unseen samples. The more complex the model \hat(x) is, the more data points it will capture, and the lower the bias will be. However, complexity will make the model "move" more to capture the data points, and hence its variance will be larger.


Derivation

The derivation of the bias–variance decomposition for squared error proceeds as follows. For notational convenience, we abbreviate f = f(x), \hat = \hat(x;D) and we drop the D subscript on our expectation operators. First, recall that, by definition, for any random variable X, we have : \operatorname = \operatorname ^2- \operatorname 2. Rearranging, we get: : \operatorname ^2= \operatorname + \operatorname 2. Since f is
deterministic Determinism is a philosophical view, where all events are determined completely by previously existing causes. Deterministic theories throughout the history of philosophy have developed from diverse and sometimes overlapping motives and consi ...
, i.e. independent of D, : \operatorname = f. Thus, given y = f + \varepsilon and \operatorname
varepsilon Epsilon (, ; uppercase , lowercase or lunate ; el, έψιλον) is the fifth letter of the Greek alphabet, corresponding phonetically to a mid front unrounded vowel or . In the system of Greek numerals it also has the value five. It was de ...
= 0 (because \varepsilon is noise), implies \operatorname = \operatorname + \varepsilon= \operatorname = f. Also, since \operatorname
varepsilon Epsilon (, ; uppercase , lowercase or lunate ; el, έψιλον) is the fifth letter of the Greek alphabet, corresponding phonetically to a mid front unrounded vowel or . In the system of Greek numerals it also has the value five. It was de ...
= \sigma^2, : \operatorname = \operatorname y_-_\operatorname[y^2.html"_;"title=".html"_;"title="y_-_\operatorname[y">y_-_\operatorname[y^2">.html"_;"title="y_-_\operatorname[y">y_-_\operatorname[y^2=_\operatorname[(y_-_f)^2.html" ;"title="">y_-_\operatorname[y^2.html" ;"title=".html" ;"title="y - \operatorname[y">y - \operatorname[y^2">.html" ;"title="y - \operatorname[y">y - \operatorname[y^2= \operatorname[(y - f)^2">">y_-_\operatorname[y^2.html" ;"title=".html" ;"title="y - \operatorname[y">y - \operatorname[y^2">.html" ;"title="y - \operatorname[y">y - \operatorname[y^2= \operatorname[(y - f)^2= \operatorname[(f + \varepsilon - f)^2] = \operatorname[\varepsilon^2] = \operatorname
varepsilon Epsilon (, ; uppercase , lowercase or lunate ; el, έψιλον) is the fifth letter of the Greek alphabet, corresponding phonetically to a mid front unrounded vowel or . In the system of Greek numerals it also has the value five. It was de ...
+ \operatorname
varepsilon Epsilon (, ; uppercase , lowercase or lunate ; el, έψιλον) is the fifth letter of the Greek alphabet, corresponding phonetically to a mid front unrounded vowel or . In the system of Greek numerals it also has the value five. It was de ...
2 = \sigma^2 + 0^2 = \sigma^2. Thus, since \varepsilon and \hat are independent, we can write : \begin MSE = \operatorname\big y - \hat)^2\big & = \operatorname\big f+\varepsilon - \hat )^2\big\\ pt & = \operatorname\big f+\varepsilon__-_\hat_+\operatorname[\hat\operatorname[\hat.html" ;"title="hat.html" ;"title="f+\varepsilon - \hat +\operatorname[\hat">f+\varepsilon - \hat +\operatorname[\hat\operatorname[\hat">hat.html" ;"title="f+\varepsilon - \hat +\operatorname[\hat">f+\varepsilon - \hat +\operatorname[\hat\operatorname[\hat^2\big] \\ pt & = \operatorname\big f-\operatorname[\hat^2\big]+\operatorname varepsilon^2\operatorname\big \operatorname[\hat_\hat)^2\big.html" ;"title="hat.html" ;"title="\operatorname[\hat">\operatorname[\hat \hat)^2\big">hat.html" ;"title="\operatorname[\hat">\operatorname[\hat \hat)^2\big +2\operatorname\big f-\operatorname[\hat\varepsilon\big] +2\operatorname\big[\varepsilon(\operatorname[\hat]- \hat)\big] +2\operatorname\big \operatorname[\hat \hat)(f-\operatorname
hat A hat is a head covering which is worn for various reasons, including protection against weather conditions, ceremonial reasons such as university graduation, religious reasons, safety, or as a fashion accessory. Hats which incorporate mecha ...
\big] \\ pt & = (f-\operatorname
hat A hat is a head covering which is worn for various reasons, including protection against weather conditions, ceremonial reasons such as university graduation, religious reasons, safety, or as a fashion accessory. Hats which incorporate mecha ...
^2+\operatorname varepsilon^2\operatorname\big \operatorname[\hat_\hat)^2\big.html" ;"title="hat.html" ;"title="\operatorname[\hat">\operatorname[\hat \hat)^2\big">hat.html" ;"title="\operatorname[\hat">\operatorname[\hat \hat)^2\big +2(f-\operatorname
hat A hat is a head covering which is worn for various reasons, including protection against weather conditions, ceremonial reasons such as university graduation, religious reasons, safety, or as a fashion accessory. Hats which incorporate mecha ...
\operatorname[\varepsilon] +2\operatorname[\varepsilon]\operatorname\big[\operatorname[\hat]- \hat\big] +2\operatorname\big[\operatorname[\hat]- \hat\big](f-\operatorname
hat A hat is a head covering which is worn for various reasons, including protection against weather conditions, ceremonial reasons such as university graduation, religious reasons, safety, or as a fashion accessory. Hats which incorporate mecha ...
\\ pt & = (f-\operatorname
hat A hat is a head covering which is worn for various reasons, including protection against weather conditions, ceremonial reasons such as university graduation, religious reasons, safety, or as a fashion accessory. Hats which incorporate mecha ...
^2+\operatorname varepsilon^2\operatorname\big \operatorname[\hat_\hat)^2\big.html" ;"title="hat.html" ;"title="\operatorname[\hat">\operatorname[\hat \hat)^2\big">hat.html" ;"title="\operatorname[\hat">\operatorname[\hat \hat)^2\big\ pt & = (f-\operatorname
hat A hat is a head covering which is worn for various reasons, including protection against weather conditions, ceremonial reasons such as university graduation, religious reasons, safety, or as a fashion accessory. Hats which incorporate mecha ...
^2+\operatorname
varepsilon Epsilon (, ; uppercase , lowercase or lunate ; el, έψιλον) is the fifth letter of the Greek alphabet, corresponding phonetically to a mid front unrounded vowel or . In the system of Greek numerals it also has the value five. It was de ...
\operatorname\big hat\big\ pt & = \operatorname
hat A hat is a head covering which is worn for various reasons, including protection against weather conditions, ceremonial reasons such as university graduation, religious reasons, safety, or as a fashion accessory. Hats which incorporate mecha ...
2+\operatorname
varepsilon Epsilon (, ; uppercase , lowercase or lunate ; el, έψιλον) is the fifth letter of the Greek alphabet, corresponding phonetically to a mid front unrounded vowel or . In the system of Greek numerals it also has the value five. It was de ...
\operatorname\big hat\big\ pt & = \operatorname
hat A hat is a head covering which is worn for various reasons, including protection against weather conditions, ceremonial reasons such as university graduation, religious reasons, safety, or as a fashion accessory. Hats which incorporate mecha ...
2+\sigma^2+\operatorname\big hat\big \end Finally, MSE loss function (or negative log-likelihood) is obtained by taking the expectation value over x\sim P: : \text = \operatorname_x\bigg\ + \sigma^2.


Approaches

Dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
and
feature selection In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construc ...
can decrease variance by simplifying models. Similarly, a larger training set tends to decrease variance. Adding features (predictors) tends to decrease bias, at the expense of introducing additional variance. Learning algorithms typically have some tunable parameters that control bias and variance; for example, *
linear Linearity is the property of a mathematical relationship (''function'') that can be graphically represented as a straight line. Linearity is closely related to '' proportionality''. Examples in physics include rectilinear motion, the linear r ...
and Generalized linear models can be regularized to decrease their variance at the cost of increasing their bias. * In
artificial neural network Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected unit ...
s, the variance increases and the bias decreases as the number of hidden units increase, although this classical assumption has been the subject of recent debate. Like in GLMs, regularization is typically applied. * In ''k''-nearest neighbor models, a high value of leads to high bias and low variance (see below). * In
instance-based learning In machine learning, instance-based learning (sometimes called memory-based learning) is a family of learning algorithms that, instead of performing explicit generalization, compare new problem instances with instances seen in training, which have b ...
, regularization can be achieved varying the mixture of
prototype A prototype is an early sample, model, or release of a product built to test a concept or process. It is a term used in a variety of contexts, including semantics, design, electronics, and Software prototyping, software programming. A prototyp ...
s and exemplars. * In
decision tree A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains condit ...
s, the depth of the tree determines the variance. Decision trees are commonly pruned to control variance. One way of resolving the trade-off is to use
mixture models In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation ...
and
ensemble learning In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Unlike a statistical ensemble in statisti ...
. For example, boosting combines many "weak" (high bias) models in an ensemble that has lower bias than the individual models, while bagging combines "strong" learners in a way that reduces their variance.
Model validation In statistics, model validation is the task of evaluating whether a chosen statistical model is appropriate or not. Oftentimes in statistical inference, inferences from models that appear to fit their data may be flukes, resulting in a misunderstan ...
methods such as
cross-validation (statistics) Cross-validation, sometimes called rotation estimation or out-of-sample testing, is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. Cross-v ...
can be used to tune models so as to optimize the trade-off.


''k''-nearest neighbors

In the case of -nearest neighbors regression, when the expectation is taken over the possible labeling of a fixed training set, a
closed-form expression In mathematics, a closed-form expression is a mathematical expression that uses a finite number of standard operations. It may contain constants, variables, certain well-known operations (e.g., + − × ÷), and functions (e.g., ''n''th roo ...
exists that relates the bias–variance decomposition to the parameter : : \operatorname y - \hat(x))^2\mid X=x= \left( f(x) - \frac\sum_^k f(N_i(x)) \right)^2 + \frac + \sigma^2 where N_1(x), \dots, N_k(x) are the nearest neighbors of in the training set. The bias (first term) is a monotone rising function of , while the variance (second term) drops off as is increased. In fact, under "reasonable assumptions" the bias of the first-nearest neighbor (1-NN) estimator vanishes entirely as the size of the training set approaches infinity.


Applications


In regression

The bias–variance decomposition forms the conceptual basis for regression
regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) In physics, especially quantum field theory, regularization is a method of modifying observables which have singularities in ...
methods such as
Lasso A lasso ( or ), also called lariat, riata, or reata (all from Castilian, la reata 're-tied rope'), is a loop of rope designed as a restraint to be thrown around a target and tightened when pulled. It is a well-known tool of the Spanish an ...
and
ridge regression Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also ...
. Regularization methods introduce bias into the regression solution that can reduce variance considerably relative to the ordinary least squares (OLS) solution. Although the OLS solution provides non-biased regression estimates, the lower variance solutions produced by regularization techniques provide superior MSE performance.


In classification

The bias–variance decomposition was originally formulated for least-squares regression. For the case of
classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...
under the 0-1 loss (misclassification rate), it is possible to find a similar decomposition. Alternatively, if the classification problem can be phrased as
probabilistic classification In machine learning, a probabilistic classifier is a classifier that is able to predict, given an observation of an input, a probability distribution over a set of classes, rather than only outputting the most likely class that the observation sho ...
, then the expected squared error of the predicted probabilities with respect to the true probabilities can be decomposed as before. It has been argued that as training data increases, the variance of learned models will tend to decrease, and hence that as training data quantity increases, error is minimized by methods that learn models with lesser bias, and that conversely, for smaller training data quantities it is ever more important to minimize variance.


In reinforcement learning

Even though the bias–variance decomposition does not directly apply in
reinforcement learning Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...
, a similar tradeoff can also characterize generalization. When an agent has limited information on its environment, the suboptimality of an RL algorithm can be decomposed into the sum of two terms: a term related to an asymptotic bias and a term due to overfitting. The asymptotic bias is directly related to the learning algorithm (independently of the quantity of data) while the overfitting term comes from the fact that the amount of data is limited.


In human learning

While widely discussed in the context of machine learning, the bias–variance dilemma has been examined in the context of human cognition, most notably by
Gerd Gigerenzer Gerd Gigerenzer (born 3 September 1947) is a German psychologist who has studied the use of bounded rationality and heuristics in decision making. Gigerenzer is director emeritus of the Center for Adaptive Behavior and Cognition (ABC) at the Max ...
and co-workers in the context of learned heuristics. They have argued (see references below) that the human brain resolves the dilemma in the case of the typically sparse, poorly-characterised training-sets provided by experience by adopting high-bias/low variance heuristics. This reflects the fact that a zero-bias approach has poor generalisability to new situations, and also unreasonably presumes precise knowledge of the true state of the world. The resulting heuristics are relatively simple, but produce better inferences in a wider variety of situations. Geman et al. argue that the bias–variance dilemma implies that abilities such as generic
object recognition Object recognition – technology in the field of computer vision for finding and identifying objects in an image or video sequence. Humans recognize a multitude of objects in images with little effort, despite the fact that the image of the ...
cannot be learned from scratch, but require a certain degree of "hard wiring" that is later tuned by experience. This is because model-free approaches to inference require impractically large training sets if they are to avoid high variance.


See also

*
Accuracy and precision Accuracy and precision are two measures of ''observational error''. ''Accuracy'' is how close a given set of measurements ( observations or readings) are to their ''true value'', while ''precision'' is how close the measurements are to each oth ...
*
Bias of an estimator In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In s ...
*
Double descent In statistics and machine learning, double descent is the phenomenon where a statistical model with a small number of parameter A parameter (), generally, is any characteristic that can help in defining or classifying a particular system ...
*
Gauss–Markov theorem In statistics, the Gauss–Markov theorem (or simply Gauss theorem for some authors) states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the ...
*
Hyperparameter optimization In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the va ...
*
Law of total variance In probability theory, the law of total variance or variance decomposition formula or conditional variance formulas or law of iterated variances also known as Eve's law, states that if X and Y are random variables on the same probability space, and ...
*
Minimum-variance unbiased estimator In statistics a minimum-variance unbiased estimator (MVUE) or uniformly minimum-variance unbiased estimator (UMVUE) is an unbiased estimator that has lower variance than any other unbiased estimator for all possible values of the parameter. For pra ...
*
Model selection Model selection is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the ...
*
Regression model validation In statistics, regression validation is the process of deciding whether the numerical results quantifying hypothesized relationships between variables, obtained from regression analysis, are acceptable as descriptions of the data. The validation ...
*
Supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...


References


External links


MLU-Explain: The Bias Variance Tradeoff
— An interactive visualization of the bias-variance tradeoff in LOESS Regression and K-Nearest Neighbors. {{DEFAULTSORT:Bias-variance dilemma Dilemmas Model selection Machine learning Statistical classification