HOME

TheInfoList



OR:

In
statistics Statistics (from German: ''Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industri ...
, explained variation measures the proportion to which a mathematical model accounts for the variation ( dispersion) of a given data set. Often, variation is quantified as
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
; then, the more specific term explained variance can be used. The complementary part of the total variation is called unexplained or residual variation.


Definition in terms of information gain


Information gain by better modelling

Following Kent (1983), we use the Fraser information (Fraser 1965) :F(\theta) = \int \textrmr\,g(r)\,\ln f(r;\theta) where g(r) is the probability density of a random variable R\,, and f(r;\theta)\, with \theta\in\Theta_i (i=0,1\,) are two families of parametric models. Model family 0 is the simpler one, with a restricted parameter space \Theta_0\subset\Theta_1. Parameters are determined by
maximum likelihood estimation In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statisti ...
, :\theta_i = \operatorname_ F(\theta). The information gain of model 1 over model 0 is written as :\Gamma(\theta_1:\theta_0) = 2 F(\theta_1)-F(\theta_0) , where a factor of 2 is included for convenience. Γ is always nonnegative; it measures the extent to which the best model of family 1 is better than the best model of family 0 in explaining ''g''(''r'').


Information gain by a conditional model

Assume a two-dimensional random variable R=(X,Y) where ''X'' shall be considered as an explanatory variable, and ''Y'' as a dependent variable. Models of family 1 "explain" ''Y'' in terms of ''X'', :f(y\mid x;\theta), whereas in family 0, ''X'' and ''Y'' are assumed to be independent. We define the randomness of ''Y'' by D(Y)=\exp 2F(\theta_0)/math>, and the randomness of ''Y'', given ''X'', by D(Y\mid X)=\exp 2F(\theta_1)/math>. Then, :\rho_C^2 = 1-D(Y\mid X)/D(Y) can be interpreted as proportion of the data dispersion which is "explained" by ''X''.


Special cases and generalized usage


Linear regression

The fraction of variance unexplained is an established concept in the context of
linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is cal ...
. The usual definition of the
coefficient of determination In statistics, the coefficient of determination, denoted ''R''2 or ''r''2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s). It is a statistic used i ...
is based on the fundamental concept of explained variance.


Correlation coefficient as measure of explained variance

Let ''X'' be a random vector, and ''Y'' a random variable that is modeled by a normal distribution with centre \mu=\Psi^\textrmX. In this case, the above-derived proportion of explained variation \rho_C^2 equals the squared
correlation coefficient A correlation coefficient is a numerical measure of some type of correlation, meaning a statistical relationship between two variables. The variables may be two columns of a given data set of observations, often called a sample, or two component ...
R^2. Note the strong model assumptions: the centre of the ''Y'' distribution must be a linear function of ''X'', and for any given ''x'', the ''Y'' distribution must be normal. In other situations, it is generally not justified to interpret R^2 as proportion of explained variance.


In principal component analysis

Explained variance is routinely used in
principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
. The relation to the Fraser–Kent information gain remains to be clarified.


Criticism

As the fraction of "explained variance" equals the squared correlation coefficient R^2, it shares all the disadvantages of the latter: it reflects not only the quality of the regression, but also the distribution of the independent (conditioning) variables. In the words of one critic: "Thus R^2 gives the 'percentage of variance explained' by the regression, an expression that, for most social scientists, is of doubtful meaning but great rhetorical value. If this number is large, the regression gives a good fit, and there is little point in searching for additional variables. Other regression equations on different data sets are said to be less satisfactory or less powerful if their R^2 is lower. Nothing about R^2 supports these claims". And, after constructing an example where R^2 is enhanced just by jointly considering data from two different populations: "'Explained variance' explains nothing."


See also

*
Analysis of variance Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures (such as the "variation" among and between groups) used to analyze the differences among means. ANOVA was developed by the statistician ...
*
Variance reduction In mathematics, more specifically in the theory of Monte Carlo methods, variance reduction is a procedure used to increase the precision of the estimates obtained for a given simulation or computational effort. Every output random variable from ...
*
Variance-based sensitivity analysis Variance-based sensitivity analysis (often referred to as the Sobol method or Sobol indices, after Ilya M. Sobol) is a form of global sensitivity analysis.Sobol,I.M. (2001), Global sensitivity indices for nonlinear mathematical models and their Mon ...


References

{{reflist


External links


Explained and Unexplained Variance on a graph
Regression analysis Statistics articles needing expert attention