statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

, Mallows's

\boldsymbol

, named for Colin Lingwood Mallows, is used to assess the fit of a regression model that has been estimated using

ordinary least squares In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression In statistics, linear regression is a statistical model, model that estimates the relationship ...

. It is applied in the context of

model selection Model selection is the task of selecting a model from among various candidates on the basis of performance criterion to choose the best one. In the context of machine learning and more generally statistical analysis, this may be the selection of ...

, where a number of predictor variables are available for predicting some outcome, and the goal is to find the best model involving a subset of these predictors. A small value of

C_p

means that the model is relatively precise. Mallows's ''C_p'' is ’essentially equivalent‘ to the

Akaike Information Criterion The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to ...

in the case of linear regression. This equivalence is only asymptotic; Akaike notes that ''C_p'' requires some subjective judgment in the choice of

\hat\sigma^2

Definition and properties

Mallows's ''C_p'' addresses the issue of

overfitting In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...

, in which model selection statistics such as the residual sum of squares always get smaller as more variables are added to a model. Thus, if we aim to select the model giving the smallest residual sum of squares, the model including all variables would always be selected. Instead, the ''C_p'' statistic calculated on a sample of data estimates the sum squared prediction error (SSPE) as its

population Population is a set of humans or other organisms in a given region or area. Governments conduct a census to quantify the resident population size within a given jurisdiction. The term is also applied to non-human animals, microorganisms, and pl ...

target :

E\sum_i (\hat_i - E(Y_i\mid X_i))^2/\sigma^2,

where

\hat_i

is the fitted value from the regression model for the ''i''th case, ''E''(''Y''_''i'' , ''X''_''i'') is the expected value for the ''i''th case, and σ² is the error variance (assumed constant across the cases). The mean squared prediction error (MSPE) will not automatically get smaller as more variables are added. The optimum model under this criterion is a compromise influenced by the sample size, the

effect size In statistics, an effect size is a value measuring the strength of the relationship between two variables in a population, or a sample-based estimate of that quantity. It can refer to the value of a statistic calculated from a sample of data, the ...

s of the different predictors, and the degree of

collinearity In geometry, collinearity of a set of points is the property of their lying on a single line. A set of points with this property is said to be collinear (sometimes spelled as colinear). In greater generality, the term has been used for aligned ...

between them. If ''P''

regressor A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical function ...

s are selected from a set of ''K'' > ''P'', the ''C_p'' statistic for that particular set of regressors is defined as: :

C_p= - N + 2(P+1),

where *

SSE_p = \sum_^N(Y_i-\hat_)^2

is the error sum of squares for the model with ''P''

s, *''Y''_pi is the predicted value of the ''i''th observation of ''Y'' from the ''P''

s, *''S''² is the estimation of residuals variance after regression on the complete set of ''K''

s and can be estimated by

\sum_^N  (Y_i- \hat_i)^2

, * and ''N'' is the

sample size Sample size determination or estimation is the act of choosing the number of observations or replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences abo ...

Alternative definition

Given a linear model such as: :

Y = \beta_0 + \beta_1X_1+\cdots+\beta_pX_p + \varepsilon

where: *

\beta_0,\ldots,\beta_p

are coefficients for predictor variables

X_1,\ldots,X_p

\varepsilon

represents error An alternate version of ''C_p'' can also be defined as: :

C_p=\frac(\operatorname + 2p\hat^2)

where * RSS is the residual sum of squares on a training set of data * is the number of predictors * and

\hat^2

refers to an estimate of the variance associated with each response in the linear model (estimated on a model containing all predictors) Note that this version of the ''C_p'' does not give equivalent values to the earlier version, but the model with the smallest ''C_p'' from this definition will also be the same model with the smallest ''C_p'' from the earlier definition.

Limitations

The ''C_p'' criterion suffers from two main limitationsGiraud, C. (2015), ''Introduction to high-dimensional statistics'', Chapman & Hall/CRC, # the ''C_p'' approximation is only valid for large sample size; # the ''C_p'' cannot handle complex collections of models as in the variable selection (or

feature selection In machine learning, feature selection is the process of selecting a subset of relevant Feature (machine learning), features (variables, predictors) for use in model construction. Feature selection techniques are used for several reasons: * sim ...

) problem.

Practical use

The ''C_p'' statistic is often used as a stopping rule for various forms of

stepwise regression In statistics, stepwise regression is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of ...

. Mallows proposed the statistic as a criterion for selecting among many alternative subset regressions. Under a model not suffering from appreciable lack of fit (bias), ''C_p'' has expectation nearly equal to ''P''; otherwise the expectation is roughly ''P'' plus a positive bias term. Nevertheless, even though it has expectation greater than or equal to ''P'', there is nothing to prevent ''C_p'' < ''P'' or even ''C_p'' < 0 in extreme cases. It is suggested that one should choose a subset that has ''C_p'' approaching ''P'', from above, for a list of subsets ordered by increasing ''P''. In practice, the positive bias can be adjusted for by selecting a model from the ordered list of subsets, such that ''C_p'' < 2''P''. Since the sample-based ''C_p'' statistic is an estimate of the MSPE, using ''C_p'' for model selection does not completely guard against overfitting. For instance, it is possible that the selected model will be one in which the sample ''C_p'' was a particularly severe underestimate of the MSPE. Model selection statistics such as ''C_p'' are generally not used blindly, but rather information about the field of application, the intended use of the model, and any known biases in the data are taken into account in the process of model selection.

Definition and properties

Alternative definition

Limitations

Practical use

See also

References

Further reading