HOME

TheInfoList



OR:

In
statistics Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, indust ...
, the variance inflation factor (VIF) is the ratio (
quotient In arithmetic, a quotient (from lat, quotiens 'how many times', pronounced ) is a quantity produced by the division of two numbers. The quotient has widespread use throughout mathematics, and is commonly referred to as the integer part of a ...
) of the variance of estimating some parameter in a model that includes multiple other terms (parameters) by the variance of a model constructed using only one term. It quantifies the severity of
multicollinearity In statistics, multicollinearity (also collinearity) is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. In this situation, the coeffic ...
in an
ordinary least squares In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the ...
regression analysis. It provides an index that measures how much the
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...
(the square of the estimate's
standard deviation In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, whil ...
) of an estimated regression coefficient is increased because of collinearity. Cuthbert Daniel claims to have invented the concept behind the variance inflation factor, but did not come up with the name.


Definition

Consider the following
linear model In statistics, the term linear model is used in different ways according to the context. The most common occurrence is in connection with regression models and the term is often taken as synonymous with linear regression model. However, the term ...
with ''k'' independent variables: : ''Y'' = ''β''0 + ''β''1 ''X''1 + ''β''2 ''X'' 2 + ... + ''β''''k'' ''X''''k'' + ''ε''. The
standard error The standard error (SE) of a statistic (usually an estimate of a parameter) is the standard deviation of its sampling distribution or an estimate of that standard deviation. If the statistic is the sample mean, it is called the standard error o ...
of the estimate of ''β''''j'' is the square root of the ''j'' + 1 diagonal element of ''s''2(''X''′''X'')−1, where ''s'' is the root mean squared error (RMSE) (note that RMSE2 is a consistent estimator of the true variance of the error term, \sigma^2 ); ''X'' is the regression
design matrix In statistics and in particular in regression analysis, a design matrix, also known as model matrix or regressor matrix and often denoted by X, is a matrix of values of explanatory variables of a set of objects. Each row represents an individual ob ...
— a matrix such that ''X''''i'', ''j''+1 is the value of the ''j''th independent variable for the ''i''th case or observation, and such that ''X''''i'',1, the predictor vector associated with the intercept term, equals 1 for all ''i''. It turns out that the square of this standard error, the estimated variance of the estimate of ''β''''j'', can be equivalently expressed as: : \widehat(\hat_j) = \frac\cdot \frac, where ''R''''j''2 is the multiple ''R''2 for the regression of ''X''''j'' on the other covariates (a regression that does not involve the response variable ''Y''). This identity separates the influences of several distinct factors on the variance of the coefficient estimate: * ''s''2: greater scatter in the data around the regression surface leads to proportionately more variance in the coefficient estimates * ''n'': greater sample size results in proportionately less variance in the coefficient estimates * \widehat\operatorname(X_j): greater variability in a particular covariate leads to proportionately less variance in the corresponding coefficient estimate The remaining term, 1 / (1 − ''R''''j''2) is the VIF. It reflects all other factors that influence the uncertainty in the coefficient estimates. The VIF equals 1 when the vector ''X''''j'' is
orthogonal In mathematics, orthogonality is the generalization of the geometric notion of '' perpendicularity''. By extension, orthogonality is also used to refer to the separation of specific features of a system. The term also has specialized meanings in ...
to each column of the design matrix for the regression of ''X''''j'' on the other covariates. By contrast, the VIF is greater than 1 when the vector ''X''''j'' is not orthogonal to all columns of the design matrix for the regression of ''X''''j'' on the other covariates. Finally, note that the VIF is invariant to the scaling of the variables (that is, we could scale each variable ''X''''j'' by a constant ''c''''j'' without changing the VIF). : \widehat(\hat_j) = s^2 X^T X)^ Now let r= X^T X, and without losing generality, we reorder the columns of ''X'' to set the first column to be X_j : r^ = \begin r_ & r_ \\ r_ & r_\end^ : r_ = X_j^T X_j, r_ = X_j^T X_, r_ = X_^T X_j, r_ = X_^T X_. By using
Schur complement In linear algebra and the theory of matrices, the Schur complement of a block matrix is defined as follows. Suppose ''p'', ''q'' are nonnegative integers, and suppose ''A'', ''B'', ''C'', ''D'' are respectively ''p'' × ''p'', ''p'' × ''q'', ''q'' ...
, the element in the first row and first column in r^ is, :r^_ = _ - r_ r_^ r_ Then we have, : \begin & \widehat(\hat_j) = s^2 X^T X)^ = s^2 r^_ \\ = & s^2 _j^T X_j - X_j^T X_ (X_^T X_)^ X_^T X_j \\ = & s^2 _j^T X_j - X_j^T X_ (X_^T X_)^ (X_^T X_) (X_^T X_)^ X_^T X_j \\ = & s^2 _j^T X_j - \hat_^T(X_^T X_) \hat_ \\ = & s^2 \frac \\ = & \frac\cdot \frac \end Here \hat_ is the coefficient of regression of dependent variable X_j over covariate X_ . \mathrm_j is the corresponding
residual sum of squares In statistics, the residual sum of squares (RSS), also known as the sum of squared estimate of errors (SSE), is the sum of the squares of residuals (deviations predicted from actual empirical values of data). It is a measure of the discrepa ...
.


Calculation and analysis

We can calculate ''k'' different VIFs (one for each ''X''''i'') in three steps:


Step one

First we run an ordinary least square regression that has ''X''''i'' as a function of all the other explanatory variables in the first equation.
If ''i'' = 1, for example, equation would be :X_1=\alpha_0 + \alpha_2 X_2 + \alpha_3 X_3 + \cdots + \alpha_k X_k +e where \alpha_0 is a constant and ''e'' is the
error term In mathematics and statistics, an error term is an additive type of error An error (from the Latin ''error'', meaning "wandering") is an action which is inaccurate or incorrect. In some usages, an error is synonymous with a mistake. The etymol ...
.


Step two

Then, calculate the VIF factor for \hat\beta_i with the following formula : : \mathrm_i = \frac where ''R''2''i'' is the
coefficient of determination In statistics, the coefficient of determination, denoted ''R''2 or ''r''2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s). It is a statistic used i ...
of the regression equation in step one, with X_i on the left hand side, and all other predictor variables (all the other X variables) on the right hand side.


Step three

Analyze the magnitude of
multicollinearity In statistics, multicollinearity (also collinearity) is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. In this situation, the coeffic ...
by considering the size of the \operatorname(\hat \beta_i). A rule of thumb is that if \operatorname(\hat \beta_i) > 10 then multicollinearity is high (a cutoff of 5 is also commonly used). However, there is no value of VIF greater than 0 in which the variance of the slopes of predictors isn't inflated. As a result, including two or more variables in a multiple regression that are not orthogonal (i.e. have correlation = 0), will alter each other's slope, SE of the slope, and P-value, because there is shared variance between the predictors that can't be uniquely attributed to any one of them. Some software instead calculates the tolerance which is just the reciprocal of the VIF. The choice of which to use is a matter of personal preference.


Interpretation

The square root of the variance inflation factor indicates how much larger the standard error increases compared to if that variable had 0 correlation to other predictor variables in the model. Example
If the variance inflation factor of a predictor variable were 5.27 (√5.27 = 2.3), this means that the standard error for the coefficient of that predictor variable is 2.3 times larger than if that predictor variable had 0 correlation with the other predictor variables.


Implementation

*vif function in th
car
R package *ols_vif_tol function in th
olsrr
R package *PROC REG in SA
System
*variance_inflation_factor function i
statsmodels
Python package *estat vif i
Stata
addon for
GRASS GIS ''Geographic Resources Analysis Support System'' (commonly termed ''GRASS GIS'') is a geographic information system (GIS) software suite used for geospatial data management and analysis, image processing, producing graphics and maps, spatial and ...


References


Further reading

* * * * * * * {{cite journal , last1=Zuur , first1=A.F. , last2=Ieno, first2=E.N., last3=Elphick, first3=C.S, year=2010 , title=A protocol for data exploration to avoid common statistical problems , journal=Methods in Ecology and Evolution , volume=1 , pages=3–14 , doi=10.1111/j.2041-210X.2009.00001.x , s2cid=18814132


See also

*
Design effect In survey methodology, the design effect (generally denoted as D_ or D_^2) is a measure of the expected impact of a sampling design on the variance of an estimator for some parameter. It is calculated as the ratio of the variance of an estimator ...
Regression diagnostics Statistical ratios Statistical deviation and dispersion de:Multikollinearität#Varianzinflationsfaktor