In
statistics
Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, the variance inflation factor (VIF) is the ratio (
quotient
In arithmetic, a quotient (from lat, quotiens 'how many times', pronounced ) is a quantity produced by the division of two numbers. The quotient has widespread use throughout mathematics, and is commonly referred to as the integer part of a ...
) of the variance of estimating some parameter in a model that includes multiple other terms (parameters) by the variance of a model constructed using only one term. It quantifies the severity of
multicollinearity
In statistics, multicollinearity (also collinearity) is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. In this situation, the coefficie ...
in an
ordinary least squares
In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the prin ...
regression
Regression or regressions may refer to:
Science
* Marine regression, coastal advance due to falling sea level, the opposite of marine transgression
* Regression (medicine), a characteristic of diseases to express lighter symptoms or less extent ( ...
analysis. It provides an index that measures how much the
variance
In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
(the square of the estimate's
standard deviation
In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while ...
) of an estimated regression coefficient is increased because of collinearity.
Cuthbert Daniel
Cuthbert Daniel (August 27, 1904 – August 8, 1997) was an American industrial statistician.
Daniel was born in Williamsport, Pennsylvania. He obtained bachelor's and master's degrees in chemical engineering from the Massachusetts Institute of T ...
claims to have invented the concept behind the variance inflation factor, but did not come up with the name.
Definition
Consider the following
linear model
In statistics, the term linear model is used in different ways according to the context. The most common occurrence is in connection with regression models and the term is often taken as synonymous with linear regression model. However, the term ...
with ''k'' independent variables:
: ''Y'' = ''β''
0 + ''β''
1 ''X''
1 + ''β''
2 ''X''
2 + ... + ''β''
''k'' ''X''
''k'' + ''ε''.
The
standard error
The standard error (SE) of a statistic (usually an estimate of a parameter) is the standard deviation of its sampling distribution or an estimate of that standard deviation. If the statistic is the sample mean, it is called the standard error ...
of the estimate of ''β''
''j'' is the square root of the ''j'' + 1 diagonal element of ''s''
2(''X''′''X'')
−1, where ''s'' is the
root mean squared error (RMSE) (note that RMSE
2 is a consistent estimator of the true variance of the error term,
); ''X'' is the regression
design matrix
In statistics and in particular in regression analysis, a design matrix, also known as model matrix or regressor matrix and often denoted by X, is a matrix of values of explanatory variables of a set of objects. Each row represents an individual ob ...
— a matrix such that ''X''
''i'', ''j''+1 is the value of the ''j''
th independent variable for the ''i''
th case or observation, and such that ''X''
''i'',1, the predictor vector associated with the intercept term, equals 1 for all ''i''. It turns out that the square of this standard error, the estimated variance of the estimate of ''β''
''j'', can be equivalently expressed as:
:
where ''R''
''j''2 is the
multiple ''R''2 for the regression of ''X''
''j'' on the other covariates (a regression that does not involve the response variable ''Y''). This identity separates the influences of several distinct factors on the variance of the coefficient estimate:
* ''s''
2: greater scatter in the data around the regression surface leads to proportionately more variance in the coefficient estimates
* ''n'': greater sample size results in proportionately less variance in the coefficient estimates
*
: greater variability in a particular covariate leads to proportionately less variance in the corresponding coefficient estimate
The remaining term, 1 / (1 − ''R''
''j''2) is the VIF. It reflects all other factors that influence the uncertainty in the coefficient estimates. The VIF equals 1 when the vector ''X''
''j'' is
orthogonal
In mathematics, orthogonality is the generalization of the geometric notion of ''perpendicularity''.
By extension, orthogonality is also used to refer to the separation of specific features of a system. The term also has specialized meanings in ...
to each column of the design matrix for the regression of ''X''
''j'' on the other covariates. By contrast, the VIF is greater than 1 when the vector ''X''
''j'' is not orthogonal to all columns of the design matrix for the regression of ''X''
''j'' on the other covariates. Finally, note that the VIF is invariant to the scaling of the variables (that is, we could scale each variable ''X''
''j'' by a constant ''c''
''j'' without changing the VIF).
:
Now let
, and without losing generality, we reorder the columns of ''X'' to set the first column to be
:
:
.
By using
Schur complement In linear algebra and the theory of matrices, the Schur complement of a block matrix is defined as follows.
Suppose ''p'', ''q'' are nonnegative integers, and suppose ''A'', ''B'', ''C'', ''D'' are respectively ''p'' × ''p'', ''p'' × ''q'', ''q'' ...
, the element in the first row and first column in
is,
:
Then we have,
:
Here
is the coefficient of regression of dependent variable
over covariate
.
is the corresponding
residual sum of squares
In statistics, the residual sum of squares (RSS), also known as the sum of squared estimate of errors (SSE), is the sum of the squares of residuals (deviations predicted from actual empirical values of data). It is a measure of the discrepan ...
.
Calculation and analysis
We can calculate ''k'' different VIFs (one for each ''X''
''i'') in three steps:
Step one
First we run an ordinary least square regression that has ''X''
''i'' as a function of all the other explanatory variables in the first equation.
If ''i'' = 1, for example, equation would be
:
where
is a constant and ''e'' is the
error term In mathematics and statistics, an error term is an additive type of error. Common examples include:
* errors and residuals in statistics, e.g. in linear regression
* the error term in numerical integration
In analysis, numerical integration ...
.
Step two
Then, calculate the VIF factor for
with the following formula :
:
where ''R''
2''i'' is the
coefficient of determination
In statistics, the coefficient of determination, denoted ''R''2 or ''r''2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).
It is a statistic used i ...
of the regression equation in step one, with
on the left hand side, and all other predictor variables (all the other X variables) on the right hand side.
Step three
Analyze the magnitude of
multicollinearity
In statistics, multicollinearity (also collinearity) is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. In this situation, the coefficie ...
by considering the size of the
. A rule of thumb is that if
then multicollinearity is high
(a cutoff of 5 is also commonly used
). However, there is no value of VIF greater than 0 in which the variance of the slopes of predictors isn't inflated. As a result, including two or more variables in a multiple regression that are not orthogonal (i.e. have correlation = 0), will alter each other's slope, SE of the slope, and P-value, because there is shared variance between the predictors that can't be uniquely attributed to any one of them.
Some software instead calculates the tolerance which is just the reciprocal of the VIF. The choice of which to use is a matter of personal preference.
Interpretation
The square root of the variance inflation factor indicates how much larger the standard error increases compared to if that variable had 0 correlation to other predictor variables in the model.
Example
If the variance inflation factor of a predictor variable were 5.27 (√5.27 = 2.3), this means that the standard error for the coefficient of that predictor variable is 2.3 times larger than if that predictor variable had 0 correlation with the other predictor variables.
Implementation
*
vif
function in th
car R package
*
ols_vif_tol
function in th
olsrr R package
*
PROC REG
in SA
System*
variance_inflation_factor
function i
statsmodelsPython
Python may refer to:
Snakes
* Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia
** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia
* Python (mythology), a mythical serpent
Computing
* Python (pro ...
package
*
estat vif
i
Stataaddon for
GRASS GIS
''Geographic Resources Analysis Support System'' (commonly termed ''GRASS GIS'') is a geographic information system (GIS) software suite used for geospatial data management and analysis, image processing, producing graphics and maps, spatial and ...
References
Further reading
*
*
*
*
*
*
* {{cite journal , last1=Zuur , first1=A.F. , last2=Ieno, first2=E.N., last3=Elphick, first3=C.S, year=2010 , title=A protocol for data exploration to avoid common statistical problems , journal=Methods in Ecology and Evolution , volume=1 , pages=3–14 , doi=10.1111/j.2041-210X.2009.00001.x , s2cid=18814132
See also
*
Design effect
In survey methodology, the design effect (generally denoted as D_ or D_^2) is a measure of the expected impact of a sampling design on the variance of an estimator for some parameter. It is calculated as the ratio of the variance of an estimator b ...
Regression diagnostics
Statistical ratios
Statistical deviation and dispersion
de:Multikollinearität#Varianzinflationsfaktor