In
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, the variance inflation factor (VIF) is the ratio (
quotient
In arithmetic, a quotient (from 'how many times', pronounced ) is a quantity produced by the division of two numbers. The quotient has widespread use throughout mathematics. It has two definitions: either the integer part of a division (in th ...
) of the variance of a parameter estimate when fitting a full model that includes other parameters to the variance of the parameter estimate if the model is fit with only the parameter on its own. The VIF provides an index that measures how much the
variance
In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...
(the square of the estimate's
standard deviation
In statistics, the standard deviation is a measure of the amount of variation of the values of a variable about its Expected value, mean. A low standard Deviation (statistics), deviation indicates that the values tend to be close to the mean ( ...
) of an estimated regression coefficient is increased because of
collinearity
In geometry, collinearity of a set of points is the property of their lying on a single line. A set of points with this property is said to be collinear (sometimes spelled as colinear). In greater generality, the term has been used for aligned ...
.
Cuthbert Daniel claims to have invented the concept behind the variance inflation factor, but did not come up with the name.
Definition
Consider the following
linear model
In statistics, the term linear model refers to any model which assumes linearity in the system. The most common occurrence is in connection with regression models and the term is often taken as synonymous with linear regression model. However, t ...
with ''k'' independent variables:
: ''Y'' = ''β''
0 + ''β''
1 ''X''
1 + ''β''
2 ''X''
2 + ... + ''β''
''k'' ''X''
''k'' + ''ε''.
The
standard error
The standard error (SE) of a statistic (usually an estimator of a parameter, like the average or mean) is the standard deviation of its sampling distribution or an estimate of that standard deviation. In other words, it is the standard deviati ...
of the estimate of ''β''
''j'' is the square root of the ''j'' + 1 diagonal element of ''s''
2(''X''′''X'')
−1, where ''s'' is the
root mean squared error (RMSE) (note that RMSE
2 is a
consistent estimator
In statistics, a consistent estimator or asymptotically consistent estimator is an estimator—a rule for computing estimates of a parameter ''θ''0—having the property that as the number of data points used increases indefinitely, the result ...
of the true variance of the error term,
); ''X'' is the regression
design matrix
In statistics and in particular in regression analysis, a design matrix, also known as model matrix or regressor matrix and often denoted by X, is a matrix of values of explanatory variables of a set of objects. Each row represents an individual o ...
— a matrix such that ''X''
''i'', ''j''+1 is the value of the ''j''
th independent variable for the ''i''
th case or observation, and such that ''X''
''i'',1, the predictor vector associated with the intercept term, equals 1 for all ''i''. It turns out that the square of this standard error, the estimated variance of the estimate of ''β''
''j'', can be equivalently expressed as:
:
where ''R''
''j''2 is the
multiple ''R''2 for the regression of ''X''
''j'' on the other covariates (a regression that does not involve the response variable ''Y'') and
are the coefficient estimates, id est, the estimates of
. This identity separates the influences of several distinct factors on the variance of the coefficient estimate:
* ''s''
2: greater scatter in the data around the regression surface leads to proportionately more variance in the coefficient estimates
* ''n'': greater sample size results in proportionately less variance in the coefficient estimates
*
: greater variability in a particular covariate leads to proportionately less variance in the corresponding coefficient estimate
The remaining term, 1 / (1 − ''R''
''j''2) is the VIF. It reflects all other factors that influence the uncertainty in the coefficient estimates. The VIF equals 1 when the vector ''X''
''j'' is
orthogonal
In mathematics, orthogonality (mathematics), orthogonality is the generalization of the geometric notion of ''perpendicularity''. Although many authors use the two terms ''perpendicular'' and ''orthogonal'' interchangeably, the term ''perpendic ...
to each column of the design matrix for the regression of ''X''
''j'' on the other covariates. By contrast, the VIF is greater than 1 when the vector ''X''
''j'' is not orthogonal to all columns of the design matrix for the regression of ''X''
''j'' on the other covariates. Finally, note that the VIF is invariant to the scaling of the variables (that is, we could scale each variable ''X''
''j'' by a constant ''c''
''j'' without changing the VIF).
:
Now let
, and without losing generality, we reorder the columns of ''X'' to set the first column to be
:
:
.
By using
Schur complement
The Schur complement is a key tool in the fields of linear algebra, the theory of matrices, numerical analysis, and statistics.
It is defined for a block matrix. Suppose ''p'', ''q'' are nonnegative integers such that ''p + q > 0'', and suppose ...
, the element in the first row and first column in
is,
:
Then we have,
:
Here
is the coefficient of regression of dependent variable
over covariate
.
is the corresponding
residual sum of squares
In statistics, the residual sum of squares (RSS), also known as the sum of squared residuals (SSR) or the sum of squared estimate of errors (SSE), is the sum of the squares of residuals (deviations predicted from actual empirical values of dat ...
.
Calculation and analysis
We can calculate ''k'' different VIFs (one for each ''X''
''i'') in three steps:
Step one
First we run an ordinary least square regression that has ''X''
''i'' as a function of all the other explanatory variables in the first equation.
If ''i'' = 1, for example, equation would be
:
where
is a constant and
is the
error term In mathematics and statistics, an error term is an additive type of error.
In writing, an error term is an instance of faulty language or grammar.
Common examples include:
* errors and residuals in statistics, e.g. in linear regression
* the error ...
.
Step two
Then, calculate the VIF factor for
with the following formula :
:
where ''R''
2''i'' is the
coefficient of determination
In statistics, the coefficient of determination, denoted ''R''2 or ''r''2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).
It is a statistic used in t ...
of the regression equation in step one, with
on the left hand side, and all other predictor variables (all the other X variables) on the right hand side.
Step three
Analyze the magnitude of
multicollinearity
In statistics, multicollinearity or collinearity is a situation where the predictors in a regression model are linearly dependent.
Perfect multicollinearity refers to a situation where the predictive variables have an ''exact'' linear rela ...
by considering the size of the
. A rule of thumb is that if
then multicollinearity is high (a cutoff of 5 is also commonly used
). However, there is no value of VIF greater than 1 in which the variance of the slopes of predictors isn't inflated. As a result, including two or more variables in a multiple regression that are not orthogonal (i.e. have correlation = 0), will alter each other's slope, SE of the slope, and P-value, because there is shared variance between the predictors that can't be uniquely attributed to any one of them.
Some software instead calculates the tolerance which is just the reciprocal of the VIF. The choice of which to use is a matter of personal preference.
Interpretation
The square root of the variance inflation factor indicates how much larger the standard error increases compared to if that variable had 0 correlation to other predictor variables in the model.
Example
If the variance inflation factor of a predictor variable were 5.27 (√5.27 = 2.3), this means that the standard error for the coefficient of that predictor variable is 2.3 times larger than if that predictor variable had 0 correlation with the other predictor variables.
Implementation
*
vif
function in th
car R package
*
ols_vif_tol
function in th
olsrr R package
*
PROC REG
in SA
System*
variance_inflation_factor
function i
statsmodelsPython
Python may refer to:
Snakes
* Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia
** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia
* Python (mythology), a mythical serpent
Computing
* Python (prog ...
package
*
estat vif
i
Stataaddon for
GRASS GIS
''Geographic Resources Analysis Support System'' (commonly termed ''GRASS GIS'') is a geographic information system (GIS) software suite used for geospatial data management and analysis, image processing, producing graphics and maps, spatial and ...
*
vif
(non categorical) and
gvif
(categorical data) functions i
StatsModelsJulia
Julia may refer to:
People
*Julia (given name), including a list of people with the name
*Julia (surname), including a list of people with the name
*Julia gens, a patrician family of Ancient Rome
*Julia (clairvoyant) (fl. 1689), lady's maid of Qu ...
programing language
References
Further reading
*
*
*
*
*
*
* {{cite journal , last1=Zuur , first1=A.F. , last2=Ieno, first2=E.N., last3=Elphick, first3=C.S, year=2010 , title=A protocol for data exploration to avoid common statistical problems , journal=Methods in Ecology and Evolution , volume=1 , pages=3–14 , doi=10.1111/j.2041-210X.2009.00001.x , s2cid=18814132 , doi-access=free
See also
*
Design effect
Regression diagnostics
Statistical ratios
Statistical deviation and dispersion
de:Multikollinearität#Varianzinflationsfaktor