statistics Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, indust ...

, multicollinearity (also collinearity) is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. In this situation, the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set; it only affects calculations regarding individual predictors. That is, a multivariate regression model with collinear predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with respect to others. Note that in statements of the assumptions underlying regression analyses such as

ordinary least squares In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the ...

, the phrase "no multicollinearity" usually refers to the absence of multicollinearity, which is an exact (non-stochastic) linear relation among the predictors. In such a case, the

design matrix In statistics and in particular in regression analysis, a design matrix, also known as model matrix or regressor matrix and often denoted by X, is a matrix of values of explanatory variables of a set of objects. Each row represents an individual ob ...

X

has less than full

rank Rank is the relative position, value, worth, complexity, power, importance, authority, level, etc. of a person or object within a ranking, such as: Level or position in a hierarchical organization * Academic rank * Diplomatic rank * Hierarchy * ...

, and therefore the moment matrix

X^X

cannot be inverted. Under these circumstances, for a general linear model

y = X \beta + \epsilon

, the ordinary least squares estimator

\hat_ = (X^X)^X^y

does not exist. In any case, multicollinearity is a characteristic of the design matrix, not the underlying

statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form ...

. Multicollinearity leads to

non-identifiable In statistics, identifiability is a property which a model must satisfy for precise inference to be possible. A model is identifiable if it is theoretically possible to learn the true values of this model's underlying parameters after obtaining ...

parameters.

Definition

Collinearity is a linear association between

explanatory variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or deman ...

s. Two variables are perfectly collinear if there is an exact linear relationship between them. For example,

X_

and

X_

are perfectly collinear if there exist parameters

\lambda_0

and

\lambda_1

such that, for all observations ''

i

'', :

X_ = \lambda_0 + \lambda_1 X_

. Multicollinearity refers to a situation in which explanatory variables in a multiple regression model are highly linearly related. There is perfect multicollinearity if, for example as in the equation above, the correlation between two independent variables equals 1 or −1. In practice, perfect multicollinearity in a data set is rare. More commonly, the issue of multicollinearity arises when there is an approximate linear relationship among two or more independent variables. Mathematically, a set of variables is perfectly multicollinear if there exist one or more exact linear relationships among some of the variables. That is, for all observations ''

i

'', where

\lambda_k

are constants and

X_

is the ''

i^

'' observation on the ''

k^

'' explanatory variable. To explore one issue caused by multicollinearity, consider the process of attempting to obtain estimates for the parameters of the multiple regression equation :

Y_ = \beta _0 + \beta _1 X_ + \cdots + \beta _k X_ + \varepsilon _

. The

estimates involve inverting the matrix

X^X

, where :

X = \begin

      1 & X_ & \cdots & X_  \\

      \vdots & \vdots & & \vdots \\

      1 & X_ & \cdots & X_

\end

is an ''

N \times (k+1)

'' matrix, where ''

N

'' is the number of observations, ''

k

'' is the number of explanatory variables, and ''

N \ge k+1

''. If there is an exact linear relationship (perfect multicollinearity) among the independent variables, then at least one of the columns of

X

is a linear combination of the others, and so the

X

(and therefore of

X^X

) is less than ''

k+1

'', and the matrix

X^X

will not be invertible. Perfect multicollinearity is fairly common when working with raw datasets, which frequently contain redundant information. Once redundancies are identified and removed, however, nearly multicollinear variables often remain due to correlations inherent in the system being studied. In such a case, Equation () may be modified to include an error term

v_i

: :

\lambda_0 + \lambda_1 X_ + \lambda_2 X_ + \cdots + \lambda_k X_ + v_i = 0

. In this case, there is no exact linear relationship among the variables, but the

X_j

variables are nearly perfectly multicollinear if the variance of

v_i

is small for some set of values for the

\lambda

's. In this case, the matrix

X^X

has an inverse, but is ill-conditioned so that a given computer algorithm may or may not be able to compute an approximate inverse; if it can, the resulting computed inverse may be highly sensitive to slight variations in the data (due to magnified effects of either rounding error or slight variations in the sampled data points) and so may be inaccurate or sample-dependent.

Detection

The following are indicators that multicollinearity may be present in a model: # Large changes in the estimated regression coefficients occur when a predictor variable is added or deleted. # Insignificant regression coefficients for the affected variables occur in the multiple regression, despite a rejection of the joint hypothesis that those coefficients are all zero (using an ''F''-test). # If a multivariable regression finds an insignificant coefficient of a particular explanator, yet a

simple linear regression In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the ''x'' and ...

of the explained variable on this explanatory variable shows its coefficient to be significantly different from zero—this situation indicates multicollinearity in the multivariable regression. # Some authors have suggested a formal detection-tolerance or the

variance inflation factor In statistics, the variance inflation factor (VIF) is the ratio (quotient) of the variance of estimating some parameter in a model that includes multiple other terms (parameters) by the variance of a model constructed using only one term. It quant ...

(VIF) for multicollinearity:

\mathrm = 1-R_^2,\quad \mathrm = \frac,

where

R_^2

is the

coefficient of determination In statistics, the coefficient of determination, denoted ''R''2 or ''r''2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s). It is a statistic used i ...

of a regression of explanator ''

j

'' on all the other explanators. A tolerance of less than 0.20 or 0.10, a VIF of 5 or 10 and above, or both, indicates a multicollinearity problem. # Farrar–Glauber test: If the variables are found to be orthogonal, there is no multicollinearity; if the variables are not orthogonal, then at least some degree of multicollinearity is present. C. Robert Wichers has argued that Farrar–Glauber partial correlation test is ineffective in that a given partial correlation may be compatible with different multicollinearity patterns. The Farrar–Glauber test has also been criticized by other researchers. # Condition number test: The standard measure of ill-conditioning in a matrix is the condition index. This determines if the inversion of the matrix is numerically unstable with finite-precision numbers (standard computer floats and

doubles Men's doubles, Women's doubles or Mixed doubles are sports having two players per side, including; * Beach volleyball * Doubles badminton * Doubles curling * Footvolley * Doubles pickleball * Doubles squash * Doubles table tennis * Doubles te ...

), indicating the potential sensitivity of the computed inverse to small changes in the original matrix. The condition number is computed by finding the square root of the maximum

eigenvalue In linear algebra, an eigenvector () or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denote ...

divided by the minimum eigenvalue of the

. If the condition number is above 30, the regression may have severe multicollinearity; multicollinearity exists if, in addition, two or more of the variables related to the high condition number have high proportions of variance explained. One advantage of this method is that it also shows which variables are causing the problem. # Perturbing the data: Multicollinearity can be detected by adding random noise to the data, re-running the regression many times, and seeing how much the coefficients change. # Construction of a correlation matrix among the explanatory variables yields indications as to the likelihood that any given couplet of right-hand-side variables are creating multicollinearity problems. Correlation values (off-diagonal elements) of at least 0.4 are sometimes interpreted as indicating a multicollinearity problem. This procedure is, however, highly problematic and cannot be recommended. Intuitively, correlation describes a bivariate relationship, whereas collinearity is a multivariate phenomenon.

Consequences

Effect of multicollinearity on coefficients of linear model

One consequence of a high degree of multicollinearity is that, even if the matrix

X^X

is invertible, a computer algorithm may be unsuccessful in obtaining an approximate inverse, and if it does obtain one, the inverse may be numerically inaccurate. But even in the presence of an accurate

X^X

matrix, the following consequences arise. The usual interpretation of a regression coefficient is that it estimates the effect of a one-unit change in an independent variable,

X_

, holding the other variables constant. In the presence of multicollinearity, this tends to be less precise than if predictors were uncorrelated with one another. If

X_

is highly correlated with another independent variable

X_

in the given data set, then

X_

and

X_

have a particular linear stochastic relationship in the set. In other words, changes in

X_

are ''not'' independent of changes in

X_

. This correlation creates an imprecise estimate of the effect of independent changes in

X_

. In some sense, the collinear variables contain the same information about the dependent variable. If nominally "different" measures quantify the same phenomenon, then they are redundant. Alternatively, if the variables are accorded different names and perhaps employ different numeric measurement scales but are highly correlated with each other, then they suffer from redundancy. One of the features of multicollinearity is that the standard errors of the affected coefficients tend to be large. In this case, the test of the hypothesis that the coefficient is equal to zero may lead to a failure to reject a false null hypothesis of no effect of the explanator, a

type II error In statistical hypothesis testing, a type I error is the mistaken rejection of an actually true null hypothesis (also known as a "false positive" finding or conclusion; example: "an innocent person is convicted"), while a type II error is the fa ...

. Another issue with multicollinearity is that small changes to the input data can lead to large changes in the model, even resulting in changes in the sign of parameter estimates. A principal danger of such data redundancy is

overfitting mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...

regression analysis In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...

models. The best regression models are those in which the predictor variables each correlate highly with the dependent variable (outcome) but correlate only minimally with each other. Such a model is often called "low noise" and will be statistically robust (that is, it will predict reliably across numerous samples of variable sets drawn from the same statistical population). So long as the underlying specification is correct, multicollinearity does not bias results; it just produces large standard errors in the related independent variables. More importantly, the usual use of regression is to take coefficients from the model and then apply them to other data. Since multicollinearity causes imprecise estimates of coefficient values, the resulting out-of-sample predictions will also be imprecise. And if the pattern of multicollinearity in the new data differs from that in the data that was fitted, such extrapolation may introduce large errors in the predictions. However, if the underlying specification is anything less than complete and correct, multicollinearity amplifies misspecification biases. Even though not often recognized in methods texts, this is a common problem in the social sciences where a complete, correct specification of an OLS regression model is rarely known and at least some relevant variables will be unobservable. As a result, the estimated coefficients of correlated independent variables in an OLS regression will be biased by multicollinearity. As the correlation approaches one, the coefficient estimates will misleadingly tend toward infinite magnitudes in opposite directions, even if the variables’ true effects are small and of the same sign.

Remedies

# Avoid the dummy variable trap; including a dummy variable for every category (e.g., summer, autumn, winter, and spring) and including a constant term in the regression together guarantee perfect multicollinearity. # Use independent subsets of data for estimation, and then apply those estimates to the whole data set. This may result in a slightly higher variance than that of the subsets, but the expectation of the coefficient values should be the same. Observe how much the coefficient values vary. # Leave the model as is, despite multicollinearity. The presence of multicollinearity doesn't affect the efficiency of extrapolating the fitted model to new data, provided that the predictor variables follow the same pattern of multicollinearity in the new data as in the data on which the regression model is based. # Drop one of the variables. An explanatory variable may be dropped to produce a model with significant coefficients. However, this loses information. Omission of a relevant variable results in biased coefficient estimates for the remaining explanatory variables that are correlated with the dropped variable. # Obtain more data, if possible. This is the preferred solution. More data can produce more precise parameter estimates (with lower standard errors), as seen from the formula in

for the variance of the estimate of a regression coefficient in terms of the sample size and the degree of multicollinearity. # Mean-center the predictor variables. Generating polynomial terms (i.e., for

x_1

x_1^2

x_1^3

, etc.) or interaction terms (i.e.,

x_1 \times x_2

, etc.) can cause some multicollinearity if the variable in question has a limited range (e.g., ,4. Mean-centering will eliminate this special kind of multicollinearity. However, in general, this has no effect. It can be useful in overcoming problems arising from rounding and other computational steps if a carefully designed computer program is not used. # Standardize the independent variables. This may help reduce a false flagging of a condition index above 30. # It has also been suggested that by using the

Shapley value The Shapley value is a solution concept in cooperative game theory. It was named in honor of Lloyd Shapley, who introduced it in 1951 and won the Nobel Memorial Prize in Economic Sciences for it in 2012. To each cooperative game it assigns a uni ...

, a

game theory Game theory is the study of mathematical models of strategic interactions among rational agents. Myerson, Roger B. (1991). ''Game Theory: Analysis of Conflict,'' Harvard University Press, p.&nbs1 Chapter-preview links, ppvii–xi It has appli ...

tool, the model could account for the effects of multicollinearity. The Shapley value assigns a value for each predictor and assesses all possible combinations of importance. # Use Tikhonov regularization (also known as

ridge regression Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also ...

). # Use principal component regression. # Use

partial least squares regression Partial least squares regression (PLS regression) is a statistical method that bears some relation to principal components regression; instead of finding hyperplanes of maximum variance between the response and independent variables, it finds a ...

. # If the correlated explanators are different lagged values of the same underlying explanator, then a distributed lag technique can be used, imposing a general structure on the relative values of the coefficients to be estimated. # Treat highly linearly related variables as a group and study their group effects (see discussion below) instead of their individual effects. At the group level, multicollinearity is not a problem, so no remedies are needed.

Multicollinearity and group effects

Strongly correlated predictor variables appear naturally as a group. Their collective impact on the response variable can be measured by group effects. For a group of predictor variables

\

, a group effect is defined as a linear combination of their parameters: :

\xi(\mathbf)=w_1\beta_1+w_2\beta_2+\dots+w_q\beta_q

where

\mathbf=(w_1,w_2,\dots,w_q)^\intercal

is a weight vector satisfying

\sum_^q , w_j, =1

. It has an interpretation as the expected change in the response variable

Y

when variables in the group

X_1, X_2,\dots, X_q

change by the amount

w_1, w_2, \dots, w_q

, respectively, at the same time with variables not in the group held constant. Group effects generalize the individual effects in that (1) if

q=1

, then the group effect reduces to an individual effect, and (2) if

w_i=1

and

w_j=0

for

j\neq i

, then the group effect also reduces to an individual effect. A group effect is said to be meaningful if the underlying simultaneous changes of the

q

variables represented by the weight vector

(w_1,w_2,\dots, w_q)^\intercal

are probable. When

\

is a group of strongly correlated variables,

\beta_1=\xi(\mathbf_1)

is not meaningful as a group effect since its underlying simultaneous changes represented by

\mathbf_1=(1,0,\dots, 0)^\intercal\in \mathbb^q

are not probable. This is because, due to their strong correlations, it is unlikely that other variables in the group will remain unchanged when

X_1

increases by one unit. This observation also applies to parameters of other variables in the group. For strongly correlated predictor variables, group effects that are not meaningful, such as the

\beta_i

's, cannot be accurately estimated by the least squares regression. On the other hand, meaningful group effects can be accurately estimated by the least squares regression. This shows that strongly correlated predictor variables should be handled as a group, and multicollinearity is not a problem at the group level. For a discussion on how to identify meaningful group effects, see

linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is cal ...

Occurrence

Survival analysis

Multicollinearity may represent a serious issue in

survival analysis Survival analysis is a branch of statistics for analyzing the expected duration of time until one event occurs, such as death in biological organisms and failure in mechanical systems. This topic is called reliability theory or reliability analysi ...

. The problem is that time-varying covariates may change their value over the timeline of the study. A special procedure is recommended to assess the impact of multicollinearity on the results.

Interest rates for different terms to maturity

In various situations, it might be hypothesized that multiple interest rates of various terms to maturity all influence some economic decision, such as the amount of money or some other

financial asset A financial asset is a non-physical asset whose value is derived from a contractual claim, such as bank deposits, bonds, and participations in companies' share capital. Financial assets are usually more liquid than other tangible assets, such ...

to hold, or the amount of

fixed investment Fixed investment in economics is the purchasing of newly produced fixed capital. It is measured as a flow variable – that is, as an amount per unit of time. Thus, fixed investment is the accumulation of physical assets such as machinery, lan ...

spending to engage in. In this case, including these various interest rates will in general create a substantial multicollinearity problem because interest rates tend to move together. If each of the interest rates has its separate effect on the dependent variable, it can be extremely difficult to separate out their effects.

Common factors

The bias-amplifying combination of multicollinearity and misspecification may occur when studies attempt to tease out the effects of two independent variables that (1) are linked by a substantive common factor, and (2) contain unobservable but substantive components (not mere error terms) that are orthogonal to the common factor and that affect the dependent variable separately from any effect of the common factor. For example, studies sometimes include the same variable twice in a regression, measured at two different points in time. A time-invariant factor common to both variables causes the multicollinearity, while the unobservable nature of the common factor or the time-specific orthogonal components causes the misspecification. The same structure may apply to other substantive variable pairs with a common factor such as two types of knowledge, intelligence, conflict, or financial measures (such as the interest rates mentioned above). The two main implications of the presence of such common factors among independent variables of a regression analysis are that, as the correlation of independent variables approaches one due to a sizeable common factor, (1) their coefficient estimates will misleadingly tend toward infinite magnitudes in opposite directions, even if the variables’ true effects are small and of the same sign, and (2) the magnitudes of the biased coefficients will be amplified at a similar pace to the standard errors and therefore t-statistics may remain artificially large. Counter-intuitive

type I error In statistical hypothesis testing, a type I error is the mistaken rejection of an actually true null hypothesis (also known as a "false positive" finding or conclusion; example: "an innocent person is convicted"), while a type II error is the fa ...

s are a likely result, rather than the

s typically associated with multicollinearity. To convince readers that this form of multicollinearity is not biasing results, studies should not merely "drop" one of the collinear variables. Rather, they should present separate regression results with each of the collinear variables in isolation followed by a regression that contains both variables. Consistent coefficient signs and magnitudes across these specifications represent strong evidence that common-factor multicollinearity is not biasing results.

Extension

The concept of ''lateral collinearity'' expands on the traditional view of multicollinearity, comprising also collinearity between explanatory and criteria (i.e., explained) variables, in the sense that they may be measuring almost the same thing as each other.

References

External links

* {{cbignore
Earliest Uses: The entry on Multicollinearity has some historical information.
Regression analysis Design of experiments