In
statistics and in particular in
regression analysis
In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
, leverage is a measure of how far away the
independent variable
Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or deman ...
values of an
observation
Observation is the active acquisition of information from a primary source. In living beings, observation employs the senses. In science, observation can also involve the perception and recording of data via the use of scientific instruments. Th ...
are from those of the other observations. ''High-leverage points'', if any, are
outliers
In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter a ...
with respect to the
independent variables
Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
. That is, high-leverage points have no neighboring points in
space, where ''
'' is the number of independent variables in a regression model. This makes the fitted model likely to pass close to a high leverage observation. Hence high-leverage points have the potential to cause large changes in the parameter estimates when they are deleted i.e., to be
influential points. Although an influential point will typically have high leverage, a high leverage point is not necessarily an influential point. The leverage is typically defined as the diagonal elements of the
hat matrix
In statistics, the projection matrix (\mathbf), sometimes also called the influence matrix or hat matrix (\mathbf), maps the vector of response values (dependent variable values) to the vector of fitted values (or predicted values). It describes ...
.
Definition and interpretations
Consider the
linear regression
In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is ...
model
,
. That is,
, where,
is the
design matrix
In statistics and in particular in regression analysis, a design matrix, also known as model matrix or regressor matrix and often denoted by X, is a matrix of values of explanatory variables of a set of objects. Each row represents an individual ...
whose rows correspond to the observations and whose columns correspond to the independent or explanatory variables. The ''leverage score'' for the
independent observation
is given as:
:
, the
diagonal element of the
ortho-projection matrix (''a.k.a'' hat matrix)
.
Thus the
leverage score can be viewed as the 'weighted' distance between
to the mean of
's (see its
relation with Mahalanobis distance). It can also be interpreted as the degree by which the
measured (dependent) value (i.e.,
) influences the
fitted (predicted) value (i.e.,
): mathematically,
:
.
Hence, the leverage score is also known the observation self-sensitivity or self-influence. Using the fact that
(i.e., the prediction
is ortho-projection of
onto range space of
) in the above expression, we get
. Note that this leverage depends on the values of the explanatory variables
of all observations but not on any of the values of the dependent variables
.
Properties
# The leverage
is a number between 0 and 1,
Proof: Note that
is
idempotent matrix
In linear algebra, an idempotent matrix is a matrix which, when multiplied by itself, yields itself. That is, the matrix A is idempotent if and only if A^2 = A. For this product A^2 to be defined, A must necessarily be a square matrix. Viewed this ...
(
) and symmetric (
). Thus, by using the fact that
, we have
. Since we know that
, we have
.
# Sum of leverages is equal to the number of parameters
in
(including the intercept). Proof:
.
Determination of outliers in X using leverages
Large leverage
correspond
that is extreme. A common rule is to identify
whose leverage value
is more than 2 times larger than the mean leverage
(see property 2 above). That is, if
,
shall be considered as an outlier. Some statisticians also prefer the threshold of
instead of
.
Relation to Mahalanobis distance
Leverage is closely related to the Mahalanobis distance (proof). Specifically, for some
matrix
, the squared Mahalanobis distance of
(where
is
row of
) from the vector of mean
of length
, is
, where
is the estimated
covariance matrix
In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements o ...
of
's. This is related to the leverage
of the hat matrix of
after appending a column vector of 1's to it. The relationship between the two is:
:
This relationship enables us to decompose leverage into meaningful components so that some sources of high leverage can be investigated analytically.
Relation to influence functions
In a regression context, we combine leverage and
influence functions to compute the degree to which estimated coefficients would change if we removed a single data point. Denoting the regression residuals as
, one can compare the estimated coefficient
to the leave-one-out estimated coefficient
using the formula
:
Young (2019) uses a version of this formula after residualizing controls. To gain intuition for this formula, note that
captures the potential for an observation to affect the regression parameters, and therefore
captures the actual influence of that observations' deviations from its fitted value on the regression parameters. The formula then divides by
to account for the fact that we remove the observation rather than adjusting its value, reflecting the fact that removal changes the distribution of covariates more when applied to high-leverage observations (i.e. with outlier covariate values). Similar formulas arise when applying general formulas for statistical influences functions in the regression context.
Effect on residual variance
If we are in an
ordinary least squares
In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the ...
setting with fixed
and
homoscedastic
In statistics, a sequence (or a vector) of random variables is homoscedastic () if all its random variables have the same finite variance. This is also known as homogeneity of variance. The complementary notion is called heteroscedasticity. The ...
regression errors , then the
regression residual,
has variance
:
.
In other words, an observation's leverage score determines the degree of noise in the model's misprediction of that observation, with higher leverage leading to less noise. This follows from the fact that
is idempotent and symmetric and
, hence,
.
The corresponding
studentized residual
In statistics, a studentized residual is the quotient resulting from the division of a residual by an estimate of its standard deviation. It is a form of a Student's ''t''-statistic, with the estimate of error varying between points.
This ...
—the residual adjusted for its observation-specific estimated residual variance—is then
:
where
is an appropriate estimate of
.
Partial leverage
Partial leverage (PL) is a measure of the contribution of the individual
independent variables
Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
to the total leverage of each observation. That is, PL is a measure of how ''
'' changes as a variable is added to the regression model. It is computed as:
:
where
is the index of independent variable,
is the index of observation and
are the
residuals from regressing ''
'' against the remaining independent variables. Note that the partial leverage is the leverage of the
point in the
partial regression plot In applied statistics, a partial regression plot attempts to show the effect of adding another variable to a model that already has one or more independent variables. Partial regression plots are also referred to as added variable plots, adjusted va ...
for the
variable. Data points with large partial leverage for an independent variable can exert undue influence on the selection of that variable in automatic regression model building procedures.
Software implementations
Many programs and statistics packages, such as
R,
Python, etc., include implementations of Leverage.
See also
*
Projection matrix
In statistics, the projection matrix (\mathbf), sometimes also called the influence matrix or hat matrix (\mathbf), maps the vector of response values (dependent variable values) to the vector of fitted values (or predicted values). It describes ...
– whose main diagonal entries are the leverages of the observations
*
Mahalanobis distance The Mahalanobis distance is a measure of the distance between a point ''P'' and a distribution ''D'', introduced by P. C. Mahalanobis in 1936. Mahalanobis's definition was prompted by the problem of identifying the similarities of skulls based ...
– a (
scaled) measure of leverage of a datum
*
Partial leverage In regression analysis, partial leverage (PL) is a measure of the contribution of the individual independent variables to the total leverage of each observation. That is, if ''h'i'' is the ''i''th element of the diagonal of the hat matrix, PL i ...
*
Cook's distance
In statistics, Cook's distance or Cook's ''D'' is a commonly used estimate of the influence of a data point when performing a least-squares regression analysis. In a practical ordinary least squares analysis, Cook's distance can be used in severa ...
– a measure of changes in regression coefficients when an observation is deleted
*
DFFITS
DFFIT and DFFITS ("difference in fit(s)") are diagnostics meant to show how influential a point is in a statistical regression, first proposed in 1980.
DFFIT is the change in the predicted value for a point, obtained when that point is left out ...
*
Outlier
In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
– observations with extreme ''Y'' values
*
Degrees of freedom (statistics)
In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary.
Estimates of statistical parameters can be based upon different amounts of information or data. The number of i ...
, the sum of leverage scores
References
{{reflist
Regression diagnostics