DFFIT and DFFITS ("difference in fit(s)") are diagnostics meant to show how
influential a point is in a
statistical regression
Statistics (from German: ''Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industri ...
, first proposed in 1980.
DFFIT is the change in the predicted value for a point, obtained when that point is left out of the regression:
:
where
and
are the prediction for point ''i'' with and without point ''i'' included in the regression.
DFFITS is the Studentized DFFIT, where
Studentization is achieved by dividing by the estimated standard deviation of the fit at that point:
:
where
is the standard error estimated without the point in question, and
is the
leverage
Leverage or leveraged may refer to:
*Leverage (mechanics), mechanical advantage achieved by using a lever
* ''Leverage'' (album), a 2012 album by Lyriel
*Leverage (dance), a type of dance connection
*Leverage (finance), using given resources to ...
for the point.
DFFITS also equals the products of the externally
Studentized residual
In statistics, a studentized residual is the quotient resulting from the division of a residual by an estimate of its standard deviation. It is a form of a Student's ''t''-statistic, with the estimate of error varying between points.
This is ...
(
) and the
leverage factor (
):
:
Thus, for low leverage points, DFFITS is expected to be small, whereas as the leverage goes to 1 the distribution of the DFFITS value widens infinitely.
For a perfectly balanced experimental design (such as a
factorial design
In statistics, a full factorial experiment is an experiment whose design consists of two or more factors, each with discrete possible values or "levels", and whose experimental units take on all possible combinations of these levels across all ...
or balanced partial factorial design), the leverage for each point is p/n, the number of parameters divided by the number of points. This means that the DFFITS values will be distributed (in the Gaussian case) as
times a t variate. Therefore, the authors suggest investigating those points with DFFITS greater than
.
Although the raw values resulting from the equations are different,
Cook's distance
In statistics, Cook's distance or Cook's ''D'' is a commonly used estimate of the influence of a data point when performing a least-squares regression analysis. In a practical ordinary least squares analysis, Cook's distance can be used in several ...
and DFFITS are conceptually identical and there is a closed-form formula to convert one value to the other.
Development
Previously when assessing a dataset before running a linear regression, the possibility of outliers would be assessed using histograms and scatterplots. Both methods of assessing data points were subjective and there was little way of knowing how much leverage each potential outlier had on the results data. This led to a variety of quantitative measures, including DFFIT,
DFBETA
In statistics, an influential observation is an observation for a statistical calculation whose deletion from the dataset would noticeably change the result of the calculation. In particular, in regression analysis an influential observation is ...
.
References
{{Reflist
Regression diagnostics