In statistics, a regression diagnostic is one of a set of procedures available for

regression analysis In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...

that seek to assess the validity of a model in any of a number of different ways. This assessment may be an exploration of the model's underlying statistical assumptions, an examination of the structure of the model by considering formulations that have fewer, more or different

explanatory variables Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...

, or a study of subgroups of observations, looking for those that are either poorly represented by the model (

outlier In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...

s) or that have a relatively large effect on the regression model's predictions. A regression diagnostic may take the form of a graphical result, informal quantitative results or a formal

statistical hypothesis test A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. Hypothesis testing allows us to make probabilistic statements about population parameters. ...

, each of which provides guidance for further stages of a regression analysis.

Introduction

Regression diagnostics have often been developed or were initially proposed in the context of

linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is ...

or, more particularly,

ordinary least squares In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the ...

. This means that many formally defined diagnostics are only available for these contexts.

Assessing assumptions

;Distribution of model errors *

Normal probability plot The normal probability plot is a graphical technique to identify substantive departures from normality. This includes identifying outliers, skewness, kurtosis, a need for transformations, and mixtures. Normal probability plots are made of raw ...

;

Homoscedasticity In statistics, a sequence (or a vector) of random variables is homoscedastic () if all its random variables have the same finite variance. This is also known as homogeneity of variance. The complementary notion is called heteroscedasticity. The s ...

* Goldfeld–Quandt test * Breusch–Pagan test *

Park test In econometrics, the Park test is a test for heteroscedasticity. The test is based on the method proposed by Rolla Edward Park for estimating linear regression parameters in the presence of heteroscedastic error terms. Background In regression an ...

White test In statistics, the White test is a statistical test that establishes whether the variance of the errors in a regression model is constant: that is for homoskedasticity. This test, and an estimator for heteroscedasticity-consistent standard e ...

;Correlation of model errors *

Breusch–Godfrey test In statistics, the Breusch–Godfrey test is used to assess the validity of some of the modelling assumptions inherent in applying regression-like models to observed data series. In particular, it tests for the presence of serial correlation tha ...

Assessing model structure

;Adequacy of existing explanatory variables * Partial residual plot *

Ramsey RESET test In statistics, the Ramsey Regression Equation Specification Error Test (RESET) test is a general specification test for the linear regression model. More specifically, it tests whether non-linear combinations of the fitted values help explain the ...

F test An ''F''-test is any statistical test in which the test statistic has an ''F''-distribution under the null hypothesis. It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model th ...

for use when there are replicated observations, so that a comparison can be made between the

lack-of-fit sum of squares In statistics, a sum of squares due to lack of fit, or more tersely a lack-of-fit sum of squares, is one of the components of a partition of the sum of squares of residuals in an analysis of variance, used in the numerator in an F-test of the null ...

and the pure error sum of squares, under the assumption that model errors are

homoscedastic In statistics, a sequence (or a vector) of random variables is homoscedastic () if all its random variables have the same finite variance. This is also known as homogeneity of variance. The complementary notion is called heteroscedasticity. The ...

and have a

normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu i ...

. ;Adding or dropping explanatory variables *

Partial regression plot In applied statistics, a partial regression plot attempts to show the effect of adding another variable to a model that already has one or more independent variables. Partial regression plots are also referred to as added variable plots, adjusted va ...

Student's t test A ''t''-test is any statistical hypothesis test in which the test statistic follows a Student's ''t''-distribution under the null hypothesis. It is most commonly applied when the test statistic would follow a normal distribution if the value of a ...

for testing inclusion of a single explanatory variable, or the

for testing inclusion of a group of variables, both under the assumption that model errors are

and have a

. ;Change of model structure between groups of observations * Structural break test **

Chow test The Chow test (), proposed by econometrician Gregory Chow in 1960, is a test of whether the true coefficients in two linear regressions on different data sets are equal. In econometrics, it is most commonly used in time series analysis to test fo ...

;Comparing model structures *

PRESS statistic In statistics, the predicted residual error sum of squares (PRESS) is a form of cross-validation used in regression analysis to provide a summary measure of the fit of a model to a sample of observations that were not themselves used to estimate ...

Important groups of observations

;Outliers ;Influential observations *

Leverage (statistics) In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. ''High-leverage points'', if any, are outliers with respect t ...

partial leverage In regression analysis, partial leverage (PL) is a measure of the contribution of the individual independent variables to the total leverage of each observation. That is, if ''h'i'' is the ''i''th element of the diagonal of the hat matrix, PL i ...

DFFITS DFFIT and DFFITS ("difference in fit(s)") are diagnostics meant to show how influential a point is in a statistical regression, first proposed in 1980. DFFIT is the change in the predicted value for a point, obtained when that point is left out ...

Cook's distance In statistics, Cook's distance or Cook's ''D'' is a commonly used estimate of the influence of a data point when performing a least-squares regression analysis. In a practical ordinary least squares analysis, Cook's distance can be used in severa ...

References

{{statistics-stub