HOME

TheInfoList



OR:

In
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, a regression diagnostic is one of a set of procedures available for
regression analysis In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
that seek to assess the validity of a model in any of a number of different ways. This assessment may be an exploration of the model's underlying
statistical assumption Statistics, like all mathematical disciplines, does not infer valid conclusions from nothing. Inferring interesting conclusions about real statistical populations almost always requires some background assumptions. Those assumptions must be made c ...
s, an examination of the structure of the model by considering formulations that have fewer, more or different
explanatory variables Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
, or a study of subgroups of observations, looking for those that are either poorly represented by the model (
outlier In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
s) or that have a relatively large effect on the regression model's predictions. A regression diagnostic may take the form of a graphical result, informal quantitative results or a formal
statistical hypothesis test A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. Hypothesis testing allows us to make probabilistic statements about population parameters. ...
, each of which provides guidance for further stages of a regression analysis.


Introduction

Regression diagnostics have often been developed or were initially proposed in the context of
linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is call ...
or, more particularly,
ordinary least squares In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the prin ...
. This means that many formally defined diagnostics are only available for these contexts.


Assessing assumptions

;Distribution of model errors *
Normal probability plot The normal probability plot is a graphical technique to identify substantive departures from normality. This includes identifying outliers, skewness, kurtosis, a need for transformations, and mixtures. Normal probability plots are made of raw d ...
;
Homoscedasticity In statistics, a sequence (or a vector) of random variables is homoscedastic () if all its random variables have the same finite variance. This is also known as homogeneity of variance. The complementary notion is called heteroscedasticity. The s ...
*
Goldfeld–Quandt test In statistics, the Goldfeld–Quandt test checks for homoscedasticity in regression analyses. It does this by dividing a dataset into two parts or groups, and hence the test is sometimes called a two-group test. The Goldfeld–Quandt test is one o ...
*
Breusch–Pagan test In statistics, the Breusch–Pagan test, developed in 1979 by Trevor Breusch and Adrian Pagan, is used to test for heteroskedasticity in a linear regression model. It was independently suggested with some extension by R. Dennis Cook and Sanf ...
* Park test *
White test In statistics, the White test is a statistical test that establishes whether the variance of the errors in a regression model is constant: that is for homoskedasticity. This test, and an estimator for heteroscedasticity-consistent standard err ...
;Correlation of model errors *
Breusch–Godfrey test In statistics, the Breusch–Godfrey test is used to assess the validity of some of the modelling assumptions inherent in applying regression-like models to observed data series. In particular, it tests for the presence of serial correlation th ...


Assessing model structure

;Adequacy of existing explanatory variables *
Partial residual plot In applied statistics, a partial residual plot is a graphical technique that attempts to show the relationship between a given independent variable and the response variable given that other independent variables are also in the model. Background ...
*
Ramsey RESET test In statistics, the Ramsey Regression Equation Specification Error Test (RESET) test is a general specification test for the linear regression model. More specifically, it tests whether non-linear combinations of the fitted values help explain the r ...
*
F test An ''F''-test is any statistical test in which the test statistic has an ''F''-distribution under the null hypothesis. It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model th ...
for use when there are replicated observations, so that a comparison can be made between the
lack-of-fit sum of squares In statistics, a sum of squares due to lack of fit, or more tersely a lack-of-fit sum of squares, is one of the components of a partition of the sum of squares of residuals in an analysis of variance, used in the numerator in an F-test of the null ...
and the pure error sum of squares, under the assumption that model errors are
homoscedastic In statistics, a sequence (or a vector) of random variables is homoscedastic () if all its random variables have the same finite variance. This is also known as homogeneity of variance. The complementary notion is called heteroscedasticity. The s ...
and have a
normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...
. ;Adding or dropping explanatory variables *
Partial regression plot In applied statistics, a partial regression plot attempts to show the effect of adding another variable to a model that already has one or more independent variables. Partial regression plots are also referred to as added variable plots, adjusted va ...
*
Student's t test A ''t''-test is any statistical hypothesis test in which the test statistic follows a Student's ''t''-distribution under the null hypothesis. It is most commonly applied when the test statistic would follow a normal distribution if the value of a ...
for testing inclusion of a single explanatory variable, or the
F test An ''F''-test is any statistical test in which the test statistic has an ''F''-distribution under the null hypothesis. It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model th ...
for testing inclusion of a group of variables, both under the assumption that model errors are
homoscedastic In statistics, a sequence (or a vector) of random variables is homoscedastic () if all its random variables have the same finite variance. This is also known as homogeneity of variance. The complementary notion is called heteroscedasticity. The s ...
and have a
normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...
. ;Change of model structure between groups of observations *
Structural break test In econometrics and statistics, a structural break is an unexpected change over time in the parameters of regression models, which can lead to huge forecasting errors and unreliability of the model in general. This issue was popularised by Davi ...
**
Chow test The Chow test (), proposed by econometrician Gregory Chow in 1960, is a test of whether the true coefficients in two linear regressions on different data sets are equal. In econometrics, it is most commonly used in time series analysis to test for ...
;Comparing model structures *
PRESS statistic In statistics, the predicted residual error sum of squares (PRESS) is a form of cross-validation used in regression analysis to provide a summary measure of the fit of a model to a sample of observations that were not themselves used to estimate t ...


Important groups of observations

;Outliers ;Influential observations *
Leverage (statistics) In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. ''High-leverage points'', if any, are outliers with respect to ...
,
partial leverage In regression analysis, partial leverage (PL) is a measure of the contribution of the individual independent variables to the total leverage of each observation. That is, if ''h'i'' is the ''i''th element of the diagonal of the hat matrix, PL is ...
*
DFFITS DFFIT and DFFITS ("difference in fit(s)") are diagnostics meant to show how influential a point is in a statistical regression, first proposed in 1980. DFFIT is the change in the predicted value for a point, obtained when that point is left out ...
*
Cook's distance In statistics, Cook's distance or Cook's ''D'' is a commonly used estimate of the influence of a data point when performing a least-squares regression analysis. In a practical ordinary least squares analysis, Cook's distance can be used in several ...


References

{{statistics-stub