In
statistics, a studentized residual is the quotient resulting from the division of a
residual by an
estimate of its
standard deviation. It is a form of a
Student's ''t''-statistic, with the estimate of error varying between points.
This is an important technique in the detection of
outlier
In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
s. It is among several named in honor of
William Sealey Gosset, who wrote under the pseudonym ''Student''. Dividing a statistic by a
sample standard deviation
In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while ...
is called studentizing, in analogy with
standardizing
In statistics, the standard score is the number of standard deviations by which the value of a raw score (i.e., an observed value or data point) is above or below the mean value of what is being observed or measured. Raw scores above the mean ...
and
normalizing.
Motivation
The key reason for studentizing is that, in
regression analysis
In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
of a
multivariate distribution, the variances of the ''residuals'' at different input variable values may differ, even if the variances of the ''errors'' at these different input variable values are equal. The issue is the difference between
errors and residuals in statistics
In statistics and optimization, errors and residuals are two closely related and easily confused measures of the deviation of an observed value of an element of a statistical sample from its " true value" (not necessarily observable). The err ...
, particularly the behavior of residuals in regressions.
Consider the
simple linear regression
In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the ''x'' an ...
model
:
Given a random sample (''X''
''i'', ''Y''
''i''), ''i'' = 1, ..., ''n'', each pair (''X''
''i'', ''Y''
''i'') satisfies
:
where the ''errors''
, are
independent
Independent or Independents may refer to:
Arts, entertainment, and media Artist groups
* Independents (artist group), a group of modernist painters based in the New Hope, Pennsylvania, area of the United States during the early 1930s
* Independe ...
and all have the same variance
. The residuals are not the true errors, but ''estimates'', based on the observable data. When the method of least squares is used to estimate
and
, then the residuals
, unlike the errors
, cannot be independent since they satisfy the two constraints
:
and
:
(Here ''ε''
''i'' is the ''i''th error, and
is the ''i''th residual.)
The residuals, unlike the errors, ''do not all have the same variance:'' the variance decreases as the corresponding ''x''-value gets farther from the average ''x''-value. This is not a feature of the data itself, but of the regression better fitting values at the ends of the domain. It is also reflected in the
influence functions of various data points on the
regression coefficients: endpoints have more influence. This can also be seen because the residuals at endpoints depend greatly on the slope of a fitted line, while the residuals at the middle are relatively insensitive to the slope. The fact that ''the variances of the residuals differ,'' even though ''the variances of the true errors are all equal'' to each other, is the ''principal reason'' for the need for studentization.
It is not simply a matter of the population parameters (mean and standard deviation) being unknown – it is that ''regressions'' yield ''different residual distributions'' at ''different data points,'' unlike ''point
estimators'' of
univariate distributions, which share a ''common distribution'' for residuals.
Background
For this simple model, the
design matrix
In statistics and in particular in regression analysis, a design matrix, also known as model matrix or regressor matrix and often denoted by X, is a matrix of values of explanatory variables of a set of objects. Each row represents an individual ...
is
: