Studentized residual
   HOME

TheInfoList



OR:

In
statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, a studentized residual is the dimensionless ratio resulting from the division of a residual by an
estimate Estimation (or estimating) is the process of finding an estimate or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is de ...
of its
standard deviation In statistics, the standard deviation is a measure of the amount of variation of the values of a variable about its Expected value, mean. A low standard Deviation (statistics), deviation indicates that the values tend to be close to the mean ( ...
, both expressed in the same
units Unit may refer to: General measurement * Unit of measurement, a definite magnitude of a physical quantity, defined and adopted by convention or by law **International System of Units (SI), modern form of the metric system **English units, histo ...
. It is a form of a Student's ''t''-statistic, with the estimate of error varying between points. This is an important technique in the detection of
outlier In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
s. It is among several named in honor of William Sealey Gosset, who wrote under the pseudonym "Student" (e.g., Student's distribution). Dividing a statistic by a
sample standard deviation In statistics, the standard deviation is a measure of the amount of variation of the values of a variable about its mean. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the ...
is called ''studentizing'', in analogy with ''
standardizing In statistics, the standard score or ''z''-score is the number of standard deviations by which the value of a raw score (i.e., an observed value or data point) is above or below the mean value of what is being observed or measured. Raw scores ...
'' and '' normalizing''.


Motivation

The key reason for studentizing is that, in regression analysis of a
multivariate distribution Multivariate is the quality of having multiple variables. It may also refer to: In mathematics * Multivariable calculus * Multivariate function * Multivariate polynomial * Multivariate interpolation * Multivariate optimization In computing * ...
, the variances of the ''residuals'' at different input variable values may differ, even if the variances of the ''errors'' at these different input variable values are equal. The issue is the difference between
errors and residuals in statistics In statistics and optimization, errors and residuals are two closely related and easily confused measures of the deviation of an observed value of an element of a statistical sample from its "true value" (not necessarily observable). The erro ...
, particularly the behavior of residuals in regressions. Consider the
simple linear regression In statistics, simple linear regression (SLR) is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the ''x ...
model : Y = \alpha_0 + \alpha_1 X + \varepsilon. \, Given a random sample (''X''''i'', ''Y''''i''), ''i'' = 1, ..., ''n'', each pair (''X''''i'', ''Y''''i'') satisfies : Y_i = \alpha_0 + \alpha_1 X_i + \varepsilon_i,\, where the ''errors'' \varepsilon_i, are
independent Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in Pennsylvania, United States * Independentes (English: Independents), a Portuguese artist ...
and all have the same variance \sigma^2. The residuals are not the true errors, but ''estimates'', based on the observable data. When the method of least squares is used to estimate \alpha_0 and \alpha_1, then the residuals \widehat, unlike the errors \varepsilon, cannot be independent since they satisfy the two constraints :\sum_^n \widehat_i=0 and :\sum_^n \widehat_i x_i=0. (Here ''ε''''i'' is the ''i''th error, and \widehat_i is the ''i''th residual.) The residuals, unlike the errors, ''do not all have the same variance:'' the variance decreases as the corresponding ''x''-value gets farther from the average ''x''-value. This is not a feature of the data itself, but of the regression better fitting values at the ends of the domain. It is also reflected in the influence functions of various data points on the
regression coefficient In statistics, linear regression is a model that estimates the relationship between a scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A model with exactly one explanatory variable ...
s: endpoints have more influence. This can also be seen because the residuals at endpoints depend greatly on the slope of a fitted line, while the residuals at the middle are relatively insensitive to the slope. The fact that ''the variances of the residuals differ,'' even though ''the variances of the true errors are all equal'' to each other, is the ''principal reason'' for the need for studentization. It is not simply a matter of the population parameters (mean and standard deviation) being unknown – it is that ''regressions'' yield ''different residual distributions'' at ''different data points,'' unlike ''point
estimators In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
'' of
univariate distribution In statistics, a univariate distribution is a probability distribution of only one random variable. This is in contrast to a multivariate distribution, the probability distribution of a random vector (consisting of multiple random variables). Exam ...
s, which share a ''common distribution'' for residuals.


Background

For this simple model, the
design matrix In statistics and in particular in regression analysis, a design matrix, also known as model matrix or regressor matrix and often denoted by X, is a matrix of values of explanatory variables of a set of objects. Each row represents an individual o ...
is :X=\left begin1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end\right/math> and the
hat matrix In statistics, the projection matrix (\mathbf), sometimes also called the influence matrix or hat matrix (\mathbf), maps the vector of response values (dependent variable values) to the vector of fitted values (or predicted values). It describes ...
''H'' is the matrix of the
orthogonal projection In linear algebra and functional analysis, a projection is a linear transformation P from a vector space to itself (an endomorphism) such that P\circ P=P. That is, whenever P is applied twice to any vector, it gives the same result as if it we ...
onto the column space of the design matrix: :H=X(X^T X)^X^T.\, The leverage ''h''''ii'' is the ''i''th diagonal entry in the hat matrix. The variance of the ''i''th residual is :\operatorname(\widehat_i)=\sigma^2(1-h_). In case the design matrix ''X'' has only two columns (as in the example above), this is equal to :\operatorname(\widehat_i)=\sigma^2\left( 1 - \frac1n -\frac \right). In the case of an
arithmetic mean In mathematics and statistics, the arithmetic mean ( ), arithmetic average, or just the ''mean'' or ''average'' is the sum of a collection of numbers divided by the count of numbers in the collection. The collection is often a set of results fr ...
, the design matrix ''X'' has only one column (a
vector of ones In mathematics, a matrix of ones or all-ones matrix is a matrix with every entry equal to one. For example: :J_2 = \begin 1 & 1 \\ 1 & 1 \end,\quad J_3 = \begin 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \end,\quad J_ = \begin 1 & 1 & 1 & 1 & 1 \\ 1 ...
), and this is simply: :\operatorname(\widehat_i)=\sigma^2\left( 1 - \frac1n \right).


Calculation

Given the definitions above, the Studentized residual is then :t_i = where ''h''''ii'' is the leverage, and \widehat is an appropriate estimate of ''σ'' (see below). In the case of a mean, this is equal to: :t_i =


Internal and external studentization

The usual estimate of ''σ''2 is the ''internally studentized'' residual :\widehat^2=\sum_^n \widehat_j^. where ''m'' is the number of parameters in the model (2 in our example). But if the ''i'' th case is suspected of being improbably large, then it would also not be normally distributed. Hence it is prudent to exclude the ''i'' th observation from the process of estimating the variance when one is considering whether the ''i'' th case may be an outlier, and instead use the ''externally studentized'' residual, which is :\widehat_^2=\sum_^n \widehat_j^, based on all the residuals ''except'' the suspect ''i'' th residual. Here is to emphasize that \widehat_j^ (j \ne i) for suspect ''i'' are computed with ''i'' th case excluded. If the estimate ''σ''2 ''includes'' the ''i'' th case, then it is called the ''internally studentized'' residual, t_i (also known as the ''standardized residual'' Regression Deletion Diagnostics
R docs). If the estimate \widehat_^2 is used instead, ''excluding'' the ''i'' th case, then it is called the ''externally studentized'', t_.


Distribution

If the errors are independent and
normally distributed In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real number, real-valued random variable. The general form of its probability density function is f(x ...
with
expected value In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first Moment (mathematics), moment) is a generalization of the weighted average. Informa ...
0 and variance ''σ''2, then the
probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
of the ''i''th externally studentized residual t_ is a
Student's t-distribution In probability theory and statistics, Student's  distribution (or simply the  distribution) t_\nu is a continuous probability distribution that generalizes the Normal distribution#Standard normal distribution, standard normal distribu ...
with ''n'' − ''m'' − 1
degrees of freedom In many scientific fields, the degrees of freedom of a system is the number of parameters of the system that may vary independently. For example, a point in the plane has two degrees of freedom for translation: its two coordinates; a non-infinite ...
, and can range from \scriptstyle-\infty to \scriptstyle+\infty. On the other hand, the internally studentized residuals are in the range 0 \,\pm\, \sqrt, where ''ν'' = ''n'' − ''m'' is the number of residual degrees of freedom. If ''t''''i'' represents the internally studentized residual, and again assuming that the errors are independent identically distributed Gaussian variables, then:Allen J. Pope (1976), "The statistics of residuals and the detection of outliers", U.S. Dept. of Commerce, National Oceanic and Atmospheric Administration, National Ocean Survey, Geodetic Research and Development Laboratory, 136 pages

eq.(6)
:t_i \sim \sqrt where ''t'' is a random variable distributed as
Student's t-distribution In probability theory and statistics, Student's  distribution (or simply the  distribution) t_\nu is a continuous probability distribution that generalizes the Normal distribution#Standard normal distribution, standard normal distribu ...
with ''ν'' − 1 degrees of freedom. In fact, this implies that ''t''''i''2 /''ν'' follows the
beta distribution In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval
, 1 The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...
or (0, 1) in terms of two positive Statistical parameter, parameters, denoted by ''alpha'' (''α'') an ...
''B''(1/2,(''ν'' − 1)/2). The distribution above is sometimes referred to as the tau distribution; it was first derived by Thompson in 1935. When ''ν'' = 3, the internally studentized residuals are uniformly distributed between \scriptstyle-\sqrt and \scriptstyle+\sqrt. If there is only one residual degree of freedom, the above formula for the distribution of internally studentized residuals doesn't apply. In this case, the ''t''''i'' are all either +1 or −1, with 50% chance for each. The standard deviation of the distribution of internally studentized residuals is always 1, but this does not imply that the standard deviation of all the ''t''''i'' of a particular experiment is 1. For instance, the internally studentized residuals when fitting a straight line going through (0, 0) to the points (1, 4), (2, −1), (2, −1) are \sqrt,\ -\sqrt/5,\ -\sqrt/5, and the standard deviation of these is not 1. Note that any pair of studentized residual ''t''''i'' and ''t''''j'' (where are NOT i.i.d. They have the same distribution, but are not independent due to constraints on the residuals having to sum to 0 and to have them be orthogonal to the design matrix.


Software implementations

Many programs and statistics packages, such as R,
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (prog ...
, etc., include implementations of Studentized residual.


See also

*
Cook's distance In statistics, Cook's distance or Cook's ''D'' is a commonly used estimate of the influence of a data point when performing a least-squares regression analysis. In a practical ordinary least squares analysis, Cook's distance can be used in several ...
– a measure of changes in regression coefficients when an observation is deleted *
Grubbs's test In statistics, Grubbs's test or the Grubbs test (named after Frank E. Grubbs, who published the test in 1950), also known as the maximum normalized residual test or extreme studentized deviate test, is a test used to detect outliers in a univaria ...
*
Normalization (statistics) In statistics and applications of statistics, normalization can have a range of meanings. In the simplest cases, normalization of ratings means adjusting values measured on different scales to a notionally common scale, often prior to averaging ...
* Samuelson's inequality *
Standard score In statistics, the standard score or ''z''-score is the number of standard deviations by which the value of a raw score (i.e., an observed value or data point) is above or below the mean value of what is being observed or measured. Raw scores ...
*
William Sealy Gosset William Sealy Gosset (13 June 1876 – 16 October 1937) was an English statistician, chemist and brewer who worked for Guinness. In statistics, he pioneered small sample experimental design. Gosset published under the pen name Student and develo ...


References


Further reading

* {{DEFAULTSORT:Studentized Residual Statistical outliers Statistical deviation and dispersion Errors and residuals Statistical ratios Regression diagnostics