The partition of sums of squares is a concept that permeates much of

inferential statistics Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers propertie ...

and

descriptive statistics A descriptive statistic (in the count noun sense) is a summary statistic that quantitatively describes or summarizes features from a collection of information, while descriptive statistics (in the mass noun sense) is the process of using and an ...

. More properly, it is the partitioning of sums of

squared deviations Squared deviations from the mean (SDM) result from squaring deviations. In probability theory and statistics, the definition of ''variance'' is either the expected value of the SDM (when considering a theoretical distribution) or its average valu ...

or errors. Mathematically, the sum of squared deviations is an unscaled, or unadjusted measure of

dispersion Dispersion may refer to: Economics and finance *Dispersion (finance), a measure for the statistical distribution of portfolio returns *Price dispersion, a variation in prices across sellers of the same item *Wage dispersion, the amount of variatio ...

(also called variability). When scaled for the number of

degrees of freedom Degrees of freedom (often abbreviated df or DOF) refers to the number of independent variables or parameters of a thermodynamic system. In various scientific fields, the word "freedom" is used to describe the limits to which physical movement or ...

, it estimates the

variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...

, or spread of the observations about their mean value. Partitioning of the sum of squared deviations into various components allows the overall variability in a dataset to be ascribed to different types or sources of variability, with the relative importance of each being quantified by the size of each component of the overall sum of squares.

Background

The distance from any point in a collection of data, to the mean of the data, is the deviation. This can be written as

y_i - \overline

, where

y_i

is the ith data point, and

\overline

is the estimate of the mean. If all such deviations are squared, then summed, as in

\sum_^n\left(y_i-\overline\,\right)^2

, this gives the "sum of squares" for these data. When more data are added to the collection the sum of squares will increase, except in unlikely cases such as the new data being equal to the mean. So usually, the sum of squares will grow with the size of the data collection. That is a manifestation of the fact that it is unscaled. In many cases, the number of

is simply the number of data points in the collection, minus one. We write this as ''n'' − 1, where ''n'' is the number of data points. Scaling (also known as normalizing) means adjusting the sum of squares so that it does not grow as the size of the data collection grows. This is important when we want to compare samples of different sizes, such as a sample of 100 people compared to a sample of 20 people. If the sum of squares were not normalized, its value would always be larger for the sample of 100 people than for the sample of 20 people. To scale the sum of squares, we divide it by the degrees of freedom, i.e., calculate the sum of squares per degree of freedom, or variance.

Standard deviation In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while ...

, in turn, is the square root of the variance. The above describes how the sum of squares is used in descriptive statistics; see the article on

total sum of squares In statistical data analysis the total sum of squares (TSS or SST) is a quantity that appears as part of a standard way of presenting results of such analyses. For a set of observations, y_i, i\leq n, it is defined as the sum over all squared dif ...

for an application of this broad principle to

Partitioning the sum of squares in linear regression

Theorem. Given a

linear regression model In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is call ...

y_i = \beta_0 + \beta_1 x_ + \cdots + \beta_p x_ + \varepsilon_i

''including a constant''

\beta_0

, based on a sample

(y_i, x_, \ldots, x_), \, i = 1, \ldots, n

containing ''n'' observations, the total sum of squares

\mathrm = \sum_^n (y_i - \bar)^2

can be partitioned as follows into the

explained sum of squares In statistics, the explained sum of squares (ESS), alternatively known as the model sum of squares or sum of squares due to regression (SSR – not to be confused with the residual sum of squares (RSS) or sum of squares of errors), is a quantity ...

(ESS) and the

residual sum of squares In statistics, the residual sum of squares (RSS), also known as the sum of squared estimate of errors (SSE), is the sum of the squares of residuals (deviations predicted from actual empirical values of data). It is a measure of the discrepan ...

(RSS): :

\mathrm =  \mathrm + \mathrm,

where this equation is equivalent to each of the following forms: :

\begin
\left\,  y - \bar \mathbf \right\, ^2 &=  \left\,  \hat - \bar \mathbf \right\, ^2 + \left\,  \hat \right\, ^2, \quad \mathbf = (1, 1, \ldots, 1)^T ,\\
\sum_^n (y_i - \bar)^2 &= \sum_^n (\hat_i - \bar)^2 + \sum_^n (y_i - \hat_i)^2 ,\\
\sum_^n (y_i - \bar)^2 &= \sum_^n (\hat_i - \bar)^2 + \sum_^n \hat_i^2 ,\\
\end

:where

\hat_i

is the value estimated by the regression line having

\hat_0

\hat_1

, ...,

\hat_p

as the estimated

coefficient In mathematics, a coefficient is a multiplicative factor in some term of a polynomial, a series, or an expression; it is usually a number, but may be any expression (including variables such as , and ). When the coefficients are themselves var ...

Proof

\begin
\sum_^n (y_i - \overline)^2 &= \sum_^n (y_i - \overline + \hat_i - \hat_i)^2
= \sum_^n ((\hat_i - \bar) + \underbrace_)^2 \\
&= \sum_^n ((\hat_i - \bar)^2 + 2 \hat_i (\hat_i - \bar) + \hat_i^2) \\
&= \sum_^n (\hat_i - \bar)^2 + \sum_^n \hat_i^2 + 2 \sum_^n \hat_i (\hat_i - \bar) \\
&= \sum_^n (\hat_i - \bar)^2 + \sum_^n \hat_i^2 + 2 \sum_^n \hat_i(\hat_0 + \hat_1 x_ + \cdots + \hat_p x_ - \overline) \\
&= \sum_^n (\hat_i - \bar)^2 + \sum_^n \hat_i^2 + 2 (\hat_0 - \overline) \underbrace_0 + 2 \hat_1 \underbrace_0 + \cdots + 2 \hat_p \underbrace_0 \\
&= \sum_^n (\hat_i - \bar)^2 + \sum_^n \hat_i^2 = \mathrm + \mathrm \\
\end

The requirement that the model include a constant or equivalently that the design matrix contain a column of ones ensures that

\sum_^n \hat_i = 0

, i.e.

\hat^T\mathbf=0

. The proof can also be expressed in vector form, as follows: :

\begin
  SS_\text   = \Vert \mathbf - \bar\mathbf \Vert^2 & = \Vert \mathbf - \bar\mathbf + \mathbf - \mathbf \Vert^2 , \\
     &  = \Vert \left( \mathbf - \bar\mathbf \right) + \left( \mathbf - \mathbf \right) \Vert^2, \\
     &  = \Vert  \Vert^2  + \Vert \hat \varepsilon\Vert^2  + 2 \hat \varepsilon^T \left( \mathbf - \bar\mathbf \right), \\
     &  = SS_\text  + SS_\text  + 2\hat \varepsilon^T \left( X\hat\beta - \bar\mathbf \right) ,\\
     &  = SS_\text  + SS_\text + 2\left( \hat \varepsilon^T X \right)\hat\beta  - 2\bar\underbrace_0, \\ 
     &  = SS_\text + SS_\text. 
\end

The elimination of terms in the last line, used the fact that :

\hat \varepsilon ^T X = \left( \mathbf - \mathbf \right)^T X 
    = \mathbf^T(I - X(X^T X)^ X^T)^T X = ^T(X^T-X^T)^T=.

Further partitioning

Note that the residual sum of squares can be further partitioned as the

lack-of-fit sum of squares In statistics, a sum of squares due to lack of fit, or more tersely a lack-of-fit sum of squares, is one of the components of a partition of the sum of squares of residuals in an analysis of variance, used in the numerator in an F-test of the null ...

plus the sum of squares due to pure error.

References

* Pre-publication chapters are available on-line. * * *:Republished as: * {{cite book, title=Probability Via Expectation, edition=4th, author=Whittle, P., publisher=Springer, date=20 April 2000, isbn=0-387-98955-2 Analysis of variance Least squares

Background

Partitioning the sum of squares in linear regression

Proof

Further partitioning

See also

References