statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...

, a sum of squares due to lack of fit, or more tersely a lack-of-fit sum of squares, is one of the components of a partition of the sum of squares of residuals in an

analysis of variance Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures (such as the "variation" among and between groups) used to analyze the differences among means. ANOVA was developed by the statisticia ...

, used in the

numerator A fraction (from la, fractus, "broken") represents a part of a whole or, more generally, any number of equal parts. When spoken in everyday English, a fraction describes how many parts of a certain size there are, for example, one-half, eight ...

in an

F-test An ''F''-test is any statistical test in which the test statistic has an ''F''-distribution under the null hypothesis. It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model th ...

of the

null hypothesis In scientific research, the null hypothesis (often denoted ''H''0) is the claim that no difference or relationship exists between two sets of data or variables being analyzed. The null hypothesis is that any experimentally observed difference is d ...

that says that a proposed model fits well. The other component is the pure-error sum of squares. The pure-error sum of squares is the sum of squared deviations of each value of the

dependent variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...

from the average value over all observations sharing its

independent variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...

value(s). These are errors that could never be avoided by any predictive equation that assigned a predicted value for the dependent variable as a function of the value(s) of the independent variable(s). The remainder of the residual sum of squares is attributed to lack of fit of the model since it would be mathematically possible to eliminate these errors entirely.

Principle

In order for the lack-of-fit sum of squares to differ from the

sum of squares of residuals In statistics, the residual sum of squares (RSS), also known as the sum of squared estimate of errors (SSE), is the sum of the squares of residuals (deviations predicted from actual empirical values of data). It is a measure of the discrepan ...

, there must be more than one value of the

response variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or deman ...

for at least one of the values of the set of predictor variables. For example, consider fitting a line :

y = \alpha x + \beta \,

by the method of

least squares The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the res ...

. One takes as estimates of ''α'' and ''β'' the values that minimize the sum of squares of residuals, i.e., the sum of squares of the differences between the observed ''y''-value and the fitted ''y''-value. To have a lack-of-fit sum of squares that differs from the residual sum of squares, one must observe more than one ''y''-value for each of one or more of the ''x''-values. One then partitions the "sum of squares due to error", i.e., the sum of squares of residuals, into two components: : sum of squares due to error = (sum of squares due to "pure" error) + (sum of squares due to lack of fit). The sum of squares due to "pure" error is the sum of squares of the differences between each observed ''y''-value and the average of all ''y''-values corresponding to the same ''x''-value. The sum of squares due to lack of fit is the ''weighted'' sum of squares of differences between each average of ''y''-values corresponding to the same ''x''-value and the corresponding fitted ''y''-value, the weight in each case being simply the number of observed ''y''-values for that ''x''-value. Because it is a property of least squares regression that the vector whose components are "pure errors" and the vector of lack-of-fit components are orthogonal to each other, the following equality holds: :

\begin
&\sum (\text - \text)^2  && \text \\
&\qquad = \sum (\text - \text)^2  && \text \\
&\qquad\qquad  + \sum \text\times (\text - \text)^2  && \text
\end

Hence the residual sum of squares has been completely decomposed into two components.

Mathematical details

Consider fitting a line with one predictor variable. Define ''i'' as an index of each of the ''n'' distinct ''x'' values, ''j'' as an index of the response variable observations for a given ''x'' value, and ''n''_''i'' as the number of ''y'' values associated with the ''i'' ^th ''x'' value. The value of each response variable observation can be represented by :

Y_ = \alpha x_i + \beta + \varepsilon_,\qquad i = 1,\dots, n,\quad j = 1,\dots,n_i.

Let :

\widehat\alpha, \widehat\beta \,

be the

estimates of the unobservable parameters ''α'' and ''β'' based on the observed values of ''x''_''i'' and ''Y''_''i j''. Let :

\widehat Y_i = \widehat\alpha x_i + \widehat\beta \,

be the fitted values of the response variable. Then :

\widehat\varepsilon_ = Y_ - \widehat Y_i \,

are the residuals, which are observable estimates of the unobservable values of the error term ''ε''_''ij''. Because of the nature of the method of least squares, the whole vector of residuals, with :

N = \sum_^n n_i

scalar components, necessarily satisfies the two constraints :

\sum_^n \sum_^ \widehat\varepsilon_ = 0 \,

\sum_^n \left(x_i \sum_^ \widehat\varepsilon_ \right) = 0. \,

It is thus constrained to lie in an (''N'' − 2)-dimensional subspace of R^''N'', i.e. there are ''N'' − 2 "

degrees of freedom Degrees of freedom (often abbreviated df or DOF) refers to the number of independent variables or parameters of a thermodynamic system. In various scientific fields, the word "freedom" is used to describe the limits to which physical movement or ...

for error". Now let :

\overline_ = \frac \sum_^ Y_

be the average of all ''Y''-values associated with the ''i'' ^th ''x''-value. We partition the sum of squares due to error into two components: :

\begin
& \sum_^n \sum_^ \widehat\varepsilon_^
= \sum_^n \sum_^ \left( Y_ - \widehat Y_i \right)^2 \\
& = \underbrace_\text
+ \underbrace_\text
\end

Probability distributions

Sums of squares

Suppose the error terms ''ε''_''i j'' are

independent Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in the New Hope, Pennsylvania, area of the United States during the early 1930s * Independ ...

and normally distributed with

expected value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a l ...

0 and

variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...

''σ''². We treat ''x''_''i'' as constant rather than random. Then the response variables ''Y''_''i j'' are random only because the errors ''ε''_''i j'' are random. It can be shown to follow that if the straight-line model is correct, then the sum of squares due to error divided by the error variance, :

\frac\sum_^n \sum_^ \widehat\varepsilon_^

has a

chi-squared distribution In probability theory and statistics, the chi-squared distribution (also chi-square or \chi^2-distribution) with k degrees of freedom is the distribution of a sum of the squares of k independent standard normal random variables. The chi-squa ...

with ''N'' − 2 degrees of freedom. Moreover, given the total number of observations ''N'', the number of levels of the independent variable ''n,'' and the number of parameters in the model ''p'': * The sum of squares due to pure error, divided by the error variance ''σ''², has a chi-squared distribution with ''N'' − ''n'' degrees of freedom; * The sum of squares due to lack of fit, divided by the error variance ''σ''², has a chi-squared distribution with ''n'' − ''p'' degrees of freedom (here ''p'' = 2 as there are two parameters in the straight-line model); * The two sums of squares are probabilistically independent.

The test statistic

It then follows that the statistic :

& = \frac \end

has an

F-distribution In probability theory and statistics, the ''F''-distribution or F-ratio, also known as Snedecor's ''F'' distribution or the Fisher–Snedecor distribution (after Ronald Fisher and George W. Snedecor) is a continuous probability distribution th ...

with the corresponding number of degrees of freedom in the numerator and the denominator, provided that the model is correct. If the model is wrong, then the probability distribution of the denominator is still as stated above, and the numerator and denominator are still independent. But the numerator then has a

noncentral chi-squared distribution In probability theory and statistics, the noncentral chi-squared distribution (or noncentral chi-square distribution, noncentral \chi^2 distribution) is a noncentral generalization of the chi-squared distribution. It often arises in the power a ...

, and consequently the quotient as a whole has a non-central F-distribution. One uses this F-statistic to test the

that the linear model is correct. Since the non-central F-distribution is

stochastically larger In probability theory and statistics, a stochastic order quantifies the concept of one random variable being "bigger" than another. These are usually partial orders, so that one random variable A may be neither stochastically greater than, less tha ...

than the (central) F-distribution, one rejects the null hypothesis if the F-statistic is larger than the critical F value. The critical value corresponds to the

cumulative distribution function In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x. Ev ...

of the

F distribution In probability theory and statistics, the ''F''-distribution or F-ratio, also known as Snedecor's ''F'' distribution or the Fisher–Snedecor distribution (after Ronald Fisher and George W. Snedecor) is a continuous probability distribution t ...

with ''x'' equal to the desired

confidence level In frequentist statistics, a confidence interval (CI) is a range of estimates for an unknown parameter. A confidence interval is computed at a designated ''confidence level''; the 95% confidence level is most common, but other levels, such as 9 ...

, and degrees of freedom ''d''₁ = (''n'' − ''p'') and ''d''₂ = (''N'' − ''n''). The assumptions of

normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...

of errors and

independence Independence is a condition of a person, nation, country, or state in which residents and population, or some portion thereof, exercise self-government, and usually sovereignty, over its territory. The opposite of independence is the statu ...

can be shown to entail that this lack-of-fit test is the

likelihood-ratio test In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models based on the ratio of their likelihoods, specifically one found by maximization over the entire parameter space and another found after im ...

of this null hypothesis.

Notes