Weighted least squares (WLS), also known as weighted linear regression, is a generalization of

ordinary least squares In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the prin ...

and

linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is call ...

in which knowledge of the unequal variance of observations (''

heteroscedasticity In statistics, a sequence (or a vector) of random variables is homoscedastic () if all its random variables have the same finite variance. This is also known as homogeneity of variance. The complementary notion is called heteroscedasticity. The s ...

'') is incorporated into the regression. WLS is also a specialization of generalized least squares, when all the off-diagonal entries of the

covariance matrix In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...

of the errors, are null.

Formulation

The fit of a model to a data point is measured by its residual,

r_i

, defined as the difference between a measured value of the dependent variable,

y_i

and the value predicted by the model,

f(x_i, \boldsymbol\beta)

r_i(\boldsymbol\beta) = y_i - f(x_i, \boldsymbol\beta).

If the errors are uncorrelated and have equal variance, then the function

S(\boldsymbol\beta) = \sum_i r_i(\boldsymbol\beta)^2,

is minimised at

\boldsymbol\hat\beta

, such that

\frac(\hat\boldsymbol\beta) = 0

. The Gauss–Markov theorem shows that, when this is so,

\hat

is a best linear unbiased estimator ( BLUE). If, however, the measurements are uncorrelated but have different uncertainties, a modified approach might be adopted. Aitken showed that when a weighted sum of squared residuals is minimized,

\hat

is the BLUE if each weight is equal to the reciprocal of the variance of the measurement

\begin
  S &= \sum_^n W_^2, &
  W_ &= \frac
\end

The gradient equations for this sum of squares are

-2\sum_i W_\frac r_i = 0,\quad j = 1, \ldots, m

which, in a linear least squares system give the modified normal equations,

\sum_^n \sum_^m X_W_X_\hat_k = \sum_^n X_W_y_i,\quad j = 1, \ldots, m\,.

When the observational errors are uncorrelated and the weight matrix, W=Ω⁻¹, is diagonal, these may be written as

\mathbf.

If the errors are correlated, the resulting estimator is the BLUE if the weight matrix is equal to the inverse of the

variance-covariance matrix In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...

of the observations. When the errors are uncorrelated, it is convenient to simplify the calculations to factor the weight matrix as

w_ = \sqrt

. The normal equations can then be written in the same form as ordinary least squares:

\mathbf\,

where we define the following scaled matrix and vector:

\begin
  \mathbf &= \operatorname\left(\mathbf\right) \mathbf,\\
  \mathbf &= \operatorname\left(\mathbf\right) \mathbf = \mathbf \oslash \mathbf.
\end

This is a type of whitening transformation; the last expression involves an

entrywise division In mathematics, the Hadamard product (also known as the element-wise product, entrywise product or Schur product) is a binary operation that takes two matrices of the same dimensions and produces another matrix of the same dimension as the operand ...

. For non-linear least squares systems a similar argument shows that the normal equations should be modified as follows.

\mathbf.\,

Note that for empirical tests, the appropriate W is not known for sure and must be estimated. For this feasible generalized least squares (FGLS) techniques may be used; in this case it is specialized for a diagonal covariance matrix, thus yielding a feasible weighted least squares solution. If the uncertainty of the observations is not known from external sources, then the weights could be estimated from the given observations. This can be useful, for example, to identify outliers. After the outliers have been removed from the data set, the weights should be reset to one.

Motivation

In some cases the observations may be weighted—for example, they may not be equally reliable. In this case, one can minimize the weighted sum of squares:

\underset\, \sum_^ w_i \left, y_i - \sum_^ X_\beta_j\^2 =
  \underset\, \left\, W^\frac\left(\mathbf - X\boldsymbol\beta\right)\right\, ^2.

where ''w''_''i'' > 0 is the weight of the ''i''th observation, and ''W'' is the diagonal matrix of such weights. The weights should, ideally, be equal to the reciprocal of the variance of the measurement. (This implies that the observations are uncorrelated. If the observations are correlated, the expression

S = \sum_k \sum_j r_k W_ r_j\,

applies. In this case the weight matrix should ideally be equal to the inverse of the

of the observations). The normal equations are then:

\left(X^\textsf W X\right)\hat = X^\textsf W \mathbf.

This method is used in

iteratively reweighted least squares The method of iteratively reweighted least squares (IRLS) is used to solve certain optimization problems with objective functions of the form of a ''p''-norm: :\underset \sum_^n \big, y_i - f_i (\boldsymbol\beta) \big, ^p, by an iterative m ...

Solution

Parameter errors and correlation

The estimated parameter values are linear combinations of the observed values

\hat = (X^\textsf W X)^ X^\textsf W \mathbf.

Therefore, an expression for the estimated

of the parameter estimates can be obtained by error propagation from the errors in the observations. Let the variance-covariance matrix for the observations be denoted by ''M'' and that of the estimated parameters by ''M^β''. Then

M^\beta = \left(X^\textsf W X\right)^ X^\textsf W M W^\textsf X \left(X^\textsf W^\textsf X\right)^.

When , this simplifies to

M^\beta = \left(X^\textsf W X\right)^.

When unit weights are used (, the

identity matrix In linear algebra, the identity matrix of size n is the n\times n square matrix with ones on the main diagonal and zeros elsewhere. Terminology and notation The identity matrix is often denoted by I_n, or simply by I if the size is immaterial o ...

), it is implied that the experimental errors are uncorrelated and all equal: , where is the ''a priori'' variance of an observation. In any case, ''σ''² is approximated by the reduced chi-squared

\chi^2_\nu

\begin
     M^\beta &= \chi^2_\nu\left(X^\textsf W X\right)^, \\
  \chi^2_\nu &= S/\nu,
\end

where ''S'' is the minimum value of the weighted objective function:

S = r^\textsf W r =  \left\, W^\frac\left(\mathbf - X\hat\right)\right\, ^2.

The denominator,

\nu = n - m

, is the number of

degrees of freedom Degrees of freedom (often abbreviated df or DOF) refers to the number of independent variables or parameters of a thermodynamic system. In various scientific fields, the word "freedom" is used to describe the limits to which physical movement or ...

; see effective degrees of freedom for generalizations for the case of correlated observations. In all cases, the variance of the parameter estimate

\hat\beta_i

is given by

M^\beta_

and the covariance between the parameter estimates

\hat\beta_i

and

\hat\beta_j

is given by

M^\beta_

. The

standard deviation In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while ...

is the square root of variance,

\sigma_i = \sqrt

, and the correlation coefficient is given by

\rho_ = M^\beta_/(\sigma_i \sigma_j)

. These error estimates reflect only

random errors Observational error (or measurement error) is the difference between a measured value of a quantity and its true value.Dodge, Y. (2003) ''The Oxford Dictionary of Statistical Terms'', OUP. In statistics, an error is not necessarily a " mistake ...

in the measurements. The true uncertainty in the parameters is larger due to the presence of

systematic errors Observational error (or measurement error) is the difference between a measured value of a quantity and its true value.Dodge, Y. (2003) ''The Oxford Dictionary of Statistical Terms'', OUP. In statistics, an error is not necessarily a " mistake ...

, which, by definition, cannot be quantified. Note that even though the observations may be uncorrelated, the parameters are typically correlated.

Parameter confidence limits

It is often ''assumed'', for want of any concrete evidence but often appealing to the central limit theorem—see Normal distribution#Occurrence and applications—that the error on each observation belongs to a normal distribution with a mean of zero and standard deviation

\sigma

. Under that assumption the following probabilities can be derived for a single scalar parameter estimate in terms of its estimated standard error

se_

(given here): * 68% that the interval

\hat\beta \pm se_\beta

encompasses the true coefficient value * 95% that the interval

\hat\beta \pm 2se_\beta

encompasses the true coefficient value * 99% that the interval

\hat\beta \pm 2.5se_\beta

encompasses the true coefficient value The assumption is not unreasonable when ''n'' >> ''m''. If the experimental errors are normally distributed the parameters will belong to a

Student's t-distribution In probability and statistics, Student's ''t''-distribution (or simply the ''t''-distribution) is any member of a family of continuous probability distributions that arise when estimating the mean of a normally distributed population in sit ...

with ''n'' − ''m''

. When ''n'' ≫ ''m'' Student's t-distribution approximates a normal distribution. Note, however, that these confidence limits cannot take systematic error into account. Also, parameter errors should be quoted to one significant figure only, as they are subject to sampling error. When the number of observations is relatively small,

Chebychev's inequality In probability theory, Chebyshev's inequality (also called the Bienaymé–Chebyshev inequality) guarantees that, for a wide class of probability distributions, no more than a certain fraction of values can be more than a certain distance from th ...

can be used for an upper bound on probabilities, regardless of any assumptions about the distribution of experimental errors: the maximum probabilities that a parameter will be more than 1, 2, or 3 standard deviations away from its expectation value are 100%, 25% and 11% respectively.

Residual values and correlation

The residuals are related to the observations by

\mathbf = \mathbf - X \hat = \mathbf - H \mathbf = (I - H) \mathbf,

where ''H'' is the idempotent matrix known as the hat matrix:

H = X \left(X^\textsf W X\right)^ X^\textsf W,

and ''I'' is the

. The variance-covariance matrix of the residuals, ''M'' ^r is given by

M^\mathbf = (I - H) M (I - H)^\textsf.

Thus the residuals are correlated, even if the observations are not. When

W = M^

M^\mathbf = (I - H) M.

The sum of weighted residual values is equal to zero whenever the model function contains a constant term. Left-multiply the expression for the residuals by ''X'' ''W'':

X^\textsf W \hat = X^\textsf W \mathbf - X^\textsf W X \hat = X^\textsf W \mathbf - \left(X^W X\right) \left(X^\textsf W X\right)^ X^\textsf W \mathbf = \mathbf.

Say, for example, that the first term of the model is a constant, so that

X_ = 1

for all ''i''. In that case it follows that

\sum_i^m X_ W_i\hat r_i = \sum_i^m W_i \hat r_i = 0.

Thus, in the motivational example, above, the fact that the sum of residual values is equal to zero is not accidental, but is a consequence of the presence of the constant term, α, in the model. If experimental error follows a normal distribution, then, because of the linear relationship between residuals and observations, so should residuals, but since the observations are only a sample of the population of all possible observations, the residuals should belong to a

. Studentized residuals are useful in making a statistical test for an

outlier In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...

when a particular residual appears to be excessively large.

References

{{reflist Least squares