Weighted least squares (WLS), also known as weighted linear regression, is a generalization of

ordinary least squares In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the prin ...

and

linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is call ...

in which knowledge of the variance of observations is incorporated into the regression. WLS is also a specialization of

generalized least squares In statistics, generalized least squares (GLS) is a technique for estimating the unknown parameters in a linear regression model when there is a certain degree of correlation between the residuals in a regression model. In these cases, ordinar ...

Introduction

A special case of

called weighted least squares can be used when all the off-diagonal entries of Ω, the

covariance matrix In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...

of the residuals, are null; the

variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...

s of the observations (along the covariance matrix diagonal) may still be unequal (

heteroscedasticity In statistics, a sequence (or a vector) of random variables is homoscedastic () if all its random variables have the same finite variance. This is also known as homogeneity of variance. The complementary notion is called heteroscedasticity. The s ...

). The fit of a model to a data point is measured by its residual,

r_i

, defined as the difference between a measured value of the dependent variable,

y_i

and the value predicted by the model,

f(x_i, \boldsymbol\beta)

: :

r_i(\boldsymbol\beta) = y_i - f(x_i, \boldsymbol\beta).

If the errors are uncorrelated and have equal variance, then the function :

S(\boldsymbol\beta) = \sum_i r_i(\boldsymbol\beta)^2

, is minimised at

\boldsymbol\hat\beta

, such that

\frac(\hat\boldsymbol\beta) = 0

. The

Gauss–Markov theorem In statistics, the Gauss–Markov theorem (or simply Gauss theorem for some authors) states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the ...

shows that, when this is so,

\hat

is a

best linear unbiased estimator Best or The Best may refer to: People * Best (surname), people with the surname Best * Best (footballer, born 1968), retired Portuguese footballer Companies and organizations * Best & Co., an 1879–1971 clothing chain * Best Lock Corporation ...

(

BLUE Blue is one of the three primary colours in the RYB colour model (traditional colour theory), as well as in the RGB (additive) colour model. It lies between violet and cyan on the spectrum of visible light. The eye perceives blue when obs ...

). If, however, the measurements are uncorrelated but have different uncertainties, a modified approach might be adopted. Aitken showed that when a weighted sum of squared residuals is minimized,

\hat

is the

if each weight is equal to the reciprocal of the variance of the measurement :

\begin
  S &= \sum_^n W_^2, &
  W_ &= \frac
\end

The gradient equations for this sum of squares are :

-2\sum_i W_\frac r_i = 0,\quad j = 1, \ldots, m

which, in a linear least squares system give the modified normal equations, :

\sum_^n \sum_^m X_W_X_\hat_k = \sum_^n X_W_y_i,\quad j = 1, \ldots, m\,.

When the observational errors are uncorrelated and the weight matrix, W=Ω⁻¹, is diagonal, these may be written as :

\mathbf.

If the errors are correlated, the resulting estimator is the

if the weight matrix is equal to the inverse of the

variance-covariance matrix In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...

of the observations. When the errors are uncorrelated, it is convenient to simplify the calculations to factor the weight matrix as

w_ = \sqrt

. The normal equations can then be written in the same form as ordinary least squares: :

\mathbf\,

where we define the following scaled matrix and vector: :

\begin
  \mathbf &= \operatorname\left(\mathbf\right) \mathbf,\\
  \mathbf &= \operatorname\left(\mathbf\right) \mathbf = \mathbf \oslash \mathbf.
\end

This is a type of

whitening transformation A whitening transformation or sphering transformation is a linear transformation that transforms a vector of random variables with a known covariance matrix into a set of new variables whose covariance is the identity matrix, meaning that they ar ...

; the last expression involves an entrywise division. For

non-linear least squares Non-linear least squares is the form of least squares analysis used to fit a set of ''m'' observations with a model that is non-linear in ''n'' unknown parameters (''m'' ≥ ''n''). It is used in some forms of nonlinear regression. The ...

systems a similar argument shows that the normal equations should be modified as follows. :

\mathbf.\,

Note that for empirical tests, the appropriate W is not known for sure and must be estimated. For this

feasible generalized least squares In statistics, generalized least squares (GLS) is a technique for estimating the unknown parameters in a linear regression model when there is a certain degree of correlation between the residuals in a regression model. In these cases, ordi ...

(FGLS) techniques may be used; in this case it is specialized for a diagonal covariance matrix, thus yielding a feasible weighted least squares solution. If the uncertainty of the observations is not known from external sources, then the weights could be estimated from the given observations. This can be useful, for example, to identify outliers. After the outliers have been removed from the data set, the weights should be reset to one., chapter 3

Motivation

In some cases the observations may be weighted—for example, they may not be equally reliable. In this case, one can minimize the weighted sum of squares: :

\underset\, \sum_^ w_i \left, y_i - \sum_^ X_\beta_j\^2 =
  \underset\, \left\, W^\frac\left(\mathbf - X\boldsymbol\beta\right)\right\, ^2.

where ''w''_''i'' > 0 is the weight of the ''i''th observation, and ''W'' is the

diagonal matrix In linear algebra, a diagonal matrix is a matrix in which the entries outside the main diagonal are all zero; the term usually refers to square matrices. Elements of the main diagonal can either be zero or nonzero. An example of a 2×2 diagonal ma ...

of such weights. The weights should, ideally, be equal to the

reciprocal Reciprocal may refer to: In mathematics * Multiplicative inverse, in mathematics, the number 1/''x'', which multiplied by ''x'' gives the product 1, also known as a ''reciprocal'' * Reciprocal polynomial, a polynomial obtained from another pol ...

of the

of the measurement. (This implies that the observations are uncorrelated. If the observations are

correlated In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics ...

, the expression

S = \sum_k \sum_j r_k W_ r_j\,

applies. In this case the weight matrix should ideally be equal to the inverse of the

of the observations). The normal equations are then: :

\left(X^\textsf W X\right)\hat = X^\textsf W \mathbf.

This method is used in

iteratively reweighted least squares The method of iteratively reweighted least squares (IRLS) is used to solve certain optimization problems with objective functions of the form of a ''p''-norm: :\underset \sum_^n \big, y_i - f_i (\boldsymbol\beta) \big, ^p, by an iterative met ...

Parameter errors and correlation

The estimated parameter values are linear combinations of the observed values :

\hat = (X^\textsf W X)^ X^\textsf W \mathbf.

Therefore, an expression for the estimated

of the parameter estimates can be obtained by

error propagation In statistics, propagation of uncertainty (or propagation of error) is the effect of variables' uncertainties (or errors, more specifically random errors) on the uncertainty of a function based on them. When the variables are the values of expe ...

from the errors in the observations. Let the variance-covariance matrix for the observations be denoted by ''M'' and that of the estimated parameters by ''M^β''. Then :

M^\beta = \left(X^\textsf W X\right)^ X^\textsf W M W^\textsf X \left(X^\textsf W^\textsf X\right)^.

When ''W'' = ''M''⁻¹, this simplifies to :

M^\beta = \left(X^\textsf W X\right)^.

When unit weights are used (''W'' = ''I'', the

identity matrix In linear algebra, the identity matrix of size n is the n\times n square matrix with ones on the main diagonal and zeros elsewhere. Terminology and notation The identity matrix is often denoted by I_n, or simply by I if the size is immaterial o ...

), it is implied that the experimental errors are uncorrelated and all equal: ''M'' = ''σ''²''I'', where ''σ''² is the ''a priori'' variance of an observation. In any case, ''σ''² is approximated by the

reduced chi-squared In statistics, the reduced chi-square statistic is used extensively in goodness of fit testing. It is also known as mean squared weighted deviation (MSWD) in isotopic dating and variance of unit weight in the context of weighted least squares. ...

\chi^2_\nu

: :

\begin
     M^\beta &= \chi^2_\nu\left(X^\textsf W X\right)^, \\
  \chi^2_\nu &= S/\nu,
\end

where ''S'' is the minimum value of the weighted

objective function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cos ...

: :

S = r^\textsf W r =  \left\, W^\frac\left(\mathbf - X\hat\right)\right\, ^2.

The denominator,

\nu = n - m

, is the number of degrees of freedom; see effective degrees of freedom for generalizations for the case of correlated observations. In all cases, the

of the parameter estimate

\hat\beta_i

is given by

M^\beta_

and the

covariance In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the les ...

between the parameter estimates

\hat\beta_i

and

\hat\beta_j

is given by

M^\beta_

. The

standard deviation In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while ...

is the square root of variance,

\sigma_i = \sqrt

, and the correlation coefficient is given by

\rho_ = M^\beta_/(\sigma_i \sigma_j)

. These error estimates reflect only

random errors Observational error (or measurement error) is the difference between a measured value of a quantity and its true value.Dodge, Y. (2003) ''The Oxford Dictionary of Statistical Terms'', OUP. In statistics, an error is not necessarily a " mistake ...

in the measurements. The true uncertainty in the parameters is larger due to the presence of systematic errors, which, by definition, cannot be quantified. Note that even though the observations may be uncorrelated, the parameters are typically

Parameter confidence limits

It is often ''assumed'', for want of any concrete evidence but often appealing to the

central limit theorem In probability theory, the central limit theorem (CLT) establishes that, in many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution even if the original variables themselv ...

—see Normal distribution#Occurrence and applications—that the error on each observation belongs to a

normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...

with a mean of zero and standard deviation

\sigma

. Under that assumption the following probabilities can be derived for a single scalar parameter estimate in terms of its estimated standard error

se_

(given

here Here is an adverb that means "in, on, or at this place". It may also refer to: Software * Here Technologies, a mapping company * Here WeGo (formerly Here Maps), a mobile app and map website by Here Television * Here TV (formerly "here!"), a TV ...

): : 68% that the interval

\hat\beta \pm se_\beta

encompasses the true coefficient value : 95% that the interval

\hat\beta \pm 2se_\beta

encompasses the true coefficient value : 99% that the interval

\hat\beta \pm 2.5se_\beta

encompasses the true coefficient value The assumption is not unreasonable when ''n'' >> ''m''. If the experimental errors are normally distributed the parameters will belong to a

Student's t-distribution In probability and statistics, Student's ''t''-distribution (or simply the ''t''-distribution) is any member of a family of continuous probability distributions that arise when estimating the mean of a normally distributed population in sit ...

with ''n'' − ''m'' degrees of freedom. When ''n'' ≫ ''m'' Student's t-distribution approximates a normal distribution. Note, however, that these confidence limits cannot take systematic error into account. Also, parameter errors should be quoted to one significant figure only, as they are subject to

sampling error In statistics, sampling errors are incurred when the statistical characteristics of a population are estimated from a subset, or sample, of that population. Since the sample does not include all members of the population, statistics of the sample ( ...

. When the number of observations is relatively small, Chebychev's inequality can be used for an upper bound on probabilities, regardless of any assumptions about the distribution of experimental errors: the maximum probabilities that a parameter will be more than 1, 2, or 3 standard deviations away from its expectation value are 100%, 25% and 11% respectively.

Residual values and correlation

The residuals are related to the observations by :

\mathbf = \mathbf - X \hat = \mathbf - H \mathbf = (I - H) \mathbf,

where ''H'' is the

idempotent matrix In linear algebra, an idempotent matrix is a matrix which, when multiplied by itself, yields itself. That is, the matrix A is idempotent if and only if A^2 = A. For this product A^2 to be defined, A must necessarily be a square matrix. Viewed this ...

known as the

hat matrix In statistics, the projection matrix (\mathbf), sometimes also called the influence matrix or hat matrix (\mathbf), maps the vector of response values (dependent variable values) to the vector of fitted values (or predicted values). It describes t ...

: :

H = X \left(X^\textsf W X\right)^ X^\textsf W,

and ''I'' is the

. The variance-covariance matrix of the residuals, ''M'' ^r is given by :

M^\mathbf = (I - H) M (I - H)^\textsf.

Thus the residuals are correlated, even if the observations are not. When

W = M^

, :

M^\mathbf = (I - H) M.

The sum of weighted residual values is equal to zero whenever the model function contains a constant term. Left-multiply the expression for the residuals by ''X'' ''W'': :

X^\textsf W \hat = X^\textsf W \mathbf - X^\textsf W X \hat = X^\textsf W \mathbf - \left(X^W X\right) \left(X^\textsf W X\right)^ X^\textsf W \mathbf = \mathbf.

Say, for example, that the first term of the model is a constant, so that

X_ = 1

for all ''i''. In that case it follows that :

\sum_i^m X_ W_i\hat r_i = \sum_i^m W_i \hat r_i = 0.

Thus, in the motivational example, above, the fact that the sum of residual values is equal to zero is not accidental, but is a consequence of the presence of the constant term, α, in the model. If experimental error follows a

, then, because of the linear relationship between residuals and observations, so should residuals, but since the observations are only a sample of the population of all possible observations, the residuals should belong to a

Studentized residual In statistics, a studentized residual is the quotient resulting from the division of a residual by an estimate of its standard deviation. It is a form of a Student's ''t''-statistic, with the estimate of error varying between points. This is ...

s are useful in making a statistical test for an

outlier In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...

when a particular residual appears to be excessively large.

References