Weighted least squares (WLS), also known as weighted linear regression, is a generalization of
ordinary least squares
In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the prin ...
and
linear regression
In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is call ...
in which knowledge of the variance of observations is incorporated into the regression.
WLS is also a specialization of
generalized least squares
In statistics, generalized least squares (GLS) is a technique for estimating the unknown parameters in a linear regression model when there is a certain degree of correlation between the residuals in a regression model. In these cases, ordinar ...
.
Introduction
A special case of
generalized least squares
In statistics, generalized least squares (GLS) is a technique for estimating the unknown parameters in a linear regression model when there is a certain degree of correlation between the residuals in a regression model. In these cases, ordinar ...
called weighted least squares can be used when all the off-diagonal entries of Ω, the
covariance matrix
In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
of the residuals, are null; the
variance
In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
s of the observations (along the covariance matrix diagonal) may still be unequal (
heteroscedasticity
In statistics, a sequence (or a vector) of random variables is homoscedastic () if all its random variables have the same finite variance. This is also known as homogeneity of variance. The complementary notion is called heteroscedasticity. The s ...
).
The fit of a model to a data point is measured by its
residual,
, defined as the difference between a measured value of the dependent variable,
and the value predicted by the model,
:
:
If the errors are uncorrelated and have equal variance, then the function
:
,
is minimised at
, such that
.
The
Gauss–Markov theorem
In statistics, the Gauss–Markov theorem (or simply Gauss theorem for some authors) states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the ...
shows that, when this is so,
is a
best linear unbiased estimator
Best or The Best may refer to:
People
* Best (surname), people with the surname Best
* Best (footballer, born 1968), retired Portuguese footballer
Companies and organizations
* Best & Co., an 1879–1971 clothing chain
* Best Lock Corporation, ...
(
BLUE
Blue is one of the three primary colours in the RYB colour model (traditional colour theory), as well as in the RGB (additive) colour model. It lies between violet and cyan on the spectrum of visible light. The eye perceives blue when obs ...
). If, however, the measurements are uncorrelated but have different uncertainties, a modified approach might be adopted.
Aitken showed that when a weighted sum of squared residuals is minimized,
is the
BLUE
Blue is one of the three primary colours in the RYB colour model (traditional colour theory), as well as in the RGB (additive) colour model. It lies between violet and cyan on the spectrum of visible light. The eye perceives blue when obs ...
if each weight is equal to the reciprocal of the variance of the measurement
:
The gradient equations for this sum of squares are
:
which, in a linear least squares system give the modified normal equations,
:
When the observational errors are uncorrelated and the weight matrix, W=Ω
−1, is diagonal, these may be written as
:
If the errors are correlated, the resulting estimator is the
BLUE
Blue is one of the three primary colours in the RYB colour model (traditional colour theory), as well as in the RGB (additive) colour model. It lies between violet and cyan on the spectrum of visible light. The eye perceives blue when obs ...
if the weight matrix is equal to the inverse of the
variance-covariance matrix
In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
of the observations.
When the errors are uncorrelated, it is convenient to simplify the calculations to factor the weight matrix as
. The normal equations can then be written in the same form as ordinary least squares:
:
where we define the following scaled matrix and vector:
:
This is a type of
whitening transformation A whitening transformation or sphering transformation is a linear transformation that transforms a vector of random variables with a known covariance matrix into a set of new variables whose covariance is the identity matrix, meaning that they ar ...
; the last expression involves an
entrywise division.
For
non-linear least squares
Non-linear least squares is the form of least squares analysis used to fit a set of ''m'' observations with a model that is non-linear in ''n'' unknown parameters (''m'' ≥ ''n''). It is used in some forms of nonlinear regression. The ...
systems a similar argument shows that the normal equations should be modified as follows.
:
Note that for empirical tests, the appropriate W is not known for sure and must be estimated. For this
feasible generalized least squares
In statistics, generalized least squares (GLS) is a technique for estimating the unknown parameters in a linear regression model when there is a certain degree of correlation between the residuals in a regression model. In these cases, ordinar ...
(FGLS) techniques may be used; in this case it is specialized for a diagonal covariance matrix, thus yielding a feasible weighted least squares solution.
If the uncertainty of the observations is not known from external sources, then the weights could be estimated from the given observations. This can be useful, for example, to identify outliers. After the outliers have been removed from the data set, the weights should be reset to one.
[, chapter 3]
Motivation
In some cases the observations may be weighted—for example, they may not be equally reliable. In this case, one can minimize the weighted sum of squares:
:
where ''w''
''i'' > 0 is the weight of the ''i''th observation, and ''W'' is the
diagonal matrix
In linear algebra, a diagonal matrix is a matrix in which the entries outside the main diagonal are all zero; the term usually refers to square matrices. Elements of the main diagonal can either be zero or nonzero. An example of a 2×2 diagonal ma ...
of such weights.
The weights should, ideally, be equal to the
reciprocal
Reciprocal may refer to:
In mathematics
* Multiplicative inverse, in mathematics, the number 1/''x'', which multiplied by ''x'' gives the product 1, also known as a ''reciprocal''
* Reciprocal polynomial, a polynomial obtained from another pol ...
of the
variance
In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
of the measurement. (This implies that the observations are uncorrelated. If the observations are
correlated
In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics ...
, the expression
applies. In this case the weight matrix should ideally be equal to the inverse of the
variance-covariance matrix
In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
of the observations).
[
The normal equations are then:
:
This method is used in ]iteratively reweighted least squares
The method of iteratively reweighted least squares (IRLS) is used to solve certain optimization problems with objective functions of the form of a ''p''-norm:
:\underset \sum_^n \big, y_i - f_i (\boldsymbol\beta) \big, ^p,
by an iterative met ...
.
Parameter errors and correlation
The estimated parameter values are linear combinations of the observed values
:
Therefore, an expression for the estimated variance-covariance matrix
In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
of the parameter estimates can be obtained by error propagation
In statistics, propagation of uncertainty (or propagation of error) is the effect of variables' uncertainties (or errors, more specifically random errors) on the uncertainty of a function based on them. When the variables are the values of exp ...
from the errors in the observations. Let the variance-covariance matrix for the observations be denoted by ''M'' and that of the estimated parameters by ''Mβ''. Then
:
When ''W'' = ''M''−1, this simplifies to
:
When unit weights are used (''W'' = ''I'', the identity matrix
In linear algebra, the identity matrix of size n is the n\times n square matrix with ones on the main diagonal and zeros elsewhere.
Terminology and notation
The identity matrix is often denoted by I_n, or simply by I if the size is immaterial o ...
), it is implied that the experimental errors are uncorrelated and all equal: ''M'' = ''σ''2''I'', where ''σ''2 is the ''a priori'' variance of an observation. In any case, ''σ''2 is approximated by the reduced chi-squared
In statistics, the reduced chi-square statistic is used extensively in goodness of fit testing. It is also known as mean squared weighted deviation (MSWD) in isotopic dating and variance of unit weight in the context of weighted least squares.
I ...
:
:
where ''S'' is the minimum value of the weighted objective function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cos ...
:
:
The denominator, , is the number of degrees of freedom
Degrees of freedom (often abbreviated df or DOF) refers to the number of independent variables or parameters of a thermodynamic system. In various scientific fields, the word "freedom" is used to describe the limits to which physical movement or ...
; see effective degrees of freedom for generalizations for the case of correlated observations.
In all cases, the variance
In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
of the parameter estimate is given by and the covariance
In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the les ...
between the parameter estimates and is given by . The standard deviation
In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while ...
is the square root of variance, , and the correlation coefficient is given by . These error estimates reflect only random errors
Observational error (or measurement error) is the difference between a measured value of a quantity and its true value.Dodge, Y. (2003) ''The Oxford Dictionary of Statistical Terms'', OUP. In statistics, an error is not necessarily a " mistake ...
in the measurements. The true uncertainty in the parameters is larger due to the presence of systematic errors, which, by definition, cannot be quantified.
Note that even though the observations may be uncorrelated, the parameters are typically correlated
In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics ...
.
Parameter confidence limits
It is often ''assumed'', for want of any concrete evidence but often appealing to the central limit theorem
In probability theory, the central limit theorem (CLT) establishes that, in many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution even if the original variables themselv ...
—see Normal distribution#Occurrence and applications—that the error on each observation belongs to a normal distribution
In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is
:
f(x) = \frac e^
The parameter \mu ...
with a mean of zero and standard deviation . Under that assumption the following probabilities can be derived for a single scalar parameter estimate in terms of its estimated standard error (given here
Here is an adverb that means "in, on, or at this place". It may also refer to:
Software
* Here Technologies, a mapping company
* Here WeGo (formerly Here Maps), a mobile app and map website by Here
Television
* Here TV (formerly "here!"), a TV ...
):
: 68% that the interval encompasses the true coefficient value
: 95% that the interval encompasses the true coefficient value
: 99% that the interval encompasses the true coefficient value
The assumption is not unreasonable when ''n'' >> ''m''. If the experimental errors are normally distributed the parameters will belong to a Student's t-distribution
In probability and statistics, Student's ''t''-distribution (or simply the ''t''-distribution) is any member of a family of continuous probability distributions that arise when estimating the mean of a normally distributed population in sit ...
with ''n'' − ''m'' degrees of freedom
Degrees of freedom (often abbreviated df or DOF) refers to the number of independent variables or parameters of a thermodynamic system. In various scientific fields, the word "freedom" is used to describe the limits to which physical movement or ...
. When ''n'' ≫ ''m'' Student's t-distribution approximates a normal distribution. Note, however, that these confidence limits cannot take systematic error into account. Also, parameter errors should be quoted to one significant figure only, as they are subject to sampling error
In statistics, sampling errors are incurred when the statistical characteristics of a population are estimated from a subset, or sample, of that population. Since the sample does not include all members of the population, statistics of the sample ( ...
.
When the number of observations is relatively small, Chebychev's inequality
In probability theory, Chebyshev's inequality (also called the Bienaymé–Chebyshev inequality) guarantees that, for a wide class of probability distributions, no more than a certain fraction of values can be more than a certain distance from th ...
can be used for an upper bound on probabilities, regardless of any assumptions about the distribution of experimental errors: the maximum probabilities that a parameter will be more than 1, 2, or 3 standard deviations away from its expectation value are 100%, 25% and 11% respectively.
Residual values and correlation
The residuals are related to the observations by
:
where ''H'' is the idempotent matrix
In linear algebra, an idempotent matrix is a matrix which, when multiplied by itself, yields itself. That is, the matrix A is idempotent if and only if A^2 = A. For this product A^2 to be defined, A must necessarily be a square matrix. Viewed this ...
known as the hat matrix
In statistics, the projection matrix (\mathbf), sometimes also called the influence matrix or hat matrix (\mathbf), maps the vector of response values (dependent variable values) to the vector of fitted values (or predicted values). It describes t ...
:
:
and ''I'' is the identity matrix
In linear algebra, the identity matrix of size n is the n\times n square matrix with ones on the main diagonal and zeros elsewhere.
Terminology and notation
The identity matrix is often denoted by I_n, or simply by I if the size is immaterial o ...
. The variance-covariance matrix of the residuals, ''M'' r is given by
:
Thus the residuals are correlated, even if the observations are not.
When ,
:
The sum of weighted residual values is equal to zero whenever the model function contains a constant term. Left-multiply the expression for the residuals by ''X'' ''W'':
:
Say, for example, that the first term of the model is a constant, so that for all ''i''. In that case it follows that
:
Thus, in the motivational example, above, the fact that the sum of residual values is equal to zero is not accidental, but is a consequence of the presence of the constant term, α, in the model.
If experimental error follows a normal distribution
In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is
:
f(x) = \frac e^
The parameter \mu ...
, then, because of the linear relationship between residuals and observations, so should residuals, but since the observations are only a sample of the population of all possible observations, the residuals should belong to a Student's t-distribution
In probability and statistics, Student's ''t''-distribution (or simply the ''t''-distribution) is any member of a family of continuous probability distributions that arise when estimating the mean of a normally distributed population in sit ...
. Studentized residual
In statistics, a studentized residual is the quotient resulting from the division of a residual by an estimate of its standard deviation. It is a form of a Student's ''t''-statistic, with the estimate of error varying between points.
This is ...
s are useful in making a statistical test for an outlier
In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
when a particular residual appears to be excessively large.
See also
* Iteratively reweighted least squares
The method of iteratively reweighted least squares (IRLS) is used to solve certain optimization problems with objective functions of the form of a ''p''-norm:
:\underset \sum_^n \big, y_i - f_i (\boldsymbol\beta) \big, ^p,
by an iterative met ...
* Heteroscedasticity-consistent standard errors
The topic of heteroskedasticity-consistent (HC) standard errors arises in statistics and econometrics in the context of linear regression and time series analysis. These are also known as heteroskedasticity-robust standard errors (or simply robust ...
* Weighted mean
The weighted arithmetic mean is similar to an ordinary arithmetic mean (the most common type of average), except that instead of each of the data points contributing equally to the final average, some data points contribute more than others. The ...
References
{{reflist
Least squares