
In
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, ordinary least squares (OLS) is a type of
linear least squares method for choosing the unknown
parameters in a
linear regression
In statistics, linear regression is a statistical model, model that estimates the relationship between a Scalar (mathematics), scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A mode ...
model (with fixed level-one effects of a
linear function
In mathematics, the term linear function refers to two distinct but related notions:
* In calculus and related areas, a linear function is a function whose graph is a straight line, that is, a polynomial function of degree zero or one. For di ...
of a set of
explanatory variable
A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical function ...
s) by the principle of
least squares
The method of least squares is a mathematical optimization technique that aims to determine the best fit function by minimizing the sum of the squares of the differences between the observed values and the predicted values of the model. The me ...
: minimizing the sum of the squares of the differences between the observed
dependent variable
A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical functio ...
(values of the variable being observed) in the input
dataset and the output of the (linear) function of the
independent variable. Some sources consider OLS to be linear regression.
Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression surface—the smaller the differences, the better the model fits the data. The resulting
estimator
In statistics, an estimator is a rule for calculating an estimate of a given quantity based on Sample (statistics), observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguish ...
can be expressed by a simple formula, especially in the case of a
simple linear regression
In statistics, simple linear regression (SLR) is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the ''x ...
, in which there is a single
regressor on the right side of the regression equation.
The OLS estimator is
consistent
In deductive logic, a consistent theory is one that does not lead to a logical contradiction. A theory T is consistent if there is no formula \varphi such that both \varphi and its negation \lnot\varphi are elements of the set of consequences ...
for the level-one fixed effects when the regressors are
exogenous and forms perfect
colinearity (rank condition), consistent for the variance estimate of the residuals when regressors have finite fourth moments and—by the
Gauss–Markov theorem
In statistics, the Gauss–Markov theorem (or simply Gauss theorem for some authors) states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in ...
—
optimal in the class of linear unbiased estimators when the
error
An error (from the Latin , meaning 'to wander'Oxford English Dictionary, s.v. “error (n.), Etymology,” September 2023, .) is an inaccurate or incorrect action, thought, or judgement.
In statistics, "error" refers to the difference between t ...
s are
homoscedastic and
serially uncorrelated. Under these conditions, the method of OLS provides
minimum-variance mean-unbiased estimation when the errors have finite
variance
In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...
s. Under the additional assumption that the errors are
normally distributed
In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real number, real-valued random variable. The general form of its probability density function is
f(x ...
with zero mean, OLS is the
maximum likelihood estimator that outperforms any non-linear unbiased estimator.
Linear model
Suppose the data consists of
observations . Each observation
includes a scalar response
and a column vector
of
parameters (regressors), i.e.,
. In a
linear regression model, the response variable,
, is a linear function of the regressors:
:
or in
vector
Vector most often refers to:
* Euclidean vector, a quantity with a magnitude and a direction
* Disease vector, an agent that carries and transmits an infectious pathogen into another living organism
Vector may also refer to:
Mathematics a ...
form,
:
where
, as introduced previously, is a column vector of the
-th observation of all the explanatory variables;
is a
vector of unknown parameters; and the scalar
represents unobserved random variables (
errors) of the
-th observation.
accounts for the influences upon the responses
from sources other than the explanatory variables
. This model can also be written in matrix notation as
:
where
and
are
vectors of the response variables and the errors of the
observations, and
is an
matrix of regressors, also sometimes called the
design matrix, whose row
is
and contains the
-th observations on all the explanatory variables.
Typically, a constant term is included in the set of regressors
, say, by taking
for all
. The coefficient
corresponding to this regressor is called the ''intercept''. Without the intercept, the fitted line is forced to cross the origin when
.
Regressors do not have to be independent for estimation to be consistent e.g. they may be non-linearly dependent. Short of perfect multicollinearity, parameter estimates may still be consistent; however, as multicollinearity rises the standard error around such estimates increases and reduces the precision of such estimates. When there is perfect multicollinearity, it is no longer possible to obtain unique estimates for the coefficients to the related regressors; estimation for these parameters cannot converge (thus, it cannot be consistent).
As a concrete example where regressors are non-linearly dependent yet estimation may still be consistent, we might suspect the response depends linearly both on a value and its square; in which case we would include one regressor whose value is just the square of another regressor. In that case, the model would be ''quadratic'' in the second regressor, but none-the-less is still considered a ''linear'' model because the model ''is'' still linear in the parameters (
).
Matrix/vector formulation
Consider an
overdetermined system
:
of
linear equation
In mathematics, a linear equation is an equation that may be put in the form
a_1x_1+\ldots+a_nx_n+b=0, where x_1,\ldots,x_n are the variables (or unknowns), and b,a_1,\ldots,a_n are the coefficients, which are often real numbers. The coeffici ...
s in
unknown
coefficients
In mathematics, a coefficient is a multiplicative factor involved in some term of a polynomial, a series, or any other type of expression. It may be a number without units, in which case it is known as a numerical factor. It may also be a ...
,
, with
. This can be written in
matrix
Matrix (: matrices or matrixes) or MATRIX may refer to:
Science and mathematics
* Matrix (mathematics), a rectangular array of numbers, symbols or expressions
* Matrix (logic), part of a formula in prenex normal form
* Matrix (biology), the m ...
form as
:
where
:
(Note: for a linear model as above, not all elements in
contains information on the data points. The first column is populated with ones,
. Only the other columns contain actual data. So here
is equal to the number of regressors plus one).
Such a system usually has no exact solution, so the goal is instead to find the coefficients
which fit the equations "best", in the sense of solving the
quadratic minimization problem
:
where the objective function
is given by
:
A justification for choosing this criterion is given in
Properties
Property is the ownership of land, resources, improvements or other tangible objects, or intellectual property.
Property may also refer to:
Philosophy and science
* Property (philosophy), in philosophy and logic, an abstraction characterizing an ...
below. This minimization problem has a unique solution, provided that the
columns of the matrix
are
linearly independent
In the theory of vector spaces, a set of vectors is said to be if there exists no nontrivial linear combination of the vectors that equals the zero vector. If such a linear combination exists, then the vectors are said to be . These concep ...
, given by solving the so-called ''normal equations'':
:
The matrix
is known as the ''normal matrix'' or
Gram matrix and the matrix
is known as the
moment matrix of regressand by regressors. Finally,
is the coefficient vector of the least-squares
hyperplane
In geometry, a hyperplane is a generalization of a two-dimensional plane in three-dimensional space to mathematical spaces of arbitrary dimension. Like a plane in space, a hyperplane is a flat hypersurface, a subspace whose dimension is ...
, expressed as
:
or
:
Estimation
Suppose ''b'' is a "candidate" value for the parameter vector ''β''. The quantity , called the ''
residual'' for the ''i''-th observation, measures the vertical distance between the data point and the hyperplane , and thus assesses the degree of fit between the actual data and the model. The ''
sum of squared residuals'' (''SSR'') (also called the ''error sum of squares'' (''ESS'') or ''residual sum of squares'' (''RSS'')) is a measure of the overall model fit:
:
where ''T'' denotes the matrix
transpose
In linear algebra, the transpose of a Matrix (mathematics), matrix is an operator which flips a matrix over its diagonal;
that is, it switches the row and column indices of the matrix by producing another matrix, often denoted by (among other ...
, and the rows of ''X'', denoting the values of all the independent variables associated with a particular value of the dependent variable, are ''X
i = x
i''
T. The value of ''b'' which minimizes this sum is called the OLS estimator for ''β''. The function ''S''(''b'') is quadratic in ''b'' with positive-definite
Hessian, and therefore this function possesses a unique global minimum at
, which can be given by the explicit formula
">roof/sup>
:
The product ''N'' = ''X''T ''X'' is a Gram matrix, and its inverse, ''Q'' = ''N''−1, is the ''cofactor matrix'' of ''β'', closely related to its covariance matrix, ''C''''β''.
The matrix (''X''T ''X'')−1 ''X''T = ''Q'' ''X''T is called the Moore–Penrose pseudoinverse matrix of ''X''. This formulation highlights the point that estimation can be carried out if, and only if, there is no perfect multicollinearity between the explanatory variables (which would cause the Gram matrix to have no inverse).
Prediction
After we have estimated ''β'', the '' fitted values'' (or ''predicted values'') from the regression will be
:
where ''P'' = ''X''(''X''T''X'')−1''X''T is the '' projection matrix'' onto the space ''V'' spanned by the columns of ''X''. This matrix ''P'' is also sometimes called the '' hat matrix'' because it "puts a hat" onto the variable ''y''. Another matrix, closely related to ''P'' is the ''annihilator'' matrix ; this is a projection matrix onto the space orthogonal to ''V''. Both matrices ''P'' and ''M'' are symmetric and idempotent
Idempotence (, ) is the property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application. The concept of idempotence arises in a number of pl ...
(meaning that and ), and relate to the data matrix ''X'' via identities and . Matrix ''M'' creates the ''residuals'' from the regression:
:
The variances of the predicted values are found in the main diagonal of the variance-covariance matrix of predicted values:
:
where ''P'' is the projection matrix and ''s''2 is the sample variance.
The full matrix is very large; its diagonal elements can be calculated individually as:
:
where ''X''i is the ''i''-th row of matrix ''X''.
Sample statistics
Using these residuals we can estimate the sample variance ''s''2 using the '' reduced chi-squared'' statistic:
:
The denominator, ''n''−''p'', is the statistical degrees of freedom. The first quantity, ''s''2, is the OLS estimate for ''σ''2, whereas the second, , is the MLE estimate for ''σ''2. The two estimators are quite similar in large samples; the first estimator is always unbiased
Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is inaccurate, closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individ ...
, while the second estimator is biased but has a smaller mean squared error
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference betwee ...
. In practice ''s''2 is used more often, since it is more convenient for the hypothesis testing. The square root of ''s''2 is called the '' regression standard error'', ''standard error of the regression'', or ''standard error of the equation''.
It is common to assess the goodness-of-fit of the OLS regression by comparing how much the initial variation in the sample can be reduced by regressing onto ''X''. The '' coefficient of determination'' ''R''2 is defined as a ratio of "explained" variance to the "total" variance of the dependent variable ''y'', in the cases where the regression sum of squares equals the sum of squares of residuals:
:
where TSS is the '' total sum of squares'' for the dependent variable, , and is an ''n''×''n'' matrix of ones. ( is a centering matrix which is equivalent to regression on a constant; it simply subtracts the mean from a variable.) In order for ''R''2 to be meaningful, the matrix ''X'' of data on regressors must contain a column vector of ones to represent the constant whose coefficient is the regression intercept. In that case, ''R''2 will always be a number between 0 and 1, with values close to 1 indicating a good degree of fit.
Simple linear regression model
If the data matrix ''X'' contains only two variables, a constant and a scalar regressor ''xi'', then this is called the "simple regression model". This case is often considered in the beginner statistics classes, as it provides much simpler formulas even suitable for manual calculation. The parameters are commonly denoted as :
:
The least squares estimates in this case are given by simple formulas
:
Alternative derivations
In the previous section the least squares estimator was obtained as a value that minimizes the sum of squared residuals of the model. However it is also possible to derive the same estimator from other approaches. In all cases the formula for OLS estimator remains the same: ; the only difference is in how we interpret this result.
Projection
For mathematicians, OLS is an approximate solution to an overdetermined system of linear equations , where ''β'' is the unknown. Assuming the system cannot be solved exactly (the number of equations ''n'' is much larger than the number of unknowns ''p''), we are looking for a solution that could provide the smallest discrepancy between the right- and left- hand sides. In other words, we are looking for the solution that satisfies
:
where is the standard ''L''2 norm in the ''n''-dimensional Euclidean space
Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, in Euclid's ''Elements'', it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are ''Euclidean spaces ...
R''n''. The predicted quantity ''Xβ'' is just a certain linear combination of the vectors of regressors. Thus, the residual vector will have the smallest length when ''y'' is projected orthogonally onto the linear subspace
In mathematics, the term ''linear'' is used in two distinct senses for two different properties:
* linearity of a ''function (mathematics), function'' (or ''mapping (mathematics), mapping'');
* linearity of a ''polynomial''.
An example of a li ...
spanned by the columns of ''X''. The OLS estimator in this case can be interpreted as the coefficients of vector decomposition of along the basis of ''X''.
In other words, the gradient equations at the minimum can be written as:
:
A geometrical interpretation of these equations is that the vector of residuals, is orthogonal to the column space of ''X'', since the dot product
In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a Scalar (mathematics), scalar as a result". It is also used for other symmetric bilinear forms, for example in a pseudo-Euclidean space. N ...
is equal to zero for ''any'' conformal vector, v. This means that is the shortest of all possible vectors , that is, the variance of the residuals is the minimum possible. This is illustrated at the right.
Introducing and a matrix ''K'' with the assumption that a matrix