statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...

, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a

linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is call ...

model (with fixed level-one effects of a linear function of a set of explanatory variables) by the principle of

least squares The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the res ...

: minimizing the sum of the squares of the differences between the observed

dependent variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...

(values of the variable being observed) in the input dataset and the output of the (linear) function of the

independent variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...

. Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression surface—the smaller the differences, the better the model fits the data. The resulting estimator can be expressed by a simple formula, especially in the case of a

simple linear regression In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the ''x'' and ...

, in which there is a single

regressor Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...

on the right side of the regression equation. The OLS estimator is consistent for the level-one fixed effects when the regressors are exogenous and forms perfect colinearity (rank condition), consistent for the variance estimate of the residuals when regressors have finite fourth moments and—by the Gauss–Markov theorem— optimal in the class of linear unbiased estimators when the errors are homoscedastic and serially uncorrelated. Under these conditions, the method of OLS provides minimum-variance mean-unbiased estimation when the errors have finite variances. Under the additional assumption that the errors are normally distributed with zero mean, OLS is the maximum likelihood estimator that outperforms any non-linear unbiased estimator.

Linear model

Suppose the data consists of

n

observations

\left\_^n

. Each observation

i

includes a scalar response

y_i

and a column vector

\mathbf_i

p

parameters (regressors), i.e.,

\mathsf

. In a linear regression model, the response variable,

y_i

, is a linear function of the regressors: :

y_i = \beta_1\ x_ + \beta_2\ x_ + \cdots + \beta_p\ x_ + \varepsilon_i,

or in vector form, :

y_i = \mathbf_i^\mathsf \boldsymbol + \varepsilon_i, \,

where

\mathbf_i

, as introduced previously, is a column vector of the

i

-th observation of all the explanatory variables;

\boldsymbol

is a

p \times 1

vector of unknown parameters; and the scalar

\varepsilon_i

represents unobserved random variables ( errors) of the

i

-th observation.

\varepsilon_i

accounts for the influences upon the responses

y_i

from sources other than the explanators

\mathbf_i

. This model can also be written in matrix notation as :

\mathbf = \mathbf \boldsymbol + \boldsymbol, \,

where

\mathbf

and

\boldsymbol

are

n \times 1

vectors of the response variables and the errors of the

n

observations, and

\mathbf

is an

n \times p

matrix of regressors, also sometimes called the design matrix, whose row

i

\mathbf_i^\mathsf

and contains the

i

-th observations on all the explanatory variables. As a rule, the constant term is always included in the set of regressors

\mathbf

, say, by taking

x_=1

for all

i=1, \dots, n

. The coefficient

\beta_1

corresponding to this regressor is called the ''intercept''. Regressors do not have to be independent: there can be any desired relationship between the regressors (so long as it is not a linear relationship). For instance, we might suspect the response depends linearly both on a value and its square; in which case we would include one regressor whose value is just the square of another regressor. In that case, the model would be ''quadratic'' in the second regressor, but none-the-less is still considered a ''linear'' model because the model ''is'' still linear in the parameters (

\boldsymbol

Matrix/vector formulation

Consider an overdetermined system :

\sum_^ X_ \beta_j = y_i,\ (i=1, 2, \dots, n),

n

linear equation In mathematics, a linear equation is an equation that may be put in the form a_1x_1+\ldots+a_nx_n+b=0, where x_1,\ldots,x_n are the variables (or unknowns), and b,a_1,\ldots,a_n are the coefficients, which are often real numbers. The coefficien ...

s in

p

unknown coefficients,

\beta_1, \beta_2, \dots, \beta_p

, with

n > p

. (Note: for a linear model as above, not all elements in

\mathbf

contains information on the data points. The first column is populated with ones,

X_ = 1

. Only the other columns contain actual data. So here

p

is equal to the number of regressors plus one.) This can be written in matrix form as :

\mathbf \boldsymbol = \mathbf ,

where :

\mathbf=\begin
X_ & X_ & \cdots & X_ \\
X_ & X_ & \cdots & X_ \\
\vdots & \vdots & \ddots & \vdots \\
X_ & X_ & \cdots & X_
\end ,
\qquad \boldsymbol \beta = \begin
\beta_1 \\ \beta_2 \\ \vdots \\ \beta_p \end ,
\qquad \mathbf y = \begin
y_1 \\ y_2 \\ \vdots \\ y_n
\end.

Such a system usually has no exact solution, so the goal is instead to find the coefficients

\boldsymbol

which fit the equations "best", in the sense of solving the quadratic minimization problem :

\hat = \underset\,S(\boldsymbol),

where the objective function

S

is given by :

S(\boldsymbol) = \sum_^n \biggl,  y_i - \sum_^p X_\beta_j\biggr, ^2 = \bigl\, \mathbf y - \mathbf \boldsymbol \beta \bigr\, ^2.

A justification for choosing this criterion is given in Properties below. This minimization problem has a unique solution, provided that the

p

columns of the matrix

\mathbf

are linearly independent, given by solving the so-called ''normal equations'': :

(\mathbf^ \mathbf )\hat = \mathbf^ \mathbf y\ .

The matrix

\mathbf^ \mathrm X

is known as the ''normal matrix'' or Gram matrix and the matrix

\mathbf^ \mathbf y

is known as the moment matrix of regressand by regressors. Finally,

\hat

is the coefficient vector of the least-squares

hyperplane In geometry, a hyperplane is a subspace whose dimension is one less than that of its ''ambient space''. For example, if a space is 3-dimensional then its hyperplanes are the 2-dimensional planes, while if the space is 2-dimensional, its hyper ...

, expressed as :

\hat= \left( \mathbf^ \mathbf \right)^ \mathbf^ \mathbf y.

or :

\hat= \boldsymbol + (\mathbf^\top \mathbf )^\mathbf ^\top \boldsymbol.

Estimation

Suppose ''b'' is a "candidate" value for the parameter vector ''β''. The quantity , called the residual for the ''i''-th observation, measures the vertical distance between the data point and the hyperplane , and thus assesses the degree of fit between the actual data and the model. The sum of squared residuals (SSR) (also called the error sum of squares (ESS) or residual sum of squares (RSS)) is a measure of the overall model fit: :

S(b) = \sum_^n (y_i - x_i ^\mathrm b)^2 = (y-Xb)^\mathrm(y-Xb),

where ''T'' denotes the matrix transpose, and the rows of ''X'', denoting the values of all the independent variables associated with a particular value of the dependent variable, are ''X_i = x_i''^T. The value of ''b'' which minimizes this sum is called the OLS estimator for ''β''. The function ''S''(''b'') is quadratic in ''b'' with positive-definite

Hessian A Hessian is an inhabitant of the German state of Hesse. Hessian may also refer to: Named from the toponym *Hessian (soldier), eighteenth-century German regiments in service with the British Empire **Hessian (boot), a style of boot **Hessian f ...

, and therefore this function possesses a unique global minimum at

b =\hat\beta

, which can be given by the explicit formula:^{roof/sup>

: $\hat\beta = \operatorname_ S(b) = (X^\mathrmX)^X^\mathrmy\ .$

The product ''N''=''X''^T ''X'' is a Gram matrix and its inverse, ''Q''=''N''^–1, is the ''cofactor matrix'' of ''β'', closely related to its covariance matrix

In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
, ''C''_''β''.
The matrix (''X''^T ''X'')^–1 ''X''^T=''Q'' ''X''^T is called the Moore–Penrose pseudoinverse matrix of X. This formulation highlights the point that estimation can be carried out if, and only if, there is no perfect multicollinearity between the explanatory variables (which would cause the gram matrix to have no inverse).

After we have estimated ''β'', the fitted values (or predicted values) from the regression will be
: $\hat = X\hat\beta = Py,$
where ''P'' = ''X''(''X''^T''X'')⁻¹''X''^T is the projection matrix onto the space ''V'' spanned by the columns of ''X''. This matrix ''P'' is also sometimes called the hat matrix because it "puts a hat" onto the variable ''y''. Another matrix, closely related to ''P'' is the ''annihilator'' matrix ; this is a projection matrix onto the space orthogonal to ''V''. Both matrices ''P'' and ''M'' are symmetric and idempotent (meaning that and ), and relate to the data matrix ''X'' via identities and . Matrix ''M'' creates the residuals from the regression:
: $\hat\varepsilon = y - \hat y = y - X\hat\beta = My = M(X\beta+\varepsilon) = (MX)\beta + M\varepsilon = M\varepsilon.$

Using these residuals we can estimate the value of ''σ''² using the reduced chi-squared statistic:
: $s^2 = \frac = \frac = \frac= \frac = \frac,\qquad
\hat\sigma^2 = \frac\;s^2$
The denominator, ''n''−''p'', is the statistical degrees of freedom. The first quantity, ''s''², is the OLS estimate for ''σ''², whereas the second, $\scriptstyle\hat\sigma^2$ , is the MLE estimate for ''σ''². The two estimators are quite similar in large samples; the first estimator is always unbiased, while the second estimator is biased but has a smaller mean squared error. In practice ''s''² is used more often, since it is more convenient for the hypothesis testing. The square root of ''s''² is called the regression standard error, standard error of the regression, or standard error of the equation.

It is common to assess the goodness-of-fit of the OLS regression by comparing how much the initial variation in the sample can be reduced by regressing onto ''X''. The coefficient of determination

In statistic}