Proofs Involving Ordinary Least Squares
   HOME

TheInfoList



OR:

The purpose of this page is to provide supplementary materials for the
ordinary least squares In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the prin ...
article, reducing the load of the main article with mathematics and improving its accessibility, while at the same time retaining the completeness of exposition.


Derivation of the normal equations

Define the ith residual to be :r_i= y_i - \sum_^ X_\beta_j. Then the objective S can be rewritten :S = \sum_^m r_i^2. Given that ''S'' is convex, it is minimized when its gradient vector is zero (This follows by definition: if the gradient vector is not zero, there is a direction in which we can move to minimize it further – see
maxima and minima In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given ran ...
.) The elements of the gradient vector are the partial derivatives of ''S'' with respect to the parameters: :\frac=2\sum_^m r_i\frac \qquad (j=1,2,\dots, n). The derivatives are :\frac=-X_. Substitution of the expressions for the residuals and the derivatives into the gradient equations gives :\frac = 2\sum_^ \left( y_i-\sum_^ X_\beta_k \right) (-X_)\qquad (j=1,2,\dots, n). Thus if \widehat \beta minimizes ''S'', we have :2\sum_^ \left( y_i-\sum_^ X_\widehat \beta_k \right) (-X_) = 0\qquad (j=1,2,\dots, n). Upon rearrangement, we obtain the normal equations: :\sum_^\sum_^ X_X_\widehat \beta_k=\sum_^ X_y_i\qquad (j=1,2,\dots, n). The normal equations are written in matrix notation as :(\mathbf X^\mathrm \mathbf X) \widehat = \mathbf X^\mathrm \mathbf y (where ''X''T is the matrix transpose of ''X''). The solution of the normal equations yields the vector \widehat of the optimal parameter values.


Derivation directly in terms of matrices

The normal equations can be derived directly from a matrix representation of the problem as follows. The objective is to minimize :S(\boldsymbol) = \bigl\, \mathbf y - \mathbf X \boldsymbol \beta \bigr\, ^2 = (\mathbf y-\mathbf X \boldsymbol \beta)^(\mathbf y-\mathbf X \boldsymbol \beta) = \mathbf y ^ \mathbf y - \boldsymbol \beta ^ \mathbf X ^ \mathbf y - \mathbf y ^ \mathbf X \boldsymbol \beta + \boldsymbol \beta ^ \mathbf X ^ \mathbf X \boldsymbol \beta . Here ( \boldsymbol \beta ^ \mathbf X ^ \mathbf y ) ^ = \mathbf y ^ \mathbf X \boldsymbol \beta has the dimension 1x1 (the number of columns of \mathbf y), so it is a scalar and equal to its own transpose, hence \boldsymbol \beta ^ \mathbf X ^ \mathbf y = \mathbf y ^ \mathbf X \boldsymbol \beta and the quantity to minimize becomes :S(\boldsymbol) = \mathbf y ^ \mathbf y - 2\boldsymbol \beta ^ \mathbf X ^ \mathbf y + \boldsymbol \beta ^ \mathbf X ^ \mathbf X \boldsymbol \beta . Differentiating this with respect to \boldsymbol \beta and equating to zero to satisfy the first-order conditions gives :- \mathbf X^ \mathbf y+ (\mathbf X^ \mathbf X ) = 0, which is equivalent to the above-given normal equations. A sufficient condition for satisfaction of the second-order conditions for a minimum is that \mathbf X have full column rank, in which case \mathbf X^ \mathbf X is
positive definite In mathematics, positive definiteness is a property of any object to which a bilinear form or a sesquilinear form may be naturally associated, which is positive-definite. See, in particular: * Positive-definite bilinear form * Positive-definite f ...
.


Derivation without calculus

When \mathbf X^ \mathbf X is positive definite, the formula for the minimizing value of \boldsymbol \beta can be derived without the use of derivatives. The quantity :S(\boldsymbol) = \mathbf y ^ \mathbf y - 2\boldsymbol \beta ^ \mathbf X ^ \mathbf y + \boldsymbol \beta ^ \mathbf X ^ \mathbf X \boldsymbol \beta can be written as : \langle \boldsymbol \beta, \boldsymbol \beta \rangle - 2\langle \boldsymbol \beta, (\mathbf X^ \mathbf X)^\mathbf X ^ \mathbf y \rangle + \langle(\mathbf X^ \mathbf X)^\mathbf X ^ \mathbf y,(\mathbf X^ \mathbf X)^\mathbf X ^ \mathbf y \rangle+ C, where C depends only on \mathbf y and \mathbf X , and \langle \cdot, \cdot \rangle is the
inner product In mathematics, an inner product space (or, rarely, a Hausdorff space, Hausdorff pre-Hilbert space) is a real vector space or a complex vector space with an operation (mathematics), operation called an inner product. The inner product of two ve ...
defined by : \langle x, y \rangle = x ^ (\mathbf X^ \mathbf X) y. It follows that S(\boldsymbol) is equal to :\langle \boldsymbol \beta - (\mathbf X^ \mathbf X)^\mathbf X ^ \mathbf y,\boldsymbol \beta - (\mathbf X^ \mathbf X)^\mathbf X ^ \mathbf y \rangle+ C and therefore minimized exactly when :\boldsymbol \beta - (\mathbf X^ \mathbf X)^\mathbf X ^ \mathbf y = 0.


Generalization for complex equations

In general, the coefficients of the matrices \mathbf , \boldsymbol and \mathbf can be complex. By using a Hermitian transpose instead of a simple transpose, it is possible to find a vector \boldsymbol which minimizes S(\boldsymbol), just as for the real matrix case. In order to get the normal equations we follow a similar path as in previous derivations: : \displaystyle S(\boldsymbol)=\langle \mathbf -\mathbf \boldsymbol,\mathbf -\mathbf \boldsymbol \rangle = \langle \mathbf ,\mathbf \rangle - \overline- + \langle \mathbf \boldsymbol,\mathbf \boldsymbol \rangle =\mathbf ^ \overline-\boldsymbol^\dagger \mathbf^\dagger \mathbf -\mathbf^\dagger \mathbf \boldsymbol + \boldsymbol^ \mathbf ^ \overline \overline, where \dagger stands for Hermitian transpose. We should now take derivatives of S(\boldsymbol) with respect to each of the coefficients \beta_j , but first we separate real and imaginary parts to deal with the conjugate factors in above expression. For the \beta_j we have : \beta_j = \beta_j^R + i\beta_j^I and the derivatives change into : \frac = \frac \frac + \frac \frac = \frac - i \frac \quad (j=1,2,3,\ldots,n). After rewriting S(\boldsymbol) in the summation form and writing \beta_j explicitly, we can calculate both partial derivatives with result: : \begin \frac = & -\sum_^m \Big(\overline _ y_i + \overline_i X_ \Big) + 2\sum_^m X_ \overline_ \beta_j^R + \sum_^m \sum_^n \Big( X_ \overline_ \overline_k + \beta_k X_ \overline_ \Big), \\ pt& -i = \sum_^m \Big(\overline_ y_i - \overline _i X_ - 2i\sum_^m X_\overline_ \beta_j^I + \sum_^m \sum_^n \Big( X_ \overline_ \overline_k - \beta_k X_ \overline_ \Big), \end which, after adding it together and comparing to zero (minimization condition for \boldsymbol ) yields : \sum_^m X_ \overline_i = \sum_^m \sum_^n X_ \overline_ \overline_k \qquad (j=1,2,3,\ldots,n). In matrix form: : \textbf^ \overline = \textbf^ \overline \quad \text\quad \big (\textbf^\dagger \textbf \big) \boldsymbol = \textbf^\dagger \textbf.


Least squares estimator for ''β''

Using matrix notation, the sum of squared residuals is given by : S(\beta) = (y-X\beta)^T(y-X\beta). Since this is a quadratic expression, the vector which gives the global minimum may be found via matrix calculus by differentiating with respect to the vector \beta (using denominator layout) and setting equal to zero: : 0 = \frac(\widehat\beta) = \frac\bigg(y^Ty - \beta^TX^Ty - y^TX\beta + \beta^TX^TX\beta\bigg)\bigg, _ = -2X^Ty + 2X^TX\widehat\beta By assumption matrix ''X'' has full column rank, and therefore ''XTX'' is invertible and the least squares estimator for ''β'' is given by : \widehat\beta = (X^TX)^X^Ty


Unbiasedness and variance of \widehat\beta

Plug ''y'' = ''Xβ'' + ''ε'' into the formula for \widehat\beta and then use the law of total expectation: : \begin\operatorname ,\widehat\beta&= \operatorname\Big X^TX)^X^T(X\beta+\varepsilon)\Big\\ &= \beta + \operatorname\Big X^TX)^X^T\varepsilon\Big\\ &= \beta + \operatorname\Big operatorname\Big[(X^TX)^X^T\varepsilon \mid X \BigBig">X^TX)^X^T\varepsilon_\mid_X_\Big.html" ;"title="operatorname\Big[(X^TX)^X^T\varepsilon \mid X \Big">operatorname\Big[(X^TX)^X^T\varepsilon \mid X \BigBig\\ &= \beta + \operatorname\Big[(X^TX)^X^T\operatorname[\varepsilon\mid X]\Big] &= \beta, \end where E[''ε'', ''X''] = 0 by assumptions of the model. Since the expected value of \widehat equals the parameter it estimates, \beta, it is an unbiased estimator of \beta. For the variance, let the covariance matrix of \varepsilon be \operatorname ,\varepsilon\varepsilon^T\,= \sigma^2 I (where I is the identity m\,\times\,m matrix), and let X be a known constant. Then, : \begin \operatorname ,(\widehat\beta - \beta)(\widehat\beta - \beta)^T&= \operatorname\Big ((X^TX)^X^T\varepsilon)((X^TX)^X^T\varepsilon)^T \Big\\ &= \operatorname\Big (X^TX)^X^T\varepsilon\varepsilon^TX(X^TX)^ \Big\\ &= (X^TX)^X^T\operatorname\Big \varepsilon\varepsilon^T \BigX(X^TX)^\\ &= (X^TX)^X^T\sigma^2X(X^TX)^ \\ &= \sigma^2(X^TX)^X^TX(X^TX)^ \\ &= \sigma^2 (X^TX)^, \end where we used the fact that \widehat - \beta is just an
affine transformation In Euclidean geometry, an affine transformation or affinity (from the Latin, ''affinis'', "connected with") is a geometric transformation that preserves lines and parallelism, but not necessarily Euclidean distances and angles. More generally, ...
of \varepsilon by the matrix (X^TX)^X^T. For a simple linear regression model, where \beta = beta_0,\beta_1T (\beta_0 is the ''y''-intercept and \beta_1 is the slope), one obtains : \begin \sigma^2 (X^TX)^ &= \sigma^2 \left( \begin 1&1& \cdots \\x_1&x_2& \cdots \end\begin 1& x_1\\1& x_2\\ \vdots & \vdots\,\,\, \end \right)^\\ pt&= \sigma^2 \left(\sum_^m \begin 1& x_i\\x_i& x_i^2\end \right)^\\ pt&= \sigma^2 \begin m& \sum x_i\\\sum x_i& \sum x_i^2\end^\\ pt&= \sigma^2 \cdot \frac\begin \sum x_i^2& -\sum x_i\\-\sum x_i& m\end\\ pt&= \sigma^2 \cdot \frac\begin \sum x_i^2& -\sum x_i\\-\sum x_i& m\end \\ pt \operatorname(\widehat\beta_1) &= \frac. \end


Expected value and biasedness of \widehat\sigma^

First we will plug in the expression for ''y'' into the estimator, and use the fact that ''X'M'' = ''MX'' = 0 (matrix ''M'' projects onto the space orthogonal to ''X''): : \widehat\sigma^ = \tfracy'My = \tfrac (X\beta+\varepsilon)'M(X\beta+\varepsilon) = \tfrac \varepsilon'M\varepsilon Now we can recognize ''ε''′''Mε'' as a 1×1 matrix, such matrix is equal to its own
trace Trace may refer to: Arts and entertainment Music * Trace (Son Volt album), ''Trace'' (Son Volt album), 1995 * Trace (Died Pretty album), ''Trace'' (Died Pretty album), 1993 * Trace (band), a Dutch progressive rock band * The Trace (album), ''The ...
. This is useful because by properties of trace operator, tr(''AB'') = tr(''BA''), and we can use this to separate disturbance ''ε'' from matrix ''M'' which is a function of regressors ''X'': : \operatorname\,\widehat\sigma^ = \tfrac\operatorname\big operatorname(\varepsilon'M\varepsilon)\big = \tfrac\operatorname\big(\operatorname \varepsilon\varepsilon'big) Using the Law of iterated expectation this can be written as : \operatorname\,\widehat\sigma^ = \tfrac\operatorname\Big(\operatorname\big Xbig">varepsilon\varepsilon'.html" ;"title="\,\operatorname[\varepsilon\varepsilon'">XbigBig) = \tfrac\operatorname\big(\operatorname[\sigma^2MI]\big) = \tfrac\sigma^2\operatorname\big[ \operatorname\,M \big] Recall that ''M'' = ''I'' − ''P'' where ''P'' is the projection onto linear space spanned by columns of matrix ''X''. By properties of a projection matrix, it has ''p'' = rank(''X'') eigenvalues equal to 1, and all other eigenvalues are equal to 0. Trace of a matrix is equal to the sum of its characteristic values, thus tr(''P'') = ''p'', and tr(''M'') = ''n'' − ''p''. Therefore, : \operatorname\,\widehat\sigma^ = \frac \sigma^2 Since the expected value of \widehat\sigma^ does not equal the parameter it estimates, \sigma^, it is a
biased estimator In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In st ...
of \sigma^. Note in the later section “Maximum likelihood” we show that under the additional assumption that errors are distributed normally, the estimator \widehat\sigma^ is proportional to a chi-squared distribution with ''n'' – ''p'' degrees of freedom, from which the formula for expected value would immediately follow. However the result we have shown in this section is valid regardless of the distribution of the errors, and thus has importance on its own.


Consistency and asymptotic normality of \widehat\beta

Estimator \widehat\beta can be written as : \widehat\beta = \big(\tfracX'X\big)^\tfracX'y = \beta + \big(\tfracX'X\big)^\tfrac X'\varepsilon = \beta\; + \;\bigg(\frac\sum_^n x_ix'_i\bigg)^ \bigg(\frac\sum_^n x_i\varepsilon_i\bigg) We can use the
law of large numbers In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials shou ...
to establish that : \frac\sum_^n x_ix'_i\ \xrightarrow\ \operatorname _ix_i'\frac, \qquad \frac\sum_^n x_i\varepsilon_i\ \xrightarrow\ \operatorname _i\varepsilon_i0 By
Slutsky's theorem In probability theory, Slutsky’s theorem extends some properties of algebraic operations on convergent sequences of real numbers to sequences of random variables. The theorem was named after Eugen Slutsky. Slutsky's theorem is also attributed to ...
and continuous mapping theorem these results can be combined to establish consistency of estimator \widehat\beta: : \widehat\beta\ \xrightarrow\ \beta + nQ_^\cdot 0 = \beta The central limit theorem tells us that : \frac\sum_^n x_i\varepsilon_i\ \xrightarrow\ \mathcal\big(0,\,V\big), where V = \operatorname _i\varepsilon_i= \operatorname ,\varepsilon_i^2x_ix'_i\,= \operatorname\big ,\operatorname[\varepsilon_i^2\mid x_i;x_ix'_i\,\big">varepsilon_i^2\mid_x_i.html" ;"title=",\operatorname[\varepsilon_i^2\mid x_i">,\operatorname[\varepsilon_i^2\mid x_i;x_ix'_i\,\big= \sigma^2 \frac Applying
Slutsky's theorem In probability theory, Slutsky’s theorem extends some properties of algebraic operations on convergent sequences of real numbers to sequences of random variables. The theorem was named after Eugen Slutsky. Slutsky's theorem is also attributed to ...
again we'll have : \sqrt(\widehat\beta-\beta) = \bigg(\frac\sum_^n x_ix'_i\bigg)^ \bigg(\frac\sum_^n x_i\varepsilon_i\bigg)\ \xrightarrow\ Q_^n\cdot\mathcal\big(0, \sigma^2\frac\big) = \mathcal\big(0,\sigma^2Q_^n\big)


Maximum likelihood approach

Maximum likelihood estimation is a generic technique for estimating the unknown parameters in a statistical model by constructing a log-likelihood function corresponding to the joint distribution of the data, then maximizing this function over all possible parameter values. In order to apply this method, we have to make an assumption about the distribution of y given X so that the log-likelihood function can be constructed. The connection of maximum likelihood estimation to OLS arises when this distribution is modeled as a
multivariate normal In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One d ...
. Specifically, assume that the errors ε have multivariate normal distribution with mean 0 and variance matrix ''σ''2''I''. Then the distribution of ''y'' conditionally on ''X'' is : y\mid X\ \sim\ \mathcal(X\beta,\, \sigma^2I) and the log-likelihood function of the data will be : \begin \mathcal(\beta,\sigma^2\mid X) &= \ln\bigg( \frace^ \bigg) \\ pt &= -\frac\ln 2\pi - \frac\ln\sigma^2 - \frac(y-X\beta)'(y-X\beta) \end Differentiating this expression with respect to ''β'' and ''σ''2 we'll find the ML estimates of these parameters: : \begin \frac & = -\frac\Big(-2X'y + 2X'X\beta\Big)=0 \quad\Rightarrow\quad \widehat\beta = (X'X)^X'y \\ pt \frac & = -\frac \frac + \frac(y-X\beta)'(y-X\beta)=0 \quad\Rightarrow\quad \widehat\sigma^ = \frac (y-X\widehat\beta)'(y-X\widehat\beta) = \frac S(\widehat\beta) \end We can check that this is indeed a maximum by looking at the Hessian matrix of the log-likelihood function.


Finite-sample distribution

Since we have assumed in this section that the distribution of error terms is known to be normal, it becomes possible to derive the explicit expressions for the distributions of estimators \widehat\beta and \widehat\sigma^: : \widehat\beta = (X'X)^X'y = (X'X)^X'(X\beta+\varepsilon) = \beta + (X'X)^X'\mathcal(0,\sigma^2I) so that by the affine transformation properties of multivariate normal distribution : \widehat\beta\mid X\ \sim\ \mathcal(\beta,\, \sigma^2(X'X)^). Similarly the distribution of \widehat\sigma^ follows from : \begin \widehat\sigma^ &= \tfrac(y-X(X'X)^X'y)'(y-X(X'X)^X'y) \\ pt&= \tfrac(My)'My \\ pt&=\tfrac(X\beta+\varepsilon)'M(X\beta+\varepsilon) \\ pt&= \tfrac\varepsilon'M\varepsilon, \end where M=I-X(X'X)^X' is the symmetric projection matrix onto subspace orthogonal to ''X'', and thus ''MX'' = ''X''′''M'' = 0. We have argued
before Before is the opposite of after, and may refer to: * ''Before'' (Gold Panda EP), 2009 * ''Before'' (James Blake EP), 2020 * "Before" (song), a 1996 song by the Pet Shop Boys * "Before", a song by the Empire of the Sun from ''Two Vines'' * "Befo ...
that this matrix rank ''n'' – ''p'', and thus by properties of
chi-squared distribution In probability theory and statistics, the chi-squared distribution (also chi-square or \chi^2-distribution) with k degrees of freedom is the distribution of a sum of the squares of k independent standard normal random variables. The chi-squa ...
, : \tfrac \widehat\sigma^\mid X = (\varepsilon/\sigma)'M(\varepsilon/\sigma)\ \sim\ \chi^2_ Moreover, the estimators \widehat\beta and \widehat\sigma^ turn out to be
independent Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in the New Hope, Pennsylvania, area of the United States during the early 1930s * Independ ...
(conditional on ''X''), a fact which is fundamental for construction of the classical t- and F-tests. The independence can be easily seen from following: the estimator \widehat\beta represents coefficients of vector decomposition of \widehat=X\widehat\beta=Py=X\beta+P\varepsilon by the basis of columns of ''X'', as such \widehat\beta is a function of ''Pε''. At the same time, the estimator \widehat\sigma^ is a norm of vector ''Mε'' divided by ''n'', and thus this estimator is a function of ''Mε''. Now, random variables (''Pε'', ''Mε'') are jointly normal as a linear transformation of ''ε'', and they are also uncorrelated because ''PM'' = 0. By properties of multivariate normal distribution, this means that ''Pε'' and ''Mε'' are independent, and therefore estimators \widehat\beta and \widehat\sigma^ will be independent as well.


Derivation of simple linear regression estimators

We look for \widehat and \widehat that minimize the sum of squared errors (SSE): :\min_ \,\operatorname\left(\widehat, \widehat\right) \equiv \min_ \sum_^n \left(y_i - \widehat - \widehat x_i\right)^2 To find a minimum take partial derivatives with respect to \widehat and \widehat : \begin &\frac \left (\operatorname \left(\widehat, \widehat\right) \right ) = -2\sum_^n \left(y_i - \widehat - \widehatx_i\right) = 0 \\ pt \Rightarrow &\sum_^n \left(y_i - \widehat - \widehatx_i\right) = 0 \\ pt \Rightarrow &\sum_^n y_i = \sum_^n \widehat + \widehat\sum_^n x_i \\ pt \Rightarrow &\sum_^n y_i = n\widehat + \widehat\sum_^n x_i \\ pt \Rightarrow &\frac\sum_^n y_ = \widehat + \frac \widehat\sum_^n x_i \\ pt \Rightarrow &\bar = \widehat + \widehat\bar \end Before taking partial derivative with respect to \widehat, substitute the previous result for \widehat. : \min_ \sum_^n \left _i - \left(\bar - \widehat \bar\right) - \widehatx_\right2 = \min_ \sum_^n \left left(y_i - \bar\right) - \widehat\left(x_i - \bar\right) \right2 Now, take the derivative with respect to \widehat: : \begin &\frac \left (\operatorname \left(\widehat, \widehat\right) \right )= -2\sum_^n \left left(y_ - \bar\right) - \widehat\left(x_ - \bar\right)\rightleft(x_-\bar\right) = 0 \\ \Rightarrow &\sum_^n \left(y_i - \bar\right)\left(x_i - \bar\right) - \widehat\sum_^n \left(x_i - \bar\right)^2 = 0 \\ \Rightarrow & \widehat = \frac = \frac \end And finally substitute \widehat to determine \widehat : \widehat = \bar - \widehat\bar{x} Article proofs Least squares