HOME

TheInfoList



OR:

In
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, principal component regression (PCR) is a
regression analysis In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
technique that is based on
principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
(PCA). More specifically, PCR is used for
estimating Estimation (or estimating) is the process of finding an estimate or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is der ...
the unknown
regression coefficients In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is c ...
in a standard linear regression model. In PCR, instead of regressing the dependent variable on the explanatory variables directly, the principal components of the explanatory variables are used as regressors. One typically uses only a subset of all the principal components for regression, making PCR a kind of regularized procedure and also a type of shrinkage estimator. Often the principal components with higher
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
s (the ones based on
eigenvectors In linear algebra, an eigenvector () or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denoted ...
corresponding to the higher
eigenvalues In linear algebra, an eigenvector () or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denoted b ...
of the
sample Sample or samples may refer to: Base meaning * Sample (statistics), a subset of a population – complete data set * Sample (signal), a digital discrete sample of a continuous analog signal * Sample (material), a specimen or small quantity of s ...
variance-covariance matrix In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
of the explanatory variables) are selected as regressors. However, for the purpose of predicting the outcome, the principal components with low variances may also be important, in some cases even more important. One major use of PCR lies in overcoming the
multicollinearity In statistics, multicollinearity (also collinearity) is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. In this situation, the coeffic ...
problem which arises when two or more of the explanatory variables are close to being
collinear In geometry, collinearity of a set of points is the property of their lying on a single line. A set of points with this property is said to be collinear (sometimes spelled as colinear). In greater generality, the term has been used for aligned ...
.Dodge, Y. (2003) ''The Oxford Dictionary of Statistical Terms'', OUP. PCR can aptly deal with such situations by excluding some of the low-variance principal components in the regression step. In addition, by usually regressing on only a subset of all the principal components, PCR can result in
dimension reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
through substantially lowering the effective number of parameters characterizing the underlying model. This can be particularly useful in settings with high-dimensional covariates. Also, through appropriate selection of the principal components to be used for regression, PCR can lead to efficient prediction of the outcome based on the assumed model.


The principle

The PCR method may be broadly divided into three major steps: : 1. \;\; Perform PCA on the observed
data matrix A Data Matrix is a two-dimensional code consisting of black and white "cells" or dots arranged in either a square or rectangular pattern, also known as a matrix. The information to be encoded can be text or numeric data. Usual data size is fro ...
for the explanatory variables to obtain the principal components, and then (usually) select a subset, based on some appropriate criteria, of the principal components so obtained for further use. : 2. \;\; Now regress the observed vector of outcomes on the selected principal components as covariates, using
ordinary least squares In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the ...
regression ( linear regression) to get a vector of estimated regression coefficients (with
dimension In physics and mathematics, the dimension of a Space (mathematics), mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any Point (geometry), point within it. Thus, a Line (geometry), lin ...
equal to the number of selected principal components). : 3. \;\; Now
transform Transform may refer to: Arts and entertainment * Transform (scratch), a type of scratch used by turntablists * ''Transform'' (Alva Noto album), 2001 * ''Transform'' (Howard Jones album) or the title song, 2019 * ''Transform'' (Powerman 5000 album ...
this vector back to the scale of the actual covariates, using the selected PCA loadings (the eigenvectors corresponding to the selected principal components) to get the final PCR estimator (with dimension equal to the total number of covariates) for estimating the regression coefficients characterizing the original model.


Details of the method

Data representation: Let \mathbf_ = \left(y_1,\ldots,y_n\right)^T denote the vector of observed outcomes and \mathbf_ = \left(\mathbf_1,\ldots,\mathbf_n\right)^T denote the corresponding
data matrix A Data Matrix is a two-dimensional code consisting of black and white "cells" or dots arranged in either a square or rectangular pattern, also known as a matrix. The information to be encoded can be text or numeric data. Usual data size is fro ...
of observed covariates where, n and p denote the size of the observed
sample Sample or samples may refer to: Base meaning * Sample (statistics), a subset of a population – complete data set * Sample (signal), a digital discrete sample of a continuous analog signal * Sample (material), a specimen or small quantity of s ...
and the number of covariates respectively, with n \geq p . Each of the n rows of \mathbf denotes one set of observations for the p
dimensional In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it. Thus, a line has a dimension of one (1D) because only one coordi ...
covariate and the respective entry of \mathbf denotes the corresponding observed outcome. Data pre-processing: Assume that \mathbf and each of the p columns of \mathbf have already been centered so that all of them have zero empirical means. This centering step is crucial (at least for the columns of \mathbf ) since PCR involves the use of PCA on \mathbf and PCA is sensitive to
centering Centring, centre, centering"Centering 2, Centring 2" def. 1. Whitney, William Dwight, and Benjamin E. Smith. ''The Century dictionary and cyclopedia''. vol. 2. New York: Century Co., 1901. p. 885., or center is a type of formwork: the temporary str ...
of the data. Underlying model: Following centering, the standard Gauss–Markov linear regression model for \mathbf on \mathbf can be represented as: \mathbf = \mathbf\boldsymbol + \boldsymbol, \; where \boldsymbol \in \mathbb^p denotes the unknown parameter vector of regression coefficients and \boldsymbol denotes the vector of random errors with \operatorname\left(\boldsymbol\right) = \mathbf \; and \; \operatorname\left(\boldsymbol\right) = \sigma^2I_ for some unknown
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
parameter \sigma^2 > 0 \;\; Objective: The primary goal is to obtain an efficient
estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
\widehat for the parameter \boldsymbol\beta , based on the data. One frequently used approach for this is
ordinary least squares In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the ...
regression which, assuming \mathbf is
full column rank In linear algebra, the rank of a matrix is the dimension of the vector space generated (or spanned) by its columns. p. 48, § 1.16 This corresponds to the maximal number of linearly independent columns of . This, in turn, is identical to the di ...
, gives the
unbiased estimator In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In stat ...
: \widehat_\mathrm = (\mathbf^\mathbf)^ \mathbf^\mathbf of \boldsymbol . PCR is another technique that may be used for the same purpose of estimating \boldsymbol . PCA step: PCR starts by performing a PCA on the centered data matrix \mathbf . For this, let \mathbf = U \Delta V^ denote the
singular value decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is re ...
of \mathbf where, \Delta_ = \operatorname\left delta_1,\ldots,\delta_p\right with \delta_1 \geq \cdots \geq \delta_p \geq 0 denoting the non-negative
singular values In mathematics, in particular functional analysis, the singular values, or ''s''-numbers of a compact operator T: X \rightarrow Y acting between Hilbert spaces X and Y, are the square roots of the (necessarily non-negative) eigenvalues of the self- ...
of \mathbf , while the
columns A column or pillar in architecture and structural engineering is a structural element that transmits, through compression, the weight of the structure above to other structural elements below. In other words, a column is a compression membe ...
of U_ = mathbf_1,\ldots,\mathbf_p and V_ = mathbf_1,\ldots,\mathbf_p are both orthonormal sets of vectors denoting the left and right singular vectors of \mathbf respectively. The principal components: V \Lambda V^T gives a spectral decomposition of \mathbf^T \mathbf where \Lambda_ = \operatorname\left lambda_1,\ldots,\lambda_p\right= \operatorname\left delta_1^2,\ldots,\delta_p^2\right= \Delta^2 with \lambda_1 \geq \cdots \geq \lambda_p \geq 0 denoting the non-negative eigenvalues (also known as the principal values) of \mathbf^T \mathbf , while the columns of V denote the corresponding orthonormal set of eigenvectors. Then, \mathbf\mathbf_j and \mathbf_j respectively denote the j^
principal component Principal may refer to: Title or rank * Principal (academia), the chief executive of a university ** Principal (education), the office holder/ or boss in any school * Principal (civil service) or principal officer, the senior management level ...
and the j^ principal component direction (or PCA loading) corresponding to the j^\text largest principal value \lambda_j for each j \in \. Derived covariates: For any k \in \, let V_ denote the p \times k matrix with orthonormal columns consisting of the first k columns of V . Let W_k = \mathbfV_ = mathbf\mathbf_1,\ldots,\mathbf\mathbf_k denote the n \times k matrix having the first k principal components as its columns. W may be viewed as the data matrix obtained by using the transformed covariates \mathbf_i^k = V_k^T \mathbf_i \in \mathbb^ instead of using the original covariates \mathbf_i \in \mathbb^p \;\; \forall \;\; 1 \leq i \leq n . The PCR estimator: Let \widehat_k = (W_k^T W_k)^ W_k^T \mathbf \in \mathbb^k denote the vector of estimated regression coefficients obtained by
ordinary least squares In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the ...
regression of the response vector \mathbf on the data matrix W_ . Then, for any k \in \, the final PCR estimator of \boldsymbol based on using the first k principal components is given by: \widehat_k = V_k \widehat_k \in \mathbb^p .


Fundamental characteristics and applications of the PCR estimator


Two basic properties

The fitting process for obtaining the PCR estimator involves regressing the response vector on the derived data matrix W_ which has orthogonal columns for any k \in \ since the principal components are mutually orthogonal to each other. Thus in the regression step, performing a
multiple linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is cal ...
jointly on the k selected principal components as covariates is equivalent to carrying out k independent simple linear regressions (or univariate regressions) separately on each of the k selected principal components as a covariate. When all the principal components are selected for regression so that k = p , then the PCR estimator is equivalent to the
ordinary least squares In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the ...
estimator. Thus, \widehat_ = \widehat_\mathrm . This is easily seen from the fact that W_ = \mathbfV_ = \mathbfV and also observing that V is an
orthogonal matrix In linear algebra, an orthogonal matrix, or orthonormal matrix, is a real square matrix whose columns and rows are orthonormal vectors. One way to express this is Q^\mathrm Q = Q Q^\mathrm = I, where is the transpose of and is the identity ma ...
.


Variance reduction

For any k \in \ , the variance of \widehat_ is given by : \operatorname(\widehat_k) = \sigma^2 \; V_k (W_k^T W_k)^ V_k^T = \sigma^2 \; V_k \; \operatorname\left(\lambda_1^,\ldots,\lambda_k^\right) V_k^ = \sigma^2 \sideset\sum_^k \frac. In particular: : \operatorname(\widehat_) = \operatorname(\widehat_\mathrm) = \sigma^2 \sideset\sum_^\frac. Hence for all k \in \ we have: : \operatorname(\widehat_\mathrm) - \operatorname(\widehat_) = \sigma^2 \sideset\sum_^p\frac. Thus, for all k \in \ we have: : \operatorname(\widehat_\mathrm) - \operatorname(\widehat_k) \succeq 0 where A \succeq 0 indicates that a square symmetric matrix A is
non-negative definite In mathematics, a symmetric matrix M with real entries is positive-definite if the real number z^\textsfMz is positive for every nonzero real column vector z, where z^\textsf is the transpose of More generally, a Hermitian matrix (that is, a c ...
. Consequently, any given
linear form In mathematics, a linear form (also known as a linear functional, a one-form, or a covector) is a linear map from a vector space to its field of scalars (often, the real numbers or the complex numbers). If is a vector space over a field , the s ...
of the PCR estimator has a lower variance compared to that of the same
linear form In mathematics, a linear form (also known as a linear functional, a one-form, or a covector) is a linear map from a vector space to its field of scalars (often, the real numbers or the complex numbers). If is a vector space over a field , the s ...
of the ordinary least squares estimator.


Addressing multicollinearity

Under
multicollinearity In statistics, multicollinearity (also collinearity) is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. In this situation, the coeffic ...
, two or more of the covariates are highly
correlated In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistic ...
, so that one can be linearly predicted from the others with a non-trivial degree of accuracy. Consequently, the columns of the data matrix \mathbf that correspond to the observations for these covariates tend to become
linearly dependent In the theory of vector spaces, a set of vectors is said to be if there is a nontrivial linear combination of the vectors that equals the zero vector. If no such linear combination exists, then the vectors are said to be . These concepts are ...
and therefore, \mathbf tends to become
rank deficient In linear algebra, the rank of a matrix is the dimension of the vector space generated (or spanned) by its columns. p. 48, § 1.16 This corresponds to the maximal number of linearly independent columns of . This, in turn, is identical to the dime ...
losing its full column rank structure. More quantitatively, one or more of the smaller eigenvalues of \mathbf^\mathbf get(s) very close or become(s) exactly equal to 0 under such situations. The variance expressions above indicate that these small eigenvalues have the maximum inflation effect on the variance of the least squares estimator, thereby destabilizing the estimator significantly when they are close to 0. This issue can be effectively addressed through using a PCR estimator obtained by excluding the principal components corresponding to these small eigenvalues.


Dimension reduction

PCR may also be used for performing
dimension reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
. To see this, let L_k denote any p \times k matrix having orthonormal columns, for any k \in \. Suppose now that we want to
approximate An approximation is anything that is intentionally similar but not exactly equal to something else. Etymology and usage The word ''approximation'' is derived from Latin ''approximatus'', from ''proximus'' meaning ''very near'' and the prefix ' ...
each of the covariate observations \mathbf_i through the
rank Rank is the relative position, value, worth, complexity, power, importance, authority, level, etc. of a person or object within a ranking, such as: Level or position in a hierarchical organization * Academic rank * Diplomatic rank * Hierarchy * ...
k
linear transformation In mathematics, and more specifically in linear algebra, a linear map (also called a linear mapping, linear transformation, vector space homomorphism, or in some contexts linear function) is a mapping V \to W between two vector spaces that pre ...
L_k \mathbf_i for some \mathbf_i \in \mathbb^ (1 \leq i \leq n) . Then, it can be shown that : \sum_^ \left \, \mathbf_i - L_\mathbf_i \right \, ^2 is minimized at L_k = V_k, the matrix with the first k principal component directions as columns, and \mathbf_i = \mathbf_^ = V_^\mathbf_i, the corresponding k dimensional derived covariates. Thus the k dimensional principal components provide the best
linear approximation In mathematics, a linear approximation is an approximation of a general function using a linear function (more precisely, an affine function). They are widely used in the method of finite differences to produce first order methods for solving o ...
of rank k to the observed data matrix \mathbf . The corresponding reconstruction error is given by: : \sum_^ \left \, \mathbf_i - V_\mathbf_^ \right \, ^2 = \begin \sum_^ \lambda_j & 1 \leqslant k < p \\ 0 & k = p \end Thus any potential
dimension reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
may be achieved by choosing k , the number of principal components to be used, through appropriate thresholding on the cumulative sum of the
eigenvalues In linear algebra, an eigenvector () or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denoted b ...
of \mathbf^\mathbf. Since the smaller eigenvalues do not contribute significantly to the cumulative sum, the corresponding principal components may be continued to be dropped as long as the desired threshold limit is not exceeded. The same criteria may also be used for addressing the
multicollinearity In statistics, multicollinearity (also collinearity) is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. In this situation, the coeffic ...
issue whereby the principal components corresponding to the smaller eigenvalues may be ignored as long as the threshold limit is maintained.


Regularization effect

Since the PCR estimator typically uses only a subset of all the principal components for regression, it can be viewed as some sort of a regularized procedure. More specifically, for any 1 \leqslant k < p, the PCR estimator \widehat_k denotes the regularized solution to the following constrained minimization problem: : \min_ \left \, \mathbf - \mathbf\boldsymbol_* \right \, ^2 \quad \text \quad \boldsymbol_* \perp \. The constraint may be equivalently written as: : V_^\boldsymbol_* = \mathbf, where: : V_ = \left mathbf_,\ldots,\mathbf_p\right. Thus, when only a proper subset of all the principal components are selected for regression, the PCR estimator so obtained is based on a hard form of
regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) In physics, especially quantum field theory, regularization is a method of modifying observables which have singularities in ...
that constrains the resulting solution to the
column space In linear algebra, the column space (also called the range or image) of a matrix ''A'' is the span (set of all possible linear combinations) of its column vectors. The column space of a matrix is the image or range of the corresponding mat ...
of the selected principal component directions, and consequently restricts it to be orthogonal to the excluded directions.


Optimality of PCR among a class of regularized estimators

Given the constrained minimization problem as defined above, consider the following generalized version of it: : \min_ \, \mathbf - \mathbf\boldsymbol_*\, ^2 \quad \text \quad L_^\boldsymbol_* = \mathbf where, L_ denotes any full column rank matrix of order p \times (p-k) with 1 \leqslant k < p. Let \widehat_L denote the corresponding solution. Thus : \widehat_L = \arg \min_ \, \mathbf - \mathbf\boldsymbol_*\, ^2 \quad \text \quad L_^\boldsymbol_* = \mathbf. Then the optimal choice of the restriction matrix L_ for which the corresponding estimator \widehat_ achieves the minimum prediction error is given by: : L^_ = V_ \Lambda_^, where : \Lambda_^ = \operatorname \left(\lambda_^,\ldots,\lambda_p^\right). Quite clearly, the resulting optimal estimator \widehat_ is then simply given by the PCR estimator \widehat_ based on the first k principal components.


Efficiency

Since the ordinary least squares estimator is
unbiased Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group, ...
for \boldsymbol , we have : \operatorname(\widehat_\mathrm) = \operatorname (\widehat_\mathrm), where, MSE denotes the
mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
. Now, if for some k \in \ , we additionally have: V_^T\boldsymbol = \mathbf , then the corresponding \widehat_k is also
unbiased Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group, ...
for \boldsymbol and therefore : \operatorname(\widehat_k) = \operatorname (\widehat_k). We have already seen that :\forall j \in \: \quad \operatorname(\widehat_\mathrm) - \operatorname(\widehat_j) \succeq 0, which then implies: : \operatorname (\widehat_\mathrm) - \operatorname (\widehat_k) \succeq 0 for that particular k. Thus in that case, the corresponding \widehat_ would be a more
efficient estimator In statistics, efficiency is a measure of quality of an estimator, of an experimental design, or of a hypothesis testing procedure. Essentially, a more efficient estimator, needs fewer input data or observations than a less efficient one to achie ...
of \boldsymbol compared to \widehat_\mathrm, based on using the mean squared error as the performance criteria. In addition, any given
linear form In mathematics, a linear form (also known as a linear functional, a one-form, or a covector) is a linear map from a vector space to its field of scalars (often, the real numbers or the complex numbers). If is a vector space over a field , the s ...
of the corresponding \widehat_ would also have a lower
mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
compared to that of the same
linear form In mathematics, a linear form (also known as a linear functional, a one-form, or a covector) is a linear map from a vector space to its field of scalars (often, the real numbers or the complex numbers). If is a vector space over a field , the s ...
of \widehat_\mathrm . Now suppose that for a given k \in \, V_^ \neq \mathbf . Then the corresponding \widehat_k is biased for \boldsymbol . However, since : \forall k \in \: \quad \operatorname(\widehat_\mathrm) - \operatorname(\widehat_k) \succeq 0, it is still possible that \operatorname(\widehat_\mathrm) - \operatorname(\widehat_k) \succeq 0 , especially if k is such that the excluded principal components correspond to the smaller eigenvalues, thereby resulting in lower
bias Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group, ...
. In order to ensure efficient estimation and prediction performance of PCR as an estimator of \boldsymbol, Park (1981) proposes the following guideline for selecting the principal components to be used for regression: Drop the j^ principal component if and only if \lambda_j < (p\sigma^2)/ \boldsymbol^T \boldsymbol. Practical implementation of this guideline of course requires estimates for the unknown model parameters \sigma^2 and \boldsymbol . In general, they may be estimated using the unrestricted least squares estimates obtained from the original full model. Park (1981) however provides a slightly modified set of estimates that may be better suited for this purpose. Unlike the criteria based on the cumulative sum of the eigenvalues of \mathbf^T\mathbf , which is probably more suited for addressing the multicollinearity problem and for performing dimension reduction, the above criteria actually attempts to improve the prediction and estimation efficiency of the PCR estimator by involving both the outcome as well as the covariates in the process of selecting the principal components to be used in the regression step. Alternative approaches with similar goals include selection of the principal components based on cross-validation or the Mallow's Cp criteria. Often, the principal components are also selected based on their degree of
association Association may refer to: *Club (organization), an association of two or more people united by a common interest or goal *Trade association, an organization founded and funded by businesses that operate in a specific industry *Voluntary associatio ...
with the outcome.


Shrinkage effect of PCR

In general, PCR is essentially a shrinkage estimator that usually retains the high variance principal components (corresponding to the higher eigenvalues of \mathbf^T\mathbf ) as covariates in the model and discards the remaining low variance components (corresponding to the lower eigenvalues of \mathbf^T\mathbf ). Thus it exerts a discrete shrinkage effect on the low variance components nullifying their contribution completely in the original model. In contrast, the
ridge regression Ridge regression is a method of estimating the coefficients of multiple- regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Als ...
estimator exerts a smooth shrinkage effect through the regularization parameter (or the tuning parameter) inherently involved in its construction. While it does not completely discard any of the components, it exerts a shrinkage effect over all of them in a continuous manner so that the extent of shrinkage is higher for the low variance components and lower for the high variance components. Frank and Friedman (1993) conclude that for the purpose of prediction itself, the ridge estimator, owing to its smooth shrinkage effect, is perhaps a better choice compared to the PCR estimator having a discrete shrinkage effect. In addition, the principal components are obtained from the eigen-decomposition of \mathbf that involves the observations for the explanatory variables only. Therefore, the resulting PCR estimator obtained from using these principal components as covariates need not necessarily have satisfactory predictive performance for the outcome. A somewhat similar estimator that tries to address this issue through its very construction is the
partial least squares Partial least squares regression (PLS regression) is a statistical method that bears some relation to principal components regression; instead of finding hyperplanes of maximum variance between the response and independent variables, it finds a li ...
(PLS) estimator. Similar to PCR, PLS also uses derived covariates of lower dimensions. However unlike PCR, the derived covariates for PLS are obtained based on using both the outcome as well as the covariates. While PCR seeks the high variance directions in the space of the covariates, PLS seeks the directions in the covariate space that are most useful for the prediction of the outcome. 2006 a variant of the classical PCR known as the supervised PCR was proposed. In a spirit similar to that of PLS, it attempts at obtaining derived covariates of lower dimensions based on a criterion that involves both the outcome as well as the covariates. The method starts by performing a set of p simple linear regressions (or univariate regressions) wherein the outcome vector is regressed separately on each of the p covariates taken one at a time. Then, for some m \in \, the first m covariates that turn out to be the most correlated with the outcome (based on the degree of significance of the corresponding estimated regression coefficients) are selected for further use. A conventional PCR, as described earlier, is then performed, but now it is based on only the n \times m data matrix corresponding to the observations for the selected covariates. The number of covariates used: m \in \ and the subsequent number of principal components used: k \in \ are usually selected by cross-validation.


Generalization to kernel settings

The classical PCR method as described above is based on classical PCA and considers a
linear regression model In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is cal ...
for predicting the outcome based on the covariates. However, it can be easily generalized to a
kernel machine Kernel may refer to: Computing * Kernel (operating system), the central component of most operating systems * Kernel (image processing), a matrix used for image convolution * Compute kernel, in GPGPU programming * Kernel method, in machine learni ...
setting whereby the regression function need not necessarily be
linear Linearity is the property of a mathematical relationship (''function'') that can be graphically represented as a straight line. Linearity is closely related to '' proportionality''. Examples in physics include rectilinear motion, the linear r ...
in the covariates, but instead it can belong to the
Reproducing Kernel Hilbert Space In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g in ...
associated with any arbitrary (possibly
non-linear In mathematics and science, a nonlinear system is a system in which the change of the output is not proportional to the change of the input. Nonlinear problems are of interest to engineers, biologists, physicists, mathematicians, and many other ...
),
symmetric Symmetry (from grc, συμμετρία "agreement in dimensions, due proportion, arrangement") in everyday language refers to a sense of harmonious and beautiful proportion and balance. In mathematics, "symmetry" has a more precise definiti ...
positive-definite kernel In operator theory, a branch of mathematics, a positive-definite kernel is a generalization of a positive-definite function or a positive-definite matrix. It was first introduced by James Mercer in the early 20th century, in the context of solving ...
. The
linear regression model In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is cal ...
turns out to be a special case of this setting when the
kernel function In operator theory, a branch of mathematics, a positive-definite kernel is a generalization of a positive-definite function or a positive-definite matrix. It was first introduced by James Mercer in the early 20th century, in the context of solving ...
is chosen to be the linear kernel. In general, under the
kernel machine Kernel may refer to: Computing * Kernel (operating system), the central component of most operating systems * Kernel (image processing), a matrix used for image convolution * Compute kernel, in GPGPU programming * Kernel method, in machine learni ...
setting, the vector of covariates is first mapped into a
high-dimensional In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it. Thus, a line has a dimension of one (1D) because only one coord ...
(potentially infinite-dimensional)
feature space In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon. Choosing informative, discriminating and independent features is a crucial element of effective algorithms in pattern r ...
characterized by the
kernel function In operator theory, a branch of mathematics, a positive-definite kernel is a generalization of a positive-definite function or a positive-definite matrix. It was first introduced by James Mercer in the early 20th century, in the context of solving ...
chosen. The mapping so obtained is known as the feature map and each of its
coordinates In geometry, a coordinate system is a system that uses one or more numbers, or coordinates, to uniquely determine the position of the points or other geometric elements on a manifold such as Euclidean space. The order of the coordinates is sig ...
, also known as the feature elements, corresponds to one feature (may be
linear Linearity is the property of a mathematical relationship (''function'') that can be graphically represented as a straight line. Linearity is closely related to '' proportionality''. Examples in physics include rectilinear motion, the linear r ...
or
non-linear In mathematics and science, a nonlinear system is a system in which the change of the output is not proportional to the change of the input. Nonlinear problems are of interest to engineers, biologists, physicists, mathematicians, and many other ...
) of the covariates. The regression function is then assumed to be a linear combination of these feature elements. Thus, the underlying regression model in the
kernel machine Kernel may refer to: Computing * Kernel (operating system), the central component of most operating systems * Kernel (image processing), a matrix used for image convolution * Compute kernel, in GPGPU programming * Kernel method, in machine learni ...
setting is essentially a
linear regression model In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is cal ...
with the understanding that instead of the original set of covariates, the predictors are now given by the vector (potentially infinite-dimensional) of feature elements obtained by transforming the actual covariates using the feature map. However, the
kernel trick In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (for example ...
actually enables us to operate in the
feature space In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon. Choosing informative, discriminating and independent features is a crucial element of effective algorithms in pattern r ...
without ever explicitly computing the feature map. It turns out that it is only sufficient to compute the pairwise
inner product In mathematics, an inner product space (or, rarely, a Hausdorff pre-Hilbert space) is a real vector space or a complex vector space with an operation called an inner product. The inner product of two vectors in the space is a scalar, often ...
s among the feature maps for the observed covariate vectors and these
inner product In mathematics, an inner product space (or, rarely, a Hausdorff pre-Hilbert space) is a real vector space or a complex vector space with an operation called an inner product. The inner product of two vectors in the space is a scalar, often ...
s are simply given by the values of the
kernel function In operator theory, a branch of mathematics, a positive-definite kernel is a generalization of a positive-definite function or a positive-definite matrix. It was first introduced by James Mercer in the early 20th century, in the context of solving ...
evaluated at the corresponding pairs of covariate vectors. The pairwise inner products so obtained may therefore be represented in the form of a n \times n symmetric non-negative definite matrix also known as the kernel matrix. PCR in the
kernel machine Kernel may refer to: Computing * Kernel (operating system), the central component of most operating systems * Kernel (image processing), a matrix used for image convolution * Compute kernel, in GPGPU programming * Kernel method, in machine learni ...
setting can now be implemented by first appropriately centering this kernel matrix (K, say) with respect to the
feature space In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon. Choosing informative, discriminating and independent features is a crucial element of effective algorithms in pattern r ...
and then performing a
kernel PCA In the field of multivariate statistics, kernel principal component analysis (kernel PCA) is an extension of principal component analysis (PCA) using techniques of kernel methods. Using a kernel, the originally linear operations of PCA are performed ...
on the centered kernel matrix (K', say) whereby an
eigendecomposition In linear algebra, eigendecomposition is the factorization of a matrix into a canonical form, whereby the matrix is represented in terms of its eigenvalues and eigenvectors. Only diagonalizable matrices can be factorized in this way. When the matr ...
of K' is obtained. Kernel PCR then proceeds by (usually) selecting a subset of all the
eigenvectors In linear algebra, an eigenvector () or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denoted ...
so obtained and then performing a standard linear regression of the outcome vector on these selected
eigenvectors In linear algebra, an eigenvector () or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denoted ...
. The
eigenvectors In linear algebra, an eigenvector () or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denoted ...
to be used for regression are usually selected using cross-validation. The estimated regression coefficients (having the same dimension as the number of selected eigenvectors) along with the corresponding selected eigenvectors are then used for predicting the outcome for a future observation. In
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
, this technique is also known as ''spectral regression''. Clearly, kernel PCR has a discrete shrinkage effect on the eigenvectors of K', quite similar to the discrete shrinkage effect of classical PCR on the principal components, as discussed earlier. However, the feature map associated with the chosen kernel could potentially be infinite-dimensional, and hence the corresponding principal components and principal component directions could be infinite-dimensional as well. Therefore, these quantities are often practically intractable under the kernel machine setting. Kernel PCR essentially works around this problem by considering an equivalent dual formulation based on using the spectral decomposition of the associated kernel matrix. Under the linear regression model (which corresponds to choosing the kernel function as the linear kernel), this amounts to considering a spectral decomposition of the corresponding n \times n kernel matrix \mathbf\mathbf^T and then regressing the outcome vector on a selected subset of the eigenvectors of \mathbf\mathbf^T so obtained. It can be easily shown that this is the same as regressing the outcome vector on the corresponding principal components (which are finite-dimensional in this case), as defined in the context of the classical PCR. Thus, for the linear kernel, the kernel PCR based on a dual formulation is exactly equivalent to the classical PCR based on a primal formulation. However, for arbitrary (and possibly non-linear) kernels, this primal formulation may become intractable owing to the infinite dimensionality of the associated feature map. Thus classical PCR becomes practically infeasible in that case, but kernel PCR based on the dual formulation still remains valid and computationally scalable.


See also

*
Principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
*
Partial least squares regression Partial least squares regression (PLS regression) is a statistical method that bears some relation to principal components regression; instead of finding hyperplanes of maximum variance between the response and independent variables, it finds a ...
*
Ridge regression Ridge regression is a method of estimating the coefficients of multiple- regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Als ...
*
Canonical correlation In statistics, canonical-correlation analysis (CCA), also called canonical variates analysis, is a way of inferring information from cross-covariance matrices. If we have two vectors ''X'' = (''X''1, ..., ''X'n'') and ''Y' ...
*
Deming regression In statistics, Deming regression, named after W. Edwards Deming, is an errors-in-variables model which tries to find the line of best fit for a two-dimensional dataset. It differs from the simple linear regression in that it accounts for erro ...
*
Total sum of squares In statistical data analysis the total sum of squares (TSS or SST) is a quantity that appears as part of a standard way of presenting results of such analyses. For a set of observations, y_i, i\leq n, it is defined as the sum over all squared dif ...


References


Further reading

* * {{DEFAULTSORT:Principal Component Regression Regression analysis Factor analysis