In
statistics
Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, principal component regression (PCR) is a
regression analysis
In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
technique that is based on
principal component analysis
Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
(PCA). More specifically, PCR is used for
estimating
Estimation (or estimating) is the process of finding an estimate or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is der ...
the unknown
regression coefficients
In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is c ...
in a
standard linear regression model.
In PCR, instead of regressing the dependent variable on the explanatory variables directly, the
principal components of the explanatory variables are used as
regressors. One typically uses only a subset of all the principal components for regression, making PCR a kind of
regularized procedure and also a type of
shrinkage estimator.
Often the principal components with higher
variance
In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
s (the ones based on
eigenvectors
In linear algebra, an eigenvector () or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denoted ...
corresponding to the higher
eigenvalues
In linear algebra, an eigenvector () or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denoted b ...
of the
sample
Sample or samples may refer to:
Base meaning
* Sample (statistics), a subset of a population – complete data set
* Sample (signal), a digital discrete sample of a continuous analog signal
* Sample (material), a specimen or small quantity of s ...
variance-covariance matrix
In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
of the explanatory variables) are selected as regressors. However, for the purpose of
predicting the outcome, the principal components with low variances may also be important, in some cases even more important.
One major use of PCR lies in overcoming the
multicollinearity
In statistics, multicollinearity (also collinearity) is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. In this situation, the coeffic ...
problem which arises when two or more of the explanatory variables are close to being
collinear
In geometry, collinearity of a set of points is the property of their lying on a single line. A set of points with this property is said to be collinear (sometimes spelled as colinear). In greater generality, the term has been used for aligned ...
.
[Dodge, Y. (2003) ''The Oxford Dictionary of Statistical Terms'', OUP. ] PCR can aptly deal with such situations by excluding some of the low-variance principal components in the regression step. In addition, by usually regressing on only a subset of all the principal components, PCR can result in
dimension reduction
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
through substantially lowering the effective number of parameters characterizing the underlying model. This can be particularly useful in settings with
high-dimensional covariates. Also, through appropriate selection of the principal components to be used for regression, PCR can lead to efficient
prediction of the outcome based on the assumed model.
The principle
The PCR method may be broadly divided into three major steps:
: 1.
Perform
PCA on the observed
data matrix
A Data Matrix is a two-dimensional code consisting of black and white "cells" or dots arranged in either a square or rectangular pattern, also known as a matrix. The information to be encoded can be text or numeric data. Usual data size is fro ...
for the explanatory variables to obtain the principal components, and then (usually) select a subset, based on some appropriate criteria, of the principal components so obtained for further use.
: 2.
Now regress the observed vector of outcomes on the selected principal components as covariates, using
ordinary least squares
In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the ...
regression (
linear regression) to get a vector of estimated regression coefficients (with
dimension
In physics and mathematics, the dimension of a Space (mathematics), mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any Point (geometry), point within it. Thus, a Line (geometry), lin ...
equal to the number of selected principal components).
: 3.
Now
transform
Transform may refer to:
Arts and entertainment
* Transform (scratch), a type of scratch used by turntablists
* ''Transform'' (Alva Noto album), 2001
* ''Transform'' (Howard Jones album) or the title song, 2019
* ''Transform'' (Powerman 5000 album ...
this vector back to the scale of the actual covariates, using the selected
PCA loadings (the eigenvectors corresponding to the selected principal components) to get the final PCR estimator (with dimension equal to the total number of covariates) for estimating the regression coefficients characterizing the original model.
Details of the method
Data representation: Let
denote the vector of observed outcomes and
denote the corresponding
data matrix
A Data Matrix is a two-dimensional code consisting of black and white "cells" or dots arranged in either a square or rectangular pattern, also known as a matrix. The information to be encoded can be text or numeric data. Usual data size is fro ...
of observed covariates where,
and
denote the size of the observed
sample
Sample or samples may refer to:
Base meaning
* Sample (statistics), a subset of a population – complete data set
* Sample (signal), a digital discrete sample of a continuous analog signal
* Sample (material), a specimen or small quantity of s ...
and the number of covariates respectively, with
. Each of the
rows of
denotes one set of observations for the
dimensional
In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it. Thus, a line has a dimension of one (1D) because only one coordi ...
covariate and the respective entry of
denotes the corresponding observed outcome.
Data pre-processing: Assume that
and each of the
columns of
have already been
centered so that all of them have zero
empirical means. This centering step is crucial (at least for the columns of
) since PCR involves the use of PCA on
and
PCA is sensitive to
centering
Centring, centre, centering"Centering 2, Centring 2" def. 1. Whitney, William Dwight, and Benjamin E. Smith. ''The Century dictionary and cyclopedia''. vol. 2. New York: Century Co., 1901. p. 885., or center is a type of formwork: the temporary str ...
of the data.
Underlying model: Following centering, the standard
Gauss–Markov linear regression model for
on
can be represented as:
where
denotes the unknown parameter vector of regression coefficients and
denotes the vector of random errors with
and
for some unknown
variance
In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
parameter
Objective: The primary goal is to obtain an efficient
estimator
In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
for the parameter
, based on the data. One frequently used approach for this is
ordinary least squares
In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the ...
regression which, assuming
is
full column rank
In linear algebra, the rank of a matrix is the dimension of the vector space generated (or spanned) by its columns. p. 48, § 1.16 This corresponds to the maximal number of linearly independent columns of . This, in turn, is identical to the di ...
, gives the
unbiased estimator
In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In stat ...
:
of
. PCR is another technique that may be used for the same purpose of estimating
.
PCA step: PCR starts by performing a PCA on the centered data matrix
. For this, let
denote the
singular value decomposition
In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is re ...
of
where,
with
denoting the non-negative
singular values In mathematics, in particular functional analysis, the singular values, or ''s''-numbers of a compact operator T: X \rightarrow Y acting between Hilbert spaces X and Y, are the square roots of the (necessarily non-negative) eigenvalues of the self- ...
of
, while the
columns
A column or pillar in architecture and structural engineering is a structural element that transmits, through compression, the weight of the structure above to other structural elements below. In other words, a column is a compression membe ...
of
and
are both
orthonormal sets of vectors denoting the
left and right singular vectors of
respectively.
The principal components:
gives a
spectral decomposition of
where
with
denoting the non-negative eigenvalues (also known as the
principal values) of
, while the columns of
denote the corresponding orthonormal set of eigenvectors. Then,
and
respectively denote the
principal component
Principal may refer to:
Title or rank
* Principal (academia), the chief executive of a university
** Principal (education), the office holder/ or boss in any school
* Principal (civil service) or principal officer, the senior management level ...
and the
principal component direction (or
PCA loading) corresponding to the
largest
principal value for each
.
Derived covariates: For any
, let
denote the
matrix with orthonormal columns consisting of the first
columns of
. Let
denote the
matrix having the first
principal components as its columns.
may be viewed as the data matrix obtained by using the
transformed covariates
instead of using the original covariates
.
The PCR estimator: Let
denote the vector of estimated regression coefficients obtained by
ordinary least squares
In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the ...
regression of the response vector
on the data matrix
. Then, for any
, the final PCR estimator of
based on using the first
principal components is given by:
.
Fundamental characteristics and applications of the PCR estimator
Two basic properties
The fitting process for obtaining the PCR estimator involves regressing the response vector on the derived data matrix
which has
orthogonal columns for any
since the principal components are
mutually orthogonal to each other. Thus in the regression step, performing a
multiple linear regression
In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is cal ...
jointly on the
selected principal components as covariates is equivalent to carrying out
independent
simple linear regressions (or univariate regressions) separately on each of the
selected principal components as a covariate.
When all the principal components are selected for regression so that
, then the PCR estimator is equivalent to the
ordinary least squares
In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the ...
estimator. Thus,
. This is easily seen from the fact that
and also observing that
is an
orthogonal matrix
In linear algebra, an orthogonal matrix, or orthonormal matrix, is a real square matrix whose columns and rows are orthonormal vectors.
One way to express this is
Q^\mathrm Q = Q Q^\mathrm = I,
where is the transpose of and is the identity ma ...
.
Variance reduction
For any
, the variance of
is given by
:
In particular:
:
Hence for all
we have:
:
Thus, for all
we have:
:
where
indicates that a square symmetric matrix
is
non-negative definite
In mathematics, a symmetric matrix M with real entries is positive-definite if the real number z^\textsfMz is positive for every nonzero real column vector z, where z^\textsf is the transpose of More generally, a Hermitian matrix (that is, a c ...
. Consequently, any given
linear form
In mathematics, a linear form (also known as a linear functional, a one-form, or a covector) is a linear map from a vector space to its field of scalars (often, the real numbers or the complex numbers).
If is a vector space over a field , the s ...
of the PCR estimator has a lower variance compared to that of the same
linear form
In mathematics, a linear form (also known as a linear functional, a one-form, or a covector) is a linear map from a vector space to its field of scalars (often, the real numbers or the complex numbers).
If is a vector space over a field , the s ...
of the ordinary least squares estimator.
Addressing multicollinearity
Under
multicollinearity
In statistics, multicollinearity (also collinearity) is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. In this situation, the coeffic ...
, two or more of the covariates are highly
correlated
In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistic ...
, so that one can be linearly predicted from the others with a non-trivial degree of accuracy. Consequently, the columns of the data matrix
that correspond to the observations for these covariates tend to become
linearly dependent
In the theory of vector spaces, a set of vectors is said to be if there is a nontrivial linear combination of the vectors that equals the zero vector. If no such linear combination exists, then the vectors are said to be . These concepts are ...
and therefore,
tends to become
rank deficient
In linear algebra, the rank of a matrix is the dimension of the vector space generated (or spanned) by its columns. p. 48, § 1.16 This corresponds to the maximal number of linearly independent columns of . This, in turn, is identical to the dime ...
losing its full column rank structure. More quantitatively, one or more of the smaller eigenvalues of
get(s) very close or become(s) exactly equal to
under such situations. The variance expressions above indicate that these small eigenvalues have the maximum
inflation effect on the variance of the least squares estimator, thereby
destabilizing the estimator significantly when they are close to
. This issue can be effectively addressed through using a PCR estimator obtained by excluding the principal components corresponding to these small eigenvalues.
Dimension reduction
PCR may also be used for performing
dimension reduction
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
. To see this, let
denote any
matrix having orthonormal columns, for any
Suppose now that we want to
approximate
An approximation is anything that is intentionally similar but not exactly equal to something else.
Etymology and usage
The word ''approximation'' is derived from Latin ''approximatus'', from ''proximus'' meaning ''very near'' and the prefix ' ...
each of the covariate observations
through the
rank
Rank is the relative position, value, worth, complexity, power, importance, authority, level, etc. of a person or object within a ranking, such as:
Level or position in a hierarchical organization
* Academic rank
* Diplomatic rank
* Hierarchy
* ...
linear transformation
In mathematics, and more specifically in linear algebra, a linear map (also called a linear mapping, linear transformation, vector space homomorphism, or in some contexts linear function) is a mapping V \to W between two vector spaces that pre ...
for some
.
Then, it can be shown that
:
is minimized at
the matrix with the first
principal component directions as columns, and
the corresponding
dimensional derived covariates. Thus the
dimensional principal components provide the best
linear approximation
In mathematics, a linear approximation is an approximation of a general function using a linear function (more precisely, an affine function). They are widely used in the method of finite differences to produce first order methods for solving o ...
of rank
to the observed data matrix
.
The corresponding
reconstruction error is given by:
:
Thus any potential
dimension reduction
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
may be achieved by choosing
, the number of principal components to be used, through appropriate thresholding on the cumulative sum of the
eigenvalues
In linear algebra, an eigenvector () or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denoted b ...
of
. Since the smaller eigenvalues do not contribute significantly to the cumulative sum, the corresponding principal components may be continued to be dropped as long as the desired threshold limit is not exceeded. The same criteria may also be used for addressing the
multicollinearity
In statistics, multicollinearity (also collinearity) is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. In this situation, the coeffic ...
issue whereby the principal components corresponding to the smaller eigenvalues may be ignored as long as the threshold limit is maintained.
Regularization effect
Since the PCR estimator typically uses only a subset of all the principal components for regression, it can be viewed as some sort of a
regularized procedure. More specifically, for any
, the PCR estimator
denotes the regularized solution to the following
constrained minimization problem:
:
The constraint may be equivalently written as:
:
where:
:
Thus, when only a proper subset of all the principal components are selected for regression, the PCR estimator so obtained is based on a hard form of
regularization
Regularization may refer to:
* Regularization (linguistics)
* Regularization (mathematics)
* Regularization (physics)
In physics, especially quantum field theory, regularization is a method of modifying observables which have singularities in ...
that constrains the resulting solution to the
column space
In linear algebra, the column space (also called the range or image) of a matrix ''A'' is the span (set of all possible linear combinations) of its column vectors. The column space of a matrix is the image or range of the corresponding mat ...
of the selected principal component directions, and consequently restricts it to be
orthogonal to the excluded directions.
Optimality of PCR among a class of regularized estimators
Given the constrained minimization problem as defined above, consider the following generalized version of it:
:
where,
denotes any full column rank matrix of order
with
.
Let
denote the corresponding solution. Thus
:
Then the optimal choice of the restriction matrix
for which the corresponding estimator
achieves the minimum prediction error is given by:
:
where
:
Quite clearly, the resulting optimal estimator
is then simply given by the PCR estimator
based on the first
principal components.
Efficiency
Since the ordinary least squares estimator is
unbiased
Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group, ...
for
, we have
:
where, MSE denotes the
mean squared error
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
. Now, if for some
, we additionally have:
, then the corresponding
is also
unbiased
Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group, ...
for
and therefore
:
We have already seen that
:
which then implies:
:
for that particular
. Thus in that case, the corresponding
would be a more
efficient estimator
In statistics, efficiency is a measure of quality of an estimator, of an experimental design, or of a hypothesis testing procedure. Essentially, a more efficient estimator, needs fewer input data or observations than a less efficient one to achie ...
of
compared to
, based on using the mean squared error as the performance criteria. In addition, any given
linear form
In mathematics, a linear form (also known as a linear functional, a one-form, or a covector) is a linear map from a vector space to its field of scalars (often, the real numbers or the complex numbers).
If is a vector space over a field , the s ...
of the corresponding
would also have a lower
mean squared error
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
compared to that of the same
linear form
In mathematics, a linear form (also known as a linear functional, a one-form, or a covector) is a linear map from a vector space to its field of scalars (often, the real numbers or the complex numbers).
If is a vector space over a field , the s ...
of
.
Now suppose that for a given
. Then the corresponding
is
biased for
. However, since
:
it is still possible that
, especially if
is such that the excluded principal components correspond to the smaller eigenvalues, thereby resulting in lower
bias
Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group, ...
.
In order to ensure efficient estimation and prediction performance of PCR as an estimator of
, Park (1981)
proposes the following guideline for selecting the principal components to be used for regression: Drop the
principal component if and only if
Practical implementation of this guideline of course requires estimates for the unknown model parameters
and
. In general, they may be estimated using the unrestricted least squares estimates obtained from the original full model. Park (1981) however provides a slightly modified set of estimates that may be better suited for this purpose.
Unlike the criteria based on the cumulative sum of the eigenvalues of
, which is probably more suited for addressing the multicollinearity problem and for performing dimension reduction, the above criteria actually attempts to improve the prediction and estimation efficiency of the PCR estimator by involving both the outcome as well as the covariates in the process of selecting the principal components to be used in the regression step. Alternative approaches with similar goals include selection of the principal components based on
cross-validation or the
Mallow's Cp criteria. Often, the principal components are also selected based on their degree of
association
Association may refer to:
*Club (organization), an association of two or more people united by a common interest or goal
*Trade association, an organization founded and funded by businesses that operate in a specific industry
*Voluntary associatio ...
with the outcome.
Shrinkage effect of PCR
In general, PCR is essentially a
shrinkage estimator that usually retains the high variance principal components (corresponding to the higher eigenvalues of
) as covariates in the model and discards the remaining low variance components (corresponding to the lower eigenvalues of
). Thus it exerts a discrete
shrinkage effect on the low variance components nullifying their contribution completely in the original model. In contrast, the
ridge regression
Ridge regression is a method of estimating the coefficients of multiple- regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Als ...
estimator exerts a smooth shrinkage effect through the
regularization parameter (or the tuning parameter) inherently involved in its construction. While it does not completely discard any of the components, it exerts a shrinkage effect over all of them in a continuous manner so that the extent of shrinkage is higher for the low variance components and lower for the high variance components. Frank and Friedman (1993)
conclude that for the purpose of prediction itself, the ridge estimator, owing to its smooth shrinkage effect, is perhaps a better choice compared to the PCR estimator having a discrete shrinkage effect.
In addition, the principal components are obtained from the
eigen-decomposition of
that involves the observations for the explanatory variables only. Therefore, the resulting PCR estimator obtained from using these principal components as covariates need not necessarily have satisfactory predictive performance for the outcome. A somewhat similar estimator that tries to address this issue through its very construction is the
partial least squares
Partial least squares regression (PLS regression) is a statistical method that bears some relation to principal components regression; instead of finding hyperplanes of maximum variance between the response and independent variables, it finds a li ...
(PLS) estimator. Similar to PCR, PLS also uses derived covariates of lower dimensions. However unlike PCR, the derived covariates for PLS are obtained based on using both the outcome as well as the covariates. While PCR seeks the high variance directions in the space of the covariates, PLS seeks the directions in the covariate space that are most useful for the prediction of the outcome.
2006 a variant of the classical PCR known as the supervised PCR was proposed.
In a spirit similar to that of PLS, it attempts at obtaining derived covariates of lower dimensions based on a criterion that involves both the outcome as well as the covariates. The method starts by performing a set of
simple linear regressions (or univariate regressions) wherein the outcome vector is regressed separately on each of the
covariates taken one at a time. Then, for some
, the first
covariates that turn out to be the most correlated with the outcome (based on the degree of significance of the corresponding estimated regression coefficients) are selected for further use. A conventional PCR, as described earlier, is then performed, but now it is based on only the
data matrix corresponding to the observations for the selected covariates. The number of covariates used:
and the subsequent number of principal components used:
are usually selected by
cross-validation.
Generalization to kernel settings
The classical PCR method as described above is based on
classical PCA and considers a
linear regression model
In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is cal ...
for predicting the outcome based on the covariates. However, it can be easily generalized to a
kernel machine
Kernel may refer to:
Computing
* Kernel (operating system), the central component of most operating systems
* Kernel (image processing), a matrix used for image convolution
* Compute kernel, in GPGPU programming
* Kernel method, in machine learni ...
setting whereby the
regression function need not necessarily be
linear
Linearity is the property of a mathematical relationship (''function'') that can be graphically represented as a straight line. Linearity is closely related to '' proportionality''. Examples in physics include rectilinear motion, the linear r ...
in the covariates, but instead it can belong to the
Reproducing Kernel Hilbert Space
In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g in ...
associated with any arbitrary (possibly
non-linear
In mathematics and science, a nonlinear system is a system in which the change of the output is not proportional to the change of the input. Nonlinear problems are of interest to engineers, biologists, physicists, mathematicians, and many other ...
),
symmetric
Symmetry (from grc, συμμετρία "agreement in dimensions, due proportion, arrangement") in everyday language refers to a sense of harmonious and beautiful proportion and balance. In mathematics, "symmetry" has a more precise definiti ...
positive-definite kernel In operator theory, a branch of mathematics, a positive-definite kernel is a generalization of a positive-definite function or a positive-definite matrix. It was first introduced by James Mercer in the early 20th century, in the context of solving ...
. The
linear regression model
In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is cal ...
turns out to be a special case of this setting when the
kernel function In operator theory, a branch of mathematics, a positive-definite kernel is a generalization of a positive-definite function or a positive-definite matrix. It was first introduced by James Mercer in the early 20th century, in the context of solving ...
is chosen to be the
linear kernel.
In general, under the
kernel machine
Kernel may refer to:
Computing
* Kernel (operating system), the central component of most operating systems
* Kernel (image processing), a matrix used for image convolution
* Compute kernel, in GPGPU programming
* Kernel method, in machine learni ...
setting, the vector of covariates is first
mapped into a
high-dimensional
In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it. Thus, a line has a dimension of one (1D) because only one coord ...
(potentially
infinite-dimensional)
feature space
In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon. Choosing informative, discriminating and independent features is a crucial element of effective algorithms in pattern r ...
characterized by the
kernel function In operator theory, a branch of mathematics, a positive-definite kernel is a generalization of a positive-definite function or a positive-definite matrix. It was first introduced by James Mercer in the early 20th century, in the context of solving ...
chosen. The
mapping so obtained is known as the
feature map and each of its
coordinates
In geometry, a coordinate system is a system that uses one or more numbers, or coordinates, to uniquely determine the position of the points or other geometric elements on a manifold such as Euclidean space. The order of the coordinates is sig ...
, also known as the
feature elements, corresponds to one feature (may be
linear
Linearity is the property of a mathematical relationship (''function'') that can be graphically represented as a straight line. Linearity is closely related to '' proportionality''. Examples in physics include rectilinear motion, the linear r ...
or
non-linear
In mathematics and science, a nonlinear system is a system in which the change of the output is not proportional to the change of the input. Nonlinear problems are of interest to engineers, biologists, physicists, mathematicians, and many other ...
) of the covariates. The
regression function is then assumed to be a
linear combination of these
feature elements. Thus, the
underlying regression model in the
kernel machine
Kernel may refer to:
Computing
* Kernel (operating system), the central component of most operating systems
* Kernel (image processing), a matrix used for image convolution
* Compute kernel, in GPGPU programming
* Kernel method, in machine learni ...
setting is essentially a
linear regression model
In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is cal ...
with the understanding that instead of the original set of covariates, the predictors are now given by the vector (potentially
infinite-dimensional) of
feature elements obtained by
transforming the actual covariates using the
feature map.
However, the
kernel trick
In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (for example ...
actually enables us to operate in the
feature space
In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon. Choosing informative, discriminating and independent features is a crucial element of effective algorithms in pattern r ...
without ever explicitly computing the
feature map. It turns out that it is only sufficient to compute the pairwise
inner product
In mathematics, an inner product space (or, rarely, a Hausdorff pre-Hilbert space) is a real vector space or a complex vector space with an operation called an inner product. The inner product of two vectors in the space is a scalar, often ...
s among the feature maps for the observed covariate vectors and these
inner product
In mathematics, an inner product space (or, rarely, a Hausdorff pre-Hilbert space) is a real vector space or a complex vector space with an operation called an inner product. The inner product of two vectors in the space is a scalar, often ...
s are simply given by the values of the
kernel function In operator theory, a branch of mathematics, a positive-definite kernel is a generalization of a positive-definite function or a positive-definite matrix. It was first introduced by James Mercer in the early 20th century, in the context of solving ...
evaluated at the corresponding pairs of covariate vectors. The pairwise inner products so obtained may therefore be represented in the form of a
symmetric non-negative definite matrix also known as the
kernel matrix.
PCR in the
kernel machine
Kernel may refer to:
Computing
* Kernel (operating system), the central component of most operating systems
* Kernel (image processing), a matrix used for image convolution
* Compute kernel, in GPGPU programming
* Kernel method, in machine learni ...
setting can now be implemented by first
appropriately centering this
kernel matrix (K, say) with respect to the
feature space
In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon. Choosing informative, discriminating and independent features is a crucial element of effective algorithms in pattern r ...
and then performing a
kernel PCA In the field of multivariate statistics, kernel principal component analysis (kernel PCA)
is an extension of principal component analysis (PCA) using techniques of kernel methods. Using a kernel, the originally linear operations of PCA are performed ...
on the
centered kernel matrix (K', say) whereby an
eigendecomposition
In linear algebra, eigendecomposition is the factorization of a matrix into a canonical form, whereby the matrix is represented in terms of its eigenvalues and eigenvectors. Only diagonalizable matrices can be factorized in this way. When the matr ...
of K' is obtained. Kernel PCR then proceeds by (usually) selecting a subset of all the
eigenvectors
In linear algebra, an eigenvector () or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denoted ...
so obtained and then performing a
standard linear regression of the outcome vector on these selected
eigenvectors
In linear algebra, an eigenvector () or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denoted ...
. The
eigenvectors
In linear algebra, an eigenvector () or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denoted ...
to be used for regression are usually selected using
cross-validation. The estimated regression coefficients (having the same dimension as the number of selected eigenvectors) along with the corresponding selected eigenvectors are then used for predicting the outcome for a future observation. In
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
, this technique is also known as ''spectral regression''.
Clearly, kernel PCR has a discrete shrinkage effect on the eigenvectors of K', quite similar to the discrete shrinkage effect of classical PCR on the principal components, as discussed earlier. However, the feature map associated with the chosen kernel could potentially be infinite-dimensional, and hence the corresponding principal components and principal component directions could be infinite-dimensional as well. Therefore, these quantities are often practically intractable under the kernel machine setting. Kernel PCR essentially works around this problem by considering an equivalent dual formulation based on using the
spectral decomposition of the associated kernel matrix. Under the linear regression model (which corresponds to choosing the kernel function as the linear kernel), this amounts to considering a spectral decomposition of the corresponding
kernel matrix
and then regressing the outcome vector on a selected subset of the eigenvectors of
so obtained. It can be easily shown that this is the same as regressing the outcome vector on the corresponding principal components (which are finite-dimensional in this case), as defined in the context of the classical PCR. Thus, for the linear kernel, the kernel PCR based on a dual formulation is exactly equivalent to the classical PCR based on a primal formulation. However, for arbitrary (and possibly non-linear) kernels, this primal formulation may become intractable owing to the infinite dimensionality of the associated feature map. Thus classical PCR becomes practically infeasible in that case, but kernel PCR based on the dual formulation still remains valid and computationally scalable.
See also
*
Principal component analysis
Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
*
Partial least squares regression
Partial least squares regression (PLS regression) is a statistical method that bears some relation to principal components regression; instead of finding hyperplanes of maximum variance between the response and independent variables, it finds a ...
*
Ridge regression
Ridge regression is a method of estimating the coefficients of multiple- regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Als ...
*
Canonical correlation
In statistics, canonical-correlation analysis (CCA), also called canonical variates analysis, is a way of inferring information from cross-covariance matrices. If we have two vectors ''X'' = (''X''1, ..., ''X'n'') and ''Y' ...
*
Deming regression
In statistics, Deming regression, named after W. Edwards Deming, is an errors-in-variables model which tries to find the line of best fit for a two-dimensional dataset. It differs from the simple linear regression in that it accounts for erro ...
*
Total sum of squares
In statistical data analysis the total sum of squares (TSS or SST) is a quantity that appears as part of a standard way of presenting results of such analyses. For a set of observations, y_i, i\leq n, it is defined as the sum over all squared dif ...
References
Further reading
*
*
{{DEFAULTSORT:Principal Component Regression
Regression analysis
Factor analysis