Sliced inverse regression (or SIR) is a tool for
dimensionality reduction
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
in the field of
multivariate statistics
Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable.
Multivariate statistics concerns understanding the different aims and background of each of the dif ...
.
In
statistics,
regression analysis
In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
is a method of studying the relationship between a response variable ''y'' and its input variable
, which is a ''p''-dimensional vector. There are several approaches in the category of regression. For example, parametric methods include multiple linear regression, and non-parametric methods include local smoothing.
As the number of observations needed to use local smoothing methods scales exponentially with high-dimensional data (as ''p'' grows), reducing the number of dimensions can make the operation computable.
Dimensionality reduction
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
aims to achieve this by showing only the most important dimension of the data. SIR uses the inverse regression curve,
, to perform a weighted principal component analysis.
Model
Given a response variable
and a (random) vector
of explanatory variables, SIR is based on the model
:
where
are unknown projection vectors,
is an unknown number smaller than
,
is an unknown function on
as it only depends on
arguments, and
is a random variable representing error with
and a finite variance of
. The model describes an ideal solution, where
depends on
only through a
dimensional subspace; i.e., one can reduce the dimension of the explanatory variables from
to a smaller number
without losing any information.
An equivalent version of
is: the conditional distribution of
given
depends on
only through the
dimensional random vector
. It is assumed that this reduced vector is as informative as the original
in explaining
.
The unknown
are called the ''effective dimension reducing directions'' (EDR-directions). The space that is spanned by these vectors is denoted by the ''effective dimension reducing space'' (EDR-space).
Relevant linear algebra background
Given
, then
, the set of all linear combinations of these vectors is called a linear subspace and is therefore a vector space. The equation says that vectors
span
, but the vectors that span space
are not unique.
The dimension of
is equal to the maximum number of linearly independent vectors in
. A set of
linear independent vectors of
makes up a basis of
. The dimension of a vector space is unique, but the basis itself is not. Several bases can span the same space. Dependent vectors can still span a space, but the linear combinations of the latter are only suitable to a set of vectors lying on a straight line.
Inverse regression
Computing the inverse regression curve (IR) means instead of looking for
*