statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...

, projection pursuit regression (PPR) is a

statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repres ...

developed by

Jerome H. Friedman Jerome Harold Friedman (born December 29, 1939) is an American statistician, consultant and Professor of Statistics at Stanford University, known for his contributions in the field of statistics and data mining.

and Werner Stuetzle which is an extension of

additive model In statistics, an additive model (AM) is a nonparametric regression method. It was suggested by Jerome H. Friedman and Werner Stuetzle (1981) and is an essential part of the ACE algorithm. The ''AM'' uses a one-dimensional smoother to build a rest ...

s. This model adapts the additive models in that it first projects the

data matrix A Data Matrix is a two-dimensional code consisting of black and white "cells" or dots arranged in either a square or rectangular pattern, also known as a matrix. The information to be encoded can be text or numeric data. Usual data size is fro ...

explanatory variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...

s in the optimal direction before applying smoothing functions to these explanatory variables.

Model overview

The model consists of linear combinations of ridge functions: non-linear transformations of linear combinations of the explanatory variables. The basic model takes the form :

y_i=\beta_0 + \sum_^r f_j (\beta_j^x_i) + \varepsilon_i ,

where ''x_i'' is a 1 × ''p'' row of the

design matrix In statistics and in particular in regression analysis, a design matrix, also known as model matrix or regressor matrix and often denoted by X, is a matrix of values of explanatory variables of a set of objects. Each row represents an individual ob ...

containing the explanatory variables for example ''i'', ''y_i'' is a 1 × 1 prediction, is a collection of ''r'' vectors (each a unit vector of length ''p'') which contain the unknown parameters, is a collection of ''r'' initially unknown smooth functions that map from ℝ → ℝ, and ''r'' is a hyperparameter. Good values for ''r'' can be determined through cross-validation or a forward stage-wise strategy which stops when the model fit cannot be significantly improved. As ''r'' approaches infinity and with an appropriate set of functions , the PPR model is a

universal estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...

, as it can approximate any continuous function in ℝ^''p''.

Model estimation

For a given set of data

\_^

, the goal is to minimize the error function :

\min_ S=\sum_^n \left y_i - \sum_^r f_j (\beta_j^ x_i) \right 2

over the functions

f_j

and vectors

\beta_j

. No method exists for solving over all variables at once, but it can be solved via alternating optimization. First, consider each

(f_j, \beta_j)

pair individually: Let all other parameters be fixed, and find a "residual", the variance of the output not accounted for by those other parameters, given by :

r_i = y_i - \sum_ f_l (\beta_l^ x_i)

The task of minimizing the error function now reduces to solving :

\min_ S'=\sum_^n \left r_i - f_j(\beta_j^ x_i) \right 2

for each ''j'' in turn. Typically new

(f_j, \beta_j)

pairs are added to the model in a forward stage-wise fashion.

Aside: Previously-fitted pairs can be readjusted after new fit-pairs are determined by an algorithm known as backfitting, which entails reconsidering a previous pair, recalculating the residual given how other pairs have changed, refitting to account for that new information, and then cycling through all fit-pairs this way until parameters converge. This process typically results in a model that performs better with fewer fit-pairs, though it takes longer to train, and it is usually possible to achieve the same performance by skipping backfitting and simply adding more fits to the model (increasing ''r'').

Solving the simplified error function to determine an

(f_j, \beta_j)

pair can be done with alternating optimization, where first a random

\beta_j

is used to project

X

in to 1D space, and then the optimal

f_j

is found to describe the relationship between that projection and the residuals via your favorite scatter plot regression method. Then if

f_j

is held constant, assuming

f_j

is once differentiable, the optimal updated weights

\beta_j

can be found via the Gauss-Newton method—a quasi-Newton method in which the part of the Hessian involving the second derivative is discarded. To derive this, first Taylor expand

f_j(\beta_j^x_i) \approx f_j(\beta_^x_i) + \dot(\beta_^x_i)(\beta_j^x_i - \beta_^x_i)

, then plug the expansion back in to the simplified error function

S'

and do some algebraic manipulation to put it in the form :

\min_ S' \approx \sum_^n \underbrace_w \Bigg bigg(\underbrace_\bigg) - \beta_j^x_i \Bigg 2

This is a weighted least squares problem. If we solve for all weights

w

and put them in a diagonal matrix

W

, stack all the new targets

\hat

in to a vector, and use the full data matrix

X

instead of a single example

x_i

, then the optimal

\beta_j

is given by the closed-form :

\underset \Big\, \vec - X\beta_j \Big\, _^2 = (X^ WX)^ X^ W \vec

Use this updated

\beta_j

to find a new projection of

X

and refit

f_j

to the new scatter plot. Then use that new

f_j

to update

\beta_j

by resolving the above, and continue this alternating process until

(f_j, \beta_j)

converges. It has been shown that the convergence rate, the bias and the variance are affected by the estimation of

\beta_j

and

f_j

Discussion

The PPR model takes the form of a basic additive model but with the additional

\beta_j

component, so each

f_j

fits a scatter plot of

\beta_j^X^T

vs the residual (unexplained variance) during training rather than using the raw inputs themselves. This constrains the problem of finding each

f_j

to low dimension, making it solvable with common least squares or spline fitting methods and sidestepping the

curse of dimensionality The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. The ...

during training. Because

f_j

is taken of a projection of

X

, the result looks like a "ridge" orthogonal to the projection dimension, so

\

are often called "ridge functions". The directions

\beta_j

are chosen to optimize the fit of their corresponding ridge functions. Note that because PPR attempts to fit projections of the data, it can be difficult to interpret the fitted model as a whole, because each input variable has been accounted for in a complex and multifaceted way. This can make the model more useful for prediction than for understanding the data, though visualizing individual ridge functions and considering which projections the model is discovering can yield some insight.

Advantages of PPR estimation

*It uses univariate regression functions instead of their multivariate form, thus effectively dealing with the

*Univariate regression allows for simple and efficient estimation *Relative to

generalized additive model In statistics, a generalized additive model (GAM) is a generalized linear model in which the linear response variable depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth func ...

s, PPR can estimate a much richer class of functions *Unlike local averaging methods (such as

k-nearest neighbors In statistics, the ''k''-nearest neighbors algorithm (''k''-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and reg ...

), PPR can ignore variables with low explanatory power.

Disadvantages of PPR estimation

*PPR requires examining an M-dimensional parameter space in order to estimate

\beta_j

. *One must select the smoothing parameter for

f_j

. *The model is often difficult to interpret

Extensions of PPR

*Alternate smoothers, such as the radial function, harmonic function and additive function, have been suggested and their performances vary depending on the data sets used. *Alternate optimization criteria have been used as well, such as standard absolute deviations and

mean absolute deviation The average absolute deviation (AAD) of a data set is the average of the absolute deviations from a central point. It is a summary statistic of statistical dispersion or variability. In the general form, the central point can be a mean, median, m ...

s. *

Ordinary least squares In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the prin ...

can be used to simplify calculations as often the data does not have strong non-linearities. *Sliced Inverse Regression (SIR) has been used to choose the direction vectors for PPR. *Generalized PPR combines regular PPR with iteratively reweighted least squares (IRLS) and a

link function In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and b ...

to estimate binary data.

PPR vs neural networks (NN)

Both projection pursuit regression and

neural networks A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...

models project the input vector onto a one-dimensional hyperplane and then apply a nonlinear transformation of the input variables that are then added in a linear fashion. Thus both follow the same steps to overcome the curse of dimensionality. The main difference is that the functions

f_j

being fitted in PPR can be different for each combination of input variables and are estimated one at a time and then updated with the weights, whereas in NN these are all specified upfront and estimated simultaneously. Thus, PPR estimation is more straightforward than NN and the transformations of variables in PPR are data driven whereas in NN, these transformations are fixed.

References

*Friedman, J.H. and Stuetzle, W. (1981
Projection Pursuit Regression
Journal of the American Statistical Association, 76, 817–823. *Hand, D., Mannila, H. and Smyth, P, (2001) Principles of Data Mining. MIT Press. *Hall, P. (1988) Estimating the direction in which a data set is the most interesting, Probab. Theory Related Fields, 80, 51–77. *Hastie, T. J., Tibshirani, R. J. and Friedman, J.H. (2009)
The Elements of Statistical Learning: Data Mining, Inference and Prediction
Springer. {{ISBN, 978-0-387-84857-0 *Klinke, S. and Grassmann, J. (2000) ‘Projection Pursuit Regression’ in Smoothing and Regression: Approaches, Computation and Application. Ed. Schimek, M.G.. Wiley Interscience. *Lingjarde, O. C. and Liestol, K. (1998
Generalized Projection Pursuit Regression
SIAM Journal of Scientific Computing, 20, 844-857. Regression analysis