Backfitting Algorithm
   HOME

TheInfoList



OR:

In
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, the backfitting algorithm is a simple iterative procedure used to fit a generalized additive model. It was introduced in 1985 by
Leo Breiman Leo Breiman (January 27, 1928 – July 5, 2005) was a distinguished statistician at the University of California, Berkeley. He was the recipient of numerous honors and awards, and was a member of the United States National Academy of Sciences. ...
and Jerome Friedman along with generalized additive models. In most cases, the backfitting algorithm is equivalent to the Gauss–Seidel method, an algorithm used for solving a certain
linear system of equations In mathematics, a system of linear equations (or linear system) is a collection of one or more linear equations involving the same variables. For example, :\begin 3x+2y-z=1\\ 2x-2y+4z=-2\\ -x+\fracy-z=0 \end is a system of three equations in ...
.


Algorithm

Additive models are a class of
non-parametric regression Nonparametric regression is a category of regression analysis in which the predictor does not take a predetermined form but is constructed according to information derived from the data. That is, no parametric form is assumed for the relationship ...
models of the form: : Y_i = \alpha + \sum_^p f_j(X_) + \epsilon_i where each X_1, X_2, \ldots, X_p is a variable in our p-dimensional predictor X, and Y is our outcome variable. \epsilon represents our inherent error, which is assumed to have mean zero. The f_j represent unspecified smooth functions of a single X_j. Given the flexibility in the f_j, we typically do not have a unique solution: \alpha is left unidentifiable as one can add any constants to any of the f_j and subtract this value from \alpha. It is common to rectify this by constraining : \sum_^N f_j(X_) = 0 for all j leaving : \alpha = 1/N \sum_^N y_i necessarily. The backfitting algorithm is then: Initialize \hat = 1/N \sum_^N y_i, \hat \equiv 0, \forall j Do until \hat converge: For each predictor ''j'': (a) \hat \leftarrow \text lbrace y_i - \hat - \sum_ \hat(x_) \rbrace_1^N /math> (backfitting step) (b) \hat \leftarrow \hat - 1/N \sum_^N \hat(x_) (mean centering of estimated function) where \text is our smoothing operator. This is typically chosen to be a cubic spline smoother but can be any other appropriate fitting operation, such as: * local polynomial regression *
kernel smoothing A kernel smoother is a statistical technique to estimate a real valued function f: \mathbb^p \to \mathbb as the weighted average of neighboring observed data. The weight is defined by the ''kernel'', such that closer points are given higher weights. ...
methods * more complex operators, such as surface smoothers for second and higher-order interactions In theory, step (b) in the algorithm is not needed as the function estimates are constrained to sum to zero. However, due to numerical issues this might become a problem in practice.


Motivation

If we consider the problem of minimizing the expected squared error: : \min E - (\alpha + \sum_^p f_j(X_j))2 There exists a unique solution by the theory of projections given by: : f_i(X_i) = E X_i/math> for ''i'' = 1, 2, ..., ''p''. This gives the matrix interpretation: : \begin I & P_1 & \cdots & P_1 \\ P_2 & I & \cdots & P_2 \\ \vdots & & \ddots & \vdots \\ P_p & \cdots & P_p & I \end \begin f_1(X_1)\\ f_2(X_2)\\ \vdots \\ f_p(X_p) \end = \begin P_1 Y\\ P_2 Y\\ \vdots \\ P_p Y \end where P_i(\cdot) = E(\cdot, X_i). In this context we can imagine a smoother matrix, S_i, which approximates our P_i and gives an estimate, S_i Y, of E(Y, X) : \begin I & S_1 & \cdots & S_1 \\ S_2 & I & \cdots & S_2 \\ \vdots & & \ddots & \vdots \\ S_p & \cdots & S_p & I \end \begin f_1\\ f_2\\ \vdots \\ f_p \end = \begin S_1 Y\\ S_2 Y\\ \vdots \\ S_p Y \end or in abbreviated form : \hatf = QY \, An exact solution of this is infeasible to calculate for large ''np'', so the iterative technique of backfitting is used. We take initial guesses f_j^ and update each f_j^ in turn to be the smoothed fit for the residuals of all the others: : \hat^ \leftarrow \text lbrace y_i - \hat - \sum_ \hat(x_) \rbrace_1^N /math> Looking at the abbreviated form it is easy to see the backfitting algorithm as equivalent to the Gauss–Seidel method for linear smoothing operators ''S''.


Explicit derivation for two dimensions

Following, Härdle, Wolfgang; et al. (June 9, 2004). "Backfitting". Archived from the original on 2015-05-10. Retrieved 2015-08-19. we can formulate the backfitting algorithm explicitly for the two dimensional case. We have: : f_1 = S_1(Y-f_2), f_2 = S_2(Y-f_1) If we denote \hat_1^ as the estimate of f_1 in the ''i''th updating step, the backfitting steps are : \hat_1^ = S_1 - \hat_2^ \hat_2^ = S_2 - \hat_1^ By induction we get : \hat_1^ = Y - \sum_^(S_1 S_2)^\alpha(I-S_1)Y - (S_1 S_2)^ S_1\hat_2^ and : \hat_2^ = S_2 \sum_^(S_1 S_2)^\alpha(I-S_1)Y + S_2(S_1 S_2)^ S_1\hat_2^ If we set \hat_2^= 0 then we get : \hat_1^ = Y - S_2^ \hat_2^ = - \sum_^(S_1 S_2)^\alpha(I-S_1) : \hat_2^ = _2 \sum_^(S_1 S_2)^\alpha(I-S_1) Where we have solved for \hat_1^ by directly plugging out from f_2 = S_2(Y-f_1) . We have convergence if \, S_1 S_2\, < 1 . In this case, letting \hat_1^, \hat_2^ \xrightarrow \hat_1^, \hat_2^ : : \hat_1^ = Y - S_2^ \hat_2^ = Y - (I - S_1 S_2)^ (I - S_1) Y : \hat_2^ = S_2 (I - S_1 S_2)^ (I - S_1) Y We can check this is a solution to the problem, i.e. that \hat_1^ and \hat_2^ converge to f_1 and f_2 correspondingly, by plugging these expressions into the original equations.


Issues

The choice of when to stop the algorithm is arbitrary and it is hard to know a priori how long reaching a specific convergence threshold will take. Also, the final model depends on the order in which the predictor variables X_i are fit. As well, the solution found by the backfitting procedure is non-unique. If b is a vector such that \hatb = 0 from above, then if \hat is a solution then so is \hat + \alpha b is also a solution for any \alpha \in \mathbb. A modification of the backfitting algorithm involving projections onto the eigenspace of ''S'' can remedy this problem.


Modified algorithm

We can modify the backfitting algorithm to make it easier to provide a unique solution. Let \mathcal_1(S_i) be the space spanned by all the eigenvectors of ''S''i that correspond to eigenvalue 1. Then any ''b'' satisfying \hatb = 0 has b_i \in \mathcal_1(S_i) \forall i=1,\dots,p and \sum_^p b_i = 0. Now if we take A to be a matrix that projects orthogonally onto \mathcal_1(S_1) + \dots + \mathcal_1(S_p) , we get the following modified backfitting algorithm: Initialize \hat = 1/N \sum_1^N y_i, \hat \equiv 0, \forall i, j, \hat = \alpha + \hat + \dots + \hat Do until \hat converge: Regress y - \hat onto the space \mathcal_1(S_i) + \dots + \mathcal_1(S_p) , setting a = A(Y- \hat) For each predictor ''j'': Apply backfitting update to (Y - a) using the smoothing operator (I - A_i)S_i, yielding new estimates for \hat


References

* * * {{cite web , url=http://sfb649.wiwi.hu-berlin.de/fedc_homepage/xplore/ebooks/html/spm/spmhtmlnode37.html , title=Backfitting , author=Härdle, Wolfgang , date=June 9, 2004 , accessdate=2015-08-19 , display-authors=etal , url-status=dead , archiveurl=https://web.archive.org/web/20150510225240/http://sfb649.wiwi.hu-berlin.de/fedc_homepage/xplore/ebooks/html/spm/spmhtmlnode37.html , archivedate=2015-05-10


External links


R Package for GAM backfitting
Numerical linear algebra Generalized linear models