statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...

, a generalized additive model (GAM) is a

generalized linear model In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and b ...

in which the linear response variable depends linearly on unknown

smooth function In mathematical analysis, the smoothness of a function (mathematics), function is a property measured by the number of Continuous function, continuous Derivative (mathematics), derivatives it has over some domain, called ''differentiability cl ...

s of some predictor variables, and interest focuses on inference about these smooth functions. GAMs were originally developed by

Trevor Hastie Trevor John Hastie (born 27 June 1953) is an American statistician and computer scientist. He is currently serving as the John A. Overdeck Professor of Mathematical Sciences and Professor of Statistics at Stanford University. Hastie is known for ...

and

Robert Tibshirani Robert Tibshirani (born July 10, 1956) is a professor in the Departments of Statistics and Biomedical Data Science at Stanford University. He was a professor at the University of Toronto from 1985 to 1998. In his work, he develops statistical to ...

to blend properties of

s with

additive model In statistics, an additive model (AM) is a nonparametric regression method. It was suggested by Jerome H. Friedman and Werner Stuetzle (1981) and is an essential part of the ACE algorithm. The ''AM'' uses a one-dimensional smoother to build a rest ...

s. They can be interpreted as the discriminative generalization of the

naive Bayes In statistics, naive Bayes classifiers are a family of simple "Probabilistic classification, probabilistic classifiers" based on applying Bayes' theorem with strong (naive) statistical independence, independence assumptions between the features (s ...

generative model. The model relates a univariate response variable, ''Y'', to some predictor variables, ''x''_''i''. An

exponential family In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate ...

distribution is specified for Y (for example

normal Normal(s) or The Normal(s) may refer to: Film and television * ''Normal'' (2003 film), starring Jessica Lange and Tom Wilkinson * ''Normal'' (2007 film), starring Carrie-Anne Moss, Kevin Zegers, Callum Keith Rennie, and Andrew Airlie * ''Norma ...

binomial Binomial may refer to: In mathematics *Binomial (polynomial), a polynomial with two terms * Binomial coefficient, numbers appearing in the expansions of powers of binomials *Binomial QMF, a perfect-reconstruction orthogonal wavelet decomposition ...

or Poisson distributions) along with a

link function In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and b ...

''g'' (for example the identity or log functions) relating the expected value of ''Y'' to the predictor variables via a structure such as :

g(\operatorname(Y))=\beta_0 + f_1(x_1) + f_2(x_2)+ \cdots + f_m(x_m).\,\!

The functions ''f''_''i'' may be functions with a specified parametric form (for example a polynomial, or an un-penalized regression spline of a variable) or may be specified non-parametrically, or semi-parametrically, simply as 'smooth functions', to be estimated by non-parametric means. So a typical GAM might use a scatterplot smoothing function, such as a locally weighted mean, for ''f''₁(''x''₁), and then use a factor model for ''f''₂(''x''₂). This flexibility to allow non-parametric fits with relaxed assumptions on the actual relationship between response and predictor, provides the potential for better fits to data than purely parametric models, but arguably with some loss of interpretability.

Theoretical background

It had been known since the 1950s (via. the

Kolmogorov–Arnold representation theorem In real analysis and approximation theory, the Kolmogorov-Arnold representation theorem (or superposition theorem) states that every multivariate continuous function can be represented as a superposition of the two-argument addition and continuou ...

) that any multivariate function could be represented as sums and compositions of univariate functions. :

f(\vec x) = \sum_^ \Phi_\left(\sum_^ \phi_(x_)\right)

Unfortunately, though the

asserts the existence of a function of this form, it gives no mechanism whereby one could be constructed. Certain constructive proofs exist, but they tend to require highly complicated (i.e. fractal) functions, and thus are not suitable for modeling approaches. Therefore, the generalized additive model drops the outer sum, and demands instead that the function belong to a simpler class. :

f(\vec x) = \Phi\left(\sum_^ \phi_(x_)\right)

where

\Phi

is a smooth monotonic function. Writing

g

for the inverse of

\Phi

, this is traditionally written as :

g(f(\vec x))=\sum_ f_(x_)

. When this function is approximating the expectation of some observed quantity, it could be written as :

g(\operatorname(Y))=\beta_0 + f_1(x_1) + f_2(x_2)+ \cdots + f_m(x_m).\,\!

Which is the standard formulation of a generalized additive model. It was then shown that the

backfitting algorithm In statistics, the backfitting algorithm is a simple iterative procedure used to fit a generalized additive model. It was introduced in 1985 by Leo Breiman and Jerome H. Friedman, Jerome Friedman along with generalized additive models. In most case ...

will always converge for these functions.

Generality

The GAM model class is quite broad, given that ''smooth function'' is a rather broad category. For example, a covariate

x_j

may be multivariate and the corresponding

f_j

a smooth function of several variables, or

f_j

might be the function mapping the level of a factor to the value of a random effect. Another example is a varying coefficient (geographic regression) term such as

z_jf_j(x_j)

where

z_j

and

x_j

are both covariates. Or if

x_j(t)

is itself an observation of a function, we might include a term such as

\int f_j(t)x_j(t)dt

(sometimes known as a signal regression term).

f_j

could also be a simple parametric function as might be used in any generalized linear model. The model class has been generalized in several directions, notably beyond exponential family response distributions, beyond modelling of only the mean and beyond univariate data.

GAM fitting methods

The original GAM fitting method estimated the smooth components of the model using non-parametric smoothers (for example smoothing splines or local linear regression smoothers) via the

. Backfitting works by iterative smoothing of partial residuals and provides a very general modular estimation method capable of using a wide variety of smoothing methods to estimate the

f_j(x_j)

terms. A disadvantage of backfitting is that it is difficult to integrate with the estimation of the degree of smoothness of the model terms, so that in practice the user must set these, or select between a modest set of pre-defined smoothing levels. If the

f_j(x_j)

are represented using

smoothing spline Smoothing splines are function estimates, \hat f(x), obtained from a set of noisy observations y_i of the target f(x_i), in order to balance a measure of goodness of fit of \hat f(x_i) to y_i with a derivative based measure of the smoothness of \ ...

s then the degree of smoothness can be estimated as part of model fitting using generalized cross validation, or by

restricted maximum likelihood In statistics, the restricted (or residual, or reduced) maximum likelihood (REML) approach is a particular form of maximum likelihood estimation that does not base estimates on a maximum likelihood fit of all the information, but instead uses a like ...

(REML, sometimes known as 'GML') which exploits the duality between spline smoothers and Gaussian random effects. This full spline approach carries an

O(n^3)

computational cost, where

n

is the number of observations for the response variable, rendering it somewhat impractical for moderately large datasets. More recent methods have addressed this computational cost either by up front reduction of the size of the basis used for smoothing (rank reduction) or by finding sparse representations of the smooths using

Markov random field In the domain of physics and probability, a Markov random field (MRF), Markov network or undirected graphical model is a set of random variables having a Markov property described by an undirected graph. In other words, a random field is said to b ...

s, which are amenable to the use of

sparse matrix In numerical analysis and scientific computing, a sparse matrix or sparse array is a matrix in which most of the elements are zero. There is no strict definition regarding the proportion of zero-value elements for a matrix to qualify as sparse b ...

methods for computation. These more computationally efficient methods use GCV (or AIC or similar) or REML or take a fully Bayesian approach for inference about the degree of smoothness of the model components. Estimating the degree of smoothness via REML can be viewed as an

empirical Bayes method Empirical Bayes methods are procedures for statistical inference in which the prior probability distribution is estimated from the data. This approach stands in contrast to standard Bayesian methods, for which the prior distribution is fixed be ...

. An alternative approach with particular advantages in high dimensional settings is to use boosting, although this typically requires bootstrapping for uncertainty quantification. GAMs fit using bagging and boosting have been found to generally outperform GAMs fit using spline methods.

The rank reduced framework

Many modern implementations of GAMs and their extensions are built around the reduced rank smoothing approach, because it allows well founded estimation of the smoothness of the component smooths at comparatively modest computational cost, and also facilitates implementation of a number of model extensions in a way that is more difficult with other methods. At its simplest the idea is to replace the unknown smooth functions in the model with basis expansions :

f_j(x_j) = \sum_^ \beta_ b_(x_j)

where the

b_(x_j)

are known basis functions, usually chosen for good approximation theoretic properties (for example

B spline In the mathematical subfield of numerical analysis, a B-spline or basis spline is a spline function that has minimal support with respect to a given degree, smoothness, and domain partition. Any spline function of given degree can be expressed ...

s or reduced rank

thin plate spline Thin plate splines (TPS) are a spline-based technique for data interpolation and smoothing. They were introduced to geometric design by Duchon. They are an important special case of a polyharmonic spline. Robust Point Matching (RPM) is a common ex ...

s), and the

\beta_

are coefficients to be estimated as part of model fitting. The basis dimension

K_j

is chosen to be sufficiently large that we expect it to overfit the data to hand (thereby avoiding bias from model over-simplification), but small enough to retain computational efficiency. If

p = \sum_j K_j

then the computational cost of model estimation this way will be

O(np^2)

. Notice that the

f_j

are only identifiable to within an intercept term (we could add any constant to

f_1

while subtracting it from

f_2

without changing the model predictions at all), so identifiability constraints have to be imposed on the smooth terms to remove this ambiguity. Sharpest inference about the

f_j

is generally obtained by using the sum-to-zero constraints :

\sum_i f_j(x_) = 0

i.e. by insisting that the sum of each the

f_j

evaluated at its observed covariate values should be zero. Such linear constraints can most easily be imposed by reparametrization at the basis setup stage, so below it is assumed that this has been done. Having replaced all the

f_j

in the model with such basis expansions we have turned the GAM into a

(GLM), with a model matrix that simply contains the basis functions evaluated at the observed

x_j

values. However, because the basis dimensions,

K_j

, have been chosen to be a somewhat larger than is believed to be necessary for the data, the model is over-parameterized and will overfit the data if estimated as a regular GLM. The solution to this problem is to penalize departure from smoothness in the model fitting process, controlling the weight given to the smoothing penalties using smoothing parameters. For example, consider the situation in which all the smooths are univariate functions. Writing all the parameters in one vector,

\beta

, suppose that

D(\beta)

is the deviance (twice the difference between saturated log likelihood and the model log likelihood) for the model. Minimizing the deviance by the usual iteratively re-weighted least squares would result in overfit, so we seek

\beta

to minimize :

D(\beta) + \sum_j \lambda_j \int f^_j(x)^2 dx

where the integrated square second derivative penalties serve to penalize wiggliness (lack of smoothness) of the

f_j

during fitting, and the smoothing parameters

\lambda_j

control the tradeoff between model goodness of fit and model smoothness. In the example

\lambda_j \to \infty

would ensure that the estimate of

f_j(x_j)

would be a straight line in

x_j

. Given the basis expansion for each

f_j

the wiggliness penalties can be expressed as

quadratic form In mathematics, a quadratic form is a polynomial with terms all of degree two ("form" is another name for a homogeneous polynomial). For example, :4x^2 + 2xy - 3y^2 is a quadratic form in the variables and . The coefficients usually belong to a ...

s in the model coefficients. That is we can write :

\int f^_j(x)^2 dx = \beta^T_j \bar S_j \beta_j = \beta^T S_j \beta

, where

\bar S_j

is a matrix of known coefficients computable from the penalty and basis,

\beta_j

is the vector of coefficients for

f_j

, and

S_j

is just

\bar S_j

padded with zeros so that the second equality holds and we can write the penalty in terms of the full coefficient vector

\beta

. Many other smoothing penalties can be written in the same way, and given the smoothing parameters the model fitting problem now becomes :

\hat \beta = \text_\beta \

, which can be found using a penalized version of the usual

iteratively reweighted least squares The method of iteratively reweighted least squares (IRLS) is used to solve certain optimization problems with objective functions of the form of a ''p''-norm: :\underset \sum_^n \big, y_i - f_i (\boldsymbol\beta) \big, ^p, by an iterative met ...

(IRLS) algorithm for GLMs: the algorithm is unchanged except that the sum of quadratic penalties is added to the working least squared objective at each iteration of the algorithm. Penalization has several effects on inference, relative to a regular GLM. For one thing the estimates are subject to some smoothing bias, which is the price that must be paid for limiting estimator variance by penalization. However, if smoothing parameters are selected appropriately the (squared) smoothing bias introduced by penalization should be less than the reduction in variance that it produces, so that the net effect is a reduction in mean square estimation error, relative to not penalizing. A related effect of penalization is that the notion of degrees of freedom of a model has to be modified to account for the penalties' action in reducing the coefficients' freedom to vary. For example, if

W

is the diagonal matrix of IRLS weights at convergence, and

X

is the GAM model matrix, then the model effective degrees of freedom is given by

\text(F)

where :

F = (X^T WX + \sum_j \lambda_j S_j)^X^T WX

, is the effective degrees of freedom matrix. In fact summing just the diagonal elements of

F

corresponding to the coefficients of

f_j

gives the effective degrees of freedom for the estimate of

f_j

Bayesian smoothing priors

Smoothing bias complicates interval estimation for these models, and the simplest approach turns out to involve a Bayesian approach. Understanding this Bayesian view of smoothing also helps to understand the REML and full Bayes approaches to smoothing parameter estimation. At some level smoothing penalties are imposed because we believe smooth functions to be more probable than wiggly ones, and if that is true then we might as well formalize this notion by placing a prior on model wiggliness. A very simple prior might be :

\pi(\beta) \propto \exp\

(where

\phi

is the GLM scale parameter introduced only for later convenience), but we can immediately recognize this as a

multivariate normal In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One d ...

prior with mean

0

and precision matrix

S_\lambda = \sum_j \lambda_j S_j/\phi

. Since the penalty allows some functions through unpenalized (straight lines, given the example penalties),

S_\lambda

is rank deficient, and the prior is actually improper, with a covariance matrix given by the Moore–Penrose pseudoinverse of

S_\lambda

(the impropriety corresponds to ascribing infinite variance to the unpenalized components of a smooth). Now if this prior is combined with the GLM likelihood, we find that the posterior mode for

\beta

is exactly the

\hat \beta

found above by penalized IRLS. Furthermore, we have the large sample result that :

\beta, y \sim N (\hat \beta, (X^T WX + S_\lambda)^\phi).

which can be used to produce confidence/credible intervals for the smooth components,

f_j

. The Gaussian smoothness priors are also the basis for fully Bayesian inference with GAMs, as well as methods estimating GAMs as mixed models that are essentially

Smoothing parameter estimation

So far we have treated estimation and inference given the smoothing parameters,

\lambda

, but these also need to be estimated. One approach is to take a fully Bayesian approach, defining priors on the (log) smoothing parameters, and using stochastic simulation or high order approximation methods to obtain information about the posterior of the model coefficients. An alternative is to select the smoothing parameters to optimize a prediction error criterion such as Generalized cross validation (GCV) or the

Akaike information criterion The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to e ...

(AIC). Finally we may choose to maximize the Marginal Likelihood (REML) obtained by integrating the model coefficients,

\beta

out of the joint density of

\beta,y

, :

\hat \lambda = \text_\lambda \int f(y, \beta,\lambda)\pi(\beta, \lambda) d \beta

. Since

f(y, \beta,\lambda)

is just the likelihood of

\beta

, we can view this as choosing

\lambda

to maximize the average likelihood of random draws from the prior. The preceding integral is usually analytically intractable but can be approximated to quite high accuracy using

Laplace's method In mathematics, Laplace's method, named after Pierre-Simon Laplace, is a technique used to approximate integrals of the form :\int_a^b e^ \, dx, where f(x) is a twice-differentiable function, ''M'' is a large number, and the endpoints ''a'' an ...

. Smoothing parameter inference is the most computationally taxing part of model estimation/inference. For example, to optimize a GCV or marginal likelihood typically requires numerical optimization via a Newton or Quasi-Newton method, with each trial value for the (log) smoothing parameter vector requiring a penalized IRLS iteration to evaluate the corresponding

\hat \beta

alongside the other ingredients of the GCV score or Laplace approximate marginal likelihood (LAML). Furthermore, to obtain the derivatives of the GCV or LAML, required for optimization, involves implicit differentiation to obtain the derivatives of

\hat \beta

w.r.t. the log smoothing parameters, and this requires some care is efficiency and numerical stability are to be maintained.

Software

Backfit GAMs were originally provided by the gam function in S, now ported to the

R language R is a programming language for statistical computing and graphics supported by the R Core Team and the R Foundation for Statistical Computing. Created by statisticians Ross Ihaka and Robert Gentleman, R is used among data miners, bioinfor ...

as the gam package. The SAS proc GAM also provides backfit GAMs. The recommended package in R for GAMs is mgcv, which stands for ''mixed GAM computational vehicle'', which is based on the reduced rank approach with automatic smoothing parameter selection. The SAS proc GAMPL is an alternative implementation. In Python, there is the InterpretML package, which implements a bagging and boosting approach. There are many alternative packages. Examples include the R packages mboost, which implements a boosting approach; gss, which provides the full spline smoothing methods; VGAM which provides vector GAMs; and gamlss, which provides Generalized additive model for location, scale and shape. `BayesX' and its R interface provides GAMs and extensions via MCMC and penalized likelihood methods. The `INLA' software implements a fully Bayesian approach based on Markov random field representations exploiting sparse matrix methods. As an example of how models can be estimated in practice with software, consider R package mgcv. Suppose that our R workspace contains vectors ''y'', ''x'' and ''z'' and we want to estimate the model :

y_i = \beta_0 + f_1(x_i) + f_2(z_i) + \epsilon_i \text \epsilon_i \sim N(0,\sigma^2).

Within R we could issue the commands library(mgcv) # load the package b = gam(y ~ s(x) + s(z)) In common with most R modelling functions gam expects a model formula to be supplied, specifying the model structure to fit. The response variable is given to the left of the ~ while the specification of the linear predictor is given to the right. gam sets up bases and penalties for the smooth terms, estimates the model including its smoothing parameters and, in standard R fashion, returns a ''fitted model object'', which can then be interrogated using various helper functions, such as summary, plot, predict, and AIC. This simple example has used several default settings which it is important to be aware of. For example a Gaussian distribution and identity link has been assumed, and the smoothing parameter selection criterion was GCV. Also the smooth terms were represented using `penalized thin plate regression splines', and the basis dimension for each was set to 10 (implying a maximum of 9 degrees of freedom after identifiability constraints have been imposed). A second example illustrates how we can control these things. Suppose that we want to estimate the model :

y_i \sim \text(\mu_i) \text \log \mu_i = \beta_0 + \beta_1 x_i + f_1(t_i) + f_2(v_i,w_i).

using REML smoothing parameter selection, and we expect

f_1

to be a relatively complicated function which we would like to model with a penalized cubic regression spline. For

f_2

we also have to decide whether

v

and

w

are naturally on the same scale so that an isotropic smoother such as

is appropriate (specified via `s(v,w)'), or whether they are really on different scales so that we need separate smoothing penalties and smoothing parameters for

v

and

w

as provided by a tensor product smoother. Suppose we opted for the latter in this case, then the following R code would estimate the model b1 = gam(y ~ x + s(t,bs="cr",k=100) + te(v,w),family=poisson,method="REML") which uses a basis size of 100 for the smooth of

t

. The specification of distribution and link function uses the `family' objects that are standard when fitting GLMs in R or S. Note that Gaussian random effects can also be added to the linear predictor. These examples are only intended to give a very basic flavour of the way that GAM software is used, for more detail refer to the software documentation for the various packages and the references below.

Model checking

As with any statistical model it is important to check the model assumptions of a GAM. Residual plots should be examined in the same way as for any GLM. That is deviance residuals (or other standardized residuals) should be examined for patterns that might suggest a substantial violation of the independence or mean-variance assumptions of the model. This will usually involve plotting the standardized residuals against fitted values and covariates to look for mean-variance problems or missing pattern, and may also involve examining

Correlogram In the analysis of data, a correlogram is a chart of correlation statistics. For example, in time series analysis, a plot of the sample autocorrelations r_h\, versus h\, (the time lags) is an autocorrelogram. If cross-correlation is plo ...

s (ACFs) and/or

Variogram In spatial statistics the theoretical variogram 2\gamma(\mathbf_1,\mathbf_2) is a function describing the degree of spatial dependence of a spatial random field or stochastic process Z(\mathbf). The semivariogram \gamma(\mathbf_1,\mathbf_2) is ...

s of the residuals to check for violation of independence. If the model mean-variance relationship is correct then scaled residuals should have roughly constant variance. Note that since GLMs and GAMs can be estimated using

Quasi-likelihood In statistics, quasi-likelihood methods are used to estimate parameters in a statistical model when exact likelihood methods, for example maximum likelihood estimation, are computationally infeasible. Due to the wrong likelihood being used, quasi- ...

, it follows that details of the distribution of the residuals beyond the mean-variance relationship are of relatively minor importance. One issue that is more common with GAMs than with other GLMs is a danger of falsely concluding that data are zero inflated. The difficulty arises when data contain many zeroes that can be modelled by a Poisson or binomial with a very low expected value: the flexibility of the GAM structure will often allow representation of a very low mean over some region of covariate space, but the distribution of standardized residuals will fail to look anything like the approximate normality that introductory GLM classes teach us to expect, even if the model is perfectly correct. The one extra check that GAMs introduce is the need to check that the degrees of freedom chosen are appropriate. This is particularly acute when using methods that do not automatically estimate the smoothness of model components. When using methods with automatic smoothing parameter selection then it is still necessary to check that the choice of basis dimension was not restrictively small, although if the effective degrees of freedom of a term estimate is comfortably below its basis dimension then this is unlikely. In any case, checking

f_j(x_j)

is based on examining pattern in the residuals with respect to

x_j

. This can be done using partial residuals overlaid on the plot of

\hat f_j(x_j)

, or using permutation of the residuals to construct tests for residual pattern (as in the `gam.check' function in R package `mgcv').

Model selection

When smoothing parameters are estimated as part of model fitting then much of what would traditionally count as model selection has been absorbed into the fitting process: the smoothing parameters estimation has already selected between a rich family of models of different functional complexity. However smoothing parameter estimation does not typically remove a smooth term from the model altogether, because most penalties leave some functions un-penalized (e.g. straight lines are unpenalized by the spline derivative penalty given above). So the question of whether a term should be in the model at all remains. One simple approach to this issue is to add an extra penalty to each smooth term in the GAM, which penalizes the components of the smooth that would otherwise be unpenalized (and only those). Each extra penalty has its own smoothing parameter and estimation then proceeds as before, but now with the possibility that terms will be completely penalized to zero. In high dimensional settings then it may make more sense to attempt this task using the

lasso A lasso ( or ), also called lariat, riata, or reata (all from Castilian, la reata 're-tied rope'), is a loop of rope designed as a restraint to be thrown around a target and tightened when pulled. It is a well-known tool of the Spanish an ...

elastic net regularization In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods. Specification The elas ...

. Boosting also performs term selection automatically as part of fitting. An alternative is to use traditional

stepwise regression In statistics, stepwise regression is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of ...

methods for model selection. This is also the default method when smoothing parameters are not estimated as part of fitting, in which case each smooth term is usually allowed to take one of a small set of pre-defined smoothness levels within the model, and these are selected between in a stepwise fashion. Stepwise methods operate by iteratively comparing models with or without particular model terms (or possibly with different levels of term complexity), and require measures of model fit or term significance in order to decide which model to select at each stage. For example, we might use

p-value In null-hypothesis significance testing, the ''p''-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small ''p''-value means ...

s for testing each term for equality to zero to decide on candidate terms for removal from a model, and we might compare

(AIC) values for alternative models. P-value computation for smooths is not straightforward, because of the effects of penalization, but approximations are available. AIC can be computed in two ways for GAMs. The marginal AIC is based on the Mariginal Likelihood (see above) with the model coefficients integrated out. In this case the AIC penalty is based on the number of smoothing parameters (and any variance parameters) in the model. However, because of the well known fact that REML is not comparable between models with different fixed effects structures, we can not usually use such an AIC to compare models with different smooth terms (since their un-penalized components act like fixed effects). Basing AIC on the marginal likelihood in which only the penalized effects are integrated out is possible (the number of un-penalized coefficients now gets added to the parameter count for the AIC penalty), but this version of the marginal likelihood suffers from the tendency to oversmooth that provided the original motivation for developing REML. Given these problems GAMs are often compared using the conditional AIC, in which the model likelihood (not marginal likelihood) is used in the AIC, and the parameter count is taken as the effective degrees of freedom of the model. Naive versions of the conditional AIC have been shown to be much too likely to select larger models in some circumstances, a difficulty attributable to neglect of smoothing parameter uncertainty when computing the effective degrees of freedom, however correcting the effective degrees of freedom for this problem restores reasonable performance.

Caveats

Overfitting mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...

can be a problem with GAMs, especially if there is un-modelled residual auto-correlation or un-modelled

overdispersion In statistics, overdispersion is the presence of greater variability (statistical dispersion) in a data set than would be expected based on a given statistical model. A common task in applied statistics is choosing a parametric model to fit a giv ...

. Cross-validation can be used to detect and/or reduce overfitting problems with GAMs (or other statistical methods), and software often allows the level of penalization to be increased to force smoother fits. Estimating very large numbers of smoothing parameters is also likely to be statistically challenging, and there are known tendencies for prediction error criteria (GCV, AIC etc.) to occasionally undersmooth substantially, particularly at moderate sample sizes, with REML being somewhat less problematic in this regard. Where appropriate, simpler models such as GLMs may be preferable to GAMs unless GAMs improve predictive ability substantially (in validation sets) for the application in question.

References

{{Reflist

External links

gam
an R package for GAMs by backfitting.

Python module in statsmodels.gam module.
InterpretML
a Python package for fitting GAMs via bagging and boosting.

an R package for GAMs using penalized regression splines.

an R package for boosting including additive models.

an R package for smoothing spline ANOVA.
INLA
software for Bayesian Inference with GAMs and more.

software for MCMC and penalized likelihood approaches to GAMs.
Doing magic and analyzing seasonal time series with GAM in RGAM: The Predictive Modeling Silver BulletBuilding GAM by projection descent

Nonparametric regression Regression models Articles with example R code