Gradient boosting is a
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
technique used in
regression and
classification tasks, among others. It gives a prediction model in the form of an
ensemble
Ensemble may refer to:
Art
* Architectural ensemble
* Ensemble (album), ''Ensemble'' (album), Kendji Girac 2015 album
* Ensemble (band), a project of Olivier Alary
* Ensemble cast (drama, comedy)
* Ensemble (musical theatre), also known as the ...
of weak prediction models, which are typically
decision trees.
When a decision tree is the weak learner, the resulting algorithm is called gradient-boosted trees; it usually outperforms
random forest.
A gradient-boosted trees model is built in a stage-wise fashion as in other
boosting methods, but it generalizes the other methods by allowing optimization of an arbitrary
differentiable
In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non-vertical tangent line at each interior point in its ...
loss function.
History
The idea of gradient boosting originated in the observation by
Leo Breiman that boosting can be interpreted as an optimization algorithm on a suitable cost function.
Explicit regression gradient boosting algorithms were subsequently developed, by
Jerome H. Friedman,
simultaneously with the more general functional gradient boosting perspective of Llew Mason, Jonathan Baxter, Peter Bartlett and Marcus Frean.
[
][
]
The latter two papers introduced the view of boosting algorithms as iterative ''functional gradient descent'' algorithms. That is, algorithms that optimize a cost function over function space by iteratively choosing a function (weak hypothesis) that points in the negative gradient direction. This functional gradient view of boosting has led to the development of boosting algorithms in many areas of machine learning and statistics beyond regression and classification.
Informal introduction
(This section follows the exposition of gradient boosting by Cheng.)
Like other boosting methods, gradient boosting combines weak "learners" into a single strong learner in an iterative fashion. It is easiest to explain in the
least-squares
The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the res ...
regression setting, where the goal is to "teach" a model
to predict values of the form
by minimizing the
mean squared error
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
, where
indexes over some training set of size
of actual values of the output variable
:
*
the predicted value
*
the observed value
*
the number of samples in
Now, let us consider a gradient boosting algorithm with
stages. At each stage
(
) of gradient boosting, suppose some imperfect model
(for low
, this model may simply return
, where the
RHS is the mean of
). In order to improve
, our algorithm should add some new estimator,
. Thus,
:
or, equivalently,
:
.
Therefore, gradient boosting will fit
to the ''
residual''
. As in other boosting variants, each
attempts to correct the errors of its predecessor
. A generalization of this idea to
loss functions other than squared error, and to
classification and ranking problems, follows from the observation that residuals
for a given model are proportional to the negative gradients of the
mean squared error (MSE) loss function (with respect to
):
:
:
.
So, gradient boosting could be specialized to a
gradient descent
In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...
algorithm, and generalizing it entails "plugging in" a different loss and its gradient.
Algorithm
In many
supervised learning
Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...
problems there is an output variable and a vector of input variables , related to each other with some probabilistic distribution. The goal is to find some function
that best approximates the output variable from the values of input variables. This is formalized by introducing some
loss function and minimizing it in expectation:
: