Gradient boosting is a
machine learning technique used in
regression
Regression or regressions may refer to:
Science
* Marine regression, coastal advance due to falling sea level, the opposite of marine transgression
* Regression (medicine), a characteristic of diseases to express lighter symptoms or less extent ( ...
and
classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood.
Classification is the grouping of related facts into classes.
It may also refer to:
Business, organizat ...
tasks, among others. It gives a prediction model in the form of an
ensemble of weak prediction models, which are typically
decision tree
A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains condit ...
s.
When a decision tree is the weak learner, the resulting algorithm is called gradient-boosted trees; it usually outperforms
random forest
Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of th ...
.
A gradient-boosted trees model is built in a stage-wise fashion as in other
boosting methods, but it generalizes the other methods by allowing optimization of an arbitrary
differentiable loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
.
History
The idea of gradient boosting originated in the observation by
Leo Breiman that boosting can be interpreted as an optimization algorithm on a suitable cost function.
Explicit regression gradient boosting algorithms were subsequently developed, by
Jerome H. Friedman
Jerome Harold Friedman (born December 29, 1939) is an American statistician, consultant and Professor of Statistics at Stanford University, known for his contributions in the field of statistics and data mining. ,
simultaneously with the more general functional gradient boosting perspective of Llew Mason, Jonathan Baxter, Peter Bartlett and Marcus Frean.
[
][
]
The latter two papers introduced the view of boosting algorithms as iterative ''functional gradient descent'' algorithms. That is, algorithms that optimize a cost function over function space by iteratively choosing a function (weak hypothesis) that points in the negative gradient direction. This functional gradient view of boosting has led to the development of boosting algorithms in many areas of machine learning and statistics beyond regression and classification.
Informal introduction
(This section follows the exposition of gradient boosting by Cheng.)
Like other boosting methods, gradient boosting combines weak "learners" into a single strong learner in an iterative fashion. It is easiest to explain in the
least-squares regression
Regression or regressions may refer to:
Science
* Marine regression, coastal advance due to falling sea level, the opposite of marine transgression
* Regression (medicine), a characteristic of diseases to express lighter symptoms or less extent ( ...
setting, where the goal is to "teach" a model
to predict values of the form
by minimizing the
mean squared error , where
indexes over some training set of size
of actual values of the output variable
:
*
the predicted value
*
the observed value
*
the number of samples in
Now, let us consider a gradient boosting algorithm with
stages. At each stage
(
) of gradient boosting, suppose some imperfect model
(for low
, this model may simply return
, where the
RHS is the mean of
). In order to improve
, our algorithm should add some new estimator,
. Thus,
:
or, equivalently,
:
.
Therefore, gradient boosting will fit
to the ''
residual''
. As in other boosting variants, each
attempts to correct the errors of its predecessor
. A generalization of this idea to
loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
s other than squared error, and to
classification and ranking problems, follows from the observation that residuals
for a given model are proportional to the negative gradients of the
mean squared error (MSE) loss function (with respect to
):
:
:
.
So, gradient boosting could be specialized to a
gradient descent algorithm, and generalizing it entails "plugging in" a different loss and its gradient.
Algorithm
In many
supervised learning problems there is an output variable and a vector of input variables , related to each other with some probabilistic distribution. The goal is to find some function
that best approximates the output variable from the values of input variables. This is formalized by introducing some
loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
and minimizing it in expectation:
: