Gradient boosting is a
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
technique based on
boosting in a functional space, where the target is ''pseudo-residuals'' instead of
residuals as in traditional boosting. It gives a prediction model in the form of an
ensemble of weak prediction models, i.e., models that make very few assumptions about the data, which are typically simple
decision tree
A decision tree is a decision support system, decision support recursive partitioning structure that uses a Tree (graph theory), tree-like Causal model, model of decisions and their possible consequences, including probability, chance event ou ...
s.
When a decision tree is the weak learner, the resulting algorithm is called gradient-boosted trees; it usually outperforms
random forest
Random forests or random decision forests is an ensemble learning method for statistical classification, classification, regression analysis, regression and other tasks that works by creating a multitude of decision tree learning, decision trees ...
.
As with other
boosting methods, a gradient-boosted trees model is built in stages, but it generalizes the other methods by allowing optimization of an arbitrary
differentiable
In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non- vertical tangent line at each interior point in ...
loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
.
History
The idea of gradient boosting originated in the observation by
Leo Breiman
Leo Breiman (January 27, 1928 – July 5, 2005) was an American statistician at the University of California, Berkeley and a member of the United States National Academy of Sciences.
Breiman's work helped to bridge the gap between statistics an ...
that boosting can be interpreted as an optimization algorithm on a suitable cost function.
Explicit regression gradient boosting algorithms were subsequently developed, by
Jerome H. Friedman
Jerome Harold Friedman (born December 29, 1939) is an American statistician, consultant and Professor of Statistics at Stanford University, known for his contributions in the field of statistics and data mining. ,
(in 1999 and later in 2001) simultaneously with the more general functional gradient boosting perspective of Llew Mason, Jonathan Baxter, Peter Bartlett and Marcus Frean.
[
][
]
The latter two papers introduced the view of boosting algorithms as iterative ''functional gradient descent'' algorithms. That is, algorithms that optimize a cost function over function space by iteratively choosing a function (weak hypothesis) that points in the negative gradient direction. This functional gradient view of boosting has led to the development of boosting algorithms in many areas of machine learning and statistics beyond regression and classification.
Informal introduction
(This section follows the exposition by Cheng Li.)
Like other boosting methods, gradient boosting combines weak "learners" into a single strong learner iteratively. It is easiest to explain in the
least-squares
The method of least squares is a mathematical optimization technique that aims to determine the best fit function by minimizing the sum of the squares of the differences between the observed values and the predicted values of the model. The me ...
regression setting, where the goal is to teach a model
to predict values of the form
by minimizing the
mean squared error
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference betwee ...
, where
indexes over some training set of size
of actual values of the output variable
:
*
the predicted value
*
the observed value
*
the number of samples in
If the algorithm has
stages, at each stage
(
), suppose some imperfect model
(for low
, this model may simply predict
to be
, the mean of
). In order to improve
, our algorithm should add some new estimator,
. Thus,
:
or, equivalently,
:
.
Therefore, gradient boosting will fit
to the ''
residual''
. As in other boosting variants, each
attempts to correct the errors of its predecessor
. A generalization of this idea to
loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
s other than squared error, and to
classification and ranking problems, follows from the observation that residuals
for a given model are proportional to the negative gradients of the
mean squared error (MSE) loss function (with respect to
):
:
:
.
So, gradient boosting could be generalized to a
gradient descent
Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function.
The idea is to take repeated steps in the opposite direction of the gradi ...
algorithm by plugging in a different loss and its gradient.
Algorithm
Many
supervised learning
In machine learning, supervised learning (SL) is a paradigm where a Statistical model, model is trained using input objects (e.g. a vector of predictor variables) and desired output values (also known as a ''supervisory signal''), which are often ...
problems involve an output variable and a vector of input variables , related to each other with some probabilistic distribution. The goal is to find some function
that best approximates the output variable from the values of input variables. This is formalized by introducing some
loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
and minimizing it in expectation:
: