Gradient boosting is a
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
technique used in
regression
Regression or regressions may refer to:
Science
* Marine regression, coastal advance due to falling sea level, the opposite of marine transgression
* Regression (medicine), a characteristic of diseases to express lighter symptoms or less extent ( ...
and
classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood.
Classification is the grouping of related facts into classes.
It may also refer to:
Business, organizat ...
tasks, among others. It gives a prediction model in the form of an
ensemble
Ensemble may refer to:
Art
* Architectural ensemble
* ''Ensemble'' (album), Kendji Girac 2015 album
* Ensemble (band), a project of Olivier Alary
* Ensemble cast (drama, comedy)
* Ensemble (musical theatre), also known as the chorus
* ''En ...
of weak prediction models, which are typically
decision tree
A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains condit ...
s.
When a decision tree is the weak learner, the resulting algorithm is called gradient-boosted trees; it usually outperforms
random forest
Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of th ...
.
A gradient-boosted trees model is built in a stage-wise fashion as in other
boosting methods, but it generalizes the other methods by allowing optimization of an arbitrary
differentiable
In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non-vertical tangent line at each interior point in its ...
loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
.
History
The idea of gradient boosting originated in the observation by
Leo Breiman
Leo Breiman (January 27, 1928 – July 5, 2005) was a distinguished statistician at the University of California, Berkeley. He was the recipient of numerous honors and awards, and was a member of the United States National Academy of Sciences.
...
that boosting can be interpreted as an optimization algorithm on a suitable cost function.
Explicit regression gradient boosting algorithms were subsequently developed, by
Jerome H. Friedman
Jerome Harold Friedman (born December 29, 1939) is an American statistician, consultant and Professor of Statistics at Stanford University, known for his contributions in the field of statistics and data mining. ,
simultaneously with the more general functional gradient boosting perspective of Llew Mason, Jonathan Baxter, Peter Bartlett and Marcus Frean.
[
][
]
The latter two papers introduced the view of boosting algorithms as iterative ''functional gradient descent'' algorithms. That is, algorithms that optimize a cost function over function space by iteratively choosing a function (weak hypothesis) that points in the negative gradient direction. This functional gradient view of boosting has led to the development of boosting algorithms in many areas of machine learning and statistics beyond regression and classification.
Informal introduction
(This section follows the exposition of gradient boosting by Cheng.)
Like other boosting methods, gradient boosting combines weak "learners" into a single strong learner in an iterative fashion. It is easiest to explain in the
least-squares
The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the res ...
regression
Regression or regressions may refer to:
Science
* Marine regression, coastal advance due to falling sea level, the opposite of marine transgression
* Regression (medicine), a characteristic of diseases to express lighter symptoms or less extent ( ...
setting, where the goal is to "teach" a model
to predict values of the form
by minimizing the
mean squared error
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
, where
indexes over some training set of size
of actual values of the output variable
:
*
the predicted value
*
the observed value
*
the number of samples in
Now, let us consider a gradient boosting algorithm with
stages. At each stage
(
) of gradient boosting, suppose some imperfect model
(for low
, this model may simply return
, where the
RHS is the mean of
). In order to improve
, our algorithm should add some new estimator,
. Thus,
:
or, equivalently,
:
.
Therefore, gradient boosting will fit
to the ''
residual''
. As in other boosting variants, each
attempts to correct the errors of its predecessor
. A generalization of this idea to
loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
s other than squared error, and to
classification and ranking problems, follows from the observation that residuals
for a given model are proportional to the negative gradients of the
mean squared error (MSE) loss function (with respect to
):
:
:
.
So, gradient boosting could be specialized to a
gradient descent
In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...
algorithm, and generalizing it entails "plugging in" a different loss and its gradient.
Algorithm
In many
supervised learning
Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...
problems there is an output variable and a vector of input variables , related to each other with some probabilistic distribution. The goal is to find some function
that best approximates the output variable from the values of input variables. This is formalized by introducing some
loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
and minimizing it in expectation:
: