Gradient boosting is a

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

technique used in

regression Regression or regressions may refer to: Science * Marine regression, coastal advance due to falling sea level, the opposite of marine transgression * Regression (medicine), a characteristic of diseases to express lighter symptoms or less extent ( ...

and

classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...

tasks, among others. It gives a prediction model in the form of an

ensemble Ensemble may refer to: Art * Architectural ensemble * ''Ensemble'' (album), Kendji Girac 2015 album * Ensemble (band), a project of Olivier Alary * Ensemble cast (drama, comedy) * Ensemble (musical theatre), also known as the chorus * ''En ...

of weak prediction models, which are typically

decision tree A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains condit ...

s. When a decision tree is the weak learner, the resulting algorithm is called gradient-boosted trees; it usually outperforms

random forest Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of th ...

. A gradient-boosted trees model is built in a stage-wise fashion as in other boosting methods, but it generalizes the other methods by allowing optimization of an arbitrary

differentiable In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non-vertical tangent line at each interior point in its ...

loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...

History

The idea of gradient boosting originated in the observation by

Leo Breiman Leo Breiman (January 27, 1928 – July 5, 2005) was a distinguished statistician at the University of California, Berkeley. He was the recipient of numerous honors and awards, and was a member of the United States National Academy of Sciences. ...

that boosting can be interpreted as an optimization algorithm on a suitable cost function. Explicit regression gradient boosting algorithms were subsequently developed, by

Jerome H. Friedman Jerome Harold Friedman (born December 29, 1939) is an American statistician, consultant and Professor of Statistics at Stanford University, known for his contributions in the field of statistics and data mining.

, simultaneously with the more general functional gradient boosting perspective of Llew Mason, Jonathan Baxter, Peter Bartlett and Marcus Frean. The latter two papers introduced the view of boosting algorithms as iterative ''functional gradient descent'' algorithms. That is, algorithms that optimize a cost function over function space by iteratively choosing a function (weak hypothesis) that points in the negative gradient direction. This functional gradient view of boosting has led to the development of boosting algorithms in many areas of machine learning and statistics beyond regression and classification.

Informal introduction

(This section follows the exposition of gradient boosting by Cheng.) Like other boosting methods, gradient boosting combines weak "learners" into a single strong learner in an iterative fashion. It is easiest to explain in the

least-squares The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the res ...

setting, where the goal is to "teach" a model

F

to predict values of the form

\hat = F(x)

by minimizing the

mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...

\tfrac\sum_i(\hat_i - y_i)^2

, where

i

indexes over some training set of size

n

of actual values of the output variable

y

: *

\hat y_i =

the predicted value

F(x_i)

y_i =

the observed value *

n =

the number of samples in

y

Now, let us consider a gradient boosting algorithm with

M

stages. At each stage

m

(

1 \le m \le M

) of gradient boosting, suppose some imperfect model

F_m

(for low

m

, this model may simply return

\hat y_i = \bar y

, where the RHS is the mean of

y

). In order to improve

F_m

, our algorithm should add some new estimator,

h_m(x)

. Thus, :

F_(x_i) = F_m(x_i) + h_m(x_i) = y_i

or, equivalently, :

h_m(x_i) = y_i - F_m(x_i)

. Therefore, gradient boosting will fit

h_m

to the '' residual''

y_i - F_m(x_i)

. As in other boosting variants, each

F_

attempts to correct the errors of its predecessor

F_m

. A generalization of this idea to

s other than squared error, and to classification and ranking problems, follows from the observation that residuals

h_m(x_i)

for a given model are proportional to the negative gradients of the mean squared error (MSE) loss function (with respect to

F(x_i)

): :

L_ = \frac \sum_^n \left(y_i - F(x_i)\right)^2

- \frac = \frac(y_i - F(x_i)) = \frach_m(x_i)

. So, gradient boosting could be specialized to a

gradient descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...

algorithm, and generalizing it entails "plugging in" a different loss and its gradient.

Algorithm

In many

supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...

problems there is an output variable and a vector of input variables , related to each other with some probabilistic distribution. The goal is to find some function

\hat(x)

that best approximates the output variable from the values of input variables. This is formalized by introducing some

L(y, F(x))

and minimizing it in expectation: :

\hat = \underset \, \mathbb_(y, F(x)) /math>.

The gradient boosting method assumes a real-valued . It seeks an approximation \hat(x) in the form of a weighted sum of  functions h_m (x) from some class \mathcal, called base (or

weak Weak may refer to: Songs * "Weak" (AJR song), 2016 * "Weak" (Melanie C song), 2011 * "Weak" (SWV song), 1993 * "Weak" (Skunk Anansie song), 1995 * "Weak", a song by Seether from '' Seether: 2002-2013'' Television episodes * "Weak" (''Fear t ...

) learners: :

\hat(x) = \sum_^M \gamma_m h_m(x) + \mbox

. We are usually given a training set

\

of known sample values of and corresponding values of . In accordance with the

empirical risk minimization Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance. The core idea is that we cannot know exactly how well an alg ...

principle, the method tries to find an approximation

\hat(x)

that minimizes the average value of the loss function on the training set, i.e., minimizes the empirical risk. It does so by starting with a model, consisting of a constant function

F_0(x)

, and incrementally expands it in a greedy fashion: :

F_0(x) = \underset

, :

F_m(x) = F_(x) + \underset \left

right Rights are law, legal, social, or ethics, ethical principles of Liberty, freedom or entitlement; that is, rights are the fundamental normative rules about what is allowed of people or owed to people according to some legal system, social convent ...

/math>, for

m \geq 1

, where

h_m \in \mathcal

is a base learner function. Unfortunately, choosing the best function

h_m

at each step for an arbitrary loss function is a computationally infeasible optimization problem in general. Therefore, we restrict our approach to a simplified version of the problem. The idea is to apply a

steepest descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...

step to this minimization problem (functional gradient descent). The basic idea behind the steepest descent is to find a local minimum of the loss function by iterating on

F_(x)

. In fact, the local maximum-descent direction of the loss function is the negative gradient. Hence, moving a small amount

\gamma

such that the linear approximation remains valid:

F_m(x) = F_(x) - \gamma \sum_^n

where

\gamma > 0

. For small

\gamma

, this implies that

L(y_i, F_(x_i)) \le L(y_i, F_(x_i))

. To prove the following, consider the objective

O = \sum_^n

Doing a Taylor expansion to first order

O = \sum_^n  \approx  \sum_^n +\ldots

Now differentiating w.r.t to

h_m(x_i)

, only the derivative of the second term remains

\nabla_L(y_i, F_(x_i))

. This is the direction of steepest ascent and hence we must move in the opposite (i.e., negative) direction in order to move in the direction of steepest descent. Furthermore, we can optimize

\gamma

by finding the

\gamma

value for which the loss function has a minimum:

\gamma_m = \underset  = \underset .

If we considered the continuous case, i.e., where

\mathcal

is the set of arbitrary differentiable functions on

\R

, we would update the model in accordance with the following equations :

F_m(x) = F_(x) - \gamma_m \sum_^n

where

\gamma_m

is the step length, defined as

\gamma_m = \underset .

In the discrete case however, i.e. when the set

\mathcal

is finite, we choose the candidate function closest to the gradient of for which the coefficient may then be calculated with the aid of

line search In optimization, the line search strategy is one of two basic iterative approaches to find a local minimum \mathbf^* of an objective function f:\mathbb R^n\to\mathbb R. The other approach is trust region. The line search approach first finds a d ...

on the above equations. Note that this approach is a heuristic and therefore doesn't yield an exact solution to the given problem, but rather an approximation. In pseudocode, the generic gradient boosting method is: Input: training set

\_^n,

a differentiable loss function

L(y, F(x)),

number of iterations . Algorithm: # Initialize model with a constant value: #:

F_0(x) = \underset \sum_^n L(y_i, \gamma).

# For = 1 to : ## Compute so-called ''pseudo-residuals'': ##:

\quad \mbox i=1,\ldots,n.

## Fit a base learner (or weak learner, e.g. tree) closed under scaling

h_m(x)

to pseudo-residuals, i.e. train it using the training set

\_^n

. ## Compute multiplier

\gamma_m

by solving the following one-dimensional optimization problem: ##:

\gamma_m = \underset \sum_^n L\left(y_i, F_(x_i) + \gamma h_m(x_i)\right).

## Update the model: ##:

F_m(x) = F_(x) + \gamma_m h_m(x).

# Output

F_M(x).

Gradient tree boosting

Gradient boosting is typically used with

decision trees A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains condit ...

(especially

CARTs A cart or dray (Australia and New Zealand) is a vehicle designed for transport, using two wheels and normally pulled by one or a pair of draught animals. A handcart is pulled or pushed by one or more people. It is different from the flatbed tr ...

) of a fixed size as base learners. For this special case, Friedman proposes a modification to gradient boosting method which improves the quality of fit of each base learner. Generic gradient boosting at the ''m''-th step would fit a decision tree

h_m(x)

to pseudo-residuals. Let

J_

be the number of its leaves. The tree partitions the input space into

J_

disjoint regions

R_, \ldots, R_

and predicts a constant value in each region. Using the

indicator notation In mathematics, an indicator function or a characteristic function of a subset of a set is a function that maps elements of the subset to one, and all other elements to zero. That is, if is a subset of some set , one has \mathbf_(x)=1 if x\in ...

, the output of

h_m(x)

for input ''x'' can be written as the sum: :

h_m(x) = \sum_^ b_ \mathbf _(x),

where

b_

is the value predicted in the region

R_

. Then the coefficients

b_

are multiplied by some value

\gamma_m

, chosen using line search so as to minimize the loss function, and the model is updated as follows: :

F_m(x) = F_(x) + \gamma_m h_m(x), \quad
    \gamma_m = \underset \sum_^n L(y_i, F_(x_i) + \gamma h_m(x_i)).

Friedman proposes to modify this algorithm so that it chooses a separate optimal value

\gamma_

for each of the tree's regions, instead of a single

\gamma_m

for the whole tree. He calls the modified algorithm "TreeBoost". The coefficients

b_

from the tree-fitting procedure can be then simply discarded and the model update rule becomes: :

F_m(x) = F_(x) + \sum_^ \gamma_ b_ \mathbf _(x), \quad
    \gamma_ = \underset \sum_ L(y_i, F_(x_i) + \gamma).

Size of trees

J

, the number of terminal nodes in trees, is the method's parameter which can be adjusted for a data set at hand. It controls the maximum allowed level of

interaction Interaction is action that occurs between two or more objects, with broad use in philosophy and the sciences. It may refer to: Science * Interaction hypothesis, a theory of second language acquisition * Interaction (statistics) * Interactions o ...

between variables in the model. With

J = 2

(

decision stump A decision stump is a machine learning model consisting of a one-level decision tree. That is, it is a decision tree with one internal node (the root) which is immediately connected to the terminal nodes (its leaves). A decision stump makes a predi ...

s), no interaction between variables is allowed. With

J = 3

the model may include effects of the interaction between up to two variables, and so on. Hastie et al. comment that typically

4 \leq J \leq 8

work well for boosting and results are fairly insensitive to the choice of

J

in this range,

J = 2

is insufficient for many applications, and

J > 10

is unlikely to be required.

Regularization

Fitting the training set too closely can lead to degradation of the model's generalization ability. Several so-called

regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) * Regularization (solid modeling) * Regularization Law, an Israeli law intended to retroactively legalize settlements See also ...

techniques reduce this

overfitting mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...

effect by constraining the fitting procedure. One natural regularization parameter is the number of gradient boosting iterations ''M'' (i.e. the number of trees in the model when the base learner is a decision tree). Increasing ''M'' reduces the error on training set, but setting it too high may lead to overfitting. An optimal value of ''M'' is often selected by monitoring prediction error on a separate validation data set. Besides controlling ''M'', several other regularization techniques are used. Another regularization parameter is the depth of the trees. The higher this value the more likely the model will overfit the training data.

Shrinkage

An important part of gradient boosting method is regularization by shrinkage which consists in modifying the update rule as follows: :

F_m(x) = F_(x) + \nu \cdot \gamma_m h_m(x), \quad 0 < \nu \leq 1,

where

parameter A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...

\nu

is called the "learning rate". Empirically it has been found that using small

learning rate In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly ac ...

s (such as

\nu < 0.1

) yields dramatic improvements in models' generalization ability over gradient boosting without shrinking (

\nu = 1

). However, it comes at the price of increasing

computational time In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by ...

both during training and querying: lower learning rate requires more iterations.

Stochastic gradient boosting

Soon after the introduction of gradient boosting, Friedman proposed a minor modification to the algorithm, motivated by Breiman's

bootstrap aggregation Bootstrap aggregating, also called bagging (from bootstrap aggregating), is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regressio ...

("bagging") method. Specifically, he proposed that at each iteration of the algorithm, a base learner should be fit on a subsample of the training set drawn at random without replacement. Friedman observed a substantial improvement in gradient boosting's accuracy with this modification. Subsample size is some constant fraction

f

of the size of the training set. When

f = 1

, the algorithm is deterministic and identical to the one described above. Smaller values of

f

introduce randomness into the algorithm and help prevent

, acting as a kind of

. The algorithm also becomes faster, because regression trees have to be fit to smaller datasets at each iteration. Friedman obtained that

0.5 \leq f \leq 0.8

leads to good results for small and moderate sized training sets. Therefore,

f

is typically set to 0.5, meaning that one half of the training set is used to build each base learner. Also, like in bagging, subsampling allows one to define an

out-of-bag error Out-of-bag (OOB) error, also called out-of-bag estimate, is a method of measuring the prediction error of random forests, gradient boosting, boosted decision trees, and other machine learning models utilizing bootstrap aggregating (bagging). Baggi ...

of the prediction performance improvement by evaluating predictions on those observations which were not used in the building of the next base learner. Out-of-bag estimates help avoid the need for an independent validation dataset, but often underestimate actual performance improvement and the optimal number of iterations.Ridgeway, Greg (2007)
Generalized Boosted Models: A guide to the gbm package.
/ref>

Number of observations in leaves

Gradient tree boosting implementations often also use regularization by limiting the minimum number of observations in trees' terminal nodes. It is used in the tree building process by ignoring any splits that lead to nodes containing fewer than this number of training set instances. Imposing this limit helps to reduce variance in predictions at leaves.

Penalize complexity of tree

Another useful regularization techniques for gradient boosted

trees In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, including only woody plants with secondary growth, plants that are u ...

is to penalize model complexity of the learned model. The model complexity can be defined as the proportional number of leaves in the learned trees. The joint optimization of loss and model complexity corresponds to a post-pruning algorithm to remove branches that fail to reduce the loss by a threshold. Other kinds of regularization such as an

\ell_2

penalty on the leaf values can also be added to avoid

Usage

Gradient boosting can be used in the field of

learning to rank Learning to rank. Slides from Tie-Yan Liu's talk at WWW 2009 conference aravailable online or machine-learned ranking (MLR) is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construct ...

. The commercial web search engines

Yahoo Yahoo! (, styled yahoo''!'' in its logo) is an American web services provider. It is headquartered in Sunnyvale, California and operated by the namesake company Yahoo! Inc. (2017–present), Yahoo Inc., which is 90% owned by investment funds ma ...

and

Yandex Yandex LLC (russian: link=no, Яндекс, p=ˈjandəks) is a Russian multinational technology company providing Internet-related products and services, including an Internet search engine, information services, e-commerce, transportation, maps ...

Yandex corporate blog entry about new ranking model "Snezhinsk"
(in Russian) use variants of gradient boosting in their machine-learned ranking engines. Gradient boosting is also utilized in High Energy Physics in data analysis. At the Large Hadron Collider (LHC), variants of gradient boosting Deep Neural Networks (DNN) were successful in reproducing the results of non-machine learning methods of analysis on datasets used to discover the

Higgs boson The Higgs boson, sometimes called the Higgs particle, is an elementary particle in the Standard Model of particle physics produced by the quantum excitation of the Higgs field, one of the fields in particle physics theory. In the Stand ...

. Gradient boosting decision tree was also applied in earth and geological studies – for example quality evaluation of sandstone reservoir.

Names

The method goes by a variety of names. Friedman introduced his regression technique as a "Gradient Boosting Machine" (GBM). Mason, Baxter et al. described the generalized abstract class of algorithms as "functional gradient boosting". Friedman et al. describe an advancement of gradient boosted models as Multiple Additive Regression Trees (MART); Elith et al. describe that approach as "Boosted Regression Trees" (BRT). A popular open-source implementation for R calls it a "Generalized Boosting Model", however packages expanding this work use BRT. Yet another name is TreeNet, after an early commercial implementation from Salford System's Dan Steinberg, one of researchers who pioneered the use of tree-based methods. XGBoost is another popular modern implementation of the method with some extensions, like second-order optimization.

Disadvantages

While boosting can increase the accuracy of a base learner, such as a decision tree or linear regression, it sacrifices intelligibility and

interpretability In mathematical logic, interpretability is a relation between formal theories that expresses the possibility of interpreting or translating one into the other. Informal definition Assume ''T'' and ''S'' are formal theories. Slightly simplified, ''T ...

. For example, following the path that a decision tree takes to make its decision is trivial and self-explained, but following the paths of hundreds or thousands of trees is much harder. To achieve both performance and interpretability, some model compression techniques allow transforming an XGBoost into a single "born-again" decision tree that approximates the same decision function. Furthermore, its implementation may be more difficult due to the higher computational demand.

References

External links

How to explain gradient boosting

Gradient Boosted Regression Trees

LightGBM
Classification algorithms Decision trees Ensemble learning

History

Informal introduction

Algorithm

Gradient tree boosting

Size of trees

Regularization

Shrinkage

Stochastic gradient boosting

Number of observations in leaves

Penalize complexity of tree

Usage

Names

Disadvantages

See also

References

Further reading

External links