AdaBoost, short for ''Adaptive
Boosting'', is a
statistical classification
In statistics, classification is the problem of identifying which of a set of categories (sub-populations) an observation (or observations) belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagno ...
meta-algorithm formulated by
Yoav Freund
Joab (Hebrew Modern: ''Yōʼav'', Tiberian: ''Yōʼāḇ'') the son of Zeruiah, was the nephew of King David and the commander of his army, according to the Hebrew Bible.
Name
The name Joab is, like many other Hebrew names, theophoric - der ...
and
Robert Schapire
Robert Elias Schapire is an American computer scientist, former David M. Siegel '83 Professor in the computer science department at Princeton University, and has recently moved to Microsoft Research. His primary specialty is theoretical and app ...
in 1995, who won the 2003
Gödel Prize
The Gödel Prize is an annual prize for outstanding papers in the area of theoretical computer science, given jointly by the European Association for Theoretical Computer Science (EATCS) and the Association for Computing Machinery Special Inter ...
for their work. It can be used in conjunction with many other types of learning algorithms to improve performance. The output of the other learning algorithms ('weak learners') is combined into a weighted sum that represents the final output of the boosted classifier. Usually, AdaBoost is presented for
binary classification
Binary classification is the task of classifying the elements of a set into two groups (each called ''class'') on the basis of a classification rule. Typical binary classification problems include:
* Medical testing to determine if a patient has c ...
, although it can be generalized to multiple classes or bounded intervals on the real line.
AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers. In some problems it can be less susceptible to the
overfitting
mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
problem than other learning algorithms. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing, the final model can be proven to converge to a strong learner.
Although AdaBoost is typically used to combine weak base learners (such as
decision stumps), it has been shown that it can also effectively combine strong base learners (such as deep
decision trees
A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains condit ...
), producing an even more accurate model.
Every learning algorithm tends to suit some problem types better than others, and typically has many different parameters and configurations to adjust before it achieves optimal performance on a dataset. AdaBoost (with
decision trees
A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains condit ...
as the weak learners) is often referred to as the best out-of-the-box classifier. When used with decision tree learning, information gathered at each stage of the AdaBoost algorithm about the relative 'hardness' of each training sample is fed into the tree growing algorithm such that later trees tend to focus on harder-to-classify examples.
Training
AdaBoost refers to a particular method of training a boosted classifier. A boosted classifier is a classifier of the form
:
where each
is a weak learner that takes an object
as input and returns a value indicating the class of the object. For example, in the two-class problem, the sign of the weak learner's output identifies the predicted object class and the absolute value gives the confidence in that classification. Similarly, the
-th classifier is positive if the sample is in a positive class and negative otherwise.
Each weak learner produces an output hypothesis
which fixes a prediction
for each sample in the training set. At each iteration
, a weak learner is selected and assigned a coefficient
such that the total training error
of the resulting
-stage boosted classifier is minimized.
:
">ssuming that
\alpha_m > 0 i.e. the weak classifier with the lowest weighted error (with weights
w_i^ = e^).
To determine the desired weight
\alpha_m that minimizes
E with the
k_m that we just determined, we differentiate:
:
\frac = \frac
Setting this to zero and solving for
\alpha_m yields:
:
\alpha_m = \frac\ln\left(\frac\right)
We calculate the weighted error rate of the weak classifier to be
\epsilon_m = \sum_ w_i^ / \sum_^N w_i^, so it follows that:
:
\alpha_m = \frac\ln\left( \frac\right)
which is the negative logit function multiplied by 0.5. Due to the convexity of
E as a function of
\alpha_m, this new expression for
\alpha_m gives the global minimum of the loss function.
Note: This derivation only applies when
k_m(x_i) \in \, though it can be a good starting guess in other cases, such as when the weak learner is biased (
k_m(x) \in \, a \neq -b), has multiple leaves (
k_m(x) \in \) or is some other function
k_m(x) \in \mathbb.
Thus we have derived the AdaBoost algorithm: At each iteration, choose the classifier
k_m, which minimizes the total weighted error
\sum_ w_i^, use this to calculate the error rate
\epsilon_m = \sum_ w_i^ / \sum_^N w_i^, use this to calculate the weight
\alpha_m = \frac\ln\left( \frac\right), and finally use this to improve the boosted classifier
C_ to
C_ = C_ + \alpha_m k_m.
Statistical understanding of boosting
Boosting is a form of linear
regression in which the features of each sample
x_i are the outputs of some weak learner
h applied to
x_i.
While regression tries to fit
F(x) to
y(x) as precisely as possible without loss of generalization, typically using
least square error
E(f) = (y(x) - f(x))^2, whereas the AdaBoost error function
E(f) = e^ takes into account the fact that only the sign of the final result is used, thus
, F(x), can be far larger than 1 without increasing error. However, the exponential increase in the error for sample
x_i as
-y(x_i)f(x_i) increases. resulting in excessive weights being assigned to outliers.
One feature of the choice of exponential error function is that the error of the final additive model is the product of the error of each stage, that is,
e^ = \prod_i e^. Thus it can be seen that the weight update in the AdaBoost algorithm is equivalent to recalculating the error on
F_t(x) after each stage.
There is a lot of flexibility allowed in the choice of loss function. As long as the loss function is
monotonic
In mathematics, a monotonic function (or monotone function) is a function between ordered sets that preserves or reverses the given order. This concept first arose in calculus, and was later generalized to the more abstract setting of order ...
and
continuously differentiable
In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non-vertical tangent line at each interior point in its ...
, the classifier is always driven toward purer solutions.
Zhang (2004) provides a loss function based on least squares, a modified
Huber loss function:
:
\phi(y,f(x)) = \begin -4y f(x) & \mboxy f(x) < -1, \\ (y f(x) - 1)^2 &\mbox -1 \le y f(x) \le 1. \\ 0 &\mbox y f(x) > 1 \end
This function is more well-behaved than LogitBoost for
f(x) close to 1 or -1, does not penalise ‘overconfident’ predictions (
y f(x) > 1), unlike unmodified least squares, and only penalises samples misclassified with confidence greater than 1 linearly, as opposed to quadratically or exponentially, and is thus less susceptible to the effects of outliers.
Boosting as gradient descent
Boosting can be seen as minimization of a
convex
Convex or convexity may refer to:
Science and technology
* Convex lens, in optics
Mathematics
* Convex set, containing the whole line segment that joins points
** Convex polygon, a polygon which encloses a convex set of points
** Convex polytope ...
loss function over a
convex set
In geometry, a subset of a Euclidean space, or more generally an affine space over the reals, is convex if, given any two points in the subset, the subset contains the whole line segment that joins them. Equivalently, a convex set or a convex r ...
of functions. Specifically, the loss being minimized by AdaBoost is the exponential loss
\sum_i \phi(i,y,f) = \sum_i e^,
whereas LogitBoost performs logistic regression, minimizing
\sum_i \phi(i,y,f) = \sum_i \ln\left(1+e^\right).
In the gradient descent analogy, the output of the classifier for each training point is considered a point
\left(F_t(x_1), \dots, F_t(x_n)\right) in n-dimensional space, where each axis corresponds to a training sample, each weak learner
h(x) corresponds to a vector of fixed orientation and length, and the goal is to reach the target point
(y_1, \dots, y_n) (or any region where the value of loss function
E_T(x_1, \dots, x_n) is less than the value at that point), in the fewest steps. Thus AdaBoost algorithms perform either
Cauchy
Baron Augustin-Louis Cauchy (, ; ; 21 August 178923 May 1857) was a French mathematician, engineer, and physicist who made pioneering contributions to several branches of mathematics, including mathematical analysis and continuum mechanics. He w ...
(find
h(x) with the steepest gradient, choose
\alpha to minimize test error) or
Newton
Newton most commonly refers to:
* Isaac Newton (1642–1726/1727), English scientist
* Newton (unit), SI unit of force named after Isaac Newton
Newton may also refer to:
Arts and entertainment
* ''Newton'' (film), a 2017 Indian film
* Newton ( ...
(choose some target point, find
\alpha h(x) that brings
F_t closest to that point) optimization of training error.
Example algorithm (Discrete AdaBoost)
With:
* Samples
x_1 \dots x_n
* Desired outputs
y_1 \dots y_n, y \in \
* Initial weights
w_ \dots w_ set to
\frac
* Error function
E(f(x), y, i) = e^
* Weak learners
h\colon x \rightarrow \
For
t in
1 \dots T:
* Choose
h_t(x):
** Find weak learner
h_t(x) that minimizes
\epsilon_t, the weighted sum error for misclassified points
\epsilon_t = \sum^n_ w_
** Choose
\alpha_t = \frac \ln \left(\frac\right)
* Add to ensemble:
**
F_t(x) = F_(x) + \alpha_t h_t(x)
* Update weights:
**
w_ = w_ e^ for
i in
1 \dots n
** Renormalize
w_ such that
\sum_i w_ = 1
**(Note: It can be shown that
\frac = \frac at every step, which can simplify the calculation of the new weights.)
Variants
Real AdaBoost
The output of decision trees is a class probability estimate
p(x) = P(y=1 , x), the probability that
x is in the positive class.
Friedman, Hastie and Tibshirani derive an analytical minimizer for
e^ for some fixed
p(x) (typically chosen using weighted least squares error):
:
f_t(x) = \frac \ln\left(\frac\right).
Thus, rather than multiplying the output of the entire tree by some fixed value, each leaf node is changed to output half the
logit
In statistics, the logit ( ) function is the quantile function associated with the standard logistic distribution. It has many uses in data analysis and machine learning, especially in data transformations.
Mathematically, the logit is the ...
transform of its previous value.
LogitBoost
LogitBoost represents an application of established
logistic regression
In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...
techniques to the AdaBoost method. Rather than minimizing error with respect to y, weak learners are chosen to minimize the (weighted least-squares) error of
f_t(x) with respect to
:
z_t = \frac,
where
:
p_t(x) = \frac,
:
w_t = p_t(x)(1 - p_t(x))
:
y^* = \frac 2.
That is
z_t is the
Newton–Raphson
In numerical analysis, Newton's method, also known as the Newton–Raphson method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real-va ...
approximation of the minimizer of the log-likelihood error at stage
t, and the weak learner
f_t is chosen as the learner that best approximates
z_t by weighted least squares.
As p approaches either 1 or 0, the value of
p_t(x_i)(1 - p_t(x_i)) becomes very small and the ''z'' term, which is large for misclassified samples, can become
numerically unstable
In the mathematical subfield of numerical analysis, numerical stability is a generally desirable property of numerical algorithms. The precise definition of stability depends on the context. One is numerical linear algebra and the other is algorit ...
, due to machine precision rounding errors. This can be overcome by enforcing some limit on the absolute value of ''z'' and the minimum value of ''w''
Gentle AdaBoost
While previous boosting algorithms choose
f_t greedily, minimizing the overall test error as much as possible at each step, GentleBoost features a bounded step size.
f_t is chosen to minimize
\sum_i w_ (y_i-f_t(x_i))^2, and no further coefficient is applied. Thus, in the case where a weak learner exhibits perfect classification performance, GentleBoost chooses
f_t(x) = \alpha_t h_t(x) exactly equal to
y, while steepest descent algorithms try to set
\alpha_t = \infty. Empirical observations about the good performance of GentleBoost appear to back up Schapire and Singer's remark that allowing excessively large values of
\alpha can lead to poor generalization performance.
Early Termination
A technique for speeding up processing of boosted classifiers, early termination refers to only testing each potential object with as many layers of the final classifier necessary to meet some confidence threshold, speeding up computation for cases where the class of the object can easily be determined. One such scheme is the object detection framework introduced by Viola and Jones: in an application with significantly more negative samples than positive, a cascade of separate boost classifiers is trained, the output of each stage biased such that some acceptably small fraction of positive samples is mislabeled as negative, and all samples marked as negative after each stage are discarded. If 50% of negative samples are filtered out by each stage, only a very small number of objects would pass through the entire classifier, reducing computation effort. This method has since been generalized, with a formula provided for choosing optimal thresholds at each stage to achieve some desired false positive and false negative rate.
In the field of statistics, where AdaBoost is more commonly applied to problems of moderate dimensionality,
early stopping In machine learning, early stopping is a form of regularization used to avoid overfitting when training a learner with an iterative method, such as gradient descent. Such methods update the learner so as to make it better fit the training data wit ...
is used as a strategy to reduce
overfitting
mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
. A validation set of samples is separated from the training set, performance of the classifier on the samples used for training is compared to performance on the validation samples, and training is terminated if performance on the validation sample is seen to decrease even as performance on the training set continues to improve.
Totally corrective algorithms
For steepest descent versions of AdaBoost, where
\alpha_t is chosen at each layer ''t'' to minimize test error, the next layer added is said to be ''maximally independent'' of layer ''t'': it is unlikely to choose a weak learner ''t+1'' that is similar to learner ''t''. However, there remains the possibility that ''t+1'' produces similar information to some other earlier layer. Totally corrective algorithms, such as
LPBoost Linear Programming Boosting (LPBoost) is a supervised classifier from the boosting family of classifiers. LPBoost maximizes a ''margin'' between training samples of different classes and hence also belongs to the class of margin-maximizing superv ...
, optimize the value of every coefficient after each step, such that new layers added are always maximally independent of every previous layer. This can be accomplished by backfitting,
linear programming
Linear programming (LP), also called linear optimization, is a method to achieve the best outcome (such as maximum profit or lowest cost) in a mathematical model whose requirements are represented by linear function#As a polynomial function, li ...
or some other method.
Pruning
Pruning is the process of removing poorly performing weak classifiers to improve memory and execution-time cost of the boosted classifier. The simplest methods, which can be particularly effective in conjunction with totally corrective training, are weight- or margin-trimming: when the coefficient, or the contribution to the total test error, of some weak classifier falls below a certain threshold, that classifier is dropped. Margineantu & Dietterich suggested an alternative criterion for trimming: weak classifiers should be selected such that the diversity of the ensemble is maximized. If two weak learners produce very similar outputs, efficiency can be improved by removing one of them and increasing the coefficient of the remaining weak learner.
See also
*
Bootstrap aggregating
Bootstrap aggregating, also called bagging (from bootstrap aggregating), is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regress ...
*
CoBoosting CoBoost is a semi-supervised training algorithm proposed by Collins and Singer in 1999. The original application for the algorithm was the task of Named Entity Classification using very weak learners.Michael Collins and Yoram Singer, Unsupervised M ...
*
BrownBoost BrownBoost is a boosting algorithm that may be robust to noisy datasets. BrownBoost is an adaptive version of the boost by majority algorithm. As is true for all boosting algorithms, BrownBoost is used in conjunction with other machine learning ...
*
Gradient boosting
Gradient boosting is a machine learning technique used in regression and classification tasks, among others. It gives a prediction model in the form of an ensemble of weak prediction models, which are typically decision trees. When a decision t ...
*
References
Further reading
* original paper of Yoav Freund and Robert E.Schapire where AdaBoost is first introduced.
* On the margin explanation of boosting algorithm.
* On the doubt about margin explanation of boosting.
{{DEFAULTSORT:AdaBoost
Classification algorithms
Ensemble learning