machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

(ML), boosting is an ensemble

metaheuristic In computer science and mathematical optimization, a metaheuristic is a higher-level procedure or heuristic designed to find, generate, tune, or select a heuristic (partial search algorithm) that may provide a sufficiently good solution to an op ...

for primarily reducing bias (as opposed to variance). It can also improve the

stability Stability may refer to: Mathematics *Stability theory, the study of the stability of solutions to differential equations and dynamical systems ** Asymptotic stability ** Exponential stability ** Linear stability **Lyapunov stability ** Marginal s ...

and accuracy of ML

classification Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...

and regression algorithms. Hence, it is prevalent in

supervised learning In machine learning, supervised learning (SL) is a paradigm where a Statistical model, model is trained using input objects (e.g. a vector of predictor variables) and desired output values (also known as a ''supervisory signal''), which are often ...

for converting weak learners to strong learners. The concept of boosting is based on the question posed by Kearns and Valiant (1988, 1989):Michael Kearns(1988)
''Thoughts on Hypothesis Boosting''
Unpublished manuscript (Machine Learning class project, December 1988) "Can a set of weak learners create a single strong learner?" A weak learner is defined as a classifier that is only slightly correlated with the true classification. A strong learner is a classifier that is arbitrarily well-correlated with the true classification. Robert Schapire answered the question in the affirmative in a paper published in 1990. This has had significant ramifications in machine learning and

statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

, most notably leading to the development of boosting. Initially, the ''hypothesis boosting problem'' simply referred to the process of turning a weak learner into a strong learner. Algorithms that achieve this quickly became known as "boosting". Freund and Schapire's arcing (Adapt tve Resampling and Combining), as a general technique, is more or less synonymous with boosting.

Algorithms

While boosting is not algorithmically constrained, most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier. When they are added, they are weighted in a way that is related to the weak learners' accuracy. After a weak learner is added, the data weights are readjusted, known as "re-

weighting The process of frequency weighting involves emphasizing the contribution of particular aspects of a phenomenon (or of a set of data) over others to an outcome or result; thereby highlighting those aspects in comparison to others in the analy ...

". Misclassified input data gain a higher weight and examples that are classified correctly lose weight. Thus, future weak learners focus more on the examples that previous weak learners misclassified. Ensemble Boosting

There are many boosting algorithms. The original ones, proposed by Robert Schapire (a

recursive Recursion occurs when the definition of a concept or process depends on a simpler or previous version of itself. Recursion is used in a variety of disciplines ranging from linguistics to logic. The most common application of recursion is in m ...

majority gate formulation), and Yoav Freund (boost by majority),Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean (2000); ''Boosting Algorithms as Gradient Descent'', in S. A. Solla, T. K. Leen, and K.-R. Muller, editors, ''Advances in Neural Information Processing Systems'' 12, pp. 512-518, MIT Press were not adaptive and could not take full advantage of the weak learners. Schapire and Freund then developed

AdaBoost AdaBoost (short for Adaptive Boosting) is a statistical classification meta-algorithm formulated by Yoav Freund and Robert Schapire in 1995, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many types of learnin ...

, an adaptive boosting algorithm that won the prestigious

Gödel Prize The Gödel Prize is an annual prize for outstanding papers in the area of theoretical computer science, given jointly by the European Association for Theoretical Computer Science (EATCS) and the Association for Computing Machinery Special Inter ...

. Only algorithms that are provable boosting algorithms in the

probably approximately correct learning In computational learning theory, probably approximately correct (PAC) learning is a framework for mathematical analysis of machine learning. It was proposed in 1984 by Leslie Valiant.L. Valiant. A theory of the learnable.' Communications of the ...

formulation can accurately be called ''boosting algorithms''. Other algorithms that are similar in spirit to boosting algorithms are sometimes called "leveraging algorithms", although they are also sometimes incorrectly called boosting algorithms. The main variation between many boosting algorithms is their method of

training data In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...

points and

hypotheses A hypothesis (: hypotheses) is a proposed explanation for a phenomenon. A scientific method, scientific hypothesis must be based on observations and make a testable and reproducible prediction about reality, in a process beginning with an educ ...

is very popular and the most significant historically as it was the first algorithm that could adapt to the weak learners. It is often the basis of introductory coverage of boosting in university machine learning courses. There are many more recent algorithms such as LPBoost, TotalBoost, BrownBoost, xgboost, MadaBoost, LogitBoost, and others. Many boosting algorithms fit into the AnyBoost framework, which shows that boosting performs

gradient descent Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function. The idea is to take repeated steps in the opposite direction of the gradi ...

in a

function space In mathematics, a function space is a set of functions between two fixed sets. Often, the domain and/or codomain will have additional structure which is inherited by the function space. For example, the set of functions from any set into a ve ...

using a

convex Convex or convexity may refer to: Science and technology * Convex lens, in optics Mathematics * Convex set, containing the whole line segment that joins points ** Convex polygon, a polygon which encloses a convex set of points ** Convex polytop ...

cost function.

Object categorization in computer vision

Given images containing various known objects in the world, a classifier can be learned from them to automatically

classify Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...

the objects in future images. Simple classifiers built based on some image feature of the object tend to be weak in categorization performance. Using boosting methods for object categorization is a way to unify the weak classifiers in a special way to boost the overall ability of categorization.

Problem of object categorization

Object categorization is a typical task of

computer vision Computer vision tasks include methods for image sensor, acquiring, Image processing, processing, Image analysis, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical ...

that involves determining whether or not an image contains some specific category of object. The idea is closely related with recognition, identification, and detection. Appearance based object categorization typically contains

feature extraction Feature may refer to: Computing * Feature recognition, could be a hole, pocket, or notch * Feature (computer vision), could be an edge, corner or blob * Feature (machine learning), in statistics: individual measurable properties of the phenome ...

learning Learning is the process of acquiring new understanding, knowledge, behaviors, skills, value (personal and cultural), values, Attitude (psychology), attitudes, and preferences. The ability to learn is possessed by humans, non-human animals, and ...

a classifier, and applying the classifier to new examples. There are many ways to represent a category of objects, e.g. from shape analysis, bag of words models, or local descriptors such as SIFT, etc. Examples of supervised classifiers are

Naive Bayes classifier In statistics, naive (sometimes simple or idiot's) Bayes classifiers are a family of " probabilistic classifiers" which assumes that the features are conditionally independent, given the target class. In other words, a naive Bayes model assumes th ...

support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...

s, mixtures of Gaussians, and

neural networks A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either Cell (biology), biological cells or signal pathways. While individual neurons are simple, many of them together in a netwo ...

. However, research has shown that object categories and their locations in images can be discovered in an unsupervised manner as well.

Status quo for object categorization

The recognition of object categories in images is a challenging problem in

, especially when the number of categories is large. This is due to high intra class variability and the need for generalization across variations of objects within the same category. Objects within one category may look quite different. Even the same object may appear unalike under different viewpoint, scale, and illumination. Background clutter and partial occlusion add difficulties to recognition as well. Humans are able to recognize thousands of object types, whereas most of the existing

object recognition Object recognition – technology in the field of computer vision for finding and identifying objects in an image or video sequence. Humans recognize a multitude of objects in images with little effort, despite the fact that the image of the ...

systems are trained to recognize only a few, e.g. human faces,

car A car, or an automobile, is a motor vehicle with wheels. Most definitions of cars state that they run primarily on roads, seat one to eight people, have four wheels, and mainly transport people rather than cargo. There are around one billio ...

s, simple objects, etc. Research has been very active on dealing with more categories and enabling incremental additions of new categories, and although the general problem remains unsolved, several multi-category objects detectors (for up to hundreds or thousands of categories) have been developed. One means is by

feature Feature may refer to: Computing * Feature recognition, could be a hole, pocket, or notch * Feature (computer vision), could be an edge, corner or blob * Feature (machine learning), in statistics: individual measurable properties of the phenome ...

sharing and boosting.

Boosting for binary categorization

AdaBoost can be used for face detection as an example of binary categorization. The two categories are faces versus background. The general algorithm is as follows: #Form a large set of simple features #Initialize weights for training images #For T rounds ##Normalize the weights ##For available features from the set, train a classifier using a single feature and evaluate the training error ##Choose the classifier with the lowest error ##Update the weights of the training images: increase if classified wrongly by this classifier, decrease if correctly #Form the final strong classifier as the linear combination of the T classifiers (coefficient larger if training error is small) After boosting, a classifier constructed from 200 features could yield a 95% detection rate under a

10^

false positive rate. Another application of boosting for binary categorization is a system that detects pedestrians using patterns of motion and appearance. This work is the first to combine both motion information and appearance information as features to detect a walking person. It takes a similar approach to the Viola-Jones object detection framework.

Boosting for multi-class categorization

Compared with binary categorization, multi-class categorization looks for common features that can be shared across the categories at the same time. They turn to be more generic

edge Edge or EDGE may refer to: Technology Computing * Edge computing, a network load-balancing system * Edge device, an entry point to a computer network * Adobe Edge, a graphical development application * Microsoft Edge, a web browser developed by ...

like features. During learning, the detectors for each category can be trained jointly. Compared with training separately, it generalizes better, needs less training data, and requires fewer features to achieve the same performance. The main flow of the algorithm is similar to the binary case. What is different is that a measure of the joint training error shall be defined in advance. During each iteration the algorithm chooses a classifier of a single feature (features that can be shared by more categories shall be encouraged). This can be done via converting

multi-class classification In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes (classifying instances into one of two classes is called binary ...

into a binary one (a set of categories versus the rest), or by introducing a penalty error from the categories that do not have the feature of the classifier. In the paper "Sharing visual features for multiclass and multiview object detection", A. Torralba et al. used GentleBoost for boosting and showed that when training data is limited, learning via sharing features does a much better job than no sharing, given same boosting rounds. Also, for a given performance level, the total number of features required (and therefore the run time cost of the classifier) for the feature sharing detectors, is observed to scale approximately

logarithm In mathematics, the logarithm of a number is the exponent by which another fixed value, the base, must be raised to produce that number. For example, the logarithm of to base is , because is to the rd power: . More generally, if , the ...

ically with the number of class, i.e., slower than

linear In mathematics, the term ''linear'' is used in two distinct senses for two different properties: * linearity of a '' function'' (or '' mapping''); * linearity of a '' polynomial''. An example of a linear function is the function defined by f(x) ...

growth in the non-sharing case. Similar results are shown in the paper "Incremental learning of object detectors using a visual shape alphabet", yet the authors used

for boosting.

Convex vs. non-convex boosting algorithms

Boosting algorithms can be based on

or non-convex optimization algorithms. Convex algorithms, such as

and LogitBoost, can be "defeated" by random noise such that they can't learn basic and learnable combinations of weak hypotheses.P. Long and R. Servedio. 25th International Conference on Machine Learning (ICML), 2008, pp. 608--615. This limitation was pointed out by Long & Servedio in 2008. However, by 2009, multiple authors demonstrated that boosting algorithms based on non-convex optimization, such as BrownBoost, can learn from noisy datasets and can specifically learn the underlying classifier of the Long–Servedio dataset.

Implementations

scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free and open-source machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support ...

, an open source machine learning library for Python * Orange, a free data mining software suite, modul
Orange.ensemble
* Weka is a machine learning set of tools that offers variate implementations of boosting algorithms like AdaBoost and LogitBoost * R packag
GBM
(Generalized Boosted Regression Models) implements extensions to Freund and Schapire's AdaBoost algorithm and Friedman's gradient boosting machine.
jboost
AdaBoost, LogitBoost, RobustBoost, Boostexter and alternating decision trees * R packag

Applies Multiclass AdaBoost.M1, AdaBoost-SAMME and Bagging * R packag

An implementation of gradient boosting for linear and tree-based models.

Notes

References

External links

* Robert E. Schapire (2003)
''The Boosting Approach to Machine Learning: An Overview''
MSRI (Mathematical Sciences Research Institute) Workshop on Nonlinear Estimation and Classification
Boosting: Foundations and Algorithms by Robert E. Schapire and Yoav Freund
{{Authority control Classification algorithms Ensemble learning Learning in computer vision Object recognition and categorization