Bootstrap aggregating
   HOME

TheInfoList



OR:

Bootstrap aggregating, also called bagging (from bootstrap aggregating), is a
machine learning ensemble In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Unlike a statistical ensemble in stati ...
meta-algorithm In computer science and mathematical optimization, a metaheuristic is a higher-level procedure or heuristic designed to find, generate, or select a heuristic (partial search algorithm) that may provide a sufficiently good solution to an optimizat ...
designed to improve the stability and accuracy of
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
algorithms used in
statistical classification In statistics, classification is the problem of identifying which of a set of categories (sub-populations) an observation (or observations) belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagn ...
and regression. It also reduces
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...
and helps to avoid
overfitting mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
. Although it is usually applied to
decision tree A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains con ...
methods, it can be used with any type of method. Bagging is a special case of the model averaging approach.


Description of the technique

Given a standard
training set In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...
D of size ''n'', bagging generates ''m'' new training sets D_i, each of size ''n′'', by sampling from ''D'' uniformly and
with replacement In statistics, quality assurance, and survey methodology, sampling is the selection of a subset (a statistical sample) of individuals from within a statistical population to estimate characteristics of the whole population. Statisticians attempt ...
. By sampling with replacement, some observations may be repeated in each D_i. If ''n ''=''n'', then for large ''n'' the set D_i is expected to have the fraction (1 - 1/'' e'') (≈63.2%) of the unique examples of ''D'', the rest being duplicates. This kind of sample is known as a bootstrap sample. Sampling with replacement ensures each bootstrap is independent from its peers, as it does not depend on previous chosen samples when sampling. Then, ''m'' models are fitted using the above ''m'' bootstrap samples and combined by averaging the output (for regression) or voting (for classification). Bagging leads to "improvements for unstable procedures", which include, for example,
artificial neural networks Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected units ...
, classification and regression trees, and subset selection in
linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is cal ...
. Bagging was shown to improve preimage learning. On the other hand, it can mildly degrade the performance of stable methods such as K-nearest neighbors.


Process of the algorithm


Key Terms

There are three types of datasets in bootstrap aggregating. These are the original, bootstrap, and out-of-bag datasets. Each section below will explain how each dataset is made except for the original dataset. The original dataset is whatever information is given.


Creating the bootstrap dataset

The bootstrap dataset is made by randomly picking objects from the original dataset. Also, it must be the same size as the original dataset. However, the difference is that the bootstrap dataset can have duplicate objects. Here is simple example to demonstrate how it works along with the illustration below: Suppose the original dataset is a group of 12 people. These guys are Emily, Jessie, George, Constantine, Lexi, Theodore, John, James, Rachel, Anthony, Ellie, and Jamal. By randomly picking a group of names, let us say our bootstrap dataset had James, Ellie, Constantine, Lexi, John, Constantine, Theodore, Constantine, Anthony, Lexi, Constantine, and Theodore. In this case, the bootstrap sample contained four duplicates for Constantine, and two duplicates for Lexi, and Theodore.


Creating the out-of-bag dataset

The out-of-bag dataset represents the remaining people who were not in the bootstrap dataset. It can be calculated by taking the difference between the original and the bootstrap datasets. In this case, the remaining samples who were not selected are Emily, Jessie, George, Rachel, and Jamal. Keep in mind that since both datasets are sets, when taking the difference the duplicate names are ignored in the bootstrap dataset. The illustration below shows how the math is done:


Importance

Creating the bootstrap and out-of-bag datasets is crucial since it is used to test the accuracy of a random forest algorithm. For example, a model that produces 50 trees using the bootstrap/out-of-bag datasets will have a better accuracy than if it produced 10 trees. Since the algorithm generates multiple trees and therefore multiple datasets the chance that an object is left out of the bootstrap dataset is low. The next few sections talk about how the random forest algorithm works in more detail.


Creation of Decision Trees

The next step of the algorithm involves the generation of
decision tree A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains con ...
s from the bootstrapped dataset. To achieve this, the process examines each gene/feature and determines for how many samples the feature's presence or absence yields a positive or negative result. This information is then used to compute a confusion matrix, which lists the true positives, false positives, true negatives, and false negatives of the feature when used as a classifier. These features are then ranked according to various classification metrics based on their confusion matrices. Some common metrics include estimate of positive correctness (calculated by subtracting false positives from true positives), measure of "goodness", and information gain. These features are then used to partition the samples into two sets: those who possess the top feature, and those who do not. The diagram below shows a decision tree of depth two being used to classify data. For example, a data point that exhibits Feature 1, but not Feature 2, will be given a "No". Another point that does not exhibit Feature 1, but does exhibit Feature 3, will be given a "Yes". This process is repeated recursively for successive levels of the tree until the desired depth is reached. At the very bottom of the tree, samples that test positive for the final feature are generally classified as positive, while those that lack the feature are classified as negative. These trees are then used as predictors to classify new data.


Random Forests

The next part of the algorithm involves introducing yet another element of variability amongst the bootstrapped trees. In addition to each tree only examining a bootstrapped set of samples, only a small but consistent number of unique features are considered when ranking them as classifiers. This means that each tree only knows about the data pertaining to a small constant number of features, and a variable number of samples that is less than or equal to that of the original dataset. Consequently, the trees are more likely to return a wider array of answers, derived from more diverse knowledge. This results in a
random forest Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of ...
, which possesses numerous benefits over a single decision tree generated without randomness. In a random forest, each tree "votes" on whether or not to classify a sample as positive based on its features. The sample is then classified based on majority vote. An example of this is given in the diagram below, where the four trees in a random forest vote on whether or not a patient with mutations A, B, F, and G has cancer. Since three out of four trees vote yes, the patient is then classified as cancer positive. Because of their properties, random forests are considered one of the most accurate data mining algorithms, are less likely to
overfit mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
their data, and run quickly and efficiently even for large datasets. They are primarily useful for classification as opposed to regression, which attempts to draw observed connections between statistical variables in a dataset. This makes random forests particularly useful in such fields as banking, healthcare, the stock market, and
e-commerce E-commerce (electronic commerce) is the activity of electronically buying or selling of products on online services or over the Internet. E-commerce draws on technologies such as mobile commerce, electronic funds transfer, supply chain manag ...
where it is important to be able to predict future results based on past data. One of their applications would be as a useful tool for predicting cancer based on genetic factors, as seen in the above example. There are several important factors to consider when designing a random forest. If the trees in the random forests are too deep, overfitting can still occur due to over-specificity. If the forest is too large, the algorithm may become less efficient due to an increased runtime. Random forests also do not generally perform well when given sparse data with little variability. However, they still have numerous advantages over similar data classification algorithms such as
neural network A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...
s, as they are much easier to interpret and generally require less data for training. As an integral component of random forests, bootstrap aggregating is very important to classification algorithms, and provides a critical element of variability that allows for increased accuracy when analyzing new data, as discussed below.


Improving Random Forests and Bagging

While the techniques described above utilize
random forest Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of ...
s and bagging (otherwise known as bootstrapping), there are certain techniques that can be used in order to improve their execution and voting time, their prediction accuracy, and their overall performance. The following are key steps in creating an efficient random forest: # Specify the maximum depth of trees: Instead of allowing your random forest to continue until all nodes are pure, it is better to cut it off at a certain point in order to further decrease chances of overfitting. # Prune the dataset: Using an extremely large dataset may prove to create results that is less indicative of the data provided than a smaller set that more accurately represents what is being focused on. #* Continue pruning the data at each node split rather than just in the original bagging process. # Decide on accuracy or speed: Depending on the desired results, increasing or decreasing the number of trees within the forest can help. Increasing the number of trees generally provides more accurate results while decreasing the number of trees will provide quicker results.


Algorithm (classification)

For classification, use a
training set In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...
D, Inducer I and the number of bootstrap samples m as input. Generate a classifier C^* as output # Create m new training sets D_i, from D with replacement # Classifier C_i is built from each set D_i using I to determine the classification of set D_i # Finally classifier C^* is generated by using the previously created set of classifiers C_i on the original data set D, the classification predicted most often by the sub-classifiers C_i is the final classification
for i = 1 to m 
C*(x) = argmax #         (most often predicted label y)
         y∈Y   


Example: ozone data

To illustrate the basic principles of bagging, below is an analysis on the relationship between
ozone Ozone (), or trioxygen, is an inorganic molecule with the chemical formula . It is a pale blue gas with a distinctively pungent smell. It is an allotrope of oxygen that is much less stable than the diatomic allotrope , breaking down in the l ...
and temperature (data from Rousseeuw and Leroy (1986), analysis done in R). The relationship between temperature and ozone appears to be nonlinear in this data set, based on the scatter plot. To mathematically describe this relationship,
LOESS Loess (, ; from german: Löss ) is a clastic, predominantly silt-sized sediment that is formed by the accumulation of wind-blown dust. Ten percent of Earth's land area is covered by loess or similar deposits. Loess is a periglacial or aeoli ...
smoothers (with bandwidth 0.5) are used. Rather than building a single smoother for the complete data set, 100 bootstrap samples were drawn. Each sample is composed of a random subset of the original data and maintains a semblance of the master set’s distribution and variability. For each bootstrap sample, a LOESS smoother was fit. Predictions from these 100 smoothers were then made across the range of the data. The black lines represent these initial predictions. The lines lack agreement in their predictions and tend to overfit their data points: evident by the wobbly flow of the lines. center By taking the average of 100 smoothers, each corresponding to a subset of the original data set, we arrive at one bagged predictor (red line). The red line's flow is stable and does not overly conform to any data point(s).


Advantages and disadvantages

Advantages: * Many weak learners aggregated typically outperform a single learner over the entire set, and has less overfit * Removes variance in high-variance low-bias weak learner * Can be performed in
parallel Parallel is a geometric term of location which may refer to: Computing * Parallel algorithm * Parallel computing * Parallel metaheuristic * Parallel (software), a UNIX utility for running programs in parallel * Parallel Sysplex, a cluster o ...
, as each separate bootstrap can be processed on its own before combination Disadvantages: * For weak learner with high bias, bagging will also carry high bias into its aggregate * Loss of interpretability of a model. * Can be computationally expensive depending on the data set


History

The concept of bootstrap aggregating is derived from the concept of bootstrapping which was developed by Bradley Efron. Bootstrap aggregating was proposed by
Leo Breiman Leo Breiman (January 27, 1928 – July 5, 2005) was a distinguished statistician at the University of California, Berkeley. He was the recipient of numerous honors and awards, and was a member of the United States National Academy of Sciences ...
who also coined the abbreviated term "bagging" (bootstrap aggregating). Breiman developed the concept of bagging in 1994 to improve classification by combining classifications of randomly generated training sets. He argued, "If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy".


See also

* Boosting (meta-algorithm) *
Bootstrapping (statistics) Bootstrapping is any test or metric that uses random sampling with replacement (e.g. mimicking the sampling process), and falls under the broader class of resampling methods. Bootstrapping assigns measures of accuracy (bias, variance, confidenc ...
*
Cross-validation (statistics) Cross-validation, sometimes called rotation estimation or out-of-sample testing, is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. Cross-va ...
* Out-of-bag error *
Random forest Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of ...
* Random subspace method (attribute bagging) * Resampled efficient frontier * Predictive analysis: Classification and regression trees


References


Further reading

* * * * {{cite book , first1=Bradley , last1=Boehmke , first2=Brandon , last2=Greenwell , chapter=Bagging , pages=191–202 , title=Hands-On Machine Learning with R , publisher=Chapman & Hall , year=2019 , isbn=978-1-138-49568-5 Ensemble learning Machine learning algorithms Computational statistics