Random forests or random decision forests is an

ensemble learning In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Unlike a statistical ensemble in statist ...

method for

classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...

regression Regression or regressions may refer to: Science * Marine regression, coastal advance due to falling sea level, the opposite of marine transgression * Regression (medicine), a characteristic of diseases to express lighter symptoms or less extent ( ...

and other tasks that operates by constructing a multitude of

decision trees A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains cond ...

at training time. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned. Random decision forests correct for decision trees' habit of

overfitting mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...

to their

training set In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...

. Random forests generally outperform

, but their accuracy is lower than gradient boosted trees. However, data characteristics can affect their performance. The first algorithm for random decision forests was created in 1995 by

Tin Kam Ho Tin Kam Ho () is a computer scientist at IBM Research with contributions to machine learning, data mining, and classification. Ho is noted for introducing random decision forests in 1995, and for her pioneering work in ensemble learning and data c ...

using the random subspace method, which, in Ho's formulation, is a way to implement the "stochastic discrimination" approach to classification proposed by Eugene Kleinberg. An extension of the algorithm was developed by

Leo Breiman Leo Breiman (January 27, 1928 – July 5, 2005) was a distinguished statistician at the University of California, Berkeley. He was the recipient of numerous honors and awards, and was a member of the United States National Academy of Sciences. ...

and Adele Cutler, who registered "Random Forests" as a

trademark A trademark (also written trade mark or trade-mark) is a type of intellectual property consisting of a recognizable sign, design, or expression that identifies products or services from a particular source and distinguishes them from oth ...

in 2006 (, owned by Minitab, Inc.). The extension combines Breiman's " bagging" idea and random selection of features, introduced first by Ho and later independently by Amit and Geman in order to construct a collection of decision trees with controlled variance. Random forests are frequently used as "blackbox" models in businesses, as they generate reasonable predictions across a wide range of data while requiring little configuration.

History

The general method of random decision forests was first proposed by Ho in 1995. Ho established that forests of trees splitting with oblique hyperplanes can gain accuracy as they grow without suffering from overtraining, as long as the forests are randomly restricted to be sensitive to only selected feature dimensions. A subsequent work along the same lines concluded that other splitting methods behave similarly, as long as they are randomly forced to be insensitive to some feature dimensions. Note that this observation of a more complex classifier (a larger forest) getting more accurate nearly monotonically is in sharp contrast to the common belief that the complexity of a classifier can only grow to a certain level of accuracy before being hurt by overfitting. The explanation of the forest method's resistance to overtraining can be found in Kleinberg's theory of stochastic discrimination. The early development of Breiman's notion of random forests was influenced by the work of Amit and Geman who introduced the idea of searching over a random subset of the available decisions when splitting a node, in the context of growing a single

tree In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, including only woody plants with secondary growth, plants that are ...

. The idea of random subspace selection from Ho was also influential in the design of random forests. In this method a forest of trees is grown, and variation among the trees is introduced by projecting the training data into a randomly chosen subspace before fitting each tree or each node. Finally, the idea of randomized node optimization, where the decision at each node is selected by a randomized procedure, rather than a deterministic optimization was first introduced by

Thomas G. Dietterich Thomas G. Dietterich is emeritus professor of computer science at Oregon State University. He is one of the pioneers of the field of machine learning. He served as executive editor of ''Machine Learning (journal)'' (1992–98) and helped co-found ...

. The proper introduction of random forests was made in a paper by

. This paper describes a method of building a forest of uncorrelated trees using a

CART A cart or dray (Australia and New Zealand) is a vehicle designed for transport, using two wheels and normally pulled by one or a pair of draught animals. A handcart is pulled or pushed by one or more people. It is different from the flatbed tr ...

like procedure, combined with randomized node optimization and bagging. In addition, this paper combines several ingredients, some previously known and some novel, which form the basis of the modern practice of random forests, in particular: # Using out-of-bag error as an estimate of the

generalization error For supervised learning applications in machine learning and statistical learning theory, generalization error (also known as the out-of-sample error or the risk) is a measure of how accurately an algorithm is able to predict outcome values for pre ...

. # Measuring variable importance through permutation. The report also offers the first theoretical result for random forests in the form of a bound on the

which depends on the strength of the trees in the forest and their

correlation In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statisti ...

Algorithm

Preliminaries: decision tree learning

Decision trees are a popular method for various machine learning tasks. Tree learning "come closest to meeting the requirements for serving as an off-the-shelf procedure for data mining", say Hastie ''et al.'', "because it is invariant under scaling and various other transformations of feature values, is robust to inclusion of irrelevant features, and produces inspectable models. However, they are seldom accurate". In particular, trees that are grown very deep tend to learn highly irregular patterns: they

overfit mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...

their training sets, i.e. have low bias, but very high variance. Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance. This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance in the final model. Forests are like the pulling together of decision tree algorithm efforts. Taking the teamwork of many trees thus improving the performance of a single random tree. Though not quite similar, forests give the effects of a k-fold cross validation.

Bagging

The training algorithm for random forests applies the general technique of

bootstrap aggregating Bootstrap aggregating, also called bagging (from bootstrap aggregating), is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regress ...

, or bagging, to tree learners. Given a training set = , ..., with responses = , ..., , bagging repeatedly (''B'' times) selects a random sample with replacement of the training set and fits trees to these samples: : For = 1, ..., : :# Sample, with replacement, training examples from , ; call these , . :# Train a classification or regression tree on , . After training, predictions for unseen samples can be made by averaging the predictions from all the individual regression trees on : :

\hat = \frac \sum_^Bf_b (x')

or by taking the in the case of classification trees. This bootstrapping procedure leads to better model performance because it decreases the

variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of number ...

of the model, without increasing the bias. This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated. Simply training many trees on a single training set would give strongly correlated trees (or even the same tree many times, if the training algorithm is deterministic); bootstrap sampling is a way of de-correlating the trees by showing them different training sets. Additionally, an estimate of the uncertainty of the prediction can be made as the standard deviation of the predictions from all the individual regression trees on : :

\sigma = \sqrt.

The number of samples/trees, , is a free parameter. Typically, a few hundred to several thousand trees are used, depending on the size and nature of the training set. An optimal number of trees can be found using cross-validation, or by observing the '' out-of-bag error'': the mean prediction error on each training sample , using only the trees that did not have in their bootstrap sample. The training and test error tend to level off after some number of trees have been fit.

From bagging to random forests

The above procedure describes the original bagging algorithm for trees. Random forests also include another type of bagging scheme: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. This process is sometimes called "feature bagging". The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be selected in many of the trees, causing them to become correlated. An analysis of how bagging and random subspace projection contribute to accuracy gains under different conditions is given by Ho. Typically, for a classification problem with features, (rounded down) features are used in each split. For regression problems the inventors recommend (rounded down) with a minimum node size of 5 as the default. In practice, the best values for these parameters should be tuned on a case-to-case basis for every problem.

ExtraTrees

Adding one further step of randomization yields ''extremely randomized trees'', or ExtraTrees. While similar to ordinary random forests in that they are an ensemble of individual trees, there are two main differences: first, each tree is trained using the whole learning sample (rather than a bootstrap sample), and second, the top-down splitting in the tree learner is randomized. Instead of computing the locally ''optimal'' cut-point for each feature under consideration (based on, e.g.,

information gain Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random, ...

or the Gini impurity), a ''random'' cut-point is selected. This value is selected from a uniform distribution within the feature's empirical range (in the tree's training set). Then, of all the randomly generated splits, the split that yields the highest score is chosen to split the node. Similar to ordinary random forests, the number of randomly selected features to be considered at each node can be specified. Default values for this parameter are

\sqrt

for classification and

p

for regression, where

p

is the number of features in the model.

Properties

Variable importance

Random forests can be used to rank the importance of variables in a regression or classification problem in a natural way. The following technique was described in Breiman's original paper and is implemented in the R package ''randomForest''. The first step in measuring the variable importance in a data set

\mathcal_n =\_^n

is to fit a random forest to the data. During the fitting process the out-of-bag error for each data point is recorded and averaged over the forest (errors on an independent test set can be substituted if bagging is not used during training). To measure the importance of the

j

-th feature after training, the values of the

j

-th feature are permuted among the training data and the out-of-bag error is again computed on this perturbed data set. The importance score for the

j

-th feature is computed by averaging the difference in out-of-bag error before and after the permutation over all trees. The score is normalized by the standard deviation of these differences. Features which produce large values for this score are ranked as more important than features which produce small values. The statistical definition of the variable importance measure was given and analyzed by Zhu ''et al.'' This method of determining variable importance has some drawbacks. For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Methods such as

partial permutation In combinatorial mathematics, a partial permutation, or sequence without repetition, on a finite set ''S'' is a bijection between two specified subsets of ''S''. That is, it is defined by two subsets ''U'' and ''V'' of equal size, and a one-to-on ...

s and growing unbiased trees can be used to solve the problem. If the data contain groups of correlated features of similar relevance for the output, then smaller groups are favored over larger groups.

Relationship to nearest neighbors

A relationship between random forests and the -nearest neighbor algorithm (-NN) was pointed out by Lin and Jeon in 2002. It turns out that both can be viewed as so-called ''weighted neighborhoods schemes''. These are models built from a training set

\_^n

that make predictions

\hat

for new points by looking at the "neighborhood" of the point, formalized by a weight function : :

\hat = \sum_^n W(x_i, x') \, y_i.

Here,

W(x_i, x')

is the non-negative weight of the 'th training point relative to the new point in the same tree. For any particular , the weights for points

x_i

must sum to one. Weight functions are given as follows: * In -NN, the weights are

W(x_i, x') = \frac

if is one of the points closest to , and zero otherwise. * In a tree,

W(x_i, x') = \frac

if is one of the points in the same leaf as , and zero otherwise. Since a forest averages the predictions of a set of trees with individual weight functions

W_j

, its predictions are :

\hat = \frac\sum_^m\sum_^n W_(x_i, x') \, y_i = \sum_^n\left(\frac\sum_^m W_(x_i, x')\right) \, y_i.

This shows that the whole forest is again a weighted neighborhood scheme, with weights that average those of the individual trees. The neighbors of in this interpretation are the points

x_i

sharing the same leaf in any tree

j

. In this way, the neighborhood of depends in a complex way on the structure of the trees, and thus on the structure of the training set. Lin and Jeon show that the shape of the neighborhood used by a random forest adapts to the local importance of each feature.

Unsupervised learning with random forests

As part of their construction, random forest predictors naturally lead to a dissimilarity measure among the observations. One can also define a random forest dissimilarity measure between unlabeled data: the idea is to construct a random forest predictor that distinguishes the "observed" data from suitably generated synthetic data. The observed data are the original unlabeled data and the synthetic data are drawn from a reference distribution. A random forest dissimilarity can be attractive because it handles mixed variable types very well, is invariant to monotonic transformations of the input variables, and is robust to outlying observations. The random forest dissimilarity easily deals with a large number of semi-continuous variables due to its intrinsic variable selection; for example, the "Addcl 1" random forest dissimilarity weighs the contribution of each variable according to how dependent it is on other variables. The random forest dissimilarity has been used in a variety of applications, e.g. to find clusters of patients based on tissue marker data.

Variants

Instead of decision trees, linear models have been proposed and evaluated as base estimators in random forests, in particular

multinomial logistic regression In statistics, multinomial logistic regression is a statistical classification, classification method that generalizes logistic regression to multiclass classification, multiclass problems, i.e. with more than two possible discrete outcomes. T ...

and

naive Bayes classifier In statistics, naive Bayes classifiers are a family of simple " probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features (see Bayes classifier). They are among the simplest Baye ...

s. In cases that the relationship between the predictors and the target variable is linear, the base learners may have an equally high accuracy as the ensemble learner.

Kernel random forest

In machine learning, kernel random forests (KeRF) establish the connection between random forests and

kernel method In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (for example ...

s. By slightly modifying their definition, random forests can be rewritten as

s, which are more interpretable and easier to analyze.

History

was the first person to notice the link between random forest and

kernel methods In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (for example c ...

. He pointed out that random forests which are grown using i.i.d. random vectors in the tree construction are equivalent to a kernel acting on the true margin. Lin and Jeon established the connection between random forests and adaptive nearest neighbor, implying that random forests can be seen as adaptive kernel estimates. Davies and Ghahramani proposed Random Forest Kernel and show that it can empirically outperform state-of-art kernel methods. Scornet first defined KeRF estimates and gave the explicit link between KeRF estimates and random forest. He also gave explicit expressions for kernels based on centered random forest and uniform random forest, two simplified models of random forest. He named these two KeRFs Centered KeRF and Uniform KeRF, and proved upper bounds on their rates of consistency.

Notations and definitions

Preliminaries: Centered forests

Centered forest is a simplified model for Breiman's original random forest, which uniformly selects an attribute among all attributes and performs splits at the center of the cell along the pre-chosen attribute. The algorithm stops when a fully binary tree of level

k

is built, where

k \in\mathbb

is a parameter of the algorithm.

Uniform forest

Uniform forest is another simplified model for Breiman's original random forest, which uniformly selects a feature among all features and performs splits at a point uniformly drawn on the side of the cell, along the preselected feature.

From random forest to KeRF

Given a training sample

\mathcal_n =\_^n

p\times\mathbb

-valued independent random variables distributed as the independent prototype pair

(\mathbf, Y)

, where

\infty

. We aim at predicting the response

Y

, associated with the random variable

\mathbf

, by estimating the regression function

m(\mathbf)=\operatorname \mid \mathbf = \mathbf /math>. A random regression forest is an ensemble of M randomized regression trees. Denote m_n(\mathbf,\mathbf_j) the predicted value at point \mathbf by the j -th tree, where \mathbf_1,\ldots,\mathbf_M are independent random variables, distributed as a generic random variable \mathbf, independent of the sample \mathcal_n . This random variable can be used to describe the randomness induced by node splitting and the sampling procedure for tree construction. The trees are combined to form the finite forest estimate m_(\mathbf,\Theta_1,\ldots,\Theta_M) = \frac\sum_^M m_n(\mathbf,\Theta_j) .
For regression trees, we have m_n = \sum_^n\frac, where A_n(\mathbf,\Theta_j) is the cell containing \mathbf, designed with randomness \Theta_j and dataset \mathcal_n, and N_n(\mathbf, \Theta_j) = \sum_^n \mathbf_.

Thus random forest estimates satisfy, for all \mathbf\in,1 d, m_(\mathbf, \Theta_1,\ldots,\Theta_M) =\frac\sum_^M \left(\sum_^n\frac\right) . Random regression forest has two levels of averaging, first over the samples in the target cell of a tree, then over all trees. Thus the contributions of observations that are in cells with a high density of data points are smaller than that of observations which belong to less populated cells. In order to improve the random forest methods and compensate the misestimation, Scornet defined KeRF by

: \tilde_(\mathbf, \Theta_1,\ldots,\Theta_M) = \frac\sum_^M\sum_^n Y_i\mathbf_, which is equal to the mean of the Y_i's falling in the cells containing \mathbf in the forest. If we define the connection function of the M finite forest as K_(\mathbf, \mathbf) = \frac \sum_^M \mathbf_, i.e. the proportion of cells shared between \mathbf and \mathbf, then almost surely we have \tilde_(\mathbf, \Theta_1,\ldots,\Theta_M) =
\frac, which defines the KeRF.

Centered KeRF

The construction of Centered KeRF of level

k

is the same as for centered forest, except that predictions are made by

\tilde_(\mathbf, \Theta_1,\ldots,\Theta_M)

, the corresponding kernel function, or connection function is :

d. \end

Uniform KeRF

Uniform KeRF is built in the same way as uniform forest, except that predictions are made by

\tilde_(\mathbf, \Theta_1,\ldots,\Theta_M)

, the corresponding kernel function, or connection function is :

K_k^(\mathbf,\mathbf) =
\sum_
\frac\left(\frac\right)^k
\prod_^d\left(1-, x_m, \sum_^\frac\right) \text \mathbf\in,1 d.

Properties

Relation between KeRF and random forest

Predictions given by KeRF and random forests are close if the number of points in each cell is controlled:

Assume that there exist sequences $(a_n),(b_n)$ such that, almost surely, : $a_n\leq N_n(\mathbf,\Theta)\leq b_n \text a_n\leq \frac 1 M \sum_^M N_n \leq b_n.$ Then almost surely, : $, m_(\mathbf) - \tilde_(\mathbf), \le\frac \tilde_(\mathbf).$

Relation between infinite KeRF and infinite random forest

When the number of trees

M

goes to infinity, then we have infinite random forest and infinite KeRF. Their estimates are close if the number of observations in each cell is bounded:

Assume that there exist sequences $(\varepsilon_n), (a_n),(b_n)$ such that, almost surely * $\ge 1,$ * $\ge 1-\varepsilon_n/2,$ * $\le b_n\mid \mathcal_n\ge 1-\varepsilon_n/2,$ Then almost surely, : $, m_(\mathbf)-\tilde_(\mathbf), \le
\frac\tilde_(\mathbf) + n \varepsilon_n \left( \max_ Y_i \right).$

Consistency results

Assume that

Y = m(\mathbf) + \varepsilon

, where

\varepsilon

is a centered Gaussian noise, independent of

\mathbf

, with finite variance

\sigma^2<\infty

. Moreover,

\mathbf

is uniformly distributed on

,1 d

and

m

Lipschitz Lipschitz, Lipshitz, or Lipchitz, is an Ashkenazi Jewish (Yiddish/German-Jewish) surname. The surname has many variants, including: Lifshitz (Lifschitz), Lifshits, Lifshuts, Lefschetz; Lipschitz, Lipshitz, Lipshits, Lopshits, Lipschutz (Lipschütz ...

. Scornet proved upper bounds on the rates of consistency for centered KeRF and uniform KeRF.

Consistency of centered KeRF

Providing

k\rightarrow\infty

and

n/2^k\rightarrow\infty

, there exists a constant

C_1>0

such that, for all

n

2 \le C_1 n^(\log n)^2

Consistency of uniform KeRF

Providing

k\rightarrow\infty

and

n/2^k\rightarrow\infty

, there exists a constant

C>0

such that,

2\le Cn^(\log n)^2

Disadvantages

While random forests often achieve higher accuracy than a single decision tree, they sacrifice the intrinsic

interpretability In mathematical logic, interpretability is a relation between formal theories that expresses the possibility of interpreting or translating one into the other. Informal definition Assume ''T'' and ''S'' are formal theories. Slightly simplified, '' ...

present in decision trees. Decision trees are among a fairly small family of machine learning models that are easily interpretable along with linear models, rule-based models, and

attention Attention is the behavioral and cognitive process of selectively concentrating on a discrete aspect of information, whether considered subjective or objective, while ignoring other perceivable information. William James (1890) wrote that "Att ...

-based models. This interpretability is one of the most desirable qualities of decision trees. It allows developers to confirm that the model has learned realistic information from the data and allows end-users to have trust and confidence in the decisions made by the model. For example, following the path that a decision tree takes to make its decision is quite trivial, but following the paths of tens or hundreds of trees is much harder. To achieve both performance and interpretability, some model compression techniques allow transforming a random forest into a minimal "born-again" decision tree that faithfully reproduces the same decision function. If it is established that the predictive attributes are linearly correlated with the target variable, using random forest may not enhance the accuracy of the base learner. Furthermore, in problems with multiple categorical variables, random forest may not be able to increase the accuracy of the base learner.

References

External links

Random Forests classifier description
(Leo Breiman's site)
Liaw, Andy & Wiener, Matthew "Classification and Regression by randomForest" R News (2002) Vol. 2/3 p. 18
(Discussion of the use of the random forest package for R) Classification algorithms Ensemble learning Decision trees Decision theory Computational statistics

History

Algorithm

Preliminaries: decision tree learning

Bagging

From bagging to random forests

ExtraTrees

Properties

Variable importance

Relationship to nearest neighbors

Unsupervised learning with random forests

Variants

Kernel random forest

History

Notations and definitions

Preliminaries: Centered forests

Uniform forest

From random forest to KeRF

Centered KeRF

Uniform KeRF

Properties

Relation between KeRF and random forest

Relation between infinite KeRF and infinite random forest

Consistency results

Consistency of centered KeRF

Consistency of uniform KeRF

Disadvantages

See also

References

Further reading

External links