Random forest

Last updated

Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned. [1] [2] Random decision forests correct for decision trees' habit of overfitting to their training set. [3] :587–588

Contents

The first algorithm for random decision forests was created in 1995 by Tin Kam Ho [1] using the random subspace method, [2] which, in Ho's formulation, is a way to implement the "stochastic discrimination" approach to classification proposed by Eugene Kleinberg. [4] [5] [6]

An extension of the algorithm was developed by Leo Breiman [7] and Adele Cutler, [8] who registered [9] "Random Forests" as a trademark in 2006 (as of 2019, owned by Minitab, Inc.). [10] The extension combines Breiman's "bagging" idea and random selection of features, introduced first by Ho [1] and later independently by Amit and Geman [11] in order to construct a collection of decision trees with controlled variance.

History

The general method of random decision forests was first proposed by Ho in 1995. [1] Ho established that forests of trees splitting with oblique hyperplanes can gain accuracy as they grow without suffering from overtraining, as long as the forests are randomly restricted to be sensitive to only selected feature dimensions. A subsequent work along the same lines [2] concluded that other splitting methods behave similarly, as long as they are randomly forced to be insensitive to some feature dimensions. Note that this observation of a more complex classifier (a larger forest) getting more accurate nearly monotonically is in sharp contrast to the common belief that the complexity of a classifier can only grow to a certain level of accuracy before being hurt by overfitting. The explanation of the forest method's resistance to overtraining can be found in Kleinberg's theory of stochastic discrimination. [4] [5] [6]

The early development of Breiman's notion of random forests was influenced by the work of Amit and Geman [11] who introduced the idea of searching over a random subset of the available decisions when splitting a node, in the context of growing a single tree. The idea of random subspace selection from Ho [2] was also influential in the design of random forests. In this method a forest of trees is grown, and variation among the trees is introduced by projecting the training data into a randomly chosen subspace before fitting each tree or each node. Finally, the idea of randomized node optimization, where the decision at each node is selected by a randomized procedure, rather than a deterministic optimization was first introduced by Thomas G. Dietterich. [12]

The proper introduction of random forests was made in a paper by Leo Breiman. [7] This paper describes a method of building a forest of uncorrelated trees using a CART like procedure, combined with randomized node optimization and bagging. In addition, this paper combines several ingredients, some previously known and some novel, which form the basis of the modern practice of random forests, in particular:

  1. Using out-of-bag error as an estimate of the generalization error.
  2. Measuring variable importance through permutation.

The report also offers the first theoretical result for random forests in the form of a bound on the generalization error which depends on the strength of the trees in the forest and their correlation.

Algorithm

Preliminaries: decision tree learning

Decision trees are a popular method for various machine learning tasks. Tree learning "come[s] closest to meeting the requirements for serving as an off-the-shelf procedure for data mining", say Hastie et al., "because it is invariant under scaling and various other transformations of feature values, is robust to inclusion of irrelevant features, and produces inspectable models. However, they are seldom accurate". [3] :352

In particular, trees that are grown very deep tend to learn highly irregular patterns: they overfit their training sets, i.e. have low bias, but very high variance. Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance. [3] :587–588 This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance in the final model.

Bagging

Illustration of training a Random Forest model. The training dataset (in this case, of 250 rows and 100 columns) is randomly sampled with replacement n times. Then, a decision tree is trained on each sample. Finally, for prediction, the results of all n trees are aggregated to produce a final decision. Random Forest Bagging Illustration.png
Illustration of training a Random Forest model. The training dataset (in this case, of 250 rows and 100 columns) is randomly sampled with replacement n times. Then, a decision tree is trained on each sample. Finally, for prediction, the results of all n trees are aggregated to produce a final decision.

The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners. Given a training set X = x1, ..., xn with responses Y = y1, ..., yn, bagging repeatedly (B times) selects a random sample with replacement of the training set and fits trees to these samples:

For b = 1, ..., B:
  1. Sample, with replacement, n training examples from X, Y; call these Xb, Yb.
  2. Train a classification or regression tree fb on Xb, Yb.

After training, predictions for unseen samples x' can be made by averaging the predictions from all the individual regression trees on x':

or by taking the plurality vote in the case of classification trees.

This bootstrapping procedure leads to better model performance because it decreases the variance of the model, without increasing the bias. This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated. Simply training many trees on a single training set would give strongly correlated trees (or even the same tree many times, if the training algorithm is deterministic); bootstrap sampling is a way of de-correlating the trees by showing them different training sets.

Additionally, an estimate of the uncertainty of the prediction can be made as the standard deviation of the predictions from all the individual regression trees on x':

The number of samples/trees, B, is a free parameter. Typically, a few hundred to several thousand trees are used, depending on the size and nature of the training set. An optimal number of trees B can be found using cross-validation, or by observing the out-of-bag error : the mean prediction error on each training sample xi, using only the trees that did not have xi in their bootstrap sample. [13] The training and test error tend to level off after some number of trees have been fit.

From bagging to random forests

The above procedure describes the original bagging algorithm for trees. Random forests also include another type of bagging scheme: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. This process is sometimes called "feature bagging". The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be selected in many of the B trees, causing them to become correlated. An analysis of how bagging and random subspace projection contribute to accuracy gains under different conditions is given by Ho. [14]

Typically, for a classification problem with p features, p (rounded down) features are used in each split. [3] :592 For regression problems the inventors recommend p/3 (rounded down) with a minimum node size of 5 as the default. [3] :592 In practice, the best values for these parameters should be tuned on a case-to-case basis for every problem. [3] :592

ExtraTrees

Adding one further step of randomization yields extremely randomized trees, or ExtraTrees. While similar to ordinary random forests in that they are an ensemble of individual trees, there are two main differences: first, each tree is trained using the whole learning sample (rather than a bootstrap sample), and second, the top-down splitting in the tree learner is randomized. Instead of computing the locally optimal cut-point for each feature under consideration (based on, e.g., information gain or the Gini impurity), a random cut-point is selected. This value is selected from a uniform distribution within the feature's empirical range (in the tree's training set). Then, of all the randomly generated splits, the split that yields the highest score is chosen to split the node. Similar to ordinary random forests, the number of randomly selected features to be considered at each node can be specified. Default values for this parameter are for classification and for regression, where is the number of features in the model. [15]

Random forests for high-dimensional data

The basic Random Forest procedure may not work well in situations where there are a large number of features but only a small proportion of these features are informative with respect to sample classification. This can be addressed by encouraging the procedure to focus mainly on features and trees that are informative. Some methods for accomplishing this are:

- Prefiltering: Eliminate features that are mostly just noise. [16] [17]

- Enriched Random Forest (ERF): Use weighted random sampling instead of simple random sampling at each node of each tree, giving greater weight to features that appear to be more informative. [18] [19]

- Tree Weighted Random Forest (TWRF): Weight trees so that trees exhibiting better accuracy are assigned higher weights. [20] [21]

Properties

Variable importance

Random forests can be used to rank the importance of variables in a regression or classification problem in a natural way. The following technique was described in Breiman's original paper [7] and is implemented in the R package randomForest. [8]

Permutation Importance

The first step in measuring the variable importance in a data set is to fit a random forest to the data. During the fitting process the out-of-bag error for each data point is recorded and averaged over the forest (errors on an independent test set can be substituted if bagging is not used during training).

To measure the importance of the -th feature after training, the values of the -th feature are permuted in the out-of-bag samples and the out-of-bag error is again computed on this perturbed data set. The importance score for the -th feature is computed by averaging the difference in out-of-bag error before and after the permutation over all trees. The score is normalized by the standard deviation of these differences.

Features which produce large values for this score are ranked as more important than features which produce small values. The statistical definition of the variable importance measure was given and analyzed by Zhu et al. [22]

This method of determining variable importance has some drawbacks.

  • For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Methods such as partial permutations [23] [24] [25] and growing unbiased trees [26] [27] can be used to solve the problem.
  • If the data contain groups of correlated features of similar relevance for the output, then smaller groups are favored over larger groups. [28]
  • Additionally, the permutation procedure may fail to identify important features when there are collinear features. In this case permuting groups of correlated features together is a remedy. [29]

Mean Decrease in Impurity Feature Importance

This feature importance for random forests is the default implementation in sci-kit learn and R. It is described in the book "Classification and Regression Trees" by Leo Breiman. [30] Variables which decrease the impurity during splits a lot are considered important: [31]

where indicates a feature, is the number of trees in the forest, indicates tree , is the fraction of samples reaching node , is the change in impurity in tree at node . As impurity measure for samples falling in a node e.g. the following statistics can be used:

The normalized importance is then obtained by normalizing over all features, so that the sum of normalized feature importances is 1.

The sci-kit learn default implementation of Mean Decrease in Impurity Feature Importance is susceptible to misleading feature importances: [29]

  • the importance measure prefers high cardinality features
  • it uses training statistics and therefore does not "reflect the ability of feature to be useful to make predictions that generalize to the test set" [32]

Relationship to nearest neighbors

A relationship between random forests and the k-nearest neighbor algorithm (k-NN) was pointed out by Lin and Jeon in 2002. [33] It turns out that both can be viewed as so-called weighted neighborhoods schemes. These are models built from a training set that make predictions for new points x' by looking at the "neighborhood" of the point, formalized by a weight function W:

Here, is the non-negative weight of the i'th training point relative to the new point x' in the same tree. For any particular x', the weights for points must sum to one. Weight functions are given as follows:

Since a forest averages the predictions of a set of m trees with individual weight functions , its predictions are

This shows that the whole forest is again a weighted neighborhood scheme, with weights that average those of the individual trees. The neighbors of x' in this interpretation are the points sharing the same leaf in any tree . In this way, the neighborhood of x' depends in a complex way on the structure of the trees, and thus on the structure of the training set. Lin and Jeon show that the shape of the neighborhood used by a random forest adapts to the local importance of each feature. [33]

Unsupervised learning with random forests

As part of their construction, random forest predictors naturally lead to a dissimilarity measure among the observations. One can also define a random forest dissimilarity measure between unlabeled data: the idea is to construct a random forest predictor that distinguishes the "observed" data from suitably generated synthetic data. [7] [34] The observed data are the original unlabeled data and the synthetic data are drawn from a reference distribution. A random forest dissimilarity can be attractive because it handles mixed variable types very well, is invariant to monotonic transformations of the input variables, and is robust to outlying observations. The random forest dissimilarity easily deals with a large number of semi-continuous variables due to its intrinsic variable selection; for example, the "Addcl 1" random forest dissimilarity weighs the contribution of each variable according to how dependent it is on other variables. The random forest dissimilarity has been used in a variety of applications, e.g. to find clusters of patients based on tissue marker data. [35]

Variants

Instead of decision trees, linear models have been proposed and evaluated as base estimators in random forests, in particular multinomial logistic regression and naive Bayes classifiers. [36] [37] [38] In cases that the relationship between the predictors and the target variable is linear, the base learners may have an equally high accuracy as the ensemble learner. [39] [36]

Kernel random forest

In machine learning, kernel random forests (KeRF) establish the connection between random forests and kernel methods. By slightly modifying their definition, random forests can be rewritten as kernel methods, which are more interpretable and easier to analyze. [40]

History

Leo Breiman [41] was the first person to notice the link between random forest and kernel methods. He pointed out that random forests which are grown using i.i.d. random vectors in the tree construction are equivalent to a kernel acting on the true margin. Lin and Jeon [42] established the connection between random forests and adaptive nearest neighbor, implying that random forests can be seen as adaptive kernel estimates. Davies and Ghahramani [43] proposed Random Forest Kernel and show that it can empirically outperform state-of-art kernel methods. Scornet [40] first defined KeRF estimates and gave the explicit link between KeRF estimates and random forest. He also gave explicit expressions for kernels based on centered random forest [44] and uniform random forest, [45] two simplified models of random forest. He named these two KeRFs Centered KeRF and Uniform KeRF, and proved upper bounds on their rates of consistency.

Notations and definitions

Preliminaries: Centered forests

Centered forest [44] is a simplified model for Breiman's original random forest, which uniformly selects an attribute among all attributes and performs splits at the center of the cell along the pre-chosen attribute. The algorithm stops when a fully binary tree of level is built, where is a parameter of the algorithm.

Uniform forest

Uniform forest [45] is another simplified model for Breiman's original random forest, which uniformly selects a feature among all features and performs splits at a point uniformly drawn on the side of the cell, along the preselected feature.

From random forest to KeRF

Given a training sample of -valued independent random variables distributed as the independent prototype pair , where . We aim at predicting the response , associated with the random variable , by estimating the regression function . A random regression forest is an ensemble of randomized regression trees. Denote the predicted value at point by the -th tree, where are independent random variables, distributed as a generic random variable , independent of the sample . This random variable can be used to describe the randomness induced by node splitting and the sampling procedure for tree construction. The trees are combined to form the finite forest estimate . For regression trees, we have , where is the cell containing , designed with randomness and dataset , and .

Thus random forest estimates satisfy, for all , . Random regression forest has two levels of averaging, first over the samples in the target cell of a tree, then over all trees. Thus the contributions of observations that are in cells with a high density of data points are smaller than that of observations which belong to less populated cells. In order to improve the random forest methods and compensate the misestimation, Scornet [40] defined KeRF by

which is equal to the mean of the 's falling in the cells containing in the forest. If we define the connection function of the finite forest as , i.e. the proportion of cells shared between and , then almost surely we have , which defines the KeRF.

Centered KeRF

The construction of Centered KeRF of level is the same as for centered forest, except that predictions are made by , the corresponding kernel function, or connection function is

Uniform KeRF

Uniform KeRF is built in the same way as uniform forest, except that predictions are made by , the corresponding kernel function, or connection function is

Properties

Relation between KeRF and random forest

Predictions given by KeRF and random forests are close if the number of points in each cell is controlled:

Assume that there exist sequences such that, almost surely,

Then almost surely,

Relation between infinite KeRF and infinite random forest

When the number of trees goes to infinity, then we have infinite random forest and infinite KeRF. Their estimates are close if the number of observations in each cell is bounded:

Assume that there exist sequences such that, almost surely

Then almost surely,

Consistency results

Assume that , where is a centered Gaussian noise, independent of , with finite variance . Moreover, is uniformly distributed on and is Lipschitz. Scornet [40] proved upper bounds on the rates of consistency for centered KeRF and uniform KeRF.

Consistency of centered KeRF

Providing and , there exists a constant such that, for all , .

Consistency of uniform KeRF

Providing and , there exists a constant such that, .

Disadvantages

While random forests often achieve higher accuracy than a single decision tree, they sacrifice the intrinsic interpretability present in decision trees. Decision trees are among a fairly small family of machine learning models that are easily interpretable along with linear models, rule-based models, and attention-based models. This interpretability is one of the most desirable qualities of decision trees. It allows developers to confirm that the model has learned realistic information from the data and allows end-users to have trust and confidence in the decisions made by the model. [36] [3] For example, following the path that a decision tree takes to make its decision is quite trivial, but following the paths of tens or hundreds of trees is much harder. To achieve both performance and interpretability, some model compression techniques allow transforming a random forest into a minimal "born-again" decision tree that faithfully reproduces the same decision function. [36] [46] [47] If it is established that the predictive attributes are linearly correlated with the target variable, using random forest may not enhance the accuracy of the base learner. [36] [39] Furthermore, in problems with multiple categorical variables, random forest may not be able to increase the accuracy of the base learner. [48]

See also

Related Research Articles

The likelihood function is the joint probability of observed data viewed as a function of the parameters of a statistical model.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In statistics, a statistic is sufficient with respect to a statistical model and its associated unknown parameter if "no other statistic that can be calculated from the same sample provides any additional information as to the value of the parameter". In particular, a statistic is sufficient for a family of probability distributions if the sample from which it is calculated gives no additional information than the statistic, as to which of those probability distributions is the sampling distribution.

<span class="mw-page-title-main">Logistic regression</span> Statistical model for a binary dependent variable

In statistics, the logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family. Sometimes loosely referred to as "the" exponential family, this class of distributions is distinct because they all possess a variety of desirable properties, most importantly the existence of a sufficient statistic.

<span class="mw-page-title-main">Expectation–maximization algorithm</span> Iterative method for finding maximum likelihood estimates in statistical models

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. It can be used, for example, to estimate a mixture of gaussians, or to solve the multiple linear regression problem.

In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for sampling from a specified multivariate probability distribution when direct sampling from the joint distribution is difficult, but sampling from the conditional distribution is more practical. This sequence can be used to approximate the joint distribution ; to approximate the marginal distribution of one of the variables, or some subset of the variables ; or to compute an integral. Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable.

In statistics, Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modeled by a linear combination of unknown parameters. A Poisson regression model is sometimes known as a log-linear model, especially when used to model contingency tables.

<span class="mw-page-title-main">Characteristic function (probability theory)</span> Fourier transform of the probability density function

In probability theory and statistics, the characteristic function of any real-valued random variable completely defines its probability distribution. If a random variable admits a probability density function, then the characteristic function is the Fourier transform of the probability density function. Thus it provides an alternative route to analytical results compared with working directly with probability density functions or cumulative distribution functions. There are particularly simple results for the characteristic functions of distributions defined by the weighted sums of random variables.

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

<span class="mw-page-title-main">Poisson distribution</span> Discrete probability distribution

In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time if these events occur with a known constant mean rate and independently of the time since the last event. It can also be used for the number of events in other types of intervals than time, and in dimension greater than 1.

Gradient boosting is a machine learning technique based on boosting in a functional space, where the target is pseudo-residuals rather than the typical residuals used in traditional boosting. It gives a prediction model in the form of an ensemble of weak prediction models, i.e., models that make very few assumptions about the data, which are typically simple decision trees. When a decision tree is the weak learner, the resulting algorithm is called gradient-boosted trees; it usually outperforms random forest. A gradient-boosted trees model is built in a stage-wise fashion as in other boosting methods, but it generalizes the other methods by allowing optimization of an arbitrary differentiable loss function.

A product distribution is a probability distribution constructed as the distribution of the product of random variables having two other known distributions. Given two statistically independent random variables X and Y, the distribution of the random variable Z that is formed as the product is a product distribution.

In coding theory, generalized minimum-distance (GMD) decoding provides an efficient algorithm for decoding concatenated codes, which is based on using an errors-and-erasures decoder for the outer code.

For certain applications in linear algebra, it is useful to know properties of the probability distribution of the largest eigenvalue of a finite sum of random matrices. Suppose is a finite sequence of random matrices. Analogous to the well-known Chernoff bound for sums of scalars, a bound on the following is sought for a given parameter t:

In statistics, ordinal regression, also called ordinal classification, is a type of regression analysis used for predicting an ordinal variable, i.e. a variable whose value exists on an arbitrary scale where only the relative ordering between different values is significant. It can be considered an intermediate problem between regression and classification. Examples of ordinal regression are ordered logit and ordered probit. Ordinal regression turns up often in the social sciences, for example in the modeling of human levels of preference, as well as in information retrieval. In machine learning, ordinal regression may also be called ranking learning.

<span class="mw-page-title-main">Hyperbolastic functions</span> Mathematical functions

The hyperbolastic functions, also known as hyperbolastic growth models, are mathematical functions that are used in medical statistical modeling. These models were originally developed to capture the growth dynamics of multicellular tumor spheres, and were introduced in 2005 by Mohammad Tabatabai, David Williams, and Zoran Bursac. The precision of hyperbolastic functions in modeling real world problems is somewhat due to their flexibility in their point of inflection. These functions can be used in a wide variety of modeling problems such as tumor growth, stem cell proliferation, pharma kinetics, cancer growth, sigmoid activation function in neural networks, and epidemiological disease progression or regression.

References

  1. 1 2 3 4 Ho, Tin Kam (1995). Random Decision Forests (PDF). Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, 14–16 August 1995. pp. 278–282. Archived from the original (PDF) on 17 April 2016. Retrieved 5 June 2016.
  2. 1 2 3 4 Ho TK (1998). "The Random Subspace Method for Constructing Decision Forests" (PDF). IEEE Transactions on Pattern Analysis and Machine Intelligence. 20 (8): 832–844. doi:10.1109/34.709601. S2CID   206420153.
  3. 1 2 3 4 5 6 7 Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2008). The Elements of Statistical Learning (2nd ed.). Springer. ISBN   0-387-95284-5.
  4. 1 2 Kleinberg E (1990). "Stochastic Discrimination" (PDF). Annals of Mathematics and Artificial Intelligence . 1 (1–4): 207–239. CiteSeerX   10.1.1.25.6750 . doi:10.1007/BF01531079. S2CID   206795835. Archived from the original (PDF) on 2018-01-18.
  5. 1 2 Kleinberg E (1996). "An Overtraining-Resistant Stochastic Modeling Method for Pattern Recognition". Annals of Statistics . 24 (6): 2319–2349. doi: 10.1214/aos/1032181157 . MR   1425956.
  6. 1 2 Kleinberg E (2000). "On the Algorithmic Implementation of Stochastic Discrimination" (PDF). IEEE Transactions on PAMI. 22 (5): 473–490. CiteSeerX   10.1.1.33.4131 . doi:10.1109/34.857004. S2CID   3563126. Archived from the original (PDF) on 2018-01-18.
  7. 1 2 3 4 Breiman L (2001). "Random Forests". Machine Learning . 45 (1): 5–32. Bibcode:2001MachL..45....5B. doi: 10.1023/A:1010933404324 .
  8. 1 2 Liaw A (16 October 2012). "Documentation for R package randomForest" (PDF). Retrieved 15 March 2013.
  9. U.S. trademark registration number 3185828, registered 2006/12/19.
  10. "RANDOM FORESTS Trademark of Health Care Productivity, Inc. - Registration Number 3185828 - Serial Number 78642027 :: Justia Trademarks".
  11. 1 2 Amit Y, Geman D (1997). "Shape quantization and recognition with randomized trees" (PDF). Neural Computation . 9 (7): 1545–1588. CiteSeerX   10.1.1.57.6069 . doi:10.1162/neco.1997.9.7.1545. S2CID   12470146. Archived from the original (PDF) on 2018-02-05. Retrieved 2008-04-01.
  12. Dietterich, Thomas (2000). "An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization". Machine Learning . 40 (2): 139–157. doi: 10.1023/A:1007607513941 .
  13. Gareth James; Daniela Witten; Trevor Hastie; Robert Tibshirani (2013). An Introduction to Statistical Learning. Springer. pp. 316–321.
  14. Ho, Tin Kam (2002). "A Data Complexity Analysis of Comparative Advantages of Decision Forest Constructors" (PDF). Pattern Analysis and Applications. 5 (2): 102–112. doi:10.1007/s100440200009. S2CID   7415435. Archived from the original (PDF) on 2016-04-17. Retrieved 2015-11-13.
  15. Geurts P, Ernst D, Wehenkel L (2006). "Extremely randomized trees" (PDF). Machine Learning. 63: 3–42. doi: 10.1007/s10994-006-6226-1 .
  16. Dessi, N. & Milia, G. & Pes, B. (2013). Enhancing random forests performance in microarray data classification. Conference paper, 99-103. 10.1007/978-3-642-38326-7_15.
  17. Ye, Y., Li, H., Deng, X., and Huang, J. (2008) Feature weighting random forest for detection of hidden web search interfaces. Journal of Computational Linguistics and Chinese Language Processing, 13, 387–404.
  18. Amaratunga, D., Cabrera, J., Lee, Y.S. (2008) Enriched Random Forest. Bioinformatics, 24, 2010-2014.
  19. Ghosh D, Cabrera J. (2022) Enriched random forest for high dimensional genomic data. IEEE/ACM Trans Comput Biol Bioinform. 19(5):2817-2828. doi:10.1109/TCBB.2021.3089417.
  20. Winham, Stacey & Freimuth, Robert & Biernacka, Joanna. (2013). A weighted random forests approach to improve predictive performance. Statistical Analysis and Data Mining. 6. 10.1002/sam.11196.
  21. Li, H. B., Wang, W., Ding, H. W., & Dong, J. (2010, 10-12 Nov. 2010). Trees weighting random forest method for classifying high-dimensional noisy data. Paper presented at the 2010 IEEE 7th International Conference on E-Business Engineering.
  22. Zhu R, Zeng D, Kosorok MR (2015). "Reinforcement Learning Trees". Journal of the American Statistical Association. 110 (512): 1770–1784. doi:10.1080/01621459.2015.1036994. PMC   4760114 . PMID   26903687.
  23. Deng, H.; Runger, G.; Tuv, E. (2011). Bias of importance measures for multi-valued attributes and solutions. Proceedings of the 21st International Conference on Artificial Neural Networks (ICANN). pp. 293–300.
  24. Altmann A, Toloşi L, Sander O, Lengauer T (May 2010). "Permutation importance: a corrected feature importance measure". Bioinformatics. 26 (10): 1340–7. doi: 10.1093/bioinformatics/btq134 . PMID   20385727.
  25. Piryonesi S. Madeh; El-Diraby Tamer E. (2020-06-01). "Role of Data Analytics in Infrastructure Asset Management: Overcoming Data Size and Quality Problems". Journal of Transportation Engineering, Part B: Pavements. 146 (2): 04020022. doi:10.1061/JPEODX.0000175. S2CID   216485629.
  26. Strobl C, Boulesteix A, Augustin T (2007). "Unbiased split selection for classification trees based on the Gini index" (PDF). Computational Statistics & Data Analysis. 52: 483–501. CiteSeerX   10.1.1.525.3178 . doi:10.1016/j.csda.2006.12.030.
  27. Painsky A, Rosset S (2017). "Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance". IEEE Transactions on Pattern Analysis and Machine Intelligence. 39 (11): 2142–2153. arXiv: 1512.03444 . doi:10.1109/tpami.2016.2636831. PMID   28114007. S2CID   5381516.
  28. Tolosi L, Lengauer T (July 2011). "Classification with correlated features: unreliability of feature ranking and solutions". Bioinformatics. 27 (14): 1986–94. doi: 10.1093/bioinformatics/btr300 . PMID   21576180.
  29. 1 2 "Beware Default Random Forest Importances". explained.ai. Retrieved 2023-10-25.
  30. Breiman, Leo (2017-10-25). Classification and Regression Trees. New York: Routledge. doi:10.1201/9781315139470. ISBN   978-1-315-13947-0.
  31. Ortiz-Posadas, Martha Refugio (2020-02-29). Pattern Recognition Techniques Applied to Biomedical Problems. Springer Nature. ISBN   978-3-030-38021-2.
  32. https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance.html 31. Aug. 2023
  33. 1 2 Lin, Yi; Jeon, Yongho (2002). Random forests and adaptive nearest neighbors (Technical report). Technical Report No. 1055. University of Wisconsin. CiteSeerX   10.1.1.153.9168 .
  34. Shi, T.; Horvath, S. (2006). "Unsupervised Learning with Random Forest Predictors". Journal of Computational and Graphical Statistics. 15 (1): 118–138. CiteSeerX   10.1.1.698.2365 . doi:10.1198/106186006X94072. JSTOR   27594168. S2CID   245216.
  35. Shi T, Seligson D, Belldegrun AS, Palotie A, Horvath S (April 2005). "Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinoma". Modern Pathology. 18 (4): 547–57. doi: 10.1038/modpathol.3800322 . PMID   15529185.
  36. 1 2 3 4 5 Piryonesi, S. Madeh; El-Diraby, Tamer E. (2021-02-01). "Using Machine Learning to Examine Impact of Type of Performance Indicator on Flexible Pavement Deterioration Modeling". Journal of Infrastructure Systems. 27 (2): 04021005. doi:10.1061/(ASCE)IS.1943-555X.0000602. ISSN   1076-0342. S2CID   233550030.
  37. Prinzie, A.; Van den Poel, D. (2008). "Random Forests for multiclass classification: Random MultiNomial Logit". Expert Systems with Applications. 34 (3): 1721–1732. doi:10.1016/j.eswa.2007.01.029.
  38. Prinzie, Anita (2007). "Random Multiclass Classification: Generalizing Random Forests to Random MNL and Random NB". In Roland Wagner; Norman Revell; Günther Pernul (eds.). Database and Expert Systems Applications: 18th International Conference, DEXA 2007, Regensburg, Germany, September 3-7, 2007, Proceedings. Lecture Notes in Computer Science. Vol. 4653. pp. 349–358. doi:10.1007/978-3-540-74469-6_35. ISBN   978-3-540-74467-2.
  39. 1 2 Smith, Paul F.; Ganesh, Siva; Liu, Ping (2013-10-01). "A comparison of random forest regression and multiple linear regression for prediction in neuroscience". Journal of Neuroscience Methods. 220 (1): 85–91. doi:10.1016/j.jneumeth.2013.08.024. PMID   24012917. S2CID   13195700.
  40. 1 2 3 4 Scornet, Erwan (2015). "Random forests and kernel methods". arXiv: 1502.03836 [math.ST].
  41. Breiman, Leo (2000). "Some infinity theory for predictor ensembles". Technical Report 579, Statistics Dept. UCB.{{cite journal}}: Cite journal requires |journal= (help)
  42. Lin, Yi; Jeon, Yongho (2006). "Random forests and adaptive nearest neighbors". Journal of the American Statistical Association. 101 (474): 578–590. CiteSeerX   10.1.1.153.9168 . doi:10.1198/016214505000001230. S2CID   2469856.
  43. Davies, Alex; Ghahramani, Zoubin (2014). "The Random Forest Kernel and other kernels for big data from random partitions". arXiv: 1402.4293 [stat.ML].
  44. 1 2 Breiman L, Ghahramani Z (2004). "Consistency for a simple model of random forests". Statistical Department, University of California at Berkeley. Technical Report (670). CiteSeerX   10.1.1.618.90 .
  45. 1 2 Arlot S, Genuer R (2014). "Analysis of purely random forests bias". arXiv: 1407.3939 [math.ST].
  46. Sagi, Omer; Rokach, Lior (2020). "Explainable decision forest: Transforming a decision forest into an interpretable tree". Information Fusion. 61: 124–138. doi:10.1016/j.inffus.2020.03.013. S2CID   216444882.
  47. Vidal, Thibaut; Schiffer, Maximilian (2020). "Born-Again Tree Ensembles". International Conference on Machine Learning. PMLR. 119: 9743–9753. arXiv: 2003.11132 .
  48. Piryonesi, Sayed Madeh (November 2019). Piryonesi, S. M. (2019). The Application of Data Analytics to Asset Management: Deterioration and Climate Change Adaptation in Ontario Roads (Doctoral dissertation) (Thesis).

Further reading