Random Forests

Random forests were introduced by Leo Breiman in 1999. A random forest is similar to a classifier created by bagging fully-grown CART decision trees, but an additional source of variation, beyond using different bootstrap samples of the training data, is used when creating the ensemble of trees (the forest) which vote to make the predictions.

As each tree of the forest is grown, not every predictor is considered each time a split is made. Rather, each time a split is to be made, a randomly selected subset of the predictors is considered, and the best possible split which can be created using one of the randomly selected predictors is used. (If there are p predictors in all, one can round the square root of p to get an integer, and then specify that this number of predictors be considered for every split. A second and third random forest classifier can be created later, using about half this number of predictors, and about twice this number of predictors, in the randomly chosen subsets to be considered each time a split is made. Then one can use the best (based on their estimated misclassification rates) of these three random forest classifiers.)

The manual describes how RandomForests can be used to cluster (see. p. 13) and how it can be used to identify outliers (see pp. 14-15). However, due to lack of time, I'm not going to cover those applications in this class.

why it (often) works

If different variables are used to create the initial split in a CART classification tree, then the rest of the tree will be grown differently --- perhaps quite differently. (It's good to keep in mind that the initial split chosen by the greedy CART algorithm may not be the one which will lead to the best tree in the end.) With a random forest, many differences are encouraged because the trees are grown from different bootstrap samples, and because randomness is involved in every splitting decision.

Undoubtably, some trees in the forest are better classifiers than others, and it may well be the case that most of the trees are worse classifiers than a tree grown from the original training data by not using randomness in the splitting, but rather making the splits which appear best when all of the predictors are considered for every split. However, it turns out that the prediction rule that results from having every tree in the forest vote is often appreciably better than that of the single CART tree grown using the original training data and no randomness.

When a CART tree is created from a data set, there should always be the fear that some "patterns" which appear to be important for prediction (due to the specific observations in the data set) aren't really important. Such misleading patterns may have been discovered only due to a certain combination of weak split decisions coming out a certain way --- if new data from the same source could be used to grow another tree, it may be that these weak decisions don't play out the way that they did originally, and these misleading patterns may not appear in the final tree. (With slight differences in the data, and different decisions made to partition the data in the early stages of the tree-growing process, the somewhat freak occurrence of events that originally led to the misleading patterns may be avoided.) When a large number of trees are grown to create the random forest, with each one being grown sensibly, though not necessarily optimally, the hope is that the really important patterns will show up in a large number of the trees, and freak misleading patterns won't appear a large number of times. When these trees vote, the important patterns which occur repeatedly will prevail, and the freakish misleading patterns will be washed out by the stronger truths.

Also, because of the randomness in the forest growing, some trees may be really good over certain subsets of input values, and weak, but not terribly misleading, over other subsets of values; while other trees may do really good on different subsets of values. When a large number of such trees are used jointly to make predictions, it is often the case that for most of the values of x in the (predictor) measurement space, there are more trees which rather strongly support the best prediction than there are trees which rather strongly support a nonoptimal prediction. Even if most trees don't strongly favor any decision for a particular set of input values, if a weighted voting is done, the optimal prediction will tend to win. (Note: The Salford Systems literature does not appear to provide the details of how the weighted voting is done, but I think that it is reasonable to guess that if a tree only slightly favors class A over class B, then it's vote for A won't count as much as a vote for B from a tree which rather strongly favors class B over class A.) In the end, success will be achieved as long as not too many trees strongly indicate a bad prediction, and CART trees are generally fairly good about this --- when they favor a nonoptimal decision, it is often the case that it is a close call that went the wrong way, as opposed to being a strong indication of support for the bad decision.

Too many predictors of little value (perhaps just "noise" variables) can work against the random forest approach, since it could be that some subsets of variables selected for a split contain no good variables. However, if there is a lot of data, so that the trees are grown deep, then some weak (perhaps rather meaningless splits) may not harm things too much, particularly if the forest is grown large, because the votes from a large number of somewhat weak trees can still wind up resulting in the optimal prediction with high probability. (Note: While noise variables can be harmful to random forests, they are also harmful to many other classification methods, such as LDA, QDA, and nearest neighbors. In general, CART, MARS, and RandomForests tend to work relatively well when there are a lot of variables, even if many of the variables have little predictive power. This is because they selectively add good variables in a stepwise manner, unlike methods like LDA, QDA, and nearest neighbors, which tend to use all of the variables equally. Even amongst stepwise procedures, the Salford Systems products can have an advantage due to their ability to deal with a nonhomogeneous response. Methods which use global fits can be hurt when a predictor is only strong over certain parts of the measurement space.) Still, it may be better to try to weed out some low-power predictors using variable importance results from CART or a previous RandomForests run (or some other classification method).

using the software

I'm not going to attempt to cover everything here. I advise that you skim through the manual on your own, and identify which parts of it should be read more carefully. Below, all page numbers refer to the manual that comes with the software.

reading in data

Like CART and MARS, there are two modes for reading in data.
- With a .csv ASCII file, use Mode 1 (DBMS/Copy off). (Spaces, tabs, or semicolons can be used as the delimiters instead of commas --- but use the same delimiter throughout.)
- With an Excel, SAS, SPSS, etc. file, use Mode 2 (DBMS/Copy on).
P. 20 gives some dos and don'ts. (P. 22 gives some examples.)
- Variable names must start with a letter.
- Some special symbols aren't allowed.
- With an ASCII file, names for categorical variables must end with $. (See p. 21 for more information.)
P. 25 gives information about reading Excel files. Below are a few highlights.
- Excel files being read must not be currently open in Excel.
- The first row must contain legal variable names.
- Missing values must be represented as blank cells --- don't use a symbol.
- Use the cut-and-paste techniques to replace all formulas in the spreadsheet with actual values.

setting up to build classifiers

class weights (p. 30)

The default is to boost up smaller classes to make them as important as larger ones. Left at the default, a case will be classified as A, as opposed to B, if the RF model determines that at the given x, A has a higher likelihood of occuring than it does when not conditioned on x, even though B may be more likely to occur than A at x. While this may be what you want, I would guess that in a lot of cases it isn't what you want. If all types of misclassifications are penalized equally, then you want to predict a case at x to be B instead of A if B is more likely to occur at x than A is. To make this happen, you should go to the Class Wts box and click on UNIT. (Note: The terminology is perhaps confusing, but UNIT weights means that no weighting occurs and so the actual class counts are used in a straightforward manner to arrive at the predictions. BALANCED reweights the cases. One purpose of doing this is to identify "hot spots" where a certain class occurs with a greater probability than it does in the general population (but this really only is guaranteed to hold if there are only two classes). It could also be used if it is thought that future cases to be classified are equally-likely to be of each of the possible classes, but the classes aren't equally represented in the training sample. But I would think that expecting cases to be equally-likely with regard to the possible classes is rare, and that in most situations when the training data doesn't reflect the classes properly, weights should be adjusted using SPECIFY (see the top of p. 45).)

estimating error rates (pp. 30-31)

The manual doesn't make it very clear, but the testing referred to is for the determination of error rates. (The trees used are fully grown --- no testing is done in the tree-growing process.) The default is out-of-bag testing/estimation, which should be fine. (When a bootstrap sample is drawn, the probability that a particular case isn't included is about 0.368. So if 500 trees are grown, the case is expected to not be used in growing about 184 of them. The trees not involving a particular case are used to vote on it's class, and a comparison of the predicted class and the actual class can be used to obtain an estimate of the accuracy of the random forest.)

setting some important controls (pp. 31-32)

In the Rnd Forests box you should set the number of predictors to be considered for each split --- the strategy based on the square root of the number of predictors (see above) is a decent starting point. You can leave the others set on the defaults, except if you have a rather large data file (of training data) you may want to decrease the number of trees in the forest from 500 to perhaps 100 or 200.

looking at the results and saving the classifier produced

There isn't a lot to look at. But one can look at the estimated error rates in several places, and one can look to see whether or not it appears that convergence has occurred. P. 35 describes the alternative method of evaluating variable importance (oddly referred to as the standard method) --- it seems pretty nifty. (To assess a variable's importance, it's values are scrambled, and then the accuracy of the forest is estimated using the out-of-bag method. The variable's importance can be measured by how much the scrambing of the values affects the accuracy of the forest.)

To save a random forest classifier (to make predictions with), click on the Save Grove button, and provide a name and location for the Grove file.