Random Forests
Random forests were introduced by Leo Breiman in 1999. A random forest is similar to a classifier created by bagging
fully-grown CART decision trees, but an additional source of variation, beyond using different bootstrap samples
of the training data, is used when creating the ensemble of trees (the forest) which vote to make the predictions.
As each tree of the forest is grown, not every predictor is considered each time a split is made. Rather, each time
a split is to be made, a randomly selected subset of the predictors is considered, and the best possible split which
can be created using one of the randomly selected predictors is used. (If there are p predictors in all, one
can round the square root of p to get an integer, and then specify that this number of predictors be
considered for every split. A second and third random forest classifier can be created later, using about half this
number of predictors, and about twice this number of predictors, in the randomly chosen subsets to be considered each
time a split is made. Then one can use the best
(based on their estimated
misclassification rates)
of these three random forest classifiers.)
The manual describes how RandomForests can be used to cluster (see. p. 13) and how it can be used to identify
outliers (see pp. 14-15). However, due to lack of time, I'm not going to cover those applications in this class.
why it (often) works
If different variables are used to create the initial split in a CART classification tree, then the rest of
the tree will be grown differently --- perhaps quite differently. (It's good to keep in mind that
the initial split chosen by the
greedy CART algorithm may not be the one which will lead to the best tree in the end.)
With a random forest, many differences are encouraged because the trees are grown from different bootstrap samples,
and because randomness is involved in every splitting decision.
Undoubtably, some trees in the forest are better classifiers than others, and it may well be the case that most of
the trees are worse classifiers than a tree grown from the original training data by not using randomness in the
splitting, but rather making the splits which appear best when all of the predictors are considered for every split.
However, it turns out that the prediction rule that results from having every tree in the forest vote is often
appreciably better than that of the single CART tree grown using the original training data and no randomness.
When a CART tree is created from a data set, there should always be the fear that some "patterns" which appear
to be important for prediction (due to the specific observations in the data set) aren't really important. Such
misleading patterns may have been discovered only due to a certain combination of weak split decisions coming out a
certain way --- if new data from the same source could be used to grow another tree, it may be that these weak
decisions don't play out the way that they did originally, and these misleading patterns may not appear in the final
tree. (With slight differences in the data, and different decisions made to partition the data in the early stages
of the tree-growing process, the somewhat freak occurrence of events that originally led to the misleading patterns
may be avoided.) When a large number of trees are grown to create the random forest, with each one being grown
sensibly, though not necessarily optimally, the hope is that the really important patterns will show up in a large
number of the trees, and freak misleading patterns won't appear a large number of times.
When these trees vote, the important patterns which occur repeatedly will prevail, and the freakish misleading
patterns will be
washed out by the stronger truths.
Also, because of the
randomness in the forest growing, some trees may be really good over certain subsets of input values, and weak, but not
terribly misleading, over other subsets of values;
while other trees may do really good on different subsets of values.
When a large number of such trees are used jointly to make predictions, it is often the case that for most of the
values of x in the (predictor) measurement space, there are more trees which rather strongly support
the best prediction than there are trees which rather strongly support a nonoptimal prediction. Even if most
trees don't strongly favor any decision for a particular set of input values,
if a weighted voting is done, the optimal prediction will tend to win.
(Note: The Salford Systems literature does not appear to provide the details of how the weighted voting is
done, but I think that it is reasonable to guess that if a tree only slightly favors class A over class
B, then it's vote for A won't count as much as a vote for B from a tree which rather strongly
favors class B over class
A.) In the end, success will be achieved as
long as not too many trees strongly indicate a bad prediction, and CART trees are generally
fairly good about this --- when they favor a nonoptimal decision, it is often the case that it is a close call that
went the wrong way, as opposed to being a strong indication of support for the bad decision.
Too many predictors of little value (perhaps just "noise" variables) can work against the random forest approach,
since it could be that some subsets of variables selected for a split contain no good variables. However, if there
is a lot of data, so that the trees are grown deep, then some weak (perhaps rather meaningless splits) may not harm
things too much, particularly if the forest is grown large, because the votes from a large number of somewhat weak
trees can still wind up resulting in the optimal prediction with high probability.
(Note: While noise variables can be harmful to random forests, they are also harmful to many other classification
methods,
such as LDA, QDA, and nearest neighbors. In general, CART, MARS, and RandomForests tend to work
relatively well when there are a lot of variables, even if many of the variables have little predictive power.
This is because they selectively add good variables in a stepwise manner, unlike methods like LDA, QDA, and nearest
neighbors, which tend to use all of the variables equally. Even amongst stepwise procedures, the Salford Systems
products can have an advantage due to their ability to deal with a nonhomogeneous response. Methods which use global
fits can be hurt when a predictor is only strong over certain parts of the measurement space.)
Still, it may be better to try to weed out some low-power predictors using variable importance results from
CART or a previous RandomForests run (or some other classification method).
using the software
I'm not going to attempt to cover everything here. I advise that you skim through the manual on your own, and
identify which parts of it should be read more carefully.
Below, all page numbers refer to the manual that comes with the software.
reading in data
- Like CART and MARS, there are two modes for reading in data.
- With a .csv ASCII file, use Mode 1 (DBMS/Copy off). (Spaces, tabs, or semicolons can be used as the
delimiters instead of commas --- but use the same delimiter throughout.)
- With an Excel, SAS, SPSS, etc. file, use Mode 2 (DBMS/Copy on).
- P. 20 gives some dos and don'ts. (P. 22 gives some examples.)
- Variable names must start with a letter.
- Some special symbols aren't allowed.
- With an ASCII file, names for categorical variables must end with $. (See p. 21 for more information.)
- P. 25 gives information about reading Excel files. Below are a few highlights.
- Excel files being read must not be currently open in Excel.
- The first row must contain legal variable names.
- Missing values must be represented as blank cells --- don't use a symbol.
- Use the cut-and-paste techniques to replace all formulas in the spreadsheet with actual values.
setting up to build classifiers
class weights (p. 30)
The default is to boost up smaller classes to make them as important as larger ones.
Left at the default, a case will be classified as A, as opposed to B, if the RF model determines that
at the given
x,
A
has a higher likelihood of occuring than it does when not conditioned on
x,
even though
B
may be more likely to occur than
A
at
x.
While this may be what you want, I would guess that in a lot of cases it isn't what you want.
If all types of misclassifications are penalized equally, then you want to predict a case at
x
to be
B
instead of
A
if
B
is more likely to occur at
x
than
A
is. To make this happen, you should go to the Class Wts box and click on UNIT.
(Note:
The terminology is perhaps confusing, but UNIT weights means that no weighting occurs and so the actual class
counts are used in a straightforward manner to arrive at the predictions. BALANCED reweights the cases.
One purpose of doing this is to identify "hot spots" where a certain class occurs with a greater probability than it
does in the general population (but this really only is guaranteed to hold if there are only two classes). It could
also be used if it is thought that future cases to be classified are equally-likely to be of each of the possible
classes, but the classes aren't equally represented in the training sample. But I would think that expecting cases
to be equally-likely with regard to the possible classes is rare, and that in most situations when the training data
doesn't reflect the classes properly, weights should be adjusted using SPECIFY (see the top of p. 45).)
estimating error rates (pp. 30-31)
The manual doesn't make it very clear, but the testing referred to is for the determination of error rates.
(The trees used are fully grown --- no testing is done in the tree-growing process.)
The default is out-of-bag testing/estimation, which should be fine. (When a bootstrap sample is drawn, the
probability that a particular case isn't included is about 0.368. So if 500 trees are grown, the case is expected
to not be used in growing about 184 of them. The trees not involving a particular case are used to vote on it's
class, and a comparison of the predicted class and the actual class can be used to obtain an estimate of the accuracy
of the random forest.)
setting some important controls (pp. 31-32)
In the Rnd Forests box you should set the number of predictors to be considered for each split --- the
strategy based on the square root of the number of predictors (see above) is a decent starting point.
You can leave the others set on the defaults, except if you have a rather large data file (of training data) you may
want to decrease the number of trees in the forest from 500 to perhaps 100 or 200.
looking at the results and saving the classifier produced
There isn't a lot to look at. But one can look at the estimated error rates in several places, and one can look to
see whether or not it appears that convergence has occurred. P. 35 describes the alternative method
of evaluating variable importance (oddly referred to as the standard method) --- it seems pretty nifty. (To
assess a variable's importance, it's values are scrambled, and then the accuracy of the forest is estimated using the
out-of-bag method. The variable's importance can be measured by how much the scrambing of the values affects the
accuracy of the forest.)
To save a random forest classifier (to make predictions with), click on the Save
Grove button, and provide a name and location for the Grove file.