Bagging



Booststrap aggregation, or bagging, is a method for improving the accuracy of predictions in classification and regression settings which was proposed by Leo Breiman in the mid 1990s. It is one of several perturb and combine / ensemble / committee of experts methods which have become popular, because of their effectiveness, over the past 15 years. With these methods, a classification or regression method is applied to various perturbations of the original data set, and the results are combined to obtain a single classifier or regression model. While an appreciable increase in accuracy is often achieved, on the downside we have that the final prediction rule is typically so complex that it is very hard to gain a good understanding of how the explanatory variables relate to the response.

With bagging, the perturbations of the data are obtained by drawing many bootstrap samples from the original data (or one can use subbagging (aka half-bagging) and draw samples of size n/2 without replacement). A specific method for classification or regression is applied to each of the randomly drawn data sets. Then the results are combined, by averaging for regression and simple voting for classification, to obtain the final prediction rule. The process of combining the results typically decreases the variability (compared to a single application of the chosen prediction method on the original data) without increasing bias, and this leads improved prediction accuracy.

While bagging is commonly applied to classification and regression trees, it can be used with many other methods of classification and regression. It is a general technique, and is not a specific method for classification and regression trees --- to say that one did classification by bagging is not adequate, since it should be stated what was bagged (e.g., bagging fully-grown classification trees obtained from CART). The Salford Systems literature refers to Brieman's "Bagger" as a classifier (if seems), but I think this is nonstandard. (My guess is that the Bagger refers to bagging fully-grown classification trees obtained from CART, but I don't believe that this use of Bagger is widespread.)


regression

I think that it is easier to understand how bagging works for regression, than to understand why it works for classification. (Note: Bagging doesn't always lead to improvement, but it typically does. It seldom hurts accuracy, and when it does, the harm tends to be slight.)

Let Y(x) denote the value of the numerical response variable which will be observed at a given set of input values, x, and let φ(x) be the prediction of Y(x) which results from applying a specific regression method to a training sample to obtain a prediction rule. (The specific regression method might be using CART to obtain a regression tree using a certain selection criterion, or it could be using a specific stepwise regression method to select an additive regression model, fit using least squares, using only 1st-order and 2nd-order polynomial terms.) Viewing φ(x) as a random variable (thinking of it as a function of the learning sample, and viewing the learning sample as random (and not thinking of it as a function of x, which is considered to be some fixed set of input values)), let μ(x) denote the (unknown) mean of φ(x). It can be shown (see p. 318 of my book chapter) that
E( [ Y(x) - φ(x) ]2 ) ≥ E( [ Y(x) - μ(x) ]2 ),
with the inequality being strict in the typical case having Var(φ(x)) > 0. This gives us that if
μ(x) = E(φ(x))
could be used as a predictor, it would have a smaller MSPE than does φ(x).

Of course, we typically don't know the joint distribution of all of the variables (including the response variable), and so it not realistic to think that μ(x) can be determined exactly. If we did know the joint distribution, then even though an exact determination of μ(x) may be too difficult, we could obtain a Monte Carlo estimate --- we could just generate many new samples of the same size as the training sample, apply the regression method to each generated sample, and average all of the values of φ(x) obtained to estimate it's mean. The strong law of large numbers gives us that for a large enough number of Monte Carlo trials, the estimated mean will be close to the true mean with high probability.

In practice, we cannot generate new samples from the true (but unknown) joint distribution, but we can generate new samples from an estimate of the unknown joint distribution --- when we create the bootstrap samples we are sampling from the empirical distribution based on the training sample, and this is an estimate of the true joint distribution. Collectively, these estimates of μ(x), for all x of interest, is the bagged prediction rule. As long as the empirical distribution from the training sample is sufficiently representative of the true joint distribution of the response and explanatory variables, then the bagged prediction rule should be an improvement over the unbagged prediction rule just based on the original training sample. Using a greater number of bootstrap samples will result in the sample means being based on a larger number number of independent observations, and will lower the variance.

To understand when bagging might lead to appreciable improvements in regression settings, note that on p. 318 of my book chapter, it can be seen that the relative improvement from bagging can be appreciable when the variance of the prediction rule is large relative to the overall MSPE. This suggests that more improvement should be expected with an unstable regression method, such as using CART, than with a stable regression method, like using OLS to fit a regression model when the correct form of the model is known and the error term distribution is nice. (CART is unstable, which leads to high variance, since a slight change in the data can result in close calls in split decisions coming out differently, and if one or more of the initial splits is done using a different variable, it could be that the rest of the growth of the tree is greatly effected.)


classification

For a specific source of data and a specific sample size, a classification method is said to be unbiased at x if the probability that the random selection of the training sample results in the classification method predicting the most probable class for x with a probability greater than it predicts any other class. A classification method is said to be biased at x otherwise.

If many classifiers are constructed from independent training samples drawn from the joint distribution underlying the response and explanatory variables, and at each x of interest, voting is done to arrive at the predicted class, then if the number of classifiers is large enough, with high probability the voting should result in the most probable class being predicted if the classification method is unbiased at x. (Note: The most probable class is the one predicted by the Bayes classifier.) This means that for the part of the measurement space on which the classification method is unbiased, the voting scheme should well approximate the optimal Bayes classifier. For the part of the measurement space on which the classification method is biased, the voting scheme will with high probability result in a prediction that differs from the prediction of the optimal Bayes classifier. But if there is a high probability that a case to be classified will correspond to the part of the measurement space on which the classification method is unbiased, then the voting scheme should result in an error rate nearly as low as that of the optimal Bayes classifier.

In practice, the voting scheme as described above cannot be done if independent samples cannot be drawn from the actual underlying population/distribution. But if the empirical distribution based on the training sample isn't too different from the actual distribution, then bootstrap samples can play the role of the independent samples from the actual distribution, and the bagged classifier may have an accuracy similar to that of the Bayes classifier.

Using bagging with a stable classification method like LDA may lead to little improvement, but bagging classification trees created by CART may produce considerable improvement because of the instability of classification trees. When bagging trees, they should generally be fully grown and unpruned. Fully grown trees tend to be more variable in their predictions than smaller trees (meaning that sample to sample variability in training data can more easily lead to variability in the predictions when the trees are more complex), but they also tend to be less biased. When bagged, the voting counters the problems due to variability as long as the bias isn't great, and so larger trees tend to work better with bagging.


some more information

Generally, between 25 and 100 bootstrap samples are adequate. Using too many bags shouldn't hurt things, and so if the data set isn't so large that speed becomes a major concern, it may be best to go with 100.

When bagging, one doesn't want to use a validation sample or cross-validation to optimize each member of the ensemble of predictors. Rather one tends to overfit the members of the ensemble, creating low bias but high variance predictors, and then relies on the bagging process to reduce the ill effects of the high variance while taking advantage of the low bias. (The lack of need to fine tune each member of the ensemble partially offsets the long running time which may be encountered due to having to construct a predictor from each bootstrap sample.) With trees, the one chosen by cross-validation tends to be decent with respect to both bias and variance. But when we bag trees, we want to use larger tress which have a smaller bias and a larger variance, and then we let the bagging decrease the ill effects due to high variance. Instead of balancing the bias-variance tradeoff, we can just focus on the bias, and let the bagging procedure take care of the high variance.


final remarks

A key concept with bagging is to bag low bias (which typically mean relatively high variance) predictors. This is because, in principle, bagging will nearly eliminate the variance problems while not harming the low bias, leaving one with the best of it (to borrow a poker phrase). In practice, bagging can indeed work very well (but perhaps other perturb and combine techniques like boosting and random forests are generally better). When bagging fails to improve performance, it may be because there wasn't a lot of room for improvement (if performance was already close to the Bayes limit), or it could be that the Achilles' heel was a sample size that was too small (don't ask --- the answer is it depends ...), which is often the key limiting factor with procedures based on bootstrapping. If the empirical distribution based on the training sample isn't close enough to the true underlying distribution, then bagging may not perform well. Of course, other methods may not perform well either in such a case!