Bagging
Booststrap aggregation, or bagging, is a method for improving the accuracy of predictions in classification and
regression settings which was proposed by Leo Breiman in the mid 1990s. It is one of several perturb and combine
/ ensemble / committee of experts methods which have become popular, because of their effectiveness,
over the past 15 years. With these methods, a classification or regression method is applied to various
perturbations of the original data set, and the results are combined to obtain a single classifier or regression
model. While an appreciable increase in accuracy is often achieved, on the downside we have that
the final prediction rule is typically so
complex that it is very hard to gain a good understanding of how the explanatory variables relate to the response.
With bagging, the perturbations of the data are obtained by drawing many bootstrap samples from the original data (or
one can use subbagging (aka half-bagging) and draw samples of size n/2 without replacement).
A specific method for classification or regression is applied to each of the randomly drawn data sets. Then the
results are combined, by averaging for regression and simple voting for classification, to obtain the final
prediction rule. The process of combining the results typically decreases the variability (compared to a single
application of
the chosen prediction method on the original data) without increasing bias, and this leads improved prediction
accuracy.
While bagging is commonly applied to classification and regression trees, it can be used with many other methods of
classification and regression. It is a general technique,
and is not a specific method for classification and regression trees --- to say that one did
classification by bagging is not adequate, since it should be stated what was bagged (e.g., bagging fully-grown
classification trees obtained from CART). The Salford Systems literature refers to Brieman's "Bagger" as a
classifier (if seems), but I think this is nonstandard. (My guess is that the Bagger refers to bagging fully-grown
classification trees obtained from CART, but I don't believe that this use of Bagger is widespread.)
regression
I think that it is easier to understand how bagging works for regression, than to understand why it works for
classification. (Note: Bagging doesn't always lead to improvement, but it typically does. It seldom hurts
accuracy, and when it does, the harm tends to be slight.)
Let
Y(x)
denote the value of the numerical response variable which will be observed at
a given set of input values,
x,
and let
φ(x) be the prediction of
Y(x)
which results from applying a specific regression method to a training sample to obtain a prediction rule.
(The specific regression method might be using CART to obtain a regression tree using a certain selection
criterion, or it could be using a specific stepwise regression method to select an additive regression model, fit
using least squares, using only 1st-order and 2nd-order polynomial terms.)
Viewing
φ(x)
as a random variable (thinking of it as a function of the learning sample, and viewing the learning sample as random
(and not thinking of it as a function of
x,
which is considered to be some fixed set of input values)), let
μ(x)
denote the (unknown) mean of
φ(x).
It can be shown (see p. 318 of
my book chapter)
that
E( [
Y(x) -
φ(x)
]2 )
≥
E( [
Y(x) -
μ(x)
]2 ),
with the inequality being strict in the typical case having
Var(φ(x)) > 0.
This gives us that if
μ(x)
= E(φ(x))
could be used as a predictor, it would have a smaller MSPE than does
φ(x).
Of course, we typically don't know the joint distribution of all of the variables (including
the response variable), and so it not realistic to think that
μ(x)
can be determined exactly. If we did know the joint distribution, then even though an exact determination of
μ(x)
may be too difficult, we could obtain a Monte Carlo estimate --- we could just generate many new samples of the same
size as the training sample, apply the regression method to each generated sample, and average all of the values of
φ(x)
obtained to estimate it's mean. The strong law of large numbers gives us that for a large enough number of Monte
Carlo trials, the estimated mean will be close to the true mean with high probability.
In practice, we cannot generate new samples from the true (but unknown) joint distribution, but we can generate new
samples from an estimate of the unknown joint distribution --- when we create the bootstrap samples we are sampling
from the empirical distribution based on the training sample, and this is an estimate of the true joint distribution.
Collectively, these estimates of
μ(x),
for all
x
of interest, is the bagged prediction rule. As long as the empirical distribution from the training sample is
sufficiently representative of the true joint distribution of the response and explanatory variables, then the
bagged prediction rule should be an improvement over the unbagged prediction rule just based on the original
training sample. Using a greater number of bootstrap samples will result in the sample means being based on a
larger number number of independent observations, and will lower the variance.
To understand when bagging might lead to appreciable improvements in regression settings, note that on
p. 318 of
my book chapter, it can be seen that the relative improvement from bagging can be
appreciable when the variance of the prediction rule is large relative to the overall MSPE. This suggests that more
improvement should be expected with an unstable regression method, such as using CART, than with a stable regression
method, like using OLS to fit a regression model when the correct form of the model is known and the error term
distribution is nice. (CART is unstable, which leads to high variance, since a slight change in the data can
result in close calls in split decisions coming out differently, and if one or more of the initial splits is done
using a different variable, it could be that the rest of the growth of the tree is greatly effected.)
classification
For a specific source of data and a specific sample size, a classification method is said to be unbiased at
x
if the probability that the random selection of the training sample results in the classification method
predicting the most probable class for
x
with a probability greater than it predicts any other class. A
classification method is said to be biased at
x
otherwise.
If many classifiers are constructed from independent training samples drawn from the joint distribution
underlying the response and explanatory variables, and at each
x
of interest, voting is done to arrive at the predicted class, then if the number of classifiers is large enough, with
high probability the voting should result in the most probable class being predicted if the classification method is
unbiased at
x.
(Note: The most
probable class is the one predicted by the Bayes classifier.)
This means that for the part of the measurement space on which the classification method is unbiased, the voting
scheme should well approximate the optimal Bayes classifier.
For the part of the measurement space on which the classification method is biased, the voting
scheme will with high probability result in a prediction that differs from the prediction of the optimal Bayes classifier.
But if there is a high probability that a case to be classified will correspond to the part of the measurement space
on which the classification method is unbiased, then the voting scheme should result in an error rate nearly as low
as that of the optimal Bayes classifier.
In practice, the voting scheme as described above cannot
be done if independent samples cannot be drawn from the actual underlying
population/distribution. But if the empirical distribution based on the training sample isn't too different from the
actual distribution, then bootstrap samples can play the role of the independent samples from the actual
distribution, and the bagged classifier may have an accuracy similar to that of the Bayes classifier.
Using bagging with a stable classification method like LDA may lead to little improvement, but bagging classification trees
created by CART may produce considerable improvement because of the instability of classification trees.
When bagging trees, they should generally be fully grown and unpruned. Fully grown trees tend to be more variable in
their predictions than smaller trees (meaning that sample to sample variability in training data can more easily lead to
variability in the predictions when the trees are more complex), but they also tend to be less biased. When bagged,
the voting counters the problems due to variability as long as the bias isn't great, and so larger trees tend to work
better with bagging.
some more information
Generally, between 25 and 100 bootstrap samples are adequate. Using too many bags shouldn't hurt things, and so if
the data set isn't so large that speed becomes a major concern, it may be best to go with 100.
When bagging, one doesn't want to use a validation sample or cross-validation to optimize each member of the ensemble
of predictors. Rather one tends to overfit the members of the ensemble, creating low bias but high variance
predictors, and then relies on the bagging process to reduce the ill effects of the high variance while taking
advantage of the low bias. (The lack of need to fine tune each member of the ensemble partially offsets the
long running time which may be encountered due to having to construct a predictor from each bootstrap sample.)
With trees, the one chosen by cross-validation tends to be decent with respect to both bias and variance. But when we
bag trees, we want to use larger tress which have a smaller bias and a larger variance, and then we let the bagging
decrease the ill effects due to high variance. Instead of balancing the bias-variance tradeoff, we can just focus on
the bias, and let the bagging procedure take care of the high variance.
final remarks
A key concept with bagging is to bag low bias (which typically mean relatively high variance) predictors. This is because,
in principle, bagging will nearly eliminate the variance problems while not harming the low bias, leaving one with the
best of it (to borrow a poker phrase). In practice, bagging can indeed work very well (but perhaps other
perturb and combine techniques like boosting and random forests are generally better). When bagging fails to improve
performance, it may be because there wasn't a lot of room for improvement (if performance was already close to the
Bayes limit), or it could be that the Achilles' heel was a sample size that was too small (don't ask --- the answer is
it depends ...), which is often the key limiting factor with procedures based on bootstrapping. If the
empirical distribution based on the training sample isn't close enough to the true underlying distribution, then
bagging may not perform well. Of course, other methods may not perform well either in such a case!