Boosting
Boosting, like bagging, is a
perturb and combine (P&C) method which can be used for classification by applying a chosen classification
method to various
perturbations of the original data set, and then combining the results to obtain a single classifier.
(Note:
There are a variety of way in which boosting can be done, and some variants of boosting can be used with regression
methods. But due to lack of time, for this presentation I'm going to restrict attention to classification in two
class settings.)
However, there are some key differences in how the perturbing and combining is done.
- With boosting, the samples used for the various steps are not all obtained in the same way --- rather,
incorrectly classified
cases for a step are given increased weight during the next step. (With bagging, bootstrap samples are drawn for
each step; the same way for all of the steps, and independently of one another.)
- With boosting, the results are combined using weights. (With bagging, a simple voting/averaging is done, with
each member of the ensemble of classifiers having with same weight.)
Thus, rather than rely on random perturbations and a simple way of combining the results, boosting is an iterative
procedure which creates the
perturbations adaptively, and uses weighting in both the perturbing and the combining. Breiman considers
boosting to be a way to do arcing (adaptive resampling and combining).
(Note:
Breiman developed some other ways to do arcing, but these alternate ways, along with the term arcing, haven't
seemed to have gained much popularity.)
Another difference between boosting and bagging is that bagging seems to work best with complex models which
have low bias, whereas boosting is typically applied to weak learners, which result in high bias / low
variance classifiers to be combined. A weak two-class classifier may be one which may only be slightly better than
random guessing, perhaps having a misclassification rate close to 50%. An example of a weak learner is a decision tree
method which limits growth to
just a small number of terminal nodes. Often, two node trees, called stumps, are used.
Schapire, in 1990, introduced the predecessor to later boosting algorithms developed by him and others. His original
method combined the results of three two-class classifiers, created from different training samples, by simple
majority voting. Freund
extended Schapire's initial method by combining the results of a larger number of weak learners. Then in 1996,
Freund and Schapire introduced the AdaBoost algorithm, which quickly became very popular.
Some have indicated that boosting is one of the most powerful machine/statistical learning ideas to have been
introduced during the 1990s, and it has been suggested that the application of boosting to classification trees
results in classifiers which are generally competitive with classifiers obtained using any other method. A
particularly nice thing about boosting and bagging is that they can be used very successfully with simple
"off-the-shelf" classifiers (as opposed to needing to carefully tune and tweak the classifiers).
(Note:
Some neural net classifiers that perform very well are ones that have been carefully tuned, perhaps
making use of architectures which incorporate subject matter knowledge.)
But like
classifiers obtained using support vector machines and neural networks, classifiers constructed using boosting and
bagging tend to be very complex and do not make it easy to gain insight about the underlying phenomenon.
AdaBoost
Boosting using the AdaBoost algorithm is by far the most popular form of boosting. There are two versions of AdaBoost.
AdaBoost.M1 starts by having the base learner (the classification method which is used during
each step) applied to the training data, with each case in the training sample given the same weight, 1/N.
The weights used for each subsequent step depend upon the
weighted resubstitution error rate of the classifier created in the immediately
preceding step, with the cases being misclassified on a given step being
given greater weight on the next step.
Specifically, if
I(m,n)
is equal to 1 if the nth case is
misclassified on the mth step, and equal to 0 otherwise, and
w(m,n)
is the weight of the
nth
case for the
mth
step, where the weights
are positive and sum to 1, for the weighted resubstitution error rate for
the
mth
step,
e(m),
we have
e(m) = Σn
w(m,n)
I(m,n).
The weights for the
(m+1)th
step are obtained from the weights for the
mth
step by multiplying the weights for the correctly classified cases
by
e(m)/(1 -
e(m)),
which should be less than 1, and then multiplying the
entire set of values by the number greater than 1 which makes the
N
values sum to 1. (This procedure downweights the cases which were
correctly classified, and gives increased
weight to the cases which were misclassified.)
For the second step, there will only be two values used for the
weights, but
as the iterations continue, the cases which are often correctly
classified can have weights much smaller than the cases which are the
hardest to classify correctly.
If the classification method of the weak learner does not allow weighted
cases, then from the second step onwards, a random sample of the
original learning sample cases is drawn to serve as the learning sample
for that step, with independent selections made using the weights as the
selection probabilities.
The fact that the cases misclassified on a given step are given increased
weight on the next step typically leads to a different fit of the base
learner, which correctly classifies some of the misclassified cases from
the previous step, contributing to a low correlation of predictions
from one step and the next. Amit, Blanchard, and Wilder (1999)
(see my
book chapter
for information about the references)
provide some details which indicate that the AdaBoost algorithm attempts to
produce low correlation between the predictions of the fitted base
classifiers, and some believe that this phenomenon is an important
factor in the success of AdaBoost.
Assuming that the weighted error rate for each step is no greater than 0.5,
the boosting procedure creates
M
classifiers using the iteratively
adjusted weights. Values such as 10 and 100 have been used for
M.
But
in many cases it will be better to use a larger value, such as 200
(since performance often continues to improve as
M
approaches 200
or 400, and
it is rare that performance will suffer, due to overfitting, if
M
is made to be too large (although overfitting can occur if there is a lot of noise)).
The
M
classifiers are then combined to obtain a single classifier to
use.
If, on any step, the error rate exceeds 0.5 (which should not occur with only two
classes, but could occur if there are more than 2 classes),
the iterations are terminated, and only
the classifiers having error rates less than 0.5 are
combined.
In either case,
the classifiers are combined by using a weighted voting to assign the
predicted class for any input
x,
with the classifier created on
the
mth
step being given weight
log((1 - e(m))/e(m)).
This gives
the greatest weights to the classifiers having the lowest weighted
resubstitution error rates.
Increasing the number of iterations for AdaBoost.M1 drives the training
error rate of the composite classifier to zero. The error
rate for an independent test sample need not approach zero, but in many
cases it can get very close to the smallest possible error rate.
Breiman's discussion in Friedman et al. (2000) makes the point that
after a large number of iterations, if one examines the weighted
proportion of times that each of the learning sample cases has been
misclassified, the values tend to be all about the same. So instead of stating
that AdaBoost concentrates on the hard to classify cases, as some have
done, it is more accurate to state that AdaBoost places (nearly) equal
importance on correctly classifying each of the cases in the learning
sample. HTF shows that AdaBoost.M1 is equivalent to
forward stagewise additive modeling using an exponential loss function.
AdaBoost.M2 is equivalent to AdaBoost.M1 if there are just two
classes, but appears to be better than AdaBoost.M1
if there are more than two
classes (because for AdaBoost.M1 to work well the weak learners should
have weighted resubstitution error rates less than 0.5, and this may be
rather difficult to achieve if there are many classes).
The weak learners used with AdaBoost.M2 assign to each case a set of
plausibility values for the possible classes. They also make use
of a loss function that can assign different penalties to different
types of misclassifications. The loss function values used are updated for each
iteration to place more emphasis on avoiding the same types of
misclassifications that occurred most frequently in the previous step.
After the iterations have been completed,
for each
x
of interest the sequence of classifiers is
used to produce
a weighted average of the
plausibilities for each class, and
the class corresponding to the largest of these values
is taken to be the prediction for
x.
(See Freund and Schapire (1996) and Freund and Schapire (1997) for details
pertaining to AdaBoost.M2.)
when and why boosting works
As to how boosting typically achieves such a low misclassification rate,
the opinions are many and varied.
I'll just give a little bit of information here, and let you go to my
book chapter for additional information and references.
Friedman et al. (2000) points out that
when used with a rather simple base classification method which is stable but tends to produce classifiers
which are highly biased,
such as stumps, in some situations (e.g., if one variable is a dominant predictor),
the success of boosting is due much more to bias
reduction
than it is to variance reduction, and this makes boosting
fundamentally different than bagging, despite the fact that both methods
combine classifiers based on various perturbations of the learning
sample.
When used with a rather complex base classification method that has low bias, but is unstable,
such as carefully grown
and pruned trees, boosting
(and arcing in general), like bagging,
in many cases can dramatically lower the error rate of the
classification method.
Stated simply, this is due in part to the fact that unstable methods have
relatively large variances, and boosting decreases the variance without
increasing the bias.
In some settings, and used with some methods, boosting doesn't work very well.
Breiman (1998) shows that when applied to linear discriminant analysis (LDA),
which is a fairly stable method of classification, neither bagging nor boosting
has any appreciable effect.
In some cases
boosting can make a classifier appreciably worse, but this does not seem
to happen in a large percentage of settings which have been investigated,
and when
it does happen, perhaps a small learning sample size is a contributing
factor.
Several studies have indicated that boosting can be outperformed when
the classes have significant overlap. Nevertheless, it is often the case that some
form of boosted classifier can do nearly as good or better than any
other classifier.
Although it does not seem to be fully understood
why boosting works as well as it does, there is no question about the
fact that it can work very well. In various comparisons with bagging,
boosting generally, but not always, does better, and leading researchers
in the field of machine learning tend to favor boosting. (It should be
noted that some of the comparisons of boosting to bagging are a bit
misleading because they use stumps as the base classifier for both
boosting and bagging, and while stumps, being weak learners, can work
very well with boosting, they can be too stable and have too great
a bias to work really well with bagging. Trees larger than stumps tend to
work better with bagging, and so a fairer comparison would be to compare
bagging fairly large trees, boosting stumps, and boosting slightly larger (than a stump) trees.)
what to boost
Hastie et al. (2001) claims that using trees with between four and eight
terminal nodes works well in most cases, and that performance is fairly
insensitive to the choice, from this range, which is made.
Friedman, Hastie, and Tibshirani (2000) gives
an example in which boosting stumps does appreciably worse than just
using a single large tree, but boosting eight node trees produced an
error rate that is less than 25 percent of the error rate of the single
large tree, showing that the choice of what is used as the base
classifier can make a large difference, and that the popular choice of
boosting stumps can be
far from optimal. However, they also give another example for which
stumps are superior to larger trees. As a general strategy, one might choose
stumps if it is suspected that effects are additive, and choose larger
trees if one anticipates interactions for which adjustments should be
made.
It's interesting that smallish trees tend to work rather well with boosting, because they can be
unstable in addition to
being highly biased. (The instability would be due to the fact that with a lot of predictors, sometimes one variable is
chosen over another one to be the splitter in a very close call, so that data sets which are only slightly different
can produce very different trees. (It can be noted that in some situations a stump can be stable, and in other
situations it can have high variance.) The high bias is due to the small number of nodes --- often the decision boundary cannot
be complex enough to result in a low bias classifier.) Perhaps the instability encourages a wide variety of trees in
the ensemble, and the adjustable case weights serve to reduce the bias, while the averaging of the results reduces the
variance. (It may be that the failure of boosting to work well with LDA is due to the great stability of classification
by LDA
--- maybe the classifiers in the ensemble are generally too similar to one another.)
bag-boosting
With bag-boosting, one boosts bagged stumps (or perhaps some other type of tree). This is an extremely
computer-intensive procedure.