Boosting



Boosting, like bagging, is a perturb and combine (P&C) method which can be used for classification by applying a chosen classification method to various perturbations of the original data set, and then combining the results to obtain a single classifier. (Note: There are a variety of way in which boosting can be done, and some variants of boosting can be used with regression methods. But due to lack of time, for this presentation I'm going to restrict attention to classification in two class settings.) However, there are some key differences in how the perturbing and combining is done. Thus, rather than rely on random perturbations and a simple way of combining the results, boosting is an iterative procedure which creates the perturbations adaptively, and uses weighting in both the perturbing and the combining. Breiman considers boosting to be a way to do arcing (adaptive resampling and combining). (Note: Breiman developed some other ways to do arcing, but these alternate ways, along with the term arcing, haven't seemed to have gained much popularity.)

Another difference between boosting and bagging is that bagging seems to work best with complex models which have low bias, whereas boosting is typically applied to weak learners, which result in high bias / low variance classifiers to be combined. A weak two-class classifier may be one which may only be slightly better than random guessing, perhaps having a misclassification rate close to 50%. An example of a weak learner is a decision tree method which limits growth to just a small number of terminal nodes. Often, two node trees, called stumps, are used.


Schapire, in 1990, introduced the predecessor to later boosting algorithms developed by him and others. His original method combined the results of three two-class classifiers, created from different training samples, by simple majority voting. Freund extended Schapire's initial method by combining the results of a larger number of weak learners. Then in 1996, Freund and Schapire introduced the AdaBoost algorithm, which quickly became very popular.

Some have indicated that boosting is one of the most powerful machine/statistical learning ideas to have been introduced during the 1990s, and it has been suggested that the application of boosting to classification trees results in classifiers which are generally competitive with classifiers obtained using any other method. A particularly nice thing about boosting and bagging is that they can be used very successfully with simple "off-the-shelf" classifiers (as opposed to needing to carefully tune and tweak the classifiers). (Note: Some neural net classifiers that perform very well are ones that have been carefully tuned, perhaps making use of architectures which incorporate subject matter knowledge.) But like classifiers obtained using support vector machines and neural networks, classifiers constructed using boosting and bagging tend to be very complex and do not make it easy to gain insight about the underlying phenomenon.


AdaBoost

Boosting using the AdaBoost algorithm is by far the most popular form of boosting. There are two versions of AdaBoost.


AdaBoost.M1 starts by having the base learner (the classification method which is used during each step) applied to the training data, with each case in the training sample given the same weight, 1/N. The weights used for each subsequent step depend upon the weighted resubstitution error rate of the classifier created in the immediately preceding step, with the cases being misclassified on a given step being given greater weight on the next step.

Specifically, if I(m,n) is equal to 1 if the nth case is misclassified on the mth step, and equal to 0 otherwise, and w(m,n) is the weight of the nth case for the mth step, where the weights are positive and sum to 1, for the weighted resubstitution error rate for the mth step, e(m), we have
e(m) = Σn w(m,n) I(m,n).
The weights for the (m+1)th step are obtained from the weights for the mth step by multiplying the weights for the correctly classified cases by e(m)/(1 - e(m)), which should be less than 1, and then multiplying the entire set of values by the number greater than 1 which makes the N values sum to 1. (This procedure downweights the cases which were correctly classified, and gives increased weight to the cases which were misclassified.) For the second step, there will only be two values used for the weights, but as the iterations continue, the cases which are often correctly classified can have weights much smaller than the cases which are the hardest to classify correctly. If the classification method of the weak learner does not allow weighted cases, then from the second step onwards, a random sample of the original learning sample cases is drawn to serve as the learning sample for that step, with independent selections made using the weights as the selection probabilities.

The fact that the cases misclassified on a given step are given increased weight on the next step typically leads to a different fit of the base learner, which correctly classifies some of the misclassified cases from the previous step, contributing to a low correlation of predictions from one step and the next. Amit, Blanchard, and Wilder (1999) (see my book chapter for information about the references) provide some details which indicate that the AdaBoost algorithm attempts to produce low correlation between the predictions of the fitted base classifiers, and some believe that this phenomenon is an important factor in the success of AdaBoost.

Assuming that the weighted error rate for each step is no greater than 0.5, the boosting procedure creates M classifiers using the iteratively adjusted weights. Values such as 10 and 100 have been used for M. But in many cases it will be better to use a larger value, such as 200 (since performance often continues to improve as M approaches 200 or 400, and it is rare that performance will suffer, due to overfitting, if M is made to be too large (although overfitting can occur if there is a lot of noise)). The M classifiers are then combined to obtain a single classifier to use. If, on any step, the error rate exceeds 0.5 (which should not occur with only two classes, but could occur if there are more than 2 classes), the iterations are terminated, and only the classifiers having error rates less than 0.5 are combined. In either case, the classifiers are combined by using a weighted voting to assign the predicted class for any input x, with the classifier created on the mth step being given weight log((1 - e(m))/e(m)). This gives the greatest weights to the classifiers having the lowest weighted resubstitution error rates. Increasing the number of iterations for AdaBoost.M1 drives the training error rate of the composite classifier to zero. The error rate for an independent test sample need not approach zero, but in many cases it can get very close to the smallest possible error rate.

Breiman's discussion in Friedman et al. (2000) makes the point that after a large number of iterations, if one examines the weighted proportion of times that each of the learning sample cases has been misclassified, the values tend to be all about the same. So instead of stating that AdaBoost concentrates on the hard to classify cases, as some have done, it is more accurate to state that AdaBoost places (nearly) equal importance on correctly classifying each of the cases in the learning sample. HTF shows that AdaBoost.M1 is equivalent to forward stagewise additive modeling using an exponential loss function.


AdaBoost.M2 is equivalent to AdaBoost.M1 if there are just two classes, but appears to be better than AdaBoost.M1 if there are more than two classes (because for AdaBoost.M1 to work well the weak learners should have weighted resubstitution error rates less than 0.5, and this may be rather difficult to achieve if there are many classes). The weak learners used with AdaBoost.M2 assign to each case a set of plausibility values for the possible classes. They also make use of a loss function that can assign different penalties to different types of misclassifications. The loss function values used are updated for each iteration to place more emphasis on avoiding the same types of misclassifications that occurred most frequently in the previous step. After the iterations have been completed, for each x of interest the sequence of classifiers is used to produce a weighted average of the plausibilities for each class, and the class corresponding to the largest of these values is taken to be the prediction for x. (See Freund and Schapire (1996) and Freund and Schapire (1997) for details pertaining to AdaBoost.M2.)


when and why boosting works

As to how boosting typically achieves such a low misclassification rate, the opinions are many and varied. I'll just give a little bit of information here, and let you go to my book chapter for additional information and references.

Friedman et al. (2000) points out that when used with a rather simple base classification method which is stable but tends to produce classifiers which are highly biased, such as stumps, in some situations (e.g., if one variable is a dominant predictor), the success of boosting is due much more to bias reduction than it is to variance reduction, and this makes boosting fundamentally different than bagging, despite the fact that both methods combine classifiers based on various perturbations of the learning sample.

When used with a rather complex base classification method that has low bias, but is unstable, such as carefully grown and pruned trees, boosting (and arcing in general), like bagging, in many cases can dramatically lower the error rate of the classification method. Stated simply, this is due in part to the fact that unstable methods have relatively large variances, and boosting decreases the variance without increasing the bias.

In some settings, and used with some methods, boosting doesn't work very well. Breiman (1998) shows that when applied to linear discriminant analysis (LDA), which is a fairly stable method of classification, neither bagging nor boosting has any appreciable effect. In some cases boosting can make a classifier appreciably worse, but this does not seem to happen in a large percentage of settings which have been investigated, and when it does happen, perhaps a small learning sample size is a contributing factor. Several studies have indicated that boosting can be outperformed when the classes have significant overlap. Nevertheless, it is often the case that some form of boosted classifier can do nearly as good or better than any other classifier.

Although it does not seem to be fully understood why boosting works as well as it does, there is no question about the fact that it can work very well. In various comparisons with bagging, boosting generally, but not always, does better, and leading researchers in the field of machine learning tend to favor boosting. (It should be noted that some of the comparisons of boosting to bagging are a bit misleading because they use stumps as the base classifier for both boosting and bagging, and while stumps, being weak learners, can work very well with boosting, they can be too stable and have too great a bias to work really well with bagging. Trees larger than stumps tend to work better with bagging, and so a fairer comparison would be to compare bagging fairly large trees, boosting stumps, and boosting slightly larger (than a stump) trees.)


what to boost

Hastie et al. (2001) claims that using trees with between four and eight terminal nodes works well in most cases, and that performance is fairly insensitive to the choice, from this range, which is made. Friedman, Hastie, and Tibshirani (2000) gives an example in which boosting stumps does appreciably worse than just using a single large tree, but boosting eight node trees produced an error rate that is less than 25 percent of the error rate of the single large tree, showing that the choice of what is used as the base classifier can make a large difference, and that the popular choice of boosting stumps can be far from optimal. However, they also give another example for which stumps are superior to larger trees. As a general strategy, one might choose stumps if it is suspected that effects are additive, and choose larger trees if one anticipates interactions for which adjustments should be made.

It's interesting that smallish trees tend to work rather well with boosting, because they can be unstable in addition to being highly biased. (The instability would be due to the fact that with a lot of predictors, sometimes one variable is chosen over another one to be the splitter in a very close call, so that data sets which are only slightly different can produce very different trees. (It can be noted that in some situations a stump can be stable, and in other situations it can have high variance.) The high bias is due to the small number of nodes --- often the decision boundary cannot be complex enough to result in a low bias classifier.) Perhaps the instability encourages a wide variety of trees in the ensemble, and the adjustable case weights serve to reduce the bias, while the averaging of the results reduces the variance. (It may be that the failure of boosting to work well with LDA is due to the great stability of classification by LDA --- maybe the classifiers in the ensemble are generally too similar to one another.)


bag-boosting

With bag-boosting, one boosts bagged stumps (or perhaps some other type of tree). This is an extremely computer-intensive procedure.