Some Notes Pertaining to Ch. 17 of E&T

This chapter focuses on estimation of prediction error. (Previously, the focus has been on methods for estimating measures related to the accuracy of estimators of distributions measures.) Good estimates of prediction error (aka generalization error) are important for model selection --- in choosing a model from amongst a set of possibilities, it seems sensible to choose one which makes good predictions. E&T consider a number of different methods for estimating prediction error, including cross-validation, and give some information about their strengths and weaknesses.

In regression settings, the prediction error is usally taken to be the mean squared prediction error (MSPE), given by (17.1), but occassionally some other measure, such as the mean absolute prediction error, is used. In (17.1), we take the first of the two y values (the one without the hat) to be a random variable (it would have been better to use Y instead of y) corresponding to a new observation of the same phenomenon the data arose from. If the original data can be viewed as having been iid observations from some multivariate distribution (i.e., the values of the predictor variables come along with the random observation of the response variable), then the second y value (the one with the hat) in (17.1) is also random, since the randomly observed predcitor values are plugged into the equation representing the fitted model (which was obtained using the original data). If the original data is due to a designed experiment in which the predictor values were fixed, then things are less clear. One possibility is to suppose that a new observation will be made using predictor values equal to those of a randomly selected experimental unit, choosing from the set of n experimental units actually used.

In classification settings, the prediction error is usually taken to be the expected misclassification cost. In the commonly occurring situation in which all types of missclassification are assigned unit loss, the expected misclassification cost is often referred to as the expected misclassification rate (or just misclassification rate), which is just the probability that a random new case will be misclassificed by the classifier constructed from the data. (In class I'll briefly discuss how classifiers are partitions of the measurement space of the predictor variables. This web page from the class I taught last summer, based on the book by Hastie, Tibshirani, and Friedman (HTF), contains some information about classification.)

Sec. 17.2 uses the hormone data from Ch. 9 to create a simple setting in which good estimates of prediction error could be used for model selection. The issue is whether a simple regression model using time (the number of hours worn) as the single predictor is better or worse than the more complex analysis of covariance model which uses the factor lot as a predictor in addition to the numerical variable time.

If we use the average of the squared residuals, given by (17.3) on p. 239, as a crude estimate of the MSPE for each model, the more complex model "wins." But each of the resubstitution estimates of prediction error (aka apparent error estimates) could be optimistic, and thus misleading low, and furthermore it can sometimes be the case that the simple model, with the larger resubstitution estimate of prediction error, can sometimes be the better predicting model.

That each resubstitution estimate should be considered to be optimistic follows from the fact that the models are fit using least squares. Since it's safe to assume that the estimated parameter values are not correct, but they are the values that minimize the sum of the squared residuals (called the residual squared error (RSE) in E&T), the sum of the squared differences of the observed responses to their corresponding true mean values will be greater than the sum of the squared residuals. Replacing the fitted values in (17.3) by the true mean values would give us an estimate from an unbiased estimator of the MSPE, and so using (17.3) as is corresponds to using an estimate which is smaller than that which would result if an unbiased estimator is used.

Dividing RSE by n - p instead of n would partially correct for bias. (I think it's better to state that p is the number of unknown parameters of the linear model (not counting any parameters associated with the error term) than to state that p is the number of predictor variables, as E&T do on p. 239. So for the simple linear regression model p = 2, and for the analysis of covariance model p = 4.) Viewed as a random variable, this adjustment of (17.3) would be an unbiased estimator of the error term variance, but even if the linear model is an unbiased estimator of the mean value of the response (given the predictor values), this web page (from the class I taught last summer) shows that MSPE is larger than the error term variance, and so we still don't have an unbiased estimator for MSPE. Plus, it should be noted that in many cases the model used won't give us an unbiased estimator for the mean response (given the predictor values).

Sec. 17.3 focuses on cross-validation, but first covers the concept of a test sample estimate of prediction error. A test sample is a sample of complete cases from the same source as cases that you wish to predict with the fitted model (except that for the ones that you want to get predictions for the response values will be unknown), but were not used to fit the model. If one plugs the predictor values of the test sample cases into the fitted model, and compares the predictions obtained with the known response values of the test sample cases --- using the average of the squared differences for regression models and the proportion of misclassified cases for classifiers --- then one has an estimate of prediction error that comes from an unbiased estimation method.

A bad thing about using a test sample is that any available complete cases that are set aside for the test sample are ones which cannot be used to fit the model. Typically, the more data used to fit a model the better the fit is expected to be, and so often one hates to set aside a portion of the data just for the purpose of obtaining good estimates of prediction error to use for model selection. (In some cases where the data is plentiful, test samples are used, with a typical fraction of the available complete cases used for the test sample being about 1/3. With regression models, sometimes the remaining 2/3 of the avaliable data is still sufficient to fit a model well enough so that most of the prediction error is due to the irreducible error of the error term.)

Cross-validation addresses the bad point referred to above by allowing each available complete case be used to obtain the fit of a model, while still arriving at an estimate of prediction error using a method that is nearly unbiased. K-fold cross-validation is described on p. 240 of E&T, and I'll go over this material in class. Values commonly used for K are 5 and 10. E&T indicate that 2 is a value which is commonly used, but given today's good computers, I don't see 2 being used very often. Typically, the estimated prediction error which results from 10-fold cross-validation is a bit greater than it should be because only 90% of the data that is used to fit the final model is used to obtain the predictions for the prediction error estimate.

If n is not too large, sometimes n-fold cross-validation is done. With least squares fits of linear models, some clever mathematics eliminates the need to do the cross-validation explicitly. The leave-one-out residuals are sometimes referred to as PRESS (for PREdiction Sum of Squares) residuals. They can be used to create a PRESS statistic which can be used to compare different models obtained from the same data.

Sec. 17.4 covers C_p (given by (17.8) on p. 242 of E&T) and some similar estimates of prediction error. The adjusted residual squared error, given by (17.7), makes a more severe adjustment to correct for bias than do RSE/n ((17.3) on p. 239) and RSE/(n-p) (described in the last paragraph of Sec. 17.2), which are both biased low.

The C_p statistic adjusts the apparent error, RSE/n, to make it approximately unbiased for the true MSPE. (Note: Despite this last fact, a few small studies that I've done indicate that C_p can be quite unreliable if the sample size is smallish.) A Taylor series argument can be used to show that (17.7) and (17.8) are equivalent to a first order of approximation, but since (17.7) doesn't seem to be used much, and (17.8) can be better justified, I would choose to use (17.8) over (17.7).

C_p is frequently used and relied on by many users of ordinary least squares multiple regression. But not all C_p proponents use it the same way, and some rely on C_p values that are not obtained in the best way. Some use C_p to identify the simplest model which is nearly unbiased, whereas others (including me, when I use C_p) use it to identify the model which should give the best predictions. When comparing various models using C_p it is important that the same estimate of the error term variance is used for all of the C_p values. This web page (from the class I taught last summer) contains some information about C_p and its uses. (It's important to note that E&T do not use the usual definition for C_p --- it's not the same, but it is equivalent to, Mallows' C_p as it is usually expressed in regression books.)

C_p is a special case of Akaike's information criterion (AIC). (Note: I won't attempt to cover AIC in general at this point.) It adds a "penalty" to the apparent error in an attempt to adjust for the fact that the more terms one uses in a model the more optimistic the apparent error is. The Bayesian information criterion (BIC) (aka Schwartz's criterion), given by (17.9), is very similar to C_p, but it applies a larger penalty for each term used in the model. Except in really small sample size situations, BIC penalizes each term more than does C_p.

Compared to cross-validation, C_p and BIC are simpler ways to estimate prediction error, but in a lot of situations they aren't as useful or as good. With some newer methods for doing regression, like tree-structured regression (e.g., as done by CART), adaptive splines (e.g., as done by MARS), local regression, projection pursuit regression, and neural network regression, the number of parameters estimated doesn't have a firm meaning. For example, with tree structured regression, each prediction may only involve a few variables, but since the same variables aren't used for all predictions, a lot of different variables can be used in all. Furthermore, to arrive at the variables used, and the way they are used, many variables may have been considered, and any time many variables are considered for use, the effective number of parameters is greater than the number actually used in the final model. It's also the case that to use C_p and BIC, one has to have a good estimate of the error term variance. While sometimes this can be obtained by using the residuals from a slightly overfit model, at other times one may not have a decent enough model to use. If an appreciably biased model is used, the bias will inflate the estimate of the error term variance, and the C_p and BIC statistics will penalize having a lot of terms way too much.

Sec. 17.5 explains how cross-validation can be used to estimate the expected misclassification cost for a tree-structured classifier constructed using CART, and how cross-validation can be used to select the final classifier from a sequence of candidates. While the use of cross-validation with CART provides a good example to illustrate cross-validation in a nontrivial application, it's somewhat unfortunate that constructing a classification tree with CART is a rather complex procedure. Since it would take about 2.5 to 3 hours to explain CART somewhat thoroughly, in the interest of making the best use of the time during our last class meeting prior to the exam, I will give a simplified presentation of how CART works in class. Those wanting more details can read a book chapter that I wrote which covers CART. One can also go through a presentation about CART which I have presented to numerous classes and seminar groups over the years. (Note: There is a mistake in the corner of p. 23 --- it should be 3*99 = 297 instead of (99)^3. Also, on the very bottom of p. 40, and on p. 42, I express concern about tie situations that could occur in the pruning process. However, now I know that tie situations can be easily dealt with --- the key is that more than a single pair of nodes can be pruned at one time. ) Additional insight may be obtained by going through this Interactive Walkabout for CART. (It takes 30 or so to go through it carefully.) (Note: Here is the home page for Salford Systems --- the company that sells CART, MARS, TreeNet, and other products.)

Here are some specific comments about the material in Sec. 17.5.

Although CART uses a greedy algorithm, that does no "look ahead" but rather optimizes each time a decision is made, it's important to note that the apparent misclassification rate is not what is being optimized as the tree is grown. Instead, a measure of node impurity is being minimized at each step of the tree-growing process.
Near the top of p. 247, it is indicated that leave-one-out cross-validation (aka n-fold cross-validation) doesn't work well with CART. An explanation is offered for why this is true, but it doesn't seem extremely convincing to me. Yet I've seen the same sort of explanation given elsewhere (e.g., the well-known book by Hastie, Tibshirani, and Friedman). I wonder if the conclusion that leave-one-out cross-validation doesn't work well is just based on empirical evidence, or if there is more to it than that.

Sec. 17.6 and Sec. 17.7 deal with bootstrap estimates of prediction error. Given the lack of time as we near the end of the course, I won't go into the grubby details, but rather I'll just give an overview of how bootstrapping can be used to estimate prediction error.

A real-world estimate of prediction error could be based on how well the model created using the original real-world data predicts the response values of an independent real-world data set. If we set aside a separate test sample, this simple scheme could be applied to get an estimate of prediction error. If we don't want to set aside a portion of the available data to serve as an independent test sample, we can use bootstrapping as follows. We use all of the available real-world data to obtain the empirical distribution --- the bootstrap world estimate of the unknown real-world distribution. From the bootstrap-world distribution, we can obtain many bootstrap samples and construct a model from each one. We can also generate independent test samples and obtain estimated prediction errors for the bootstrap-world models. The average of these estimates for the bootstrap-world models can serve as the estimated prediction error for the model constructed from the real-world data.

Since the prediction error we're interested in is an expected value (e.g., the mean squared prediction error), rather than draw independent test samples to estimate this expected value in the bootstrap world, we should simply use the original data as the test sample since it gives equal weight to each possible observation from the bootstrap-world distribution. Although this simple translation of the test sample approach to the bootstrap world may seem to be without snags, it doesn't perform very well. Even though each step of the procedure in the bootstrap world faithfully mimics the real world (ideal) test sample approach, we wind up applying a test sample in the bootstrap world (the original data set) which typically has tremendous overlap with the sample from which the bootstrap-world model is constructed. (On the average, a bootstrap sample will contain roughly 63% of the original observations at least once.) This overlap may cause prediction estimates of overfit models to be too low.

Another approach is to estimate the optimism associated with the crude apparent estimate of error, and adjust the apparent estimate accordingly --- bias-correct the apprarent estimate. If we plug the observations in a bootstrap sample back into the model created from that bootstrap sample, we get the bootstrap-world analogy to the apparent error estimate in the real world. Such values are given in the 2nd column of Table 17.1 on p. 248. The values in the first column of Table 17.1 should be considered superior estimates of error since they use the actual distribution underlying the bootstrap world instead of empirical distribution of just a single bootstrap sample. The difference between the average of the column 1 values and the column 2 values provides us with a estimate of the optimism of the apparent error. In the real world, the true error of interest (using the notation indicated by Table 17.1) is err(x,F). The relationship between it and the real-word estimate of apparent error is analgous to the relationship between the 1st and 2nd columns of Table 17.1. So if we apply the estimate of optimism obtained from Table 17.1 to the apparent error of the real world we get an optimism-adjusted estimate of error.

The .632 bootstrap estimate of error is another attempt to adjust for the optimism of the apparent error. Since a flaw with the original straightforward bootrap estimate of prediction error (the one based on using the bootstrap world to approximate the real-world test sample estimate) is the great overlap between the set of observations serving as as the bootstrap-world test sample and and obervations in the bootstrap sample used to obtain a fitted model, an approach to correcting for this problem would be to not use the part of the bootstrap-world test sample with overlaps with a particular bootstrap sample in estimating how well the model based on that bootstrap sample makes predictions. (It can be said that we should just use the out-of-sample observations to obtain the estimated prediction error.) It can be shown that using this idea to make adjustments in a very straightforward manner overcorrects for the optimism of the apparent error. It turns out that scaling down the seemingly obvious adjustment is the correct thing to do, and using 0.632 as the scaling factor leads to the .632 estimator given by (17.24) on p. 253. (Note: A precise theoretical explanation for the 0.632 scaling factor is not given by E&T. Nor is it given in a more advanced book written by Hastie, Tibshirani, and Friedman. It's not a simple justification, and I won't attempt it here.)