Some Notes Pertaining to Ch. 9 of E&T



Let me know if you have any questions about the basic regression material given on p. 106 of E&T --- I may go over this stuff rather quickly in class, or skip it completely if there aren't any questions. However, if you haven't had STAT 656, or a course similar to it, don't worry about (9.9) on p. 106 and (9.10) on p. 107. (Note: E&T take vectors to be row vectors, while in most regression books they are column vectors. E&T's β vector is a column vector because it's defined using a transpose (see first line of p. 106). This explains why (9.3) on the bottom of p. 105 doesn't contain a transpose. Since E&T set things up differently from most regression books, some of their formulas may appear to be at odds with those you can find in other books.)

Note that E&T use p for the number of unknown parameters, including the intercept if there is one. (Sec. 9.6 considers two models not having intercepts.) Even though there is just a single explanatory variable, p could be 3; e.g., if the model was a quadratic polynomial regression model including an intercept. (In such a case we can say that the explanatory variable and it's square are two different predictors/regressors.)



Sec. 9.3 deals with a regression setting for which there are two potential explanatory variables, z, which is numerical, and L, which is a categorical variable having three categories. (Note: I'm using the notation given at the top of p. 108 of E&T.) Here is a data file I created which contains the hormone data from Table 9.1 on p. 107 of E&T.)

As a first step, E&T fit the simple regression model given by (9.11) on p. 108. The first portion of this R code pertains to this simple regression model --- results in agreement with Table 9.2 on p. 110 of E&T can be obtained with the code, and plots similar to those in Fig. 9.1 can be obtained. (Notes: (1)I'm not going to go over why (9.17) is correct, since it would take too much time to derive it. It's one of the many results covered in STAT 656. (2) One doesn't need to use matrix results to arrive at formulas for the standard errors of the coefficient estimators. In STAT 554 I derive the variance of the slope estimator in a simple regression model without using any matrix results.)

E&T also consider an analysis of covariance model in Sec. 9.9 --- they use both the numerical variable, z (hours), and the factor, L (lot), to explain the observed values of the response variable (amount). (Note: One can deal with such a set of explanatory variables using multiple regression methods, using "dummy variables" to account for the lot differences. But some software packages allow you to bypass the step of forming the dummy variables yourself, and allow you to specify models having a mixture of numerical variables and categorical factors as explanatory variables.) This R code also fits the analysis of covariance model and produces results shown in Table 9.3 on p. 110 of E&T.



Sec. 9.4 explains how to bootstrap the residuals to obtain estimates of the standard errors of the coefficient estimators. It can be noted that the "approximate errors" given by (9.23) on p. 111 of E&T are just the residuals from the regression fit done on the original data, and the first term on the right hand side of the equation given in (9.26) on p. 111 is just the ith fitted value (aka predicted value) from the regression fit done on the original data. (Note: I think it would be more sensible to work with the analysis of covariance model rather than the simple regression model, since initial results strongly suggest that the more complex model is needed, and so the residuals from the simple regression model reflect an error term distribution of a model that just isn't adequate. But I guess E&T wanted to keep things somewhat simple when introducing the concept of resampling from the residuals to do bootstrapping.)

The bootstrap estimates of the standard errors converge to values that we can obtain without bootstrapping, and so there is no real point in using the bootstrapping the residuals method if one is using ordinary least squares (OLS) regression. But the fact that it works in the OLS regression setting may give us some hope that it can work okay in other regression settings as well (i.e., if another regression method is being used). This R code also uses the bootstrapping the residuals method to estimate the standard errors of the simple regression coefficient estimators.



Sec. 9.5 points out that one could also bootstrap complete cases, as was done in Ch. 7. So there are (at least) two different ways to use bootstrapping in regression settings.

If the values of the predictor variables should be considered to be fixed (e.g., if they were determined by a planned experimental design) and if the error term random variables can be thought of as being iid, then bootstrapping the residuals is the proper choice. But if for each case the explanatory variables and the response can be thought of as being an observation from a multivariate distribution, and it's not clear that the error term random variables can be thought of as being iid, then bootstrapping cases is more appropriate. (When we bootstrap cases and the original data exhibits a pattern of heteroscedasticity, the bootstrap regression data sets will generally be similar in this regard.) Of course these two situations don't cover all possibilities. (The last portion of Sec. 9.7 deals with another possibility: having fixed values for the explanatory variable, but not having the error term random variables being iid --- they don't all have the same variance.) This R code also uses the bootstrapping complete cases method to estimate the standard errors of the simple regression coefficient estimators.



Here is some of the data in Table 9.4 on p. 116 of E&T. (Notes: (1) The value in the last column for plate 6 in Table 9.4 is incorrect. (2) I intentionally put the data for plate 13 after the data for plate 14 because I wanted to have the questionable case at the bottom of the data file for convenience --- it makes it easier to omit the questionable case when doing analysis with R.) This R code can be used to produce some of the results given in Sec. 9.6 and Table 9.5.



Sec. 9.7 covers least median of squares (LMS) regression and the breakdown point of an estimator. I'll discuss these topics in class, as well as asymptotic breakdown points. (Note: E&T use the term breakdown instead of breakdown point, which I believe is more commonly used.)

Since the dose values used in the cell survival experiment should be considered to be fixed --- set by the investigator --- bootstrapping the residuals would be strongly favored over bootstrapping the cases if it wasn't for the fact that there appears to be heteroscedasticity in the error term random variables. A method which attempts to correctly adjust for the heteroscedasticity while respecting the experimental design is described on the bottom half of p. 120 of E&T. A problem with using such a model is that it isn't clear that the adjustment for heteroscedasticity is correct. A particular problem with using bootstrapping with this model, by bootstrapping the adjusted residuals, is that the adjusted residuals may provide a poor approximation of the distribution governing the ε terms due to the fit being way off (and the sample size being very small). I think using M-regression would be much better than LMS regression since it is generally a better method unless one really needs a breakdown point close to 50%. Also, M-regression doesn't make use of random numbers. This R code can be used to obtain estimated standard errors for the LMS coefficient estimators using two different bootstrap methods --- bootstrapping complete cases, and using the model given by (9.42) on p. 120 of E&T and bootstraping the adjusted residuals. (Note: I really don't like the use of LMS regression. In order to see how unstable the LMS fitting method is, click the first Submit button you come to on this U. Minn. web page and then look at the image that's created (after you wait for R to run). The nature of the fit is highly influenced by the particular points included in each bootstrap sample.)