Some Notes Pertaining to Ch. 16 of E&T



The last part of Ch. 15 dealt with bootstrap hypothesis tests based on bootstrap confidence intervals. This chapter covers bootstrap methods specifically designed for doing tests of hypotheses. Also, the chapter will show that bootstrap tests of hypotheses can be done in some situations for which a permutation test doesn't exist.



Sec. 16.2 considers bootstrap hypothesis testing methods for the general two sample problem. Letting t denote the difference in sample means, an expression for the p-value is given by (16.1) on p. 220. To obtain the probability under the null hypothesis of identical distributions, we note that if the null hypothesis is true, all n + m observations came from the same distribution, and we can use the empirical distribution obtained from the combined sample to estimate the unknown common distribution. In practice, the probability for the p-value is estimated by doing bootstrap resampling. This R code can be used to obtain an example of such a test.

The bottom portion of p. 221 describes an alternative bootstrap test: it uses Student's two-sample t statistic instead of the simple difference in sample means. For the mouse data from p. 11 of E&T, my R code produces a p-value of 0.1415 for this bootstrap test, which is pretty close to the p-value of 0.1406 obtained from Student's t test. The simpler bootstrap test produced a p-value of 0.1266, and the permutation test from Ch. 15 produced a p-value of 0.1410 (the exact p-value obtained using StatXact). To determine whether to report a p-value of 0.14 or 0.13, a Monte Carlo study could be done to judge the accuracy and power of the tests in situations that are hopefully similar to the one being dealt with --- in the end we could report the p-value which came from the test that the Monte Carlo study indicates has the best accuracy and power characteristics. (I'll discuss this more in class. We want to consider accuracy for test sizes ranging from 0.05 to 0.25. By not selecting the p-value (based on tests done on the original data) which is necessarily the smallest, but instead going with the test that the Monte Carlo study indicates is the most appropriate, we avoid "cheating.")

The last portion of Sec. 16.2 deals with using bootstrapping to do a test about the means without assuming that both distributions have the same shape. To do such a test we should not use the empirical distribution based on the combined sample to represent the null hypothesis distributions underlying the two samples. Instead we should resample from each of the two samples separately, but we need to do so in such a way so that the null hypothesis is respected. To do this we can shift the two empirical distributions (based on the two observed samples) so that they each have a mean equal to the sample mean of the combined sample (so that the two shifted empirical distributions have the same mean), and resample from these shifted empirical distributions. Then we can either do a bootstrap test based on the difference in two sample means, or we can do a test based on Welch's statistic (since there is no reason to assume that the two underlying distributions have the same variance). Note that being creative in how we bootstrap resample, we can do a test about the means without assuming that the distributions have the same shape. (Note: Despite what E&T indicate in Sec. 16.3, if we're willing to make certain assumptions, we can sometimes interpret a two-sample permutation test as being such a test about the means.)

This R code also does a bootstrap test based on Welch's statistic.



Sec. 16.4 considers a bootstrap procedure for a one-sample test about a distribution mean.

I don't like the way E&T describe and do things on pp. 224 and 225. On p. 124 they have "want to test whether the mean of the treatment group in Table 2.1 was 129.0 as well" --- but it's clear that the sample mean of the treatment group is not 129.0, since it rounds to 86.9., and so they perhaps should have put the mean of the distribution underlying the treatment group. Also, on the bottom of p. 224 they state the null hypothesis as a simple hypothesis (that the mean equals 129.0) and don't state an alternative. Then on p. 225 they do a lower-tailed test without stating why they are doing a lower-tailed test. If it's because the observed sample mean is less than 129.0, then that's bad statistical practice if they hadn't decided to do a lower-tailed test before examining the data. Finally, it bugs me that they report the result given by (16.12) on p. 225, since the t test result given by (16.13) seems much more appropriate. (To assume that the t statistic has a normal distribution when the sample size is only 7 doesn't seem good.)

To do a bootstrap test, one shifts the empirical distribution to make it have a mean of 129.0 (which can be done by subtracting the sample mean and adding 129.0 to each observation). Then one resamples from the shifted empirical distribution to obtain bootstrap samples from which replicates of the test statistic (which is just Student's one-sample t statistic) are computed. The empirical distribution of these replicates serves as an estimate of the null hypothesis sampling distribution of the test statistic. The observed value of the test statistic (the t statistic computed from the observed sample) is compared to the quantiles of the estimated null distribution to obtain an approximate p-value. This R code can be used to perform the test pertaining to Sec. 16.4.

The bottom of p. 226 and the top of p. 227 give us another way to obtain the replicates. Obtaining the replicates using (16.18) is easier, and the results are equivalent to those obtained using the method described above (see the 3rd line on p. 227), but by explaining it as was done above makes it easier to understand the logic behind the test procedure.

The very last portion of Sec. 16.4 indicates that doing a test using bootstrap t confidence intervals is compatible with the testing procedure described in the section. I'll explain this in class. (Note: There's a tiny thing that makes the two testing procedures not exactly compatible.)

In some situations, instead of shifting the empirical distribution to make it have a certain mean, it may be better to rescale it in order to achieve a certain mean. If the sample can be thought to be observations of iid nonnegative random variables, a simple shifting may result in the shifted empirical distribution assigning positive probability to negative values, but this can be avoided if the observations in the sample are divided by the sample mean and then multiplied by the null hypothesis mean.



Sec. 16.5 deals with an out of the ordinary hypothesis testing situation: testing to see if there is significant evidence that a density has more than one mode.

(16.19) gives a Gaussian kernel density estimate (which I'll discuss in class). Fig. 16.2 shows how the shape of the density estimate, and the number of modes of the density estimate, changes as the smoothing parameter (aka window size) changes. Because of this, and because it's not clear what the value of the smoothing parameter should be, it's not clear how many modes the density underlying the data has. But it might be informative to do a test of the null hypothesis that there is just one mode against the alternative that there are more than one mode.

A test statistic can be h1, the smallest value of the smoothing parameter which will produce a density estimate with just one mode. (Note: The number of modes is a decreasing function of the smoothing parameter, h.) A large value of h1 means that a lot of smoothing has to be done in order to create a density estimate with just one mode, and can be taken to be evidence against the null hypothsis that there is just one mode.

For the stamp data dealt with in Sec. 16.5, the value of h1 is about 0.0068. The p-value corresponds to the probability that a sample of size 485 from a unimodal distribution require a value of h1 as large as 0.0068 to yield a unimodal density estimate. But to firm up the bootstrap test, it needs to be decided how to generate new samples from a unimodal having characteristic similar to the unknown distribution underlying the data. One reasonable possibility (and the one that E&T consider) is to use a density estimate obtained from the original data having just one mode. (Another possibility would be to use stamp data from nearby years, thought to be years when a philatelic mixture (see p. 227) did not occur.) Since one can create many different density estimates from the stamp data that have just one mode, and we need to select something specific, we can use the one obtained using the smallest value of the smoothing parameter that yields a unimodal density estimate, since it will be one which is not oversmoothed and which most closely corresponds to the empirical distribution. (Note: Picking a null model so close to distributions corresponding to the alternative hypothesis (since h = h1 almost gives us a bimodal density) makes our test conservative, since a null model based on a larger value of h would be more prominantly unimodal, and so samples from it may not need a lot orf smoothing to be applied to obtain unimodal density estimates.) Because random variables corresponding to a Gaussian kernel density estimate have a variance that is larger than the variance of the associatd empirical distribution, rather than sample from the unimodal density estimate obtained from the data, it's better to sample from a rescaled version of the density estimate (the one created from observations rescaled according to (16.22) on p. 231). This will make the bootstrap world distribution have the right variance, which is important, since clearly, in addition to the shape of the denisty, the value of h1 depends on how spread out the sample observations are.

Having generated bootstrap samples corresponding to the null hypothesis of a unimodal distribution, to carry out the bootstrap test we need to know how many of them have an h1 value as large as the one obtained from the observed data. (If none of them do, then it means that data from a unimodal distribution doesn't have to be smoothed as much as the observed data to obtain a unimodal density estimate, which strongly suggests that the observed data did not come from a unimodal distribution. If a lot of them do, then it's plausible that the observed data came from a unimodal distribution.) Rather than determine a value of h1 for each bootstrap sample, since what we really need to know is whether the value of h1 is greater than or equal to 0.0068, we can just determine if a density estimate based on a smoothing parameter value of 0.0068 is unimodal. If it is, then the value of h1 for the bootstrap sample is less than or equal to 0.0068. If it's not, then h1 must exceed 0.0068. (It can be noted that there is a slight problem with this shortcut method, concerning the nature of the inequalities, but for the sake of simplicity I'll overlook it.)

This R code can be used to perform the test described in Sec. 16.5