Some Notes Pertaining to Ch. 15 of E&T

Sec. 15.1 points out that Fisher introduced permutation tests in the 1930s. Pitman was also a major contributor, and in fact the two-sample permutation test emphasized in the first portion of Ch. 15 is known as Pitman's permutation test, while Fisher's permutation test is one which is applied to matched-pairs data (and it can sometimes be used for a one-sample test of location (see Ch. 1 of Miller's Beyond ANOVA: Basics of Applied Statistics for information about Fisher's permutation test)).

Sec. 15.1 also indicates that originally permutation tests served to justify the use of normal theory procedures in nonnormal settings, and were not practical for everyday use. This is because they require a lot of computation --- the reference distribution that is used to obtain a permutation test p-value cannot be found in a book's appendices ... it has to be constructed using the particular set of observed values at hand.

Now some software can be used to perform permutation tests, with StatXact being the software that I prefer to use for such tests. But StatXact won't do permutation tests based on statistics other than the sample mean, and so it'll be useful to read about such tests in E&T and figure out how they can be done using R. Also, it'll be nice to understand how permutation tests relate to bootstrap tests, but while the latter part of Ch. 15 introduces bootstrap hypothesis testing methods, some of the bootstrap testing procedures covered in Ch. 16 are more similar to permutation tests.

Sec. 15.2 covers the basics of the two-sample permutation test (aka Pitman's permutation test). I'll give some comments about this test, and about E&T's presentation of it, below.

It's really a test of the general two-sample problem, for which the null hypothesis is that the two distributions are indentical (expressed by (15.2) on p. 202), and the alternative hypothesis is the general alternative that the two distributions are not identical. In order to use it as a test about the means, certain assumptions have to be made. E.g., if we can believe that either the two distributions are identical or that one if stochastically larger than the other, then the test can be interpretted as being a test about the means.
The achieved significance level (ASL) given by E&T on p. 203 is what most books refer to as the p-value (or P value). I will use these interchangeably.
I do not like that E&T use "accept H₀" --- most statisticians (but for some reason not these two eminent ones) frown of the use of "accept the null hypothesis" and prefer to simply state that "the null hypothesis is not rejected."
I think it's interesting that from a null hypothesis point of view one has equally-likely outcomes for the test statistic --- so the rejection region is not formed by those outcomes which are the least compatible with the null hypothesis, but rather it's formed by those outcomes will are most compatible with the alternative that the means are different. (Note: Even though the permutation test may be used to test the null hypothesis of identical distributions against the general alternative (that the distributions are not identical), the rejection region is based on a compatibility with the distribution means being different.)
For the mouse data considered in Sec. 15.2, StatXact can be used to obtain an exact p-value for a one-sided test permutation test, with the result being 0.1410. This R code can be used to obtain an estimated exact p-value of 0.1432. (E&T's estimated exact p-value is 0.132 --- it's off because it was only based on 1000 trials. Using 40,000 trials like I did produced a result much closer to the exact p-value obtained from StatXact.) Student's two-sample t test gives a p-value of 0.1406.
In addition to giving a point estimate for the exact p-value, R can also be used to give a confidence interval for the exact p-value. (See my R code for details.)
There are two ways to justify the permutation test: one based on the viewpoint that the data was obtained from an experiment that used random assignment of experimental units to the treatment groups, and the other based on a conditioning argument. (I'll discuss both viewpoints in class.)

Sec. 15.3 pertains to two-sample permutation tests based on statistics other than the sample mean. The difference in sample medians or trimmed means can replace the difference in the sample means to obtain tests having different power characteristics. This R code can also be used to obtain an estimated exact p-value based on the difference in sample medians. (StatXact, nor any other software package that I know of, has such a test on it's menu of choices. But it wasn't hard to create R code for the test.)

The top part of p. 212 describes a test based on the ratio of estimated variances. It's important to realize that this test is just another test of the general two-sample problem, and not really a test about variances, although it's power will depend on how the variances differ.

Pages 213-214 address the misleadingly small p-values that can result if one tries several different tests and selects the smallest p-value obtained from the tests to use. If it is decided to use the strategy of performing several types of permutation tests and reporting the smallest p-value obtained, dishonest and misleading results can be avoided if the test statistic given by (15.27) is used. (I'll explain this test in class.)

Sec. 15.4 introduces one type of bootstrap hypothesis test: using bootstrap confidence intervals to perform a test of hypotheses. I don't particularly like this type of test, especially for a one-sided test, because bootstrap confidence interval methods don't behave accurately in some situations.

Recall that if one has an accurate 90% confidence interval procedure, one can perform a size 0.10 test of H₀: θ = 0 against the alternative that θ does not equal 0 by rejecting the null hypothesis if the confidence interval does not contain 0. (This is because if the null hypothesis is true the confidence interval will cover 0 with probaility 0.9, and so the probability of a false rejection of the null hypothesis is 0.1.) If one of the endpoints of a 90% confidence interval is equal to 0, we can claim that the p-value is 0.10. If a 90% confidence interval doesn't cover 0, but a 95% confidence interval does, then the p-value must be between 0.05 and 0.10. If a 92% confidence interval has an endpopint of 0, we can claim that the p-value is 0.08.

For one-sided tests, if we have an accurate 90% confidence interval method that (for the situation at hand) misses to the left with the same probability that it misses to the right, then if an endpoint equals 0, the the p-value is either 0.05 or 0.95. A problem with doing tests of hypotheses this way is that a lot of times confidence interval methods just aren't accurate enough. But if we choose to perform a test using this scheme, and we're doing a test against the alternative that θ > 0, we can reject with a size 0.05 test if the lower confidence bound of a 90% confidence interval exceeds 0. Furthermore, the p-value is equal to α if the lower bound of a 1 - 2α confidence interval equals 0. If we base the test on a percentile confidence interval, the p-value is just the proportion of replicates that are less than 0. For a two sample test about the means, the replicates are differences in the sample means obtained from the bootstrap samples. It's a bit more complicated if BCA intervals are used, but p. 216 of E&T describes the procedure.

Even though the difference in two sample means can be used for both a two-sample permutation test and a test of hypotheses based on a bootstrap confidence interval, the nature of these tests differ. The permutation test is a test of the null hypothesis of identical distributions against the general alternative, whereas the bootstrap test can be viewed as a test about the distribution means.