Some Comments about Chapter 5 of Hollander & Wolfe



On p. 141, H&W state "Section 5.4 contains a distribution-free test of the general hypothesis that two poulations are identical in all respects." Really, with the exception of the jackknife procedure of Sec. 5.2, all of the tests of Ch. 4 and Ch. 5 have as the null hypothesis that the two distributions are identical, and a small p-value is evidence that the distributions are not the same. To make a statement about the means, medians, or variances requires further assumptions. For example, with a shift model, or a scale model for nonnegative random variables, a small p-value can be taken as being strong evidence that the means, median, and quantiles differ (and in the case of the scale model, that the variances differ --- the shift model dictates that the variances are equal). Although p. 141 suggests that the test of Sec. 5.1 is a test about the scale parameters of two distributions having equal medians, it should be kept in mind that the p-value is determined using the null hypothesis model of identical distributions.


Section 5.1

To continue along the lines started above, even if the medians are equal, if the distributions aren't members of the same location-scale family, the A-B test can have high power to yield a small p-value in some cases for which the variances are equal, and can have low power to produce a small p-value in some cases for which the variances differ --- except under a rather strict set of circumstances that may not often occur, the A-B test is best not thought of as a test of variances.

There are several other tests that are similar in spirit to the A-B test. StatXact includes some such tests, and I'll mention those during class. Perhaps the A-B test gets favored status here because Hollander has a connection to Bradley (the B of A-B).
p. 142
The location-scale model is imposed. If we believe in that model, and if we believe that the medians are equal, then the test of Sec. 5.1 can be viewed as a test about the scale parameters (or a test about the variances if the variances exist). But I think you'll find it rare that you should believe in a location-scale model and equal medians. I think the A-B test should be primarily viewed as a test of the null hypothesis that two distributions are identical against the general alternative that they differ, having sensitivity to differences in dispersion more so than differences in location, or in having one distribution stochastically larger than the other one. If a small p-value is obtained, then you have evidence that the distributions differ, and you can perhaps characterize the differences by estimating quantiles.
p. 145 (top half of page)
The mean and variance results shown are just special cases of general results I presented during the 4th lecture.
p. 146
The adjustment for ties result is just the application of the general result I presented during the 4th lecture to the midrank adjusted set of Ansari-Bradley scores.
p. 147
Viewing the two samples as being iid under the null hypothesis of no difference seems okay if the serum was divided to make 40 experimental units and these were randomly assigned to the two methods. But the book indicates that 20 "duplicate analyses were made" and that suggests matched pairs, which means that they shouldn't be viewed as two independent samples.
p. 152 (first paragraph)
p. 152, Comment 8
I just don't see the point of the mean and variance computations for each test statistic for one or more specific small sample situations, when general results that apply to any two-sample linear rank statistics and any sample sizes are so easy to obtain.
p. 153, Comment 9
For those with a knowledge of basic results from survey sampling, (i) and (ii) are ways to arrive at the general mean and variance results that I presnted during our 4th class.
p. 154 (near bottom of page)
H&W make the asymptotic normality seem rather trivial by appealing to sampling results, but if one wants to firmly establish the asymptotic normality, it's not all that simple to do!
pp. 155-156, Comment 11
This is a good example to illustrate the exact distribution based on midranks that StatXact uses. (For all but the smallest sample size cases, working with the exact distribution would be quite time consuming without software such as StatXact.)
p. 156, Comment 12
I'm not going to emphasize the confidence interval and point estimator, since they only make sense if one has a location-scale model with either known medians or equal medians.
p. 156, Comment 13
The subtracting the sample median ploy is generally not a good thing to do. To view as a test about the variances, one would still have to assume that both distributions are of the same location-scale family. It can be noted that the only reference given for the asymptotically distribution-free result is an unpublished Ph.D. dissertation, which suggests to me that it isn't such a useful result.
p. 157, Comment 15
It will be intersting to compare the method described at the end of the comment to the Lepage test of Sec. 5.3. Note that one could always get a p-value at least as small by doing both the W-M-W test and the Ansari-Bradley test, and taking the smaller of the two p-values. So the described method would only be of interest to statisticians who value ethical practice. (I'll try to remember to discuss this during class --- how to get a p-value from the scheme, and why it's "cheating" to do both tests and take the smaller of the two p-values.)

Section 5.2

This procedure is omitted from a lot of books on nonparametric statistics. It's not distribution-free, but since no parametric model is assumed, it can be referred to as being nonparametric. (Jackkniffing is discussed in GMU's course on computational statistics, although I don't think that course covers jackknife hypothesis testing and interval estimation methods. This summer, I covered the method described in H&W when I taught an advanced topics course.)

For doing two-sample tests about variances, the jackknife procedure and the APF (aka Box-Anderson) test seem to be the best general methods available. As to which is better, it's not clear to me from the results I've seen reported in the literature.

I'll offer some specific comments about the text below.
p. 158
H&W indicate that a location-scale model is assumed. I don't think that is absolutely necessary, but I do believe that the test will be more accurate if the two distributions belong to the same location-scale family.
p. 161 (top half of page)
The suggestion given in H&W is that for small sample sizes, use the t critical values, with m+n-2 df, corresponding to Student's two-sample t statistic, instead of the standard normal critical values. But if you look at the test statistic given by (5.35) on p. 160 (with the various parts of it defined on pp. 159-160), it has the form of Welch's two-sample statistic instead of Student's two-sample statistic, and so I think using the df prescribed for Welch's procedure may be better, and I went the Welch route when creating my Minitab macro. Upon looking in the literature, one can see that Miller indicates that the proper df to use isn't completely clear. But others who have studied the procedure have generally adopted the Student's t df, and some have altered the test statistic from what is given by (5.35) to be of the form of Student's two-sample t statistic. In at least one case it appears that this was done to more easily extend the scheme to handle more than two samples: the jackknife test statistic is made to resemble the one-way ANOVA F statistic, and this reduces to being equivalent to Student's two-sample t in the case of only two samples. (Note: For more than two samples, I'd consider extending the scheme by basing the test statistic on Welch's F statistic for the heteroscedastic case. This statistic reduces to Welch's two-sample statistic, and thus matches (5.35).) It can be noted that in studies done to compare the performances of variance testing procedures, when the sample sizes were equal, the jackknife method using m+n-2 df was anticonservative in some cases. If the df from Welch's two-sample statistic had been used, fewer rejections would have resulted, and the problem of anticonservative behavior would have been eliminated or reduced. The anticonservativeness problem also exists for unequal sample sizes when m+n-2 df are used, and my guess is that the Welch method may improve things even more in that setting. (Note: An excellent project for an M.S. student (whether it be for a thesis, or just picking up some credits by doing an independent project in the summer) would be to compare several varieties of the jackknife variance testing procedure in order to determine which one performs best.)
p. 164 (top half of page)
When my Minitab macro is applied to the data for Example 5.2, the resulting value of the jackknife test statistic matches the value of 1.36 given by H&W, but the p-values do not match since I used the Welch's test df formula (aka the df resulting from a Satterthwaite approximation) instead of m+n-2 df.
p. 165, Comment 19
The alternative expressions for the sample variances are used in my Minitab macro. Using them allows one to easily compute the whole set of values given by (5.25) on p. 159.
p. 165, Comment 21
The motivation given for the jackknife procedure by H&W isn't real good. In class, I'll try to explain how the assumed (near) independence of the pseudo-values (which are defined, but not referred to as pseudo-values, 6 lines from the bottom of p. 165, right before the jackknife estimator is defined) can be used to create a test statistic that resembles Welch's two-sample statistic.
pp. 166-167, Comment 24
I'm not going to cover the related interval and point estimators, even though they follow fairly simply from the development of the test procedure. (Note: I have no idea why H&W chose to use a 94.52% confidence bound.)
p. 167, Comment 26
Note that the actual size of a nominal size 0.05 F test can be as small as 0.0056 and as large as 0.166. (Note: H&W use level instead of size, but size is the better choice.)

Section 5.3

As with the test of Sec. 5.1, the Lepage test is a test of the null hypothesis of identical distributions against the general alternative, but it has some sensitivity to both differences in dispersion and differences in location.

The Lepage test is somewhat similar to the test indicated near the end of Comment 15, but has different power characteristics. Neither dominates the other with regard to power.

The Lepage test may result in a smaller p-value than either of it's "component" tests. For example, with a given data set, both the W-M-W test and the A-B test could result in p-values near 0.05, and the Lepage test could yield a p-value near 0.02.

I'll offer some specific comments about the text below.
p. 170 (near middle of page)
Note that to use the tables you have to use the scores of larger of the two samples to compute the test statistic.
p. 170, Large-Sample Approximation
Since the chi-square distribution with 2 df is an exponential distribution, a simple formula can be given for the p-value (since a simple formula exists for upper-tail probabilities for exponential distributions). Since the Lepage test is not on StatXact or, to my knowledge, any other software, and since the tables only cover a small number of small sample size situations, it will be good to know how to easily get the asymptotic p-value.

Section 5.4

The two-sample Kolmogorov-Smirnov test is a test of the null hypothesis of identical distributions against the general alternative. Like the Wald-Wolfowitz test, the K-S test is an omnibus test. If a shift model holds, or a scale model for nonnegative random variables holds, or if two distributions have the same median but differ in scale, then other two-sample tests typically do better (i.e., yield smaller p-values when a small p-value is warranted), but for some data sets the K-S test can be a good test to use. (The data set of Example 5.4 provides an example of such a data set.)

I'll offer some specific comments about the text below.
p. 179
Not all books and statistical software packages define the test statistic as H&W do in (5.70). For example, StatXact omits the mn/d factor. (So for the data of Example 5.4, StatXact supplies a test statistic value of 0.6.)
p. 181
StatXact gives an asymptotic p-value of about 0.0546, which differs from the large-sample approximation result given in Example 5.4. However, if one notes that the value of (5.73) is about 1.34164, and uses interpolation with Table 11 as opposed to rounding and looking up the value corresponding to 1.34, then one obtains a value in agreement with the StatXact output.
p. 184, Comment 41
The one-sample K-S test is briefly described. I hope to cover it when I present some material on goodness-of-fit tests during the last lecture of the semester.

Section 5.5

The main thing to note from this section is that the A-B test is typically a poor choice, from a power perspective, compared to the APF test and the jackknife test. In particular, note the low efficiency (from the table on p. 187) of the A-B test relative to the APF test for the normal and light-tailed uniform distribution cases. (Of course, it should be kept in mind that these are asymptotic results, and things may differ for small samples.) But an advantage of the A-B test is that it is exact, whereas the APF test and the jackknife test are approximate (and may do poorly, with regard to accuracy (validity), when the sample sizes are small).

H&W don't describe the APF test. I covered it in the class I taught this summer, and I sometimes find time to mention it when I teach STAT 554. Perhaps I can quickly present the APF test in class, since I think it is a good test for you to know about.