Some Comments about Chapter 5 of Hollander & Wolfe
On p. 141, H&W state "Section 5.4 contains a distribution-free test
of the general hypothesis that two poulations are identical in all respects."
Really, with the exception of the jackknife procedure of Sec. 5.2, all
of the tests of Ch. 4 and Ch. 5 have as the null hypothesis that the two
distributions are identical, and a small p-value is evidence that the
distributions are not the same. To make a statement about the means,
medians, or variances requires further assumptions. For example, with a
shift model, or a scale model for nonnegative random variables, a small
p-value can be taken as being strong evidence that the means, median,
and quantiles differ (and in the case of the scale model, that the
variances differ --- the shift model dictates that the variances are
equal). Although p. 141 suggests that the test of Sec. 5.1 is a test
about the scale parameters of two distributions having equal medians, it
should be kept in mind that the p-value is determined using the
null hypothesis model of identical
distributions.
Section 5.1
To continue along the lines started above, even if the medians are
equal, if the distributions aren't members
of the same location-scale family, the A-B test can have high power to
yield a small p-value in some cases for which the variances are equal,
and can have low power to produce a small p-value in some cases for
which the variances differ --- except under a rather strict set of
circumstances that may not often occur, the A-B test is best not thought
of as a test of variances.
There are several other tests that are similar in spirit to the A-B
test. StatXact includes some such tests, and I'll mention those
during class. Perhaps the A-B test gets favored status here because
Hollander has a connection to Bradley (the B of A-B).
- p. 142
- The location-scale model is imposed. If we believe in that model,
and if we believe that the medians are equal, then the test of Sec. 5.1
can be viewed as a test about the scale parameters (or a test about the
variances if the variances exist). But I think you'll find it rare that
you should believe in a location-scale model and equal medians.
I think the A-B test should be primarily viewed as a test of
the null hypothesis that two distributions are identical against the
general alternative that they differ, having sensitivity to differences
in dispersion more so than differences in location, or in having one
distribution stochastically larger than the other one. If a small p-value
is obtained, then you have evidence that the distributions differ, and
you can perhaps characterize the differences by estimating quantiles.
- p. 145 (top half of page)
- The mean and variance results shown are just special cases of
general results I presented during the 4th lecture.
- p. 146
- The adjustment for ties result is just the application of the
general result
I presented during the 4th lecture to the midrank adjusted set of
Ansari-Bradley scores.
- p. 147
- Viewing the two samples as being iid under the null hypothesis of
no difference seems okay if the serum was divided to make 40
experimental units and these were randomly assigned to the two methods.
But the book indicates that 20 "duplicate analyses were made" and that
suggests matched pairs, which means that they shouldn't be viewed as two
independent samples.
- p. 152 (first paragraph)
-
- The first sentence makes sense only if it is accepted that the two
distributions belong to the same location-scale family. For example,
one can have two distributions having equal medians and variances, but
differing in shape.
- While the F test does not require equal medians, it does
require normality (or very near normality --- there is little
robustness to deviations from normality).
- The last part of the paragraph provides a good example of why the
test shouldn't be viewed as a test about variances --- if the supports of
two distributions are disjoint intervals, the value of the test
statistics will just be a function of the sample sizes and not the
observed data.
- p. 152, Comment 8
- I just don't see the point of the mean and variance computations
for each test statistic for one or more specific small sample
situations, when general results that apply to any two-sample linear
rank statistics and any sample sizes are so easy to obtain.
- p. 153, Comment 9
- For those with a knowledge of basic results from survey sampling,
(i) and (ii) are ways to arrive at the general mean and
variance results that I presnted during our 4th class.
- p. 154 (near bottom of page)
- H&W make the asymptotic normality seem rather trivial by appealing to
sampling results, but if one wants to firmly establish the asymptotic
normality, it's not all that simple to do!
- pp. 155-156, Comment 11
- This is a good example to illustrate the exact distribution based
on midranks that StatXact uses. (For all but the smallest sample
size cases, working with the exact distribution would be quite
time consuming without software such as StatXact.)
- p. 156, Comment 12
- I'm not going to emphasize the confidence interval and point
estimator, since they only
make sense if one has a location-scale model with either known
medians or equal medians.
- p. 156, Comment 13
- The subtracting the sample median ploy is generally not a good
thing to do. To view as a test about the variances, one would still
have to assume that both distributions are of the same location-scale
family. It can be noted that the only reference given for the
asymptotically distribution-free result is an unpublished Ph.D.
dissertation, which suggests to me that it isn't such a useful result.
- p. 157, Comment 15
- It will be intersting to compare the method described at the end of
the comment to the Lepage test of Sec. 5.3. Note that one could always
get a p-value at least as small by doing both the W-M-W test and the
Ansari-Bradley test, and taking the smaller of the two p-values. So the
described method would only be of interest to statisticians who value
ethical practice. (I'll try to remember to discuss this during class
--- how to get a p-value from the scheme, and why it's "cheating" to do
both tests and take the smaller of the two p-values.)
Section 5.2
This procedure is omitted from a lot of books on nonparametric
statistics. It's not distribution-free, but since no parametric model
is assumed, it can be referred to as being nonparametric. (Jackkniffing
is discussed in GMU's course on computational statistics, although I
don't think that course covers jackknife hypothesis testing and interval
estimation methods. This summer, I covered the method described in H&W
when I taught an advanced topics course.)
For doing two-sample tests about variances, the jackknife procedure and
the APF (aka Box-Anderson) test seem to be the best general methods
available. As to which is better, it's not clear to me from the results
I've seen reported in the literature.
I'll offer some specific comments
about the text below.
- p. 158
- H&W indicate that a location-scale model is assumed. I don't think
that is absolutely necessary, but I do believe that the test will be
more accurate if the two distributions belong to the same location-scale
family.
- p. 161 (top half of page)
- The suggestion given in H&W is that for small sample sizes, use the
t critical values, with m+n-2 df, corresponding to
Student's two-sample t statistic, instead of the standard normal
critical values. But if you look at the test statistic given by (5.35)
on p. 160 (with the various parts of it defined on pp. 159-160), it has
the form of Welch's two-sample statistic instead of Student's two-sample
statistic, and so I think using the df prescribed for Welch's procedure
may be better, and I went the Welch route when creating my
Minitab macro. Upon looking in the literature, one can see that
Miller indicates that the proper df to use isn't completely clear. But
others who have studied the procedure have generally adopted the
Student's t df, and some have altered the test statistic from
what is given by (5.35) to be of the form of Student's two-sample
t statistic. In at least one case it appears that this was done
to more easily extend the scheme to handle more than two samples: the
jackknife test statistic is made to resemble the one-way ANOVA F
statistic, and this reduces to being equivalent to Student's two-sample
t in the case of only two samples. (Note: For more than two
samples, I'd consider extending the scheme by basing the test statistic
on Welch's F statistic for the heteroscedastic case. This
statistic reduces to Welch's two-sample statistic, and thus matches
(5.35).) It can be noted that in studies done to compare the
performances of variance testing procedures, when the sample sizes were
equal, the jackknife method using m+n-2 df was anticonservative
in some cases. If the df from Welch's two-sample statistic had been
used, fewer rejections would have resulted, and the problem of
anticonservative behavior would have been eliminated or reduced. The
anticonservativeness problem also exists for unequal sample sizes when
m+n-2 df are used, and my guess is that the Welch method may
improve things even more in that setting. (Note: An excellent project
for an M.S. student (whether it be for a thesis, or just picking up some
credits by doing an independent project in the summer) would be to
compare several varieties of the jackknife variance testing procedure in
order to determine which one performs best.)
- p. 164 (top half of page)
- When my Minitab macro is applied to the data for Example
5.2, the resulting value of the jackknife test statistic matches the
value of 1.36 given by H&W, but the p-values do not match since I used
the Welch's test df formula (aka the df resulting from a Satterthwaite
approximation) instead of m+n-2 df.
- p. 165, Comment 19
- The alternative expressions for the sample variances are used in my
Minitab macro. Using them allows one to easily compute the whole
set of values given by (5.25) on p. 159.
- p. 165, Comment 21
- The motivation given for the jackknife procedure by H&W isn't real
good. In class, I'll try to explain how the assumed (near) independence
of the pseudo-values (which are defined, but not referred to as
pseudo-values, 6 lines from the bottom of p. 165, right before the
jackknife estimator is defined) can be used to create a test statistic
that resembles Welch's two-sample statistic.
- pp. 166-167, Comment 24
- I'm not going to cover the related interval and point estimators,
even though they follow fairly simply from the development of the test
procedure. (Note: I have no idea why H&W chose to use a 94.52%
confidence bound.)
- p. 167, Comment 26
- Note that the actual size of a nominal size 0.05 F
test can be as small as 0.0056 and as large as 0.166. (Note: H&W use
level instead of size, but size is the better choice.)
Section 5.3
As with the test of Sec. 5.1, the Lepage test is a test of the null
hypothesis of identical distributions against the general alternative,
but it has some sensitivity to both differences in dispersion and
differences in location.
The Lepage test is somewhat similar to the test indicated near the end
of Comment 15, but has different power characteristics. Neither
dominates the other with regard to power.
The Lepage test may result in a smaller p-value than either of it's
"component" tests. For example, with a given data set,
both the W-M-W test and the A-B test could result in p-values near
0.05, and the Lepage test could yield a p-value near 0.02.
I'll offer some specific comments
about the text below.
- p. 170 (near middle of page)
- Note that to use the tables you have to use the scores of larger of the
two samples to compute the test statistic.
- p. 170, Large-Sample Approximation
- Since the chi-square distribution with 2 df is an exponential
distribution, a simple formula can be given for the p-value (since a
simple formula exists for upper-tail probabilities for exponential
distributions). Since the Lepage test is not on StatXact or,
to my knowledge, any other software, and since the tables only cover a
small number of small sample size situations, it will be good to know
how to easily get the asymptotic p-value.
Section 5.4
The two-sample Kolmogorov-Smirnov test is a test of the null hypothesis
of identical distributions against the general alternative. Like the
Wald-Wolfowitz test, the K-S test is an omnibus test. If a shift
model holds, or a scale model for nonnegative random variables holds, or if
two distributions have the same median but differ in scale, then other
two-sample tests typically do better (i.e., yield smaller p-values when a
small p-value is warranted), but for some data sets the K-S test can be
a good test to use. (The data set of Example 5.4 provides an example of
such a data set.)
I'll offer some specific comments
about the text below.
- p. 179
- Not all books and statistical software packages define the test
statistic as H&W do in (5.70). For example, StatXact omits the
mn/d factor. (So for the data of Example 5.4, StatXact
supplies a test statistic value of 0.6.)
- p. 181
- StatXact gives an asymptotic p-value of about 0.0546, which
differs from the large-sample approximation result given in Example 5.4.
However, if one notes that the value of (5.73) is about 1.34164, and
uses interpolation with Table 11 as opposed to rounding and looking up
the value corresponding to 1.34, then one obtains a value in agreement
with the StatXact output.
- p. 184, Comment 41
- The one-sample K-S test is briefly described. I hope to cover it
when I present some material on goodness-of-fit tests during the last
lecture of the semester.
Section 5.5
The main thing to note from this section is that the A-B test is
typically a poor choice, from a power perspective, compared to the APF
test and the jackknife
test. In particular, note the low efficiency (from the table on p. 187)
of the A-B test relative to the APF test for the normal and light-tailed
uniform distribution cases. (Of course, it should be kept in mind that
these are asymptotic results, and things may differ for small samples.)
But an advantage of the A-B test is that it is exact, whereas the APF
test and the jackknife test are approximate (and may do poorly, with
regard to accuracy (validity), when the sample sizes are small).
H&W don't describe the APF test. I covered it in the class I taught this
summer, and
I sometimes find time to mention it when I teach STAT 554. Perhaps I can
quickly present the APF test in class, since I think it is a good test
for you to know about.