Some Comments about Chapter 6 of Hollander & Wolfe
Assumption A3 stipulates a shift model. If we have a shift model,
and determine that a pair of distributions differ, then we can conclude that
the means, medians, and quantiles all differ for that pair of
distibutions. However, I don't think a shift model is very realistic in
most k sample settings, and so in general concluding that two
distributions are different does not allow us to draw a conclusion about
the means or medians. To make matters more fuzzy, for some of the tests
included in this chapter, a small p-value will just mean that at least
one of the distributions is different, and the test doesn't provide any
specific information about which distributions are different from which
other distributions.
The Xij notation indicated at the top of page 190 is
not what I'm used to (and so in class I may reverse things --- either
intentionally or unintentionally). H&W use the j (2nd
index) to indicate the group, and use the i (1st index) to
indicate the observation number within a group. The usual convention
for one-way designs is to reverse the roles, and let the 1st
index indicate the group, and the 2nd index indicate the
observation number within the group.
Section 6.1
H&W focus on the Kruskal-Wallis test in this section, during during
class I'll also describe some similar tests that can be done using
StatXact.
I'll offer some specific comments
about the text below.
- p. 191, (6.4)
- The notation used in (6.4) is not what I'm used to. I prefer to
use a dot to replace the observation number index to indicate summation
over that index (giving a sum for the group/sample), and then put a bar
over the indicated sum to represent the group average (sample mean).
- p. 191, Large-Sample Approximation
- The tables below compare the approximate values of
P(H >= h) to the exact values, for various values of h.
My guess is that outside the set of small sample sizes covered
by the Table A.12 of H&W, if all of the sample sizes are at least 5 or
6, then the chi-square approximation does "okay" for
approximating p-values close to 0.05, and perhaps 0.025, but that it can
be off by a factor of 2 or more (and by as much as a factor of 5 in some
cases I'll guess) when used to approximate smaller p-values, with the
approximate p-values being larger than they should be, which suggests
diminished power to reject the null hypothesis in
cases for which it should be rejected. This serves to indicate that
StatXact should be used to perform the Kruskal-Wallis test in
small sample settings not covered by the table.
(I'll also guess that if all of the sample sizes are at least 10, then
the approximation does okay except for really small p-values (say
approximating p-values less than 0.001).)
sample sizes of 5, 5, 5
h |
exact |
approx |
5.660 |
0.0509 |
0.0590 |
6.740 |
0.0248 |
0.0344 |
7.980 |
0.0105 |
0.0185 |
8.780 |
0.0050 |
0.0124 |
9.920 |
0.0010 |
0.0070 |
sample sizes of 6, 6, 6
h |
exact |
approx |
5.719 |
0.0502 |
0.0573 |
6.889 |
0.0249 |
0.0319 |
8.222 |
0.0099 |
0.0164 |
9.088 |
0.0050 |
0.0106 |
10.819 |
0.0010 |
0.0045 |
sample sizes of 7, 7, 7
h |
exact |
approx |
5.766 |
0.0506 |
0.0560 |
6.954 |
0.0245 |
0.0309 |
8.378 |
0.0099 |
0.0152 |
9.373 |
0.0049 |
0.0092 |
11.288 |
0.0010 |
0.0035 |
sample sizes of 8, 8, 8
h |
exact |
approx |
5.805 |
0.0497 |
0.0549 |
6.995 |
0.0249 |
0.0303 |
8.465 |
0.0099 |
0.0145 |
9.495 |
0.0049 |
0.0087 |
11.625 |
0.0010 |
0.0030 |
sample sizes of 4, 4, 4, 4
h |
exact |
approx |
7.213 |
0.0507 |
0.0654 |
8.228 |
0.0248 |
0.0415 |
9.287 |
0.0100 |
0.0257 |
9.971 |
0.0049 |
0.0188 |
11.338 |
0.0010 |
0.0100 |
sample sizes of 3, 3, 3, 3, 3
h |
exact |
approx |
8.333 |
0.0496 |
0.0801 |
9.200 |
0.0250 |
0.0563 |
10.200 |
0.0099 |
0.0372 |
10.733 |
0.0049 |
0.0297 |
11.633 |
0.0010 |
0.0203 |
- p. 191, Large-Sample Approximation
- I think it's horrible that H&W don't supply a proper table of
critical values for chi-square distributions, since one may lose
appreciable accuracy when using using Chart A.2. (Note that on p. 193,
H&W get a value of 0.64 from the chart when the value should be 0.68
(see Minitab output on p. 194).)
- pp. 191-192, Ties
- Both Minitab and StatXact use the correction for ties
given by (6.8) when performing an approximate version of the test.
- pp. 192-193, Example 6.1
- If you try to do this one using StatXact 5, you'll find that it
will produce an exact p-value (very quickly, on my machine), and so there is
no need to do a Monte Carlo approximation. (It used to be the
StatXact could not even do an exact computation of a p-value from
a Kruskal-Wallis test with
samples as small as these, but the latest version of StatXact can
do more than earlier versions could.)
You can also try some of the other tests for k independent
samples on StatXact's menu. You should get p-values to match
those given below.
Example 6.1 |
exact |
approx |
Kruskal-Wallis |
0.7108 |
0.6800 |
normal scores |
0.5795 |
0.5484 |
Savage scores |
0.3704 |
0.3338 |
permutation/ANOVA |
0.5807 |
0.5484 |
- p. 194, Comment 1
- One example for which the more general setting can be used
is an analysis of the randomness of selections of numbered balls from an
urn, and in particular, in an analysis of the draft lottery data (which
I may go over in class at some point this semester).
- p. 194, Comment 2
- I don't know why, in the first displayed equation, the
(N-1)!/N! isn't replaced by 1/N.
- pp. 196-197, Comment 8
- This comment indicates how StatXact gets an exact p-value.
- p. 198, Comment 11
- To me, the Behrens-Fisher problem refers to the problem of
doing accurate tests about means of normal distributions when it
cannot be assumed that the variances are all the same. For nonnormal
distributions, I'd use the phrase
generalized Behrens-Fisher problem. The Rust-Fligner test
referred to isn't very useful because of the requirement of symmetry, and
due to its questionable accuracy.
- p. 198, Comment 12
- Steel and Dwass (who didn't work together, but arrived at the same
test at about the same time while working separately) developed a
pairwise version of the K-W test in 1960, and so one might find it odd
(but, on the other hand, entirely consistent with H&W's pattern of
favoring people from FSU and OSU) that Fligner's 1985 paper is
highlighted here.
- pp. 199-200, Problem 6.4
- I encourage you to try this data set with StatXact. The
sample sizes are too large to get exact p-values for the Kruskal-Wallis
test, the normal scores test, the Savage scores test, and the
permutation (ANOVA) test, and so you need to use the Monte Carlo option,
or else settle for an asymptotic result. (If you try to obtain an exact
p-value you'll get an indication that the sample sizes are too large.)
I've put a description of
how to use StatXact's Monte Carlo option
here.
You can get an exact p-value using StatXact's (k-sample)
median test (aka Mood's test) with this data. The exact p-value is
about 1.2 E-6, which is smaller than the asymptotic p-value of 4.3 E-5
from the K-W test. (I'll discuss StatXact's k-sample
median test, normal scores test, Savage scores test, and permutation
test in class on Oct. 3.)
Section 6.2
H&W focus on the Jonckheere-Terpstra test in this section, but there is
another procedure which is similar that can also be done using
StatXact, and I'll discuss it during class as well.
I'll offer some specific comments
about the text below.
- pp. 205-206, Example 6.2
- There is no need to use the Monte Carlo option --- StatXact 5
can be used to obtain an exact p-value. The exact p-value is about
0.0210. This differs from the value obtained from Table A.13 of H&W due
to the ties --- because of the ties, the table in H&W cannot be used to
obtain an exact p-value (unless one chose to do a conservative test or
had some other scheme for breaking ties). StatXact's asymptotic
p-value of 0.0207 matches the approximate p-value given in H&W (because
both use the same scheme). (Note: A continuity correction sometimes
improves the normal approximation, but in this case it does not.
(StatXact does not employ a continuity correction, and neither
does H&W.))
- p. 206, Comment 16
- If it can be believed that either two distributions are identical,
or that one is stochastically larger than the other if they differ, and
that there is a natural ordering of the means/distributions if they are
not all the same, then the J-T test can be interpreted as a test of a
monotone alternative involving the distribution means (similar to the
Abelson-Tukey test based on an assumption of normality).
Section 6.3
I had never seen anything about the tests covered in this section until
I opened my H&W book when I received it in August. So it is safe to
assume that the material in this section is not mainstream stuff.
Since the tests of this section are awkward to perform (since they
aren't included on statistical software packages), and this is
especially true of the test of 6.3.B, and since the test of 6.3.B isn't
fully developed except for a limited number of small
sample size situations, I'll focus on the test of 6.3.A, and give you an
exercise or two pertaining to it, but not place much importance on
mastering 6.3.B at this time.
An important thing to realize is that when using these tests, like when
using the tests of Sec. 6.2, there is no built in protection against
getting a small p-value with high probability when the alternative
hypothesis in not true. The interpretation of a small p-value is
what it is meant to be for these tests if either the null hypothesis or
the alternative hypothesis is true, but if something else is true, then
a small p-value could result, and so some care should be taken when
interpretting a small p-value to be strong evidence supporting the
alternative hypothesis. For example, for the test of 6.3.A, if
the true peak is at p - 1 or p + 1, as oppoed to the
hypothesized p, then enough of the
"component" Mann-Whitney statistics may be "leaning the right way" to
produce an overall test statistic vlaue that leads to a rejection of the
null hypothesis.
I'll offer some specific comments
about the text below.
- p. 213 (last line)
- The tables only cover equal sample size situations, with sample
sizes ranging from 2 to 5, and so they are of limited usefulness.
To get some feel for how accurate the normal approximation described on
p. 214 is, one could compare the approximated upper-tail probabilities
to the exact values for a case in which the sample sizes equal 5 (the
largest sample sizes for which an exact distribution is readily
available). I'll design a homework exercise along these lines.
- p. 214, Large-Sample Approximation
- Hopefully you realize that (6.36) indicates that the p-value is
just the upper-tail probability --- the area under the standard normal
pdf corresponding to values greater than or equal to the observed value
of A*p.
- p. 215, Example 6.3
- I have no idea why they choose to consider a test having an
approximate level of 0.0274, which results in a critical value of 1.92.
Why not use one of the more common values, like 0.05 or 0.025? Or why
not just determine the p-value?
- p. 217, Comment 26
- I don't think that the Mack-Wolfe test is included in any
readily available statistical software. That, combined with the fact
that umbrella alternatives aren't commonly considered, makes the
Mack-Wolfe test a seldom-used test.
- p. 221, Comment 30
- I may go through the derivation of the null expected value in
class.
- pp. 223-224, Comment 31
- Note that one doesn't have to believe that a shift model holds to
use the Mack-Wolfe test. One can use the Mack-Wolfe test to test the
null hypothesis that all of the distributions are the same against the
umbrella alternative that either the "adjacent distributions" are the
same, or they differ in that one is stochastically larger than the other
in the "direction" indicated by the alternative.
Along these lines, I disagree with Comment 36 on p. 230 and
Comment 4 on p. 195 --- one needs equal variances under the null
hypothesis (since the null hypothesis is one of identical
distributions), but one doesn't have to believe that the variances are
equal (or even that the distributions have the same general shape) if
the alternative is true (provided that my "stochastically larger
assumption" for "adjacent distributions" is believed). However, if one
wants to do tests involving
umbrella alternatives pertaining to the means/medians, and allow for the
possibility that the variances aren't the same under the null hypothesis of
equal means/medians, then it needs to be kept in mind that the tests presented
in Sec. 6.3 are not valid.
- p. 227, Procedure
- I'm not very comfortable with Mack and Wolfe's suggestions
(the last 3 sentences before Ties) for
handling cases not covered by the tables. My guess is that more studies
need to be done to confirm that their recommendations are good.
- pp. 232-233, Comment 42
- Be sure to take note of Comment 42, since it extends the
usefulness of the tests on Sec. 6.3.
Section 6.4
The test of this section is based on a very simple idea, and so we won't
spend much time discussing this section. A nice thing about the test is
that
it can be performed using StatXact
(by doing a two-sample
W-M-W test). A bad thing about the
test is that if a rejection is obtained, the test doesn't identify any
of the treatments as being (statistically) significantly better
(stochastically larger, or stochastically smaller, as the case may be)
than the control (see Comment 53 on p. 238).
A test that can be used to identify particular
treatments as being significantly better than the control is given in
Sec. 6.7. However, if several of the treatments are just mildly better
than the control, the test of Sec. 6.7 can have low power for rejecting
the null hypothesis, while the test of this section can be more powerful
in such cases, since it uses the combined strength of the evidence
involving all of the samples.
For the test of this section, one could take the null hypothesis to be
that all of the distributions are identical (that all of the
observations are governed by the same distribution), or one could in
fact, in the case of seeking evidence that at least one of the
treatments has a distribution that is stochastically larger than the
control distribution (and thus has a larger mean), take the null
hypothesis to be that each of the treatment distributions is either
identical to the control distribution, or is stochastically smaller than
the control distribution. (But if too many of the treatment distributions are
stochastically smaller than the control distribution, the power to find
that at least one of them is stochastically larger will be greatly
diminished.) Either way, I don't have to assume that a shift model
holds. Since I don't want to be limited by believing assumption
A3 (the shift model assumption), I can ignore Comment
48 on p. 237 as well. I also don't agree with Comment 51 on p.
238, although, as I noted above, the power to claim that at least one of
the treatments is better (worse) than the control is diminished if any
of the treatments is actually worse (better) than the control.
I'll offer another specific comment
about the text below.
- p. 235
- Don't worry about the comments on the top portion of the page
pertaining to the use of the tables, since we can use StatXact to
perform the test. (I'll put instructions for doing so on my
StatXact web page.)
Section 6.5
The first paragraph of this section contains "the multiple comparison
procedure of the section would generally be applied to one-way layout
data after rejection of H0 (6.2) with the
Kruskal-Wallis procedure from Section 6.1." While that may (or may not)
be true in common practice, the Steel-Dwass procedure of this section
need not be used that way --- it can be used in place of the
Kruskal-Wallis test, and in a lot of cases will be more powerful than
the K-W test (and in a lot of cases the K-W test will be the more
powerful of the two). In some cases you may reject with the K-W test,
and then apply the Steel-Dwass procedure and find that none of the
distributions are significally different from any of the other
distributions. In other cases, you won't reject with the K-W test but
then find that the Steel-Dwass test indicates that not all of the
distributions are identical. So while I'm sure that some people use the
Steel-Dwass procedure as a follow-up to the K-W test when they obtain a
rejection, the two procedures are not guaranteed to yield results that
are in agreement.
I'll offer some specific comments
about the text below.
- p. 241, (6.62)
- This is equivalent to what I present in STAT 554, only in the
numerator I put the M-W U statistic minus it's null expected
value.
- p. 241, (6.63)
- I think the "otherwise decide" part is bad. Just because there
isn't strong evidence to indicate a difference, it doesn't mean there is
strong evidence that the distributions are the same. If one applies
(6.63) as is stated in the book, then one can conclude that
distributions 1 and 2 are the same, that distributions 2 and 3 are the
same, but that distribtions 1 and 3 are different (which violates the
transitive property that we learn in grammer school).
- p. 241, Large-Sample Approximation
- The critical values given in Table A.17 are quantiles of the
studentized range distributions having k and infinity degrees of
freedom. Studentized range distributions are related to maximum
differences from (studentized) pairwise comparisons of sample means from normal
distributions. The 2nd degree of freedom relates to the amount of
"uncertainty" in the variance estimate. The limiting distribution
"infinity case" corresponds to no uncertainty in the variance estimate,
and that's appropriate when used with a rank procedure even when the
sample sizes are small, since the exact variance is just a function of the
sample sizes and is not uncertain at all. It can also be noted that the
normality assumption is only approximately met --- the asymptotic
normality is being relied upon, and with small sample sizes we need to
use the tables for the exact null sampling distribution instead of the
studentized range critical values. Table A.17 is really nice since it
contains critical values for the studentized range distributions that
aren't that easy to come by.
- p. 241, Ties
- If one uses StatXact to compute the z-scores, then
the proper adjustment for ties is done automatically. Look at my
StatXact web page to see how
StatXact can be used to help perform the Steel-Dwass procedure
(even though StatXact doesn't do the procedure in full).
- p. 244, Comment 57
- This comment in the text is particularly important. It suggests,
as I do above, that the Steel-Dwass procedure can be used to do a test,
and need not be viewed as a follow-up to a rejection obtained with a K-W
test.
- p. 244, Comment 58
- The main point of this comment is very important, but it may be
too well hidden in the rest of the content of the comment to fully
register! The point is that the larger k is, the more different
any two samples have to be in order to be identified as being
significantly different using the Steel-Dwass procedure. One needs to
keep in mind that when a whole bunch of random samples are compared in a
pairwise fashion, by chance two of them may appear to be somewhat
different even though the underlying distributions are all the same ---
the more comparisons that are done, the more likely it is to obtain an
unusual result (i.e., two samples appearing to come from different
distributions) just by chance. To protect against making a type I error
when doing so many comparisons, the "differentness threshhold" has to be
pretty high. A consequence of this is that some odd things can happen.
For example, you may have three samples, and determine that all three
underlying distributions are different using the Steel-Dwass procedure
applied to the three samples. Then later you may add 10 more samples
that come from the same distribution as the third sample. When the Steel-Dwass procedure is applied to the 13 samples, it may be that there isn't
strong evidence to conclude that any of them differ from any of the
others, and so in particular, one cannot conclude that samples 1, 2 and
3 come from different distributions, even though that was the conclusion
reached when the Steel-Dwass procedure was applied to just those three
samples. (When the S-D procedure was applied to only three samples,
then only three pairwise comparisons are necessary, and each comparison
can indicate different distributions, even when the S-D procedure is
taking into account that three comparisons are being made. But when the
S-D procedure is applied to 13 samples, then 78 (13 chose 2) pairwise
comparisons are being done, and when the S-D procedure takes into
account that when 78 pairwise comparisons are done, then at least one of
them can be fairly unusual (suggesting a difference) just by chance even
though all 13 distributions are identical, the apparent differences
originally suggested from the first three samples are no longer strongly
suggestive of distribution differences.)
- p. 246, Comment 60
- This comments indicates why Critchlow and Fligner should perhaps
also be worked into the name of the procedure that for years I've
referred to as the Steel-Dwass procedure. They extended the method
developed by Steel and Dwass to work with unequal sample sizes.
- pp. 247-248, Comment 63
- The joint ranking approach is what I refer to as the rank
analog to the Tukey-Kramer test in STAT 554. It is also included in
Miller's Beyond ANOVA: Basics of Applied Statistics. Since H&W
don't give any tables for this test (which I find odd, given the nature
of the rest of H&W, since Wolfe helped to further the development of the
procedure), I won't emphasize it. (My STAT 554 class notes, as well as
Miller's Beyond ANOVA, and another book on simultaneous
inferences written by Miller, give a large sample approximate version of
the test that only requires the same studentized range critical values
as the S-D-C-F procedure, but H&W gives references for exact tables if
one really wants to use this test with small samples.) My experience
indicates that in a lot of situations, this joint ranking approach
yields a result very similar to the S-D-C-F procedure, and so perhaps
not a lot is lost in not routinely trying this procedure too.
Section 6.6
The section pertains to a procedure that is not commonly used, even
though it seems like a natural follow-up to a J-T test. (Note that it
need not only be used as a follow-up to a J-T test --- one could use it
in place of a J-T test.)
I'll offer another specific comment
about the text below.
- p. 252, Example 6.7
- Note the lack of agreement of the J-T test and the Hayter-Stone
multiple comparison procedure of this section. The J-T test indicates
that not all of the distributions are the same, but the H-S procedure
doesn't identify any pair of them as being different.
Section 6.7
The section pertains to a procedure that is not commonly used, even
though it seems like a natural follow-up to the Fligner-Wolfe test of
Sec. 6.4. (Note that it
need not only be used as a follow-up to a F-W test --- one could use it
in place of a F-W test, as is described in Comment 73 on p. 258.)
I'll offer some specific comments
about the text below.
- p. 254, (last line)
- Note that with this procedure you jointly rank the combined sample
of all of the treatment and control observations (from 1 to N,
as is done for the K-W test),
instead of doing pairwise ranking as is required for the procedures of
Sec. 6.2, Sec. 6.3, Sec. 6.5, and Sec. 6.6.
- p. 257, Comment 71
- Note that this comment extends the usefulness of the test. Another
way to do the test for the "reversed direction" would be to just
multiply each observation by -1 and do the test as is decribed in the
main part of the section.
- p. 258, Comment 74
- This comment pertains to a somewhat weird aspect of the procedure.
Section 6.8
I'm not going to cover this section in class. The estimates only make
sense if one can assume a k-sample shift model, and I don't think
that this is usually the case.
Section 6.9
I'm not going to cover this section in class. The estimates only make
sense if one can assume a k-sample shift model, and I don't think
that this is usually the case.
Section 6.10
I'm not going to cover this section in class --- I'll let you read over
it on your own and ask questions if desired. (Note: Sections
that I skip, like this one, will not be covered by the final exam.)