Ch. 6 notes for H&W, STAT 657

Some Comments about Chapter 6 of Hollander & Wolfe

Assumption A3 stipulates a shift model. If we have a shift model, and determine that a pair of distributions differ, then we can conclude that the means, medians, and quantiles all differ for that pair of distibutions. However, I don't think a shift model is very realistic in most k sample settings, and so in general concluding that two distributions are different does not allow us to draw a conclusion about the means or medians. To make matters more fuzzy, for some of the tests included in this chapter, a small p-value will just mean that at least one of the distributions is different, and the test doesn't provide any specific information about which distributions are different from which other distributions.

The X_ij notation indicated at the top of page 190 is not what I'm used to (and so in class I may reverse things --- either intentionally or unintentionally). H&W use the j (2nd index) to indicate the group, and use the i (1st index) to indicate the observation number within a group. The usual convention for one-way designs is to reverse the roles, and let the 1st index indicate the group, and the 2nd index indicate the observation number within the group.

Section 6.1

H&W focus on the Kruskal-Wallis test in this section, during during class I'll also describe some similar tests that can be done using StatXact.

I'll offer some specific comments about the text below.

p. 191, (6.4)

The notation used in (6.4) is not what I'm used to. I prefer to use a dot to replace the observation number index to indicate summation over that index (giving a sum for the group/sample), and then put a bar over the indicated sum to represent the group average (sample mean).

p. 191, Large-Sample Approximation

The tables below compare the approximate values of P(H >= h) to the exact values, for various values of h. My guess is that outside the set of small sample sizes covered by the Table A.12 of H&W, if all of the sample sizes are at least 5 or 6, then the chi-square approximation does "okay" for approximating p-values close to 0.05, and perhaps 0.025, but that it can be off by a factor of 2 or more (and by as much as a factor of 5 in some cases I'll guess) when used to approximate smaller p-values, with the approximate p-values being larger than they should be, which suggests diminished power to reject the null hypothesis in cases for which it should be rejected. This serves to indicate that StatXact should be used to perform the Kruskal-Wallis test in small sample settings not covered by the table. (I'll also guess that if all of the sample sizes are at least 10, then the approximation does okay except for really small p-values (say approximating p-values less than 0.001).)

sample sizes of 5, 5, 5

h	exact	approx
5.660	0.0509	0.0590
6.740	0.0248	0.0344
7.980	0.0105	0.0185
8.780	0.0050	0.0124
9.920	0.0010	0.0070

sample sizes of 6, 6, 6

h	exact	approx
5.719	0.0502	0.0573
6.889	0.0249	0.0319
8.222	0.0099	0.0164
9.088	0.0050	0.0106
10.819	0.0010	0.0045

sample sizes of 7, 7, 7

h	exact	approx
5.766	0.0506	0.0560
6.954	0.0245	0.0309
8.378	0.0099	0.0152
9.373	0.0049	0.0092
11.288	0.0010	0.0035

sample sizes of 8, 8, 8

h	exact	approx
5.805	0.0497	0.0549
6.995	0.0249	0.0303
8.465	0.0099	0.0145
9.495	0.0049	0.0087
11.625	0.0010	0.0030

sample sizes of 4, 4, 4, 4

h	exact	approx
7.213	0.0507	0.0654
8.228	0.0248	0.0415
9.287	0.0100	0.0257
9.971	0.0049	0.0188
11.338	0.0010	0.0100

sample sizes of 3, 3, 3, 3, 3

h	exact	approx
8.333	0.0496	0.0801
9.200	0.0250	0.0563
10.200	0.0099	0.0372
10.733	0.0049	0.0297
11.633	0.0010	0.0203

p. 191, Large-Sample Approximation

I think it's horrible that H&W don't supply a proper table of critical values for chi-square distributions, since one may lose appreciable accuracy when using using Chart A.2. (Note that on p. 193, H&W get a value of 0.64 from the chart when the value should be 0.68 (see Minitab output on p. 194).)

pp. 191-192, Ties

Both Minitab and StatXact use the correction for ties given by (6.8) when performing an approximate version of the test.

pp. 192-193, Example 6.1

If you try to do this one using StatXact 5, you'll find that it will produce an exact p-value (very quickly, on my machine), and so there is no need to do a Monte Carlo approximation. (It used to be the StatXact could not even do an exact computation of a p-value from a Kruskal-Wallis test with samples as small as these, but the latest version of StatXact can do more than earlier versions could.) You can also try some of the other tests for k independent samples on StatXact's menu. You should get p-values to match those given below.

Example 6.1	exact	approx
Kruskal-Wallis	0.7108	0.6800
normal scores	0.5795	0.5484
Savage scores	0.3704	0.3338
permutation/ANOVA	0.5807	0.5484

p. 194, Comment 1

One example for which the more general setting can be used is an analysis of the randomness of selections of numbered balls from an urn, and in particular, in an analysis of the draft lottery data (which I may go over in class at some point this semester).

p. 194, Comment 2

I don't know why, in the first displayed equation, the (N-1)!/N! isn't replaced by 1/N.

pp. 196-197, Comment 8

This comment indicates how StatXact gets an exact p-value.

p. 198, Comment 11

To me, the Behrens-Fisher problem refers to the problem of doing accurate tests about means of normal distributions when it cannot be assumed that the variances are all the same. For nonnormal distributions, I'd use the phrase generalized Behrens-Fisher problem. The Rust-Fligner test referred to isn't very useful because of the requirement of symmetry, and due to its questionable accuracy.

p. 198, Comment 12

Steel and Dwass (who didn't work together, but arrived at the same test at about the same time while working separately) developed a pairwise version of the K-W test in 1960, and so one might find it odd (but, on the other hand, entirely consistent with H&W's pattern of favoring people from FSU and OSU) that Fligner's 1985 paper is highlighted here.

pp. 199-200, Problem 6.4

I encourage you to try this data set with StatXact. The sample sizes are too large to get exact p-values for the Kruskal-Wallis test, the normal scores test, the Savage scores test, and the permutation (ANOVA) test, and so you need to use the Monte Carlo option, or else settle for an asymptotic result. (If you try to obtain an exact p-value you'll get an indication that the sample sizes are too large.) I've put a description of how to use StatXact's Monte Carlo option here. You can get an exact p-value using StatXact's (k-sample) median test (aka Mood's test) with this data. The exact p-value is about 1.2 E-6, which is smaller than the asymptotic p-value of 4.3 E-5 from the K-W test. (I'll discuss StatXact's k-sample median test, normal scores test, Savage scores test, and permutation test in class on Oct. 3.)

Section 6.2

H&W focus on the Jonckheere-Terpstra test in this section, but there is another procedure which is similar that can also be done using StatXact, and I'll discuss it during class as well.

I'll offer some specific comments about the text below.

pp. 205-206, Example 6.2: There is no need to use the Monte Carlo option --- StatXact 5 can be used to obtain an exact p-value. The exact p-value is about 0.0210. This differs from the value obtained from Table A.13 of H&W due to the ties --- because of the ties, the table in H&W cannot be used to obtain an exact p-value (unless one chose to do a conservative test or had some other scheme for breaking ties). StatXact's asymptotic p-value of 0.0207 matches the approximate p-value given in H&W (because both use the same scheme). (Note: A continuity correction sometimes improves the normal approximation, but in this case it does not. (StatXact does not employ a continuity correction, and neither does H&W.))
p. 206, Comment 16: If it can be believed that either two distributions are identical, or that one is stochastically larger than the other if they differ, and that there is a natural ordering of the means/distributions if they are not all the same, then the J-T test can be interpreted as a test of a monotone alternative involving the distribution means (similar to the Abelson-Tukey test based on an assumption of normality).

Section 6.3

I had never seen anything about the tests covered in this section until I opened my H&W book when I received it in August. So it is safe to assume that the material in this section is not mainstream stuff.

Since the tests of this section are awkward to perform (since they aren't included on statistical software packages), and this is especially true of the test of 6.3.B, and since the test of 6.3.B isn't fully developed except for a limited number of small sample size situations, I'll focus on the test of 6.3.A, and give you an exercise or two pertaining to it, but not place much importance on mastering 6.3.B at this time.

An important thing to realize is that when using these tests, like when using the tests of Sec. 6.2, there is no built in protection against getting a small p-value with high probability when the alternative hypothesis in not true. The interpretation of a small p-value is what it is meant to be for these tests if either the null hypothesis or the alternative hypothesis is true, but if something else is true, then a small p-value could result, and so some care should be taken when interpretting a small p-value to be strong evidence supporting the alternative hypothesis. For example, for the test of 6.3.A, if the true peak is at p - 1 or p + 1, as oppoed to the hypothesized p, then enough of the "component" Mann-Whitney statistics may be "leaning the right way" to produce an overall test statistic vlaue that leads to a rejection of the null hypothesis.

I'll offer some specific comments about the text below.

p. 213 (last line): The tables only cover equal sample size situations, with sample sizes ranging from 2 to 5, and so they are of limited usefulness. To get some feel for how accurate the normal approximation described on p. 214 is, one could compare the approximated upper-tail probabilities to the exact values for a case in which the sample sizes equal 5 (the largest sample sizes for which an exact distribution is readily available). I'll design a homework exercise along these lines.
p. 214, Large-Sample Approximation: Hopefully you realize that (6.36) indicates that the p-value is just the upper-tail probability --- the area under the standard normal pdf corresponding to values greater than or equal to the observed value of A^*_p.
p. 215, Example 6.3: I have no idea why they choose to consider a test having an approximate level of 0.0274, which results in a critical value of 1.92. Why not use one of the more common values, like 0.05 or 0.025? Or why not just determine the p-value?
p. 217, Comment 26: I don't think that the Mack-Wolfe test is included in any readily available statistical software. That, combined with the fact that umbrella alternatives aren't commonly considered, makes the Mack-Wolfe test a seldom-used test.
p. 221, Comment 30: I may go through the derivation of the null expected value in class.
pp. 223-224, Comment 31: Note that one doesn't have to believe that a shift model holds to use the Mack-Wolfe test. One can use the Mack-Wolfe test to test the null hypothesis that all of the distributions are the same against the umbrella alternative that either the "adjacent distributions" are the same, or they differ in that one is stochastically larger than the other in the "direction" indicated by the alternative. Along these lines, I disagree with Comment 36 on p. 230 and Comment 4 on p. 195 --- one needs equal variances under the null hypothesis (since the null hypothesis is one of identical distributions), but one doesn't have to believe that the variances are equal (or even that the distributions have the same general shape) if the alternative is true (provided that my "stochastically larger assumption" for "adjacent distributions" is believed). However, if one wants to do tests involving umbrella alternatives pertaining to the means/medians, and allow for the possibility that the variances aren't the same under the null hypothesis of equal means/medians, then it needs to be kept in mind that the tests presented in Sec. 6.3 are not valid.
p. 227, Procedure: I'm not very comfortable with Mack and Wolfe's suggestions (the last 3 sentences before Ties) for handling cases not covered by the tables. My guess is that more studies need to be done to confirm that their recommendations are good.
pp. 232-233, Comment 42: Be sure to take note of Comment 42, since it extends the usefulness of the tests on Sec. 6.3.

Section 6.4

The test of this section is based on a very simple idea, and so we won't spend much time discussing this section. A nice thing about the test is that it can be performed using StatXact (by doing a two-sample W-M-W test). A bad thing about the test is that if a rejection is obtained, the test doesn't identify any of the treatments as being (statistically) significantly better (stochastically larger, or stochastically smaller, as the case may be) than the control (see Comment 53 on p. 238). A test that can be used to identify particular treatments as being significantly better than the control is given in Sec. 6.7. However, if several of the treatments are just mildly better than the control, the test of Sec. 6.7 can have low power for rejecting the null hypothesis, while the test of this section can be more powerful in such cases, since it uses the combined strength of the evidence involving all of the samples.

For the test of this section, one could take the null hypothesis to be that all of the distributions are identical (that all of the observations are governed by the same distribution), or one could in fact, in the case of seeking evidence that at least one of the treatments has a distribution that is stochastically larger than the control distribution (and thus has a larger mean), take the null hypothesis to be that each of the treatment distributions is either identical to the control distribution, or is stochastically smaller than the control distribution. (But if too many of the treatment distributions are stochastically smaller than the control distribution, the power to find that at least one of them is stochastically larger will be greatly diminished.) Either way, I don't have to assume that a shift model holds. Since I don't want to be limited by believing assumption A3 (the shift model assumption), I can ignore Comment 48 on p. 237 as well. I also don't agree with Comment 51 on p. 238, although, as I noted above, the power to claim that at least one of the treatments is better (worse) than the control is diminished if any of the treatments is actually worse (better) than the control.

I'll offer another specific comment about the text below.

p. 235: Don't worry about the comments on the top portion of the page pertaining to the use of the tables, since we can use StatXact to perform the test. (I'll put instructions for doing so on my StatXact web page.)

Section 6.5

The first paragraph of this section contains "the multiple comparison procedure of the section would generally be applied to one-way layout data after rejection of H₀ (6.2) with the Kruskal-Wallis procedure from Section 6.1." While that may (or may not) be true in common practice, the Steel-Dwass procedure of this section need not be used that way --- it can be used in place of the Kruskal-Wallis test, and in a lot of cases will be more powerful than the K-W test (and in a lot of cases the K-W test will be the more powerful of the two). In some cases you may reject with the K-W test, and then apply the Steel-Dwass procedure and find that none of the distributions are significally different from any of the other distributions. In other cases, you won't reject with the K-W test but then find that the Steel-Dwass test indicates that not all of the distributions are identical. So while I'm sure that some people use the Steel-Dwass procedure as a follow-up to the K-W test when they obtain a rejection, the two procedures are not guaranteed to yield results that are in agreement.

I'll offer some specific comments about the text below.

p. 241, (6.62): This is equivalent to what I present in STAT 554, only in the numerator I put the M-W U statistic minus it's null expected value.
p. 241, (6.63): I think the "otherwise decide" part is bad. Just because there isn't strong evidence to indicate a difference, it doesn't mean there is strong evidence that the distributions are the same. If one applies (6.63) as is stated in the book, then one can conclude that distributions 1 and 2 are the same, that distributions 2 and 3 are the same, but that distribtions 1 and 3 are different (which violates the transitive property that we learn in grammer school).
p. 241, Large-Sample Approximation: The critical values given in Table A.17 are quantiles of the studentized range distributions having k and infinity degrees of freedom. Studentized range distributions are related to maximum differences from (studentized) pairwise comparisons of sample means from normal distributions. The 2nd degree of freedom relates to the amount of "uncertainty" in the variance estimate. The limiting distribution "infinity case" corresponds to no uncertainty in the variance estimate, and that's appropriate when used with a rank procedure even when the sample sizes are small, since the exact variance is just a function of the sample sizes and is not uncertain at all. It can also be noted that the normality assumption is only approximately met --- the asymptotic normality is being relied upon, and with small sample sizes we need to use the tables for the exact null sampling distribution instead of the studentized range critical values. Table A.17 is really nice since it contains critical values for the studentized range distributions that aren't that easy to come by.
p. 241, Ties: If one uses StatXact to compute the z-scores, then the proper adjustment for ties is done automatically. Look at my StatXact web page to see how StatXact can be used to help perform the Steel-Dwass procedure (even though StatXact doesn't do the procedure in full).
p. 244, Comment 57: This comment in the text is particularly important. It suggests, as I do above, that the Steel-Dwass procedure can be used to do a test, and need not be viewed as a follow-up to a rejection obtained with a K-W test.
p. 244, Comment 58: The main point of this comment is very important, but it may be too well hidden in the rest of the content of the comment to fully register! The point is that the larger k is, the more different any two samples have to be in order to be identified as being significantly different using the Steel-Dwass procedure. One needs to keep in mind that when a whole bunch of random samples are compared in a pairwise fashion, by chance two of them may appear to be somewhat different even though the underlying distributions are all the same --- the more comparisons that are done, the more likely it is to obtain an unusual result (i.e., two samples appearing to come from different distributions) just by chance. To protect against making a type I error when doing so many comparisons, the "differentness threshhold" has to be pretty high. A consequence of this is that some odd things can happen. For example, you may have three samples, and determine that all three underlying distributions are different using the Steel-Dwass procedure applied to the three samples. Then later you may add 10 more samples that come from the same distribution as the third sample. When the Steel-Dwass procedure is applied to the 13 samples, it may be that there isn't strong evidence to conclude that any of them differ from any of the others, and so in particular, one cannot conclude that samples 1, 2 and 3 come from different distributions, even though that was the conclusion reached when the Steel-Dwass procedure was applied to just those three samples. (When the S-D procedure was applied to only three samples, then only three pairwise comparisons are necessary, and each comparison can indicate different distributions, even when the S-D procedure is taking into account that three comparisons are being made. But when the S-D procedure is applied to 13 samples, then 78 (13 chose 2) pairwise comparisons are being done, and when the S-D procedure takes into account that when 78 pairwise comparisons are done, then at least one of them can be fairly unusual (suggesting a difference) just by chance even though all 13 distributions are identical, the apparent differences originally suggested from the first three samples are no longer strongly suggestive of distribution differences.)
p. 246, Comment 60: This comments indicates why Critchlow and Fligner should perhaps also be worked into the name of the procedure that for years I've referred to as the Steel-Dwass procedure. They extended the method developed by Steel and Dwass to work with unequal sample sizes.
pp. 247-248, Comment 63: The joint ranking approach is what I refer to as the rank analog to the Tukey-Kramer test in STAT 554. It is also included in Miller's Beyond ANOVA: Basics of Applied Statistics. Since H&W don't give any tables for this test (which I find odd, given the nature of the rest of H&W, since Wolfe helped to further the development of the procedure), I won't emphasize it. (My STAT 554 class notes, as well as Miller's Beyond ANOVA, and another book on simultaneous inferences written by Miller, give a large sample approximate version of the test that only requires the same studentized range critical values as the S-D-C-F procedure, but H&W gives references for exact tables if one really wants to use this test with small samples.) My experience indicates that in a lot of situations, this joint ranking approach yields a result very similar to the S-D-C-F procedure, and so perhaps not a lot is lost in not routinely trying this procedure too.

Section 6.6

The section pertains to a procedure that is not commonly used, even though it seems like a natural follow-up to a J-T test. (Note that it need not only be used as a follow-up to a J-T test --- one could use it in place of a J-T test.)

I'll offer another specific comment about the text below.

p. 252, Example 6.7: Note the lack of agreement of the J-T test and the Hayter-Stone multiple comparison procedure of this section. The J-T test indicates that not all of the distributions are the same, but the H-S procedure doesn't identify any pair of them as being different.

Section 6.7

The section pertains to a procedure that is not commonly used, even though it seems like a natural follow-up to the Fligner-Wolfe test of Sec. 6.4. (Note that it need not only be used as a follow-up to a F-W test --- one could use it in place of a F-W test, as is described in Comment 73 on p. 258.)

I'll offer some specific comments about the text below.

p. 254, (last line): Note that with this procedure you jointly rank the combined sample of all of the treatment and control observations (from 1 to N, as is done for the K-W test), instead of doing pairwise ranking as is required for the procedures of Sec. 6.2, Sec. 6.3, Sec. 6.5, and Sec. 6.6.
p. 257, Comment 71: Note that this comment extends the usefulness of the test. Another way to do the test for the "reversed direction" would be to just multiply each observation by -1 and do the test as is decribed in the main part of the section.
p. 258, Comment 74: This comment pertains to a somewhat weird aspect of the procedure.

Section 6.8

I'm not going to cover this section in class. The estimates only make sense if one can assume a k-sample shift model, and I don't think that this is usually the case.

Section 6.9

I'm not going to cover this section in class. The estimates only make sense if one can assume a k-sample shift model, and I don't think that this is usually the case.

Section 6.10

I'm not going to cover this section in class --- I'll let you read over it on your own and ask questions if desired. (Note: Sections that I skip, like this one, will not be covered by the final exam.)