Ch. 7 notes for H&W, STAT 657

Some Comments about Chapter 7 of Hollander & Wolfe

This is a long chapter, with a lot of sections. I think that it will make sense to cover the first part of the chapter (through Sec. 7.4) rather thoroughly, while also inserting Quade's test which we can do using StatXact, and then focus on only several of the remaining sections of the chapter,

As usual, H&W include an Assumption A3 which stipulates a shift model. Not only do they assume a shift model, but the model statement (given about midway between (7.1) and (7.2) on p. 272) imposes an additive structure --- with no allowance for interaction effects, the order of the distribution medians and the distances between distribution medians is the same for each block. My guess is that a lot of times, data in a two-way layout should not be assumed to follow a simple additive shift model.

The notation in the book isn't what I like to see --- they use rows (1st index) for the block, and columns (2nd index) for the treatment (which is opposite of what is typically done with two-way ANOVA models for mixed effects). I guess I'll try to follow H&W's system for the indexing in order to match the book, but I won't use their R_j and R._j notation --- instead I'm going to use the convention of the dot indicating summation, and then adding a bar to indicate an average. Also note that they use n to denote the number of blocks, instead of the number of observations per cell as is usually done with a two-way ANOVA. I guess I'll try to go with their notation here. (At least one other nonparametric statistics book uses n this way, and also reverses the rows and columns from the usual ANOVA presentation. It's annoying.)

The blocks can be viewed as being either a fixed effect or a random effect.

Section 7.1

Pages 272 and 273 give the basics about Friedman's test, which is the one nonparameteric test associated with a two-way layout which is typically covered in a basic course on applied statistics.

I'll offer some specific comments about the text below.

p. 274: H&W suggest that if there are ties, then compute the adjusted test statistic, S^', and compare it to the values in Table A.22. Of course a better thing would be to use StatXact, which handles ties as described in Comment 9. If one doesn't have access to StatXact, an alternative to H&W's approximate scheme would be to break the ties conservatively (to minimize the value of the test statistic, and maximize the p-value), assigning only integer ranks, and then go to Table A.22.
p. 276, Example 7.1: With n = 22 blocks, the chi-square approximation ought to do okay, even if the p-value is in the neighborhood of 0.005. Still, rather than depend on Minitab's approximate p-value (Minitab uses the chi-square approximation regardless of what n is), you might as well enter the data into StatXact and obtain an exact p-value. StatXact yields an approximate p-value of 0.0038, which is in agreement with Minitab's 0.004. StatXact's exact p-value is 0.0031. So while the chi-square approximation isn't horrible in this case, it doesn't even get one significant digit correct. (Note: If one is going to use Minitab, there is a better way to get the values into C1 and C2. To get the values into C1 enter the set C1 command, and then enter 22(1) 22(2) 22(3) at the DATA> prompt (and then end at the next DATA> prompt). To get the values into C2 enter the set C2 command, and then enter 3(1:22) at the DATA> prompt (and then end at the next DATA> prompt).) I'll also point out that if one incorrectly treated the data like 3 independent samples and did a K-W test, the resulting p-value is about 0.40 (both Monte Carlo estimate of the exact value, and the chi-square approximation). Typically, the p-value which results from ignoring the two-way design and incorrectly doing a K-W test is larger (but whether it is larger or smaller, it is incorrect, because one doesn't have k independent samples).
p. 276, Comment 1: Note that no allowance for interactions gives us an additive model. For the Rounding First Base example (Example 7.1), this means that it should be believed that the same method is best for all of the players, and the same method is worst for all of the players. The inclusion of an interaction term in the model would allow for some players to do best with one method, and other players to do best with another method. Of course, nonzero interaction terms makes it more difficult to draw a conclusion about which method is best. But I think it makes sense to allow for interactions. If they were found to be nonsignificant (say with an ANOVA F test, based on an assumption of iid normal error terms), then one could feel better about adopting an additive model and concluding that there may be an overall best method. But if the test for interactions is significant, one should conclude that the same method need not be best for every player (and more careful testing should be done to help determine which method is actually the best one for each player).
p. 277, Comment 4: This is an important comment, since it explains the motivation for the test statistic. (In the first displayed equation, the (k-1)!/k! can be replaced by 1/k (which can be viewed as a simplification of the expression obtained using combinatorics, or which can be viewed as resulting from a simple symmetry argument --- under the null hypothesis, R_ij is equally likely to take the values 1, 2, ..., k, and so each of these values will occur with probability 1/k).) Comment 5 also provides some motivation for the test statistic --- it's a rank version of the normal theory two-way ANOVA F test.
pp. 277-278, Comment 6: Assumption A3^' indicates the more general null hypothesis --- that the k observations (one from each treatment) in each block arise from iid random variables, but that the distribution of the random variables can be different from block to block. So basically, we have a null hypothesis of no treatment differences (but there can be block differences). If a small p-value results, then there is strong evidence of differences in the treatment distributions for at least some of the blocks. (With this test applied to data from a randomized block design, it's hard to reach firm conclusions about the treatments --- even if one assumes a shift model (a common error term distribution for each cell), a small p-value only imples that not all of the treatment effects are the same, but the test doesn't provide any indication of which treatments are different from which other treatments --- it could be that they are all different, or it could be that only one differs from the rest, which are the same, with the one that is different being either smaller or larger than the others.)
p. 278, Comment 8: This is a good example of the type of thing that I refer to as a "brute force" derivation of a null sampling distribution.
p. 281, Comment 11: StatXact gives us some alternatives not included in H&W. One can consider Quade's test, and possibly stratified two-sample tests (if there are just two treatments).

Section 7.2

This is a two-way layout version of the test that I referred to in class as being a competitor to the J-T test --- the one which can be done using StatXact's Linear-by-linear Association test, applied to data in a one-way layout. Since that Ch. 6 type of test is so similar to Page's test for a monontone alternative in a two-way layout, I often refer to the one-way version as being a test similar to Page's test.

The test described in this section can be done using StatXact's Page test. (I'll post some comments about using StatXact for Page's test on my StatXact web page.)

I'll offer some specific comments about the text below.

p. 284

The alternative given by (7.9) is for a "one-sided" monotone alternative. One could also consider a "two-sided" monotone alternative, to be used in cases for which it is reasonable to think that if there are differences between treatments, then they will be of a monotone nature, but it isn't clear if the values will monotonically increase or decrease. (E.g., it might be thought that if a drug will have an effect, it will either monotonically increase or decrease with increasing dosage, but it is not known what the direction will be if differences are observed. Often the two-sided alternative makes sense when considering "side-effects." For example, if the drug is supposed to lower blood pressure, then a one-sided alternative could be considered to see if it is effective in doing so, and if greater dosages correspond to larger decreases (but not necessarily of a strictly monotone nature, or else a regression analysis may be more sensible). But one could also be concerned if the blood pressure medication affected something else (a side effect of sorts). For example, it could increase or decrease the amount of saliva in the mouth. In such a case, the desired outcome may be one of no change, but one could test for a difference, and anticipate that either a monotonically increasing or decreasing effect will be observed if there is any effect.)

p. 286, Large-Sample Approximation

Since L is integer-valued, it could be that a continuity correction of 1/2 will generally improve the normal approximation. If we consider Example 7.5, using the table in the back of H&W, one can only determine that the exact p-value is between 0.0014 and 0.0041. However, StatXact can be used to determine that the exact p-value is about 0.0025 (and you can use this value as a check to make sure that you're doing Page's test correctly using StatXact). The normal approximation without a continuity correction yields a p-value of about 0.0040, and the normal approximation with a continuity correction yields a p-value of about 0.0047. So, in this case, the continuity correction made the approximation worse. In another case that I considered in my course last summer, for which the p-value was also small, the continuity correction made the approximation worse. But I think it is the case that when the p-value isn't real small, a continuity correction can improve matters. (It can be noted that for k = 2, Page's test reduces to the sign test, and it is known that a continuity correction generally improves the normal approximation for a sign test (except for some cases for which the p-value is rather extreme).)

p. 186, Ties

StatXact is the best way to deal with ties --- it does things as is indicated by Comment 19 on pp. 289-290. For this test, H&W don't give the adjustment for ties for the null variance that makes the normal approximation perform better (see Comment 21 on p. 292 for more about the normal approximation in the presence of ties (in particular, how it's conservative when there are ties)), and of course the tables in the back of the book are derived for the case of no ties.

pp. 286-287, Example 7.2

From what I can gather from H&W, and also going to the original source (the classic book by Cochran and Cox), the data is from a randomized block experiment, which had three plots of land divided into five smaller subplots, with the 5 levels of potash randomly assigned to the five subplots within each large plot. (I suppose that it was thought that other characteristics of the the land could influence the strength of the cotton, and that a randomized block design would be better than a one-way layout in which 15 small plots of land were randomly assigned to the five treatment levels. I'll discuss this in class.) Four measurements were made for each of the 15 cells, but since Page's test only uses one observation per cell, the four measurements for each cell were averaged to yiled just one value per cell.

p. 287, Comment 14

This is an important comment, since it explains the motivation for the test statistic.

p. 287, Comment 15

This comment acknowledges that the test can still be a distribution-free test even if a shift model doesn't hold.

p. 288, Comment 17

Two facts contribute to the (k!)ⁿ = 36 possibilities indicated on p. 288 being equally-likely under the null hypothesis of no differences due to treatments:

if the null hypothesis of no differences due to treatments is true, then the k! orderings of the ranks are equally-likely for each block (which follows from the fact that if the null hypothesis is true, then the random variables corresponding to the k observations for the block are iid);
and the ordering that results for a block is independent of the orderings for all of the other blocks.

So for the case at hand, we have 3! * 3! = 6*6 = 36 equally-likely possibilities, and in general we have (k!)ⁿ equally-likely possibilities under the null hypothesis.
I'll also point out that to get an upper-tailed test p-value the "brute force" way, one can save some work by arranging the 36 equally-likely possibilities in a 6 by 6 table, and noting that (that when done in a sensible way) one has symmetry. Also, (when the table is constructed in a sensible way (I'll explain this in class)) one can note that the largest values of the test statistic occur in one corner of the table, and so it is typically not necessary to fill out the entire table to determine an upper-tail probability. (StatXact can obtain an upper-tail probability by using similar "tricks" to avoid consideration of all (k!)ⁿ equally-likely possibilities.)

p. 289-290, Comment 19

In class, I'll go over the determination of the p-value (the null probability that the test statistic exceeds the observed value of 23) for the example addressed on p. 290.

p. 292, Comment 22

We'll encounter Spearman's rank order correlation coefficient in Ch. 8.

Section 7.3

This is a two-way layout version of the one-way layout procedure described in Comment 63 in Sec. 6.5 (p. 247). Like the main procedure covered in Sec. 6.5 (the S-D-C-F multiple comparison procedure for one-way layouts), the procedure of Sec. 7.3, which I guess we can refer to an the W-N-M-T procedure, can be used to identify which treatments are significantly different from one another, and can be used as a test of the null hypothesis of no differences against the general alternative that there are some differences. But the procedure described in this section is more similar to the procedure described in Comment 63 on p. 247 because they both use joint ranks, whereas the S-D-C-F procedure does pairwise comparisons for which the other k-2 treatments have no influence in the determination of whether there is significant evidence that a given pair of treatments differ. (More on this is given in Comment 30 on p. 299.)

The first paragraph of Sec. 7.3 describes the procedure as one which would typically be applied after a rejection is obtained with Friedman's test (in order to determine which treatments differ from which other treatments), but as the last paragraph of Comment 24 indicates, the procedure can be used as a test (which is a competitor to Friedman's test).

I'll offer some specific comments about the text below.

p. 295, Procedure: Another way to describe the critical value, r_alpha, that is equivalent to (7.26), is to say that the null hypothesis probability that at least one of the k choose 2 absolute differences of rank sums is at least r_alpha is alpha.
p. 295, Procedure: StatXact isn't helpful in doing this procedure, and so we'll have to make use of Table A.24 of H&W or the asymptotic approximation based on the studentized range distribution. Unfortunately, Table A.24 only gives a few probabilities for each combination of k and n covered, and so it won't be (easily) possible to always produce exact p-values (and we may have to do something like report that 0.008 < p-value < 0.032). To obtain the needed rank sums, one can note (see p. 276) that Minitab's friedman command gives these, but at times a hand calculation may be quicker than putting the data into Minitab.
p. 296, Large-Sample Approximation: As usual, I don't agree with the otherwise decide part of statements like (7.27). Just because there is not strong evidence to conclude that there is a difference, it doesn't mean that there is no difference. There could be a somewhat mild difference. Also, it can be noted that a strict interpretation of the rule (7.27) can lead in inconsistencies. For example, as is the case with Example 7.3, one can conclude that methods 1 and 2 are equivalent, that methods 2 and 3 are equivalent, but that methods 1 and 3 differ.
p. 296, Ties: The use of midranks doesn't correspond to the assumptions underlying the table of the exact values (Table A.24). Also, it could produce a value, such as 21.5, which is not included in the table.
pp. 296-297, Example 7.3: To determine at what level of significance it can be concluded that method 2 differs from method 3, one can divide the absolute difference in the rank sums, 15, by the square root of 22(3)(4)/12 (i.e., the square root of 22). and compare the resulting value to the critical values of the studentized range distribution with 3 and infinity df (which are given in Table A.17). The resulting value is about 3.198, and using the 2nd row of entries in Table A.17 on p. 669, it can be concluded that methods 2 and 3 are different at level 0.10, but not at 0.05.
Using the last 2 lines of p. 296 and the first 2 lines of p. 297, it can be concluded that the p-value which results from a test of the null hypothesis of no differences dues to methods against the general alternative that there are some differences satisfies 0.001 < p-value < 0.005. The exact p-value from Friedman's test is about 0.0031, and so both tests give about the same result.
The last paragraph of the example considers the reduced data set comprised of just the first 15 cases. The largest absolute difference in rank sums is 15. Since Table A.24 includes exact results for the k = 3 and n = 15 case, it can be used to determine that the p-value of the test of the null hypothesis of no differences due to methods against the general alternative satisfies 0.010 < p-value < 0.028. To see what the large-sample approximation yields, we divide the largest absolute difference of 15 by the square root of nk(k+1)/12 (the square root of 15*3*4/12 = 15), obtaining a value of about 3.873, which can be compared to the entries in the 2nd row of Table A.17, which suggests that 0.010 < p-value < 0.025, in close agreement with the result from the table of the exact distribution.
p. 299, Comment 30: One way to avoid letting the k-2 other treatment results influence the determination of whether or not there is statistically significant evidence to conclude that a given pair differ is to do paired-sample tests on each k choose 2 pairs of samples, and then combine the results using Boole's inequality to obtain a conservative "overall" p-value of a test of the null hypothesis of no differences due to treatments against the general alternative. For small k and n an exact (nonconservative) test could be performed using this general scheme, but I don't know of any tables for such a test (and have never heard anyone propose such a test).

Section 7.4

This is a two-way layout version of the one-way layout procedure described in Sec. 6.7. Like the procedure covered in Sec. 6.7, the procedure of Sec. 7.4, which I guess we can refer to as the N-W-W-M procedure, can be used to identify which treatments are significantly better than the control (which could be a commonly used treatment that new treatments are compared to, or it could be no treatment, or even a placebo). It can also be used as a test of the null hypothesis of no differences against the alternative that there are some differences (see the 2nd paragraph of Comment 32 on p. 303), and if certain assumptions are made (like a shift model holds, or that if a treatment differs from the control it corresponds to an upward shifting of probability mass), then conclusions can be stated in terms of means and/or medians.

I'll offer some specific comments about the text below.

p. 301, Procedure: Another way to describe the critical value, r^*_alpha, that is equivalent to (7.29), is to say that the null hypothesis probability that at least one of the k - 1 differences of rank sums (of the form treatment minus control) is at least r^*_alpha is alpha.
p. 301, Procedure: StatXact isn't helpful in doing this procedure, and so we'll have to make use of Table A.25 of H&W or the asymptotic approximation. To obtain the needed rank sums, one can note (see p. 276) that Minitab's friedman command gives these, but at times a hand calculation may be quicker than putting the data into Minitab.
p. 301, Large-Sample Approximation: As usual, I don't agree with the otherwise decide part of statements like (7.30). Just because there is not strong evidence to conclude that there is a positive difference, it doesn't mean that there is no difference. There could be a somewhat mild positive difference, or there could be a negative difference.
Note that to use Table A.21, you use k - 1 for l.
p. 301, Ties: The use of midranks doesn't correspond to the assumptions underlying the table of the exact values (Table A.25). Also, it could produce a value, such as 21.5, which is not included in the table.
p. 302, Example 7.4: For the illustration of the large-sample method at the bottom of p. 302, one can divide the largest difference, 6, by the square root of nk(k+1)/6, and use that as a test statistic value to compare against the critical values given in the l = 2 column of the rho = 1/2 table on pp. 691-692. The test statistic value is 6 divided by the square root of 18*3*4/6 (the square root of 36, which equals 6). So the test statistic value is 1. Using the x = 1.00 value of the l = 2 column, the p-value is found to be 1 - 0.74520 = 0.25480, which is in the ballpark of the 0.2859 value from the table of the exact distribution. (Note: I have no clue as to why the k = 6 value is given at the bottom of p. 302 --- I suspect someone got mixed up at some point and thought that it was relevant.)

Section 7.5

I'm not going to cover this section in class. The estimates only make sense if one can assume an additive model with iid error terms (i.e., the error term distribution is the same for each cell, and the relationships between the medians is the same for each block, giving us a shift model), and I don't believe that this is usually the case.

Section 7.6

I think that this is a relatively important section, since such BIBD data is sometimes encountered, and it may be unwise to rely on normal theory ANOVA methods (since it's very hard to check all of the assumptions with such sparse data ... and I suspect that a lot of the time the assuptions are rather severely violated).

I'll offer some specific comments about the text below.

p. 310 (1st line): In class, I'll establish the restriction that is given (solving Problem 59 on p. 315 of H&W), but I challenge you to try to establish it on your own first.
p. 310, Procedure: In class, I'll show the equality of the two expressions for d given in (7.43) (solving Problem 60 on p. 315 of H&W). We've encountered similar things previously in Ch. 6 and Ch. 7, so this one time I'll show you how to go about establishing something like (7.43). Also, I'll derive the mean of D under the null hypothesis.
p. 311, Large-Sample Approximation: Note that the chi-square approximation is conservative, particularly out in the tail. It can be shown that the null variance of D is less than 2(k-1), which is the variance of a chi-square distribution with k-1 df, indicating that large values of D, when the null hypothesis is true, occur less frequently than would be the case if D in fact had a chi-square distribution with k-1 df. As usual, the conservativeness of the approximation under the null hypothesis translates into larger than deserved approximate p-values which corresponds to reduced power when the alternative hypothesis is true. For many cases not covered by the exact tables, a good Monte Carlo estimate of the exact p-value would be the next best thing, but this would require quite a bit of effort. (StatXact does not do the test covered by this section.)
p. 312, Comment 45: Note that a shift model arrangement is not necessary --- the null hypothesis sampling distribution would be satisfied if the s observations in each block are iid, and so we can view the null hypothesis as one of no difference between treatments in any of the blocks, but allowing the common response distribution for the treatments to differ from block to block in ways other than a shift. Basically, the generalization of the shift model allows for the error term distribution to differ from block to block. In the more general setting, a small p-value should be interpretted as being strong evidence of some differences between at least some of the treatments in at least some of the blocks. (With a shift model assumption, the interpretation is that there are some differences between at least some of the treatments, but that the differences are the same no matter what block is considered. The shift model is addressed in Comment 48 on p. 313.)
p. 313, Comment 48: Note that A3' corresponds to the more general setting I referred to in my preceding comment: when the null hypothesis is true, the treatments have a common distribution in each block, but the distribution need not be the same in each block. But A3' imposes an additive structure on the medians under the alternative hypothesis. That is, if there are differences between treatments, the spacing of the medians is the same within each block (with only the error term distributions differing from block to block). I think that the additive model assumption isn't very realistic in a lot of cases, since it doesn't allow for larger differences in some blocks than in others (and even no differences in some of the blocks).
pp. 313-314, Comment 49: This is a good example of the type of thing that I refer to as a "brute force" derivation of a null sampling distribution.
pp. 314-315, Comment 51: This is a good example of the type of thing that I refer to as a "brute force" derivation of a null sampling distribution in the presence of ties. If StatXact included the test of this section (Durbin's test, for short), I feel safe in assuming that it would make use of this type of exact sampling distribution to obtain p-values. Given that StatXact doesn't do the test, and given that the chi-square approximation need not be good to small designs, and given that I worked through some small design examples of exact null sampling distributions in the presence of ties for some other tests in Ch. 7 and showed that ties can change the distribution quite a bit from the case with no ties, I think it would be wise to go to the extra trouble to obtain an exact p-value in some small design cases if there are ties (but perhaps not if the approximate p-value is not at all smallish).

Section 7.7

The first paragraph of Sec. 7.7 describes the procedure as one which would typically be applied after a rejection is obtained with the test of Sec. 7.6 (in order to determine which treatments differ from which other treatments), but it can also be viewed as a competitor to the test of Sec. 7.6. To perform a test of the null hypothesis of no differences between treatments against the general alternative, we can reject at level alpha if any of the k choose 2 pairs of treatments are deemed to be significantly different by the criterion given in (7.46) on p. 317.

I'll offer some specific comments about the text below.

p. 317, Procedure: Note that the procedure is approximate, instead of exact.
p. 318, Comment 55: The conservtive procedure may have low power to detect somewhat mild differences. Also, it may be rare that you can apply it due to the extremely limited number of designs covered by Table A.26.

Section 7.8

Since I think that other topics that we can cover this fall may be more useful, I'm not going to cover this section in class. (Ch. 7 is a rather long chapter, containing many procedures. I think that on the whole, it'll be better to not spend too much time on Ch. 7, so that we can have more time for other chapters, and for procedures that we can do on StatXact which are not covered in H&W. Performing the test of Sec. 7.8 is made difficult due to the lack of tables for the exact distribution (with the exception of a relatively small number of cases).)

Section 7.9

I will present the basics of this test in class, but I don't have anything to make note of here.

Section 7.10

I will present the basics of this procedure in class.

As usual, just because this method is a multiple comparison procedure, I don't think that one has to think of it as a follow-up procedure to be applied after the rejection of the null hypothesis by another procedure. Instead, it can be used to test the null hypothesis of no treatment differences within blocks against the general alternative that there are some differences between treatments (in at least some of the blocks).

I'll offer some specific comments about the text below.

p. 340, Procedure: The procedure described on p. 340 is an approximate one which should work okay as long as n isn't too small (but unfortunately, no guidelines are presented to help us decide if n is large enough). To perform a test of the null hypothesis of no treatment differences against the general alternative, one can consider all k choose two pairs of samples and see if any of them correspond to a significant difference at some specified level. Alternatively, to get a p-value, one can divide the largest absolute difference (the largest S_u minus the smallest S_u) by the square root factor on the right side of the inequality in (7.75), and compare the value of this test statistic to the quantiles given in Table A.17.
p. 342, Comment 82: The alternative presented is a conservative procedure which may be rather weak in power, which leads to the recommendation that the approximate procedure described on p. 340 be used "whenever the number of blocks is reasonably large." Unfortunately, no guidelines are presented to help us decide if n is large enough.

Section 7.11

I will spend a relatively long time discussing this section during class. Also, I'll work though a simple example, step by step. In addition, I'll give the results from the application of Friedman's test and Quade's test to the same simple example data set. Then, I'll give the results from applying all three tests (Friedman's test, Quade's test, and Doksum's test (the test of Sec. 7.11)) to Woody Wardward's base running data (on p. 274).

I'll offer some specific comments about the text below.

p. 344, Procedure: I'll stick with H&W's notation when I present this section during class. My presentation will be easier to follow if you spend some time getting comfortable with all of the notation prior to my lecture on this section. For example, it should be kept in mind that H_u. assumes a large value when the treatment u observations are generally smaller than the observations from the other treatments.
p. 344, Procedure: I don't think that the text makes it clear why the variance given by (7.82) and (7.83) corresponds to a suitable part of the denominator for the sum of the squared deviations that is the heart of the test statistic. (My advice is not to worry about this matter for now.)

Section 7.12

I will present this test in class, but I don't have anything to make note of here.

Section 7.13

Section 7.14

Section 7.15

Section 7.16

I will briefly mention some things from this section when I cover Sec. 7.11, but I don't intend to get into the details of this section.