Ch. 4 notes for H&W, STAT 657

Some Comments about Chapter 4 of Hollander & Wolfe

I'll generalize the coverage of this chapter in several ways.

I'll discuss how we can use the tests of Sec. 4.1 in situations for which we don't assume that the distributions differ only by a location shift if they differ.
I'll present ways of doing tests using scores other than those corresponding to the rank sum test and normal scores test. (We can use StatXact to perform such tests, and we can do approximate versions of tests based on other scores by making use of general formulas for the mean and variance of two-sample linear rank statistics that I'll present.) One example of a test which can be done which is not included on the menus of StatXact is the two-sample median test. (Although the two sample median test can be done on StatXact using commands for tests based on categorical data, it's awkward since one has to do a bit of work outside of StatXact to figure out what values should be entered into the contingency table. If you already have the data entered as "case data" it'd be nice to be able to do the two-sample median test along with other two-sample tests, and so I'll show you how to do it using the two-sample permutation test command.)
I'll cover the Savage scores test and the two-sample permutation test which are included on StatXact.

Note that assumption A3 states that the distributions underlying the data should be continuous. Although ties resulting from discrete distributions used to be a bother, with StatXact they don't necessarily cause us any grief --- StatXact can deal with ties in a fair and exact way. The stated assumption of no ties made an exact approach feasible in the years prior to StatXact, but in practice ties used to be a problem since even though the phenomenon underlying the data was continuous, the distribution from which the observations come is always discrete due to limitations in the precision of measuring things, and with discrete distributions, ties can occur with positive probability. (The beauty of the continuous distributions assumption is that with probability 1 there will be no ties.)

Section 4.1

For the tests of this section (noting that while most of the section deals with the rank sum test, the normal scores test is introduced in Comment 12 on p. 121), and also most of the tests that will be covered in Ch. 5, the null hypothesis is that both samples arose from the same distribution, with all of the random variables being independent (so that we have m+n iid random variables). When a small p-value is obtained from any of these tests, we can say that there is statistically significant evidence that the samples did not come from m+n iid random variables, but if we take independence as a given (that we have m+n independent random variables whether or not they have the same distribution), and further take as a given that the X_i all have the same distribution and the Y_j all have the same distribution (and this is what is commonly done), then a small p-value can be taken as being significant evidence (supporting the general alternative) that the two distributions differ. To say anything further, perhaps a statement about the means or medians, we have to impose additional assumptions. In Sec. 4.1, H&W simplify things greatly by restricting attention to what is know as the shift model, but to me the shift model isn't realistic in a lot of situations. I'll discuss the shift model and other possibilities in class.

The main test in this section is sometimes referred to as the W-M-W test or the M-W-W test. As noted on p. 11 of Ch. 1, Wilcoxon's 1945 paper presented the rank sum test for the equal sample size case. The 1947 paper by Mann and Whitney presented a version of the test that was more general (not just for when m = n), and their test statistic had a different form, but since the Mann-Whitney version of the test is equivalent to the Wilcoxon version of the test (in that no matter which version of the test is used, the p-value will always be the same (if there are no ties, or if there are ties and the ties are treated in equivalent ways when performing the two tests)), it's best to think of them as the same test. Interestingly, while Minitab has a mann command, the value of the test statistic produced corresponds to the Wilcoxon version of the test. StatXact emphasizes the Wilcoxon version of the test, but some statistical software uses the Mann-Whitney version of the test. Similarly, some books have tables for the Wilcoxon version and other books have tables for the Mann-Whitney version. (It would be somewhat silly and wasteful for a book to have tables for both, since the two tests are equivalent.)

I'll offer some specific comments about the text below.

p. 107 (top half of page): One way for Y to have the same distribution as X + Delta is for the treatment to have the exact same effect (change the value by the same amount) for all subjects/units. But I find this to not be realistic for a lot of treatments --- maybe a treatment can have no effect on some subjects, and have effects of varying amounts on other subjects. (I would think that life would be so much easier for doctors (and patients) if every treatment/medication had exactly the same effect on all people. But that just doesn't seem realistic, does it?) One case in which a shift model could plausibly hold is when there is just one subject, and repeated measurements are made on the subject prior to a treatment, and repeated measurements are made on a subject after a treatment, and it can be assumed that the sole source of variation in the pretreatment measurements is measurement error, the sole source of variation in the posttreatment measurements is measurement error, and that the m+n measurement errors can be thought to be observations of iid random variables. (H&W briefly touch on an alternative to the simple shift model by introducing the location-shift function in Comment 13 on p. 122, but they don't give a lot of information about this, and in Sections 4.2 amd 4.3, attention is refocused on the simple shift model given by (4.2).)
p. 107 (bottom half of page): Note that the two samples can result from a control group and a treatment group, from two different treatment groups, or from the same treatment being applied to two different populations (e.g., men and women). Note that while in (4.3) the ranks for the treatment group subjects are being summed to be the test statistic, on p. 108 it can be seen that to use the tables (which is not necessary, given that you have StatXact handy), one has to sum the ranks for the smaller of the two samples, whether it be the treatment sample or the control sample. Also note that to use the table in H&W, both sample sizes have to be less than or equal to 10. If one sample size is 5 and the other is 11, we can't make use of the table, and at the same time it could be that the normal approximation should not be trusted (but this situation causes us no worry if StatXact is handy, since it can easily do an exact computation of the p-value when the sample sizes are 5 and 11). It can be noted that the table in H&W can be annoying to use for lower-tailed tests and two-tailed tests. You may prefer to use the tables I distributed in STAT 554 (assuming that you have those). But of course, why use a table at all? In order to give you a better "feel" for the test, in class I'll describe the construction of tables using a recursive method.
p. 108, Large-Sample Approximation: As was the case for the signed-rank test of Chapter 3 (but not the sign test of Ch. 3), establishing the asymptotic normality of the rank sum test is difficult (and uses probability theory beyond the prerequisites of this course) because the rank sum statistic is not a sum of iid random variables. In class I will give a derivation of the expected value (given by (4.7)), and I will generalize to cover other two-sample rank tests based on other scores. (Perhaps they should be referred to as two-sample score tests, but typically they are referred to as rank tests even if the scores are not integer ranks.) I will also give a formula for the null sampling distribution variance of a two-sample linear rank statistic based on general scores. (If one plugs in midrank adjusted integer scores, the value of the variance will match the values which result from (4.13) and (4.14) on p. 109.) It should be noted that H&W do not employ the continuity correction for their normal approximation (and neither does StatXact). Over the years, I've found that the continuity correction improves the normal approximation more often than not, but with StatXact handy, there is little need for a normal approximation. Minitab includes the continuity correction in it's normal approximation for this test. (Note: Minitab does not do an exact version of the test --- the mann command results in a normal approximation no matter how small the sample sizes are.)
pp. 110-111, Example 4.1: Things got messed up when the StatXact output on p. 111 was put into the book --- it'd be impossible to get the exact output that is shown. To get something close to the output, you could put ten 2s and five 1s down the first column of the CaseData spreadsheet, and then put the sample of ten values followed by the sample of five values down the second column. Then pull down the Statistics menu, go down to Two Independent Samples and select Wilcoxon-Mann-Whitney. Finally, click VAR1 into the Population box, VAR2 into the Response box, select Exact, and click OK. The Observed value of 30.00 on the output is the sum of the ranks for the sample of size five (the sample coded with 1s). The Mean value corresponds to (4.8) from H&W, putting n equal to 5 and m equal to 10. (Recall, in H&W, n is the sample size of the sample yielding the ranks which are summed to get the test statistic.) The rest of your output should match what is shown in H&W, only in two places GE should be LE. (Note: If you had coded the sample of size ten with 1s and the sample of size five with 2s, you would get a different value for the test statistic (the Observed value would be 90.00, the sum of the ranks for the sample of size ten --- the sample coded with 1s).) Note that with StatXact one doesn't specify which of the two possible one-tailed tests to perform --- it just reports the p-value for the one-tailed test that yields the smaller p-value. For example, with the coding of the sample of size five with 1s and the sample of size ten with 2s, the exact One-sided P-value is .1272 and it corresponds to Pr{Test Statistic .LE. Observed}, which is the probability under the null hypothesis that the sum of the ranks for the sample of size five assumes a value less than or equal to 30. This is the p-value for the alternative hypothesis that the permeability is less for 12-26 weeks, or equivalently, that the permeability is greater at term. If one wanted the p-value for the other possible alternative, that the permeability is greater for 12-26 weeks, then the p-value corresponds to the null probability that the test statistics is greater than or equal to 30. Letting W denote the test statistic, the desired probability is P(W >= 30). This can be obtained from the StatXact output by noting that it is equal to 1 - P(W <= 29), which is equal to 1 - P(W <= 30) + P(W = 30), which explains the presence of the Pr{Test Statistic .EQ. Observed} line in the StatXact output.
p. 113, Comment 1: Notice how simple the motivation is.
p. 113, Comment 3: As I indicated above, I will explain how to find the null distribution of the test statistic in class, for general m and n, using a recursive relationship.
p. 114, Comment 4: I plan to generalize this material some in class,
pp. 115-116, Comment 5: This is how StatXact handles ties (in an exact way), so try to understand this. (It may be helpful to put a subscript of A on one of the 3.5 values, and put a subscript of B on the other one.)
pp. 119-120, Comment 9: Don't worry so much about this material --- H&W don't really derive the power results. Note that to use the formula, one needs to know something about the underlying pdf.
p. 120, Comment 10: Don't worry so much about this material.
p. 120, Comment 11: Two distributions can have the same mean and/or median, but the sampling distribution of the test statistic need not match the distribution under the null hypothesis of identical distributions. So a small p-value from the test should not be taken as representing statistically significant evidence against the hypothesis of equal means (or the null hypothesis of equal medians), unless it can be assumed that if the means (medians) are equal then the distributions are identical.
pp. 121-122, Comment 12: The scores being summed in (4.26) are the Van der Waerden normal scores, which for large N = m+n approximate the expected values of the order statistics from a sample of size N from a standard normal distribution. The test based on these scores is typically called the normal scores test, and to execute it in StatXact, you can do exactly as for the rank sum test, only select Normal Scores, instead of Wilcoxon-Mann-Whitney, from the choices corresponding to Two Independent Samples from the Statistics menu. The test corresponding to the test statistic given by c₁ on p. 122 is also referred to as the normal scores test, and it is based on the exact expected values of the order statistics from a sample of size N from a standard normal distribution. I don't think this test is included on any statistical software package --- to do so would require storing the needed expected values for each sample size. (I don't know of any statistical software package except StatXact that includes the Van der Waerden (normal scores) test, but please check out any statistical software that you have access to and let me know if you find the Van der Waerden test, and if so, note if a normal approximation version is employed.)
p. 122, Comment 13: Note that while the model considered here allows for the treatment to not have the exact same effect on all subjects/units, it dictates that the effect will be the same for all units corresponding to the same value of X, and I don't think that's enough of a generalization.
p. 123, Comment 14: The main point to be made here is that if P(X_i < Y_j) = 1/2, then one cannot expect the rank sum test to reject with high probability even though the two distributions may not be identical. (Understanding why this is so is perhaps most easily done by examining the normal approximation version of the Mann-Whitney version of the test.) It can be noted that some books refer to X_i as being stochastically smaller than Y_j if P(X_i < Y_j) > 1/2, but this does not correspond to the usual definintion of one distribution being stochastically smaller than another distribution.

Section 4.2

It should be noted that the Hodges-Lehmann point estimator for the shift parameter only makes completely good sense if the shift model described on p. 107 holds (and I question whether it is often good to make such an assumption). However, I suppose that if it appeared that a shift model nearly held, and there was concern about outliers, then the interval may suitably serve as an approximate confidence interval for the difference in means --- it'd be a trade-off, giving up precision in the coverage probability for some protection against the ill effects of outliers. (The resistance to outliers is suggested by Comment 16 on p. 128.)

I'll offer some specific comments about the text below.

p. 126 & p. 128, Comment 15: Make sure that you understand the Hodges-Lehmann estimation scheme.
p. 128, Comment 17: The alternative point estimator introduced here is fairly interesting, as is the way of choosing between it and the main estimator dealt with in Sec. 4.2 using the widths of confidence intervals associated with the two different point estimators to suggest which estimator may be the more accurate of the two. Notice how the widths of confidence intervals associated with the two point estimators that are used in the alternative estimator introduced in Comment 17 are used to estimate the standard error of those two estimators, by assuming that they are approximately normally distributed (I'll go over this in class), and then those two standard error estimates are combined in the right way to obtain an estimate of the standard error of the alternative estimator.
pp. 129-131, Comment 18: The first four paragraphs of Comment 18 are rather interesting. They suggest that, in many cases, P(X_i < Y_j) may be a more meaningful thing to estimate than the amount of shift between the distributions. The rest of the Comment concerns a confidence interval for the estimand --- I think it's okay not to be concerned about the details of this interval estimator. Note that the distribution of the point estimator, U/mn, is more complicated than the distribution of the sample proportion based on iid Bernoulli trials, because U is not a sum of independent random variables.

Section 4.3

It should be noted that the confidence interval for the shift parameter only makes completely good sense if the shift model described on p. 107 holds (and I question whether it is often good to make such an assumption). However, I suppose that if it appeared that a shift model nearly held, and there was concern about outliers, then the interval may suitably serve as an approximate confidence interval for the difference in means --- it'd be a trade-off, giving up precision in the coverage probability for some protection against the ill effects of outliers.

I'll offer some specific comments about the text below.

p. 133 (top half of page): Note that StatXact and Minitab both give exact confidence intervals, and that the large sample approximation scheme described in the book produces an interval that does not match the exact interval (but of course the sample sizes are only 5 and 10 for the example). My guess is that the approximation scheme does quite well in general, for sample sizes that are not really small, but that one may not have to use the approximation often due to the availability of software to produce exact intervals.
p. 133, Comment 20: While the alternative estimators referred to in Comment 20 are somewhat interesting, there seems to be no reason to ever choose them over the more commonly used estimators emhasized in Sections 4.2 and 4.3.
p. 133, Comment 21: It should be noted that even if the shift model holds, one should not necessarily use the point estimator of Sec. 4.2 and the interval estimator of Sec. 4.3. Since the shift amount is the same as the difference in the means, the difference in the medians, and the difference in matching quantiles other than the medians, there are lots of ways to estimate the shift amount. In choosing between estimators, it's nice to know something about their performances, and the estimated standard errors of point estimators provide one way to compare expected performance. The quantity described in this comment is a good way to estimate the standard error of the Hodges-Lehmann point estimator introduced in Sec. 4.2. It should also provide a decent indication of how the related confidence interval may perform with the distributions under consideration.

Section 4.4

I don't think that the test covered by this section is a particularly good test, and I'm not going to emphasize it. During the early 1990s, Kelly J. Buchanan (now Kelly Thomas) did an excellent M.S. thesis for me, and she compared the performances of many test procedures for the generalized Behrens-Fisher problem. Based on research done for her thesis, Kelly and I concluded that the Fligner-Policello test wasn't very trustworthy, since it could have an inflated type I error rate, appreciably greater than the nominal level of the test. From results in the literature, it appeared that a test by Sen, which like the Fligner-Policello test is a modification of the W-M-W test, is a bit better. Although it may have been somewhat unfair to dismiss the Fligner-Policello test so quickly, it was not included in Kelly's study, but she did include Sen's test, a modification of Sen's test by Fung, Yuen's test based on trimmed means, and several other procedures. It was found that Yuen's test was better than Fung's modification to Sen's test, which was better than Sen's test. So if there is no reason to think that Fligner and Policello's test works better than Sen's test, we can conclude that there is no good reason to choose to use the Fligner and Policello test. Kelly also investigated some bootstrap methods that I helped her develop, and at least one of them appeared to be promising for certain types of situations. My guess is that with advances in bootstrp methods over the past 10 years, that one might be able to come up with a bootstrap method that would pretty much dominate Fligner and Policello's test. But right now, if I was asked to do a test about the medians/means of two symmetric distributions having unequal variances, I would choose to rely on Yuen's test. (Note: The test by Yuen which I am referring to is from her 1974 Biometrika paper, "The two-sample trimmed t for unequal population variances" and not the test based on trimmed means from her 1973 paper with Dixon which is for when homoscedasticity can be assummed.)

I'll offer some specific comments about the text below.

p. 135: I think that H&W should indicate that they are dealing with the generalized Behrens-Fisher problem, and that the Behrens-Fisher problem should be used to refer to tests and confidence intervals for the difference in the means of two normal distributions having unequal variances.
p. 136: The table only covers cases for which both sample sizes are less than or equal to 12. Upon looking at the entries in the table, it can be noted that the 0.1 critical values are fairly close to z_0.1 when both sample sizes are at least 8, but that the other tabulated critical values for the Fligner-Policello test are greater than the corresponding standard normal critical values. This suggests that for sample sizes just beyond the ranges covered by the table, the normal approximation may be appreciably anticonservative unless one is doing a level 0.1 test.

Section 4.5

Many of the results presented in this section match those presented in Sec. 3.11, which pertains to the efficiencies of paired replicates and one-sample location procedures. In fact, we can use the results of Sec. 3.11 to give us information about how the two-sample median test (which I will present in class) performs --- we can use (3.118) on p. 105 (which pertains to the asymptotic relative efficieny of the sign test with respect to the one-sample t test) to see the asymptotic relative efficieny of the two-sample median test with respect to Student's two-sample t test. We can also use the results of Sec. 4.5 to give us information about how the one-sample normal scores test performs --- we can use the second table on p. 140 (which pertains to the asymptotic relative efficieny of the two-sample normal scores test with respect to the W-M-W test) to see the asymptotic relative efficieny of the one-sample normal scores test with respect to the signed-rank test.

I'll offer some specific comments about the text below.

p. 140: One can multiply the corresponding entries in the two tables on p. 140 to obtain the asymptotic relative efficiencies of the two-sample normal scores test with respect to Student's two-sample t test (and the values obtained are also the asymptotic relative efficiencies of the one-sample normal scores test with respect to Student's one-sample t test). For example, when the distributions underlying the data are normal, the ARE of the normal scores test with respect to the t test is about 1.047*0.955, which rounds to 1.00 (and in fact a precise calculation would yield the value 1 exactly). When the distributions underlying the data are logistic, the ARE of the normal scores test with respect to the t test is about 0.955*1.097, which is about 1.05.
p. 140: Note the very interesting fact that the ARE of the two-sample normal scores test based on the exact expected value normal scores with respect to Student's two-sample t test is greater than or equal to 1 for all underlying distributions. A similar story holds if we consider the test based on Van der Waerden scores, or if we consider one sample normal scores tests.