Ch. 3 notes for H&W, STAT 657

Some Comments about Chapter 3 of Hollander & Wolfe

Section 3.1

I get annoyed with H&W. The "main part" of each section is "cookbookish" and then details are given in the Comments at the end of the section. But the comments tend to address selective details, and I feel as though a nice description of the procedures, with good motivation and discussion, is omitted. Perhaps our discussion in the classroom will fill in some gaps. Meanwhile, I'll offer some comments about the text below.

p. 36, Assumptions

While each individual in the population may have a unique distribution for the Y-X difference, if the sample of Z_i is obtained by randomly drawing from the population, we can think of the Z_i as being identically distributed (having as their common distribution, the mixture distribution obtained by giving weight 1/N, where N is the population size, to each of the N Y-X difference distributions). I suppose a good thing about the test is that if the treatment has a constant effect on each member of the population, one doesn't really need a random sample --- all that is needed, besides independence, is that after an individual from the population is selected, the measured difference will be governed by a distribution which is symmetric about theta. One of the claims sometimes made about nonparametric tests is that random samples are not really needed for the tests to be meaningful. To explore this notion further, suppose that the treatment does nothing, and that all of the difference distributions are then symmetric about 0. As long as we have independence, the signed-rank test statistic's distribution should follow it's usual null distribution, no matter how the individuals (subjects) were selected. But from my point of view, if a rejection of the null hypothesis of no treatment effect is obtained, all I have is evidence of some sort of a treatment effect. Not being a believer (for most settings) in a constant treatment effect, if the data can be viewed as a random sample, I could use the data to make inferences about the nature of the treatment effect for the population which the random sample represents. (E.g., I could estimate the median change and/or the mean change, or perhaps the proportion of individuals in the population for which the change will be at least as large as 5.) But if I cannot view the data as a random sample, I would be hesitant to use it to make an inference about a larger population --- all I would feel confident in believing is that I have strong evidence that the treatment has some effect on at least a portion of the population. If I can believe in a constant treatment effect (which I might be able to do if my observations are just repeated measurements on the same treated unit, and the variation is just due to measurement error, which has the same distribution before and after the treatment), then I suppose I can go onwards from a rejection of the null hypothesis and use the methods of Secctions 3.2 & 3.3 to estimate the value of the treatment effect, theta. (Perhaps I should discuss all of this "philosophical stuff" in class. There are some subtle points involved, and it may take you a while to get comfortable with everything.)

p. 37, a & b

Note that t_alpha is an integer. For each n, only certain values are possible for alpha (not necessarily 0.05, 0.025, 0.01, and 0.005). If, instead of using a test having a preset size, one just wants the p-value, then for an upper-tail test you can look up the value of t⁺ in the x column of Table A.4 for the right value of n, and the p-value is the probability given in the next column over. For a lower-tail test, you look up n(n+1)/2 - t⁺ in the x column.

p. 40

Here are some comments about Example 3.1.

If one wanted to round the p-value to two significant digits (which, even though it's an exact p-value, can still benefit from rounding since it's not all that important whether a p-value is 0.0195 or 0.0196), we don't know how to round 0.0195 (i.e., whether it should be 0.019 or 0.020). I don't like the default in StatXact, which is to round the p-value to the nearest ten thousandth. Sometimes one waits a while to get a p-value, and then StatXact prints out 0.0000. To avoid being disappointed if this were to happen, I suggest that before doing any tests, click on Options on the top bar of the main window of StatXact, and then select Global. For Display of Numeric Output, select Exponential, which will result in p-values given in scientific notation, always displaying some significant digits (and never giving all zeros). Before closing, go down to the bottom and put a check beside of Save Gloabal Parameters Permanently (which will keep the scientific notation while you continue to use StatXact for additional tests during your session, but unfortunately will not be remembered after you shut down StatXact, and so you have to change it each time you start up StatXact). Now when the exact signed-rank test is run, one can see that the one-sided p-value rounds to 0.019531, which means that if we want two significant digits, it should be 0.020, and not 0.019.
As H&W point out, Minitab uses a normal approximation with a continuity correction. So the 5 in T^* becomes 5.5, and T^* becomes -2.014, and one obtains an approximate p-value of about 0.022. In this case, the continuity correction made the approximation worse. I find that this is sometimes the case when one should not even be using an approximation. That is, if n is real small, the approximation w/o the cont. corr. may be better than the approx. w/ the cont. corr., but of course an exact p-value is better than any approximation, and there is no good reason not to report an exact p-value. I find that the cont. corr. can also make the approx. worse when the p-value is real small, but in most cases that I've investigated, with a sample size larger than 20, the cont. corr. made the normal approx. better.
An exact permutation test yields a one-tailed p-value of about 0.014, which is smaller than the p-value from the signed-rank test. I'll go over the permutation test in class since H&W doesn't cover it.

pp. 41-42

Here are some comments about Example 3.2.

In the book, a conservative approach (which is good if you don't have StatXact) is used to obtain the "exact" p-value of 0.039 (which is the p-value corresponding to a test statistic value of 62, instead of the value of 62.5). StatXact can be used to arrive at an exact p-value based on the midrank scheme. (The exact method using midranks is explained in Comment 11 on pp. 46-48.) Below I will compare various p-values, obtained using different methods.
- 0.0334 (StatXact, using midranks)
- 0.039 (conservative scheme using table, looking up 62)
- 0.0386 (same as above, if we had a more accurate table)
- 0.0326 (normal approx. using 62.5, variance corr., but no cont. corr.)
- 0.036 (Minitab's normal approx. using 62.5, w/ a 1/2 cont. corr.)
This illustrates that if one didn't have StatXact, the normal approximation based on the value of 62.5 comes closer to the exact value of about 0.033 from StatXact than does the conservative value of 0.039 obtained from the table.
I suspect the set of 12 differences is not a random sample. Having to find pairs of workers with very similar characteristics would further complicate the difficult task of drawing a random sample.

p. 44, Comment 7

The key to obtaining the null distribution of the signed-rank statistic (and easily obtaining the null mean and variance) is to use the fact that the null distribution of the test statistic is the same as the sum of the V_i, with the V_i being independent, having the distribution given near the bottom of the page. I'll explain the equivalence in class. (I'll start with an example of a randomized experiment, and then provide an argument for the more general case.)

p. 45, Comment 7

The special case of the central limit theorem which pertains to iid random variables cannot be used to establish the asymptotic normality of the signed-rank statistic, since the statistic cannot be viewed as a sum of iid random variables. A more general version of the central limit theorem, like would be covered in a Ph.D. level probability class, is needed. Because of this, I won't establish the asymptotic normality in class.

p. 45, Comment 8

H&W point out that the symmetry is verified for the n = 3 case in the book, but it's easy to see why the null distribution is symmetric for any value of n. The key is to note that there is a 1:1 correspondence between sets of integer ranks that sum to x, and sets of integer ranks that sum to n(n+1)/2 - x. (Ask me about this in class if you don't understand why the null sampling distribution is symmetric.)

p. 46, Comment 9

I'll discuss dealing with values of 0 in class. (With a randomized experiment, it's clear that they should be ignored (as H&W suggests in general). The StatXact manual also suggests an alternative strategy.)

p. 49, Comment 13

Note that H&W use beta for the probability of a type II error, instead of for power, and so 1 - beta is the power.

Section 3.2

I'll describe the general scheme for Hodges-Lehmann estimators associated with nonparametric tests in class. (The estimator covered in this section is a special case of a class of estimators.)

Section 3.3

H&W doesn't give a lot of explanation for the confidence interval covered by this section. (They don't show why it would have the stated coverage probability.) If there is time, I may say a little more about this interval during the 3rd lecture, but I won't take time to do so during the 2nd lecture --- I think other topics are more important to discuss.

p. 58, Comment 23: Yet another estimator is suggested. Since no indication of when this may be a good estimator to use is supplied, I think it's safe to ignore/forget this estimator.

Section 3.4

I'll give my comments below, starting with one pertaining to the top portion of p. 60, which is actually not part of Sec. 3.4 (and doesn't seem to be part of any section --- H&W tend to have little introduction portions like this, with the top portion of p. 60 serving as an introduction of sorts to Sections 3.4, 3.5, & 3.6).

p. 60, top portion: At first, the fact that B2 doesn't require that all of the differences have the same distribution may seem somewhat liberating. But, since it would be very odd for a treatment to have a variable effect, yet have the median difference be the same for each individual, it seems like B2 would only hold if the treatment had the exact same effect --- causing a constant change, theta, in all individuals --- and the variability was just random measurement error (or perhaps some sort of natural fluctuation of the phenomenon being measured). From such a viewpoint, the fact that B2 doesn't require that all differences have the same distribution seems to impose a rather severe restriction on the nature of the treatment effect, and is far from being liberating. I would prefer to be able to view the observed differences as being a random sample (from a common distribution), since in such a case theta can represent the median difference caused by the treatment (when applied to everyone in the population), and we don't have to assume that the treatment had the same effect on all individuals.
p. 60, (3.39): We can view (3.39) as being of the same general form as (3.3), each being a sum of nonrandom scores multiplied by iid indicator random variables. For the signed-rank test, the scores are the integers, 1, 2, 3, ..., n, and for the sign test, each score has the same value of 1. To create another nonparametric test, we only need to specify a different set of scores. I'll do this in class when I introduce the one-sample / paired-samples normal scores test. (Annoyingly, neither H&W nor StatXact include the one-sample / paired-samples normal scores test, although both include the two-sample version, aka Van der Waerden's test.) The one-sample / paired-samples permutation test can also be written in the same general form as (3.3), only the scores differ from time to time, since they are the absolute values of the observations. (If the null hypothesis value isn't 0, one first subtracts the null hypothesis value from each observation and then takes absolute values.)
p. 61, (3.45): For the normal approximation of the sign test, a continuity correction typically improves the accuracy.
p. 62, Ties: With data from a matched-pairs randomized experiment, it definitely makes sense to ignore the zero values when testing for a treatment effect. (I'll explain this in class.)
p. 64, Comment 26: The difference between A2 and B2 is that B2 doesn't specify that the distributions are symmetric. B2' allows for the possibility that the distributions are not continuous (but does require that P(Z_i = theta₀) = 0, since it must be that P(Z_i < theta₀) = P(Z_i > theta₀) = 0.5).
pp. 68-69 (bottom of p. 68, top of p. 69): If a continuity correction is used, the approximate power is 0.5, which is pretty close to the exact power. (Note: If the cont. corr. isn't used, someone could do the computation another way and arrive at 0.6425, instead of 0.3575, as the approximate power. So not only does the cont. corr. improve the accuracy here, it also gives a unique value for the approximate power, which is not the case if the correction isn't used.)
p. 69, (3.56): I might derive (3.56) in class.

Section 3.5

p. 72: Even though the sample median has a connection with the sign test (it's the Hodges-Lehmann est'r associated with the sign test), I think it's important to keep in mind that the sample median is not a particularly good estimator of the medain for most distributions. The Harrell-Davis estimator is typically a better choice (although it is not resistant to gross outliers, like the sample median is, and so one must watch out for them). Also, in many small sample size situations, various trimmed means and M-estimators work better than the sample median (and sometimes the sample mean is better), even though these alternatives aren't necessarily consistent estimators of the distribution median unless the distribution is symmetric.
pp. 72-73, Comment 38: One can call the sample median the Hodges-Lehmann estimator associated with the sign test, but since there are so many different Hodges-Lehmann estimators (not all associated with nonparametric procedures --- there is a H-L estimator for the error term variance in an ANOVA model), I think it's better to simply refer to the estimator dealt with in this section as the sample median.
p. 73, Comment 43: My guess is that in most situations, many other estimators will outperform a quasimedian. I doubt that I would ever choose to use a quasimedian in practice (unless I was trying to estimate the median of a uniform diestribution, with both endpoints of the support being unknown, in whcih case I would use the average of the sample minimum and the sample maximum, which is an extreme example of a quasimedian).
pp. 73-74, Comment 44: Estimators having the form given in (3.61) are called L-estimators. Sample medians, the sample mean, trimmed means, and quasimedians are all L-estimators.
p. 74, Comment 45: It is true that for some distributions, with an odd sample size, one is better off deleting the last observation in the sample (not the ordered sample), and computing the sample median from a sample having an even sample size. I believe that this is true for a uniform distribution, but not all distributions. (Indeed, in Hodges and Lehmann's 1967 paper, in their TABLE 1.4, there is an example which indicates that it isn't always true for normal distibutions, since the variance of the sample median for the n = 19 case is smaller than the variance of the sample median for the n = 18 case.)
p. 74, Comment 46: I've never seen this estimator for the asymptotic standard error of the sample median given anywhere else, although I've seen several alternatives suggested in various places. None of the alternatives that I've investigated work well when the sample size is smallish, and I suspect that the same is true for this estimator (after all, it is referred to as an estimator of the asymptotic standard deviation).

Section 3.6

Note that only certain coverages probabilities are possible for exact intervals (with the choices depending on the sample size, and not necessarily including 0.9, 0.95, and 0.99). Minitab's sint command will produce an approximate interval having any approximate coverage probability, using a nonlinear interpolation method. Pages 87-88 of Rand Wilcox's Introduction to Robust Estimation and Hypothesis Testing (Academic Press, 1997) describes such a method, provides references, and indicates that some researchers have supported the use of the method. But when n is rather small, I prefer to go with an exact interval, even though I may have to use a nonstandard coverage probability.

p. 78, Comment 52: The estimators described here are the same as the quasimedians described in Comment 43.
p. 78, Comment 54: Note that if the distribution is discrete, one can use the same formulas (as in the continuous case) to obtain exact confidence interval endpoints, only in the discrete case the interval endpoints are part of the interval --- that is, we need to write the interval as a closed interval instead of an open interval. If the phenomenon being measured has a continous distribution, but limited precision in measuring, or excessive rounding, creates tied values, then one should perhaps express a confidence interval as a closed interval. For example, if z_(L) is the lower confidence bound, and z_(U) is the lower confidence bound, if z_(L+1) equals z_(L), or z_(U-1) equals z_(U), then I would write the interval as [z_(L), z_(U)], and not (z_(L), z_(U)).

Section 3.7

There isn't a lot new in this section (given that one has gone through Sections 3.4, 3.5, and 3.6). During the 3rd lecture, I'll extend the coverage to deal with making inferences about quantiles other than the median.

p. 80: Example 3.8 seems a bit silly --- I don't think it's pertinent to do a two-sided test using 81.3035 as the null hypothesis value, since 81.3035 seems to be an estimate obtained from previously obtained data, and so it seems to be more of a two-sample problem than a one-sample problem.
p. 81: The sentence before the Comments section points out that the "data provides an example in which the populations of the Z observations are probably not the same." I guess they mean that since the experimental conditions were not the same from time to time, the distribution of the measurement error should not be assumed to be the same. This seems okay to think, but then why should we think that each distribution is symmetric about the same value? (Why should we assume that the error distribution is symmetric? Why should we even think that the expected value of the measurements is the value that we're trying to estimate? One of the nice things about doing a test on the differences of paired data is that it can be hoped that the differencing operation cancels out any measurement bias (and also insures symmetry if there is no treatment effect.)

Section 3.8

p. 84: See p. 61 if you have forgotten about the b_alpha,1/2 notation.
p. 85: The approximate confidence interval example right before the Comments section is silly. In my opinion, the beauty of the approximate confidence interval formula is that one can use it to obtain an approximate 95% or 99% confidence interval when an exact interval is not possible.
p. 85, Comment 57: I'll cover making inferences about other quantiles in class.

Section 3.9

As far as I know, the test described in this section isn't included on StatXact or any other major statistical software package. Since it's a pain to perform unless n is rather small, and since the test has little power to detect asymmetry unless n is not small, I can't imagine that this test gets a lot of use (and in fact, I'll guess that it get's practically no use at all). If a distribution is skewed, it's usually the case that the skewness can be detected with graphical methods unless n is rather small. But, I suppose that when n is large enough for the test to have decent power, it can be used to partially confirm the apparent skewness. However, I think that cases for which one would like to have evidence of symmetry are more common than cases where one seeks evidence of skewness, and the test is of little value for providing evidence of symmetry, since a failure to reject the null hypothesis could be due to low power to detect skewness.

My guess is that while most books on nonparametric statistics omit this test, it is included here because one of the authors is partially credited with its development. I'm not going to emphasize this test because we don't have an easy way to perform it, and it just doesn't seem that useful to me.

The first published article about this type of "triples test" for asymmetry appeared in 1978, and was written by Davis and Quade. The article about this test by Randles, Fligner, Policello, and Wolfe appeared in 1980, but was originally submitted to the journal in July of 1977. So it may be the case that both teams of authors developed the same test at about the same time, with the Davis and Quade article appearing first in a journal that typically had a shorter time lag between article submission and article publication. It can be noted that there was a 23 month gap between the time the Randles et al. article was first submitted and when the revision (that eventually appeared in print) was submitted. I'll guess that most of that time was taken by the journal's editors and the article's referees to process the original submission. Since the 1980 article contained additional information about the test, I have no problem with giving the two sets of authors equal billing.

It can be noted that the test described in Sec. 3.9 apparently can be anticonservative (i.e., if the nominal size of the test is 0.05, the actual probability of a type I error may exceed 0.05). Randles et al. recommend using t critical values (using n df) instead of standard normal critical values to help curb the anticonservative behavior. This ploy helps (using the larger t critical values will result in fewer type I errors), but the test may still be anticonservative for certain parent distributions of the data. Another test for asymmetry is based on the asymptotic normality of the sample skewness. A Monte Carlo study done by Randles et al. indicates that this competitor test has less problems with anticonservativeness, but also rejects less than the "triples test" when the alternative hypothesis is true (i.e., it has lower power). So it seems to come down to a choice of using a more powerful, but less accurate (with respect to respecting the stated level of the test) test, and a test which seems to misbehave less under the null hypothesis, but has lower power. Some would argue strongly that the test that better respects the nominal type I error rate should be chosen, while others would choose the more powerful test as long as they felt that the actual type I error rate wasn't too inflated above the nominal level of the test. This second viewpoint wouldn't be so bad if we could characterize the types of situations in which the "triples test" badly misbehaves, be able to identify when those situations arise in practice, and avoid using the "triples test" in such cases.

Below I'll summarize some results presented in the Randles et al. paper. T designates the "triples test" based on standard normal critical values, T^* designates the "triples test" based on t_n critical values, and S designates the competitor test based on the asymptotic normality of the sample skewness.

The first two tables below show estimated type I error rates for the three tests, for six different symmetric parent distributions. The kurtosis of the distributions increase as one goes from the 1st distribution to the 5th distribution (I may add the exact kurtosis values later), and the kurtosis does not exist (which suggests really heavy tails) for the 6th distribution.

n = 20	T	T^*	S
distribution 1	0.071	0.055	0.038
distribution 2	0.045	0.034	0.023
distribution 3	0.064	0.047	0.037
distribution 4	0.067	0.053	0.041
distribution 5	0.068	0.056	0.041
distribution 6	0.103	0.083	0.056

n = 30	T	T^*	S
distribution 1	0.079	0.065	0.040
distribution 2	0.050	0.037	0.023
distribution 3	0.058	0.048	0.032
distribution 4	0.061	0.049	0.032
distribution 5	0.065	0.056	0.035
distribution 6	0.089	0.080	0.020

T^* should be clearly preferred to T, since both can be anticonservative, but T^*'s performance isn't as bad. One can note that while T^* can have an inflated type I error rate, it isn't badly inflated except for the extremely heavy-tailed parent distribution, and for this distribution S also has an inflated type I error rate when n = 20. It might be worthwhile to adjust the "triples test" a bit more, perhaps using t critical values with n-1 or n-2 df (since the slightly larger critical values will dampen the anticonservativeness). Another idea would be to use bootstrapping to improve the test based on the sample skewness --- if it could be made to be less conservative in cases for which it is conservative, then it's power should be improved in those cases.

The next two tables show estimated powers (against a particular alternative) for the two most accurate tests, for fourteen different skewed parent distributions. The skewness is given for eight of the distributions.

n = 20	skewness	T^*	S
distribution 7	0.50	0.222	0.147
distribution 8	1.50	0.625	0.301
distribution 9	0.90	0.190	0.125
distribution 10	1.50	0.286	0.179
distribution 11	0.80	0.060	0.044
distribution 12	2.00	0.129	0.090
distribution 13	3.16	0.769	0.323
distribution 14	3.88	0.793	0.324
distribution 15		0.248	0.192
distribution 16		0.446	0.384
distribution 17		0.219	0.153
distribution 18		0.624	0.367
distribution 19		0.140	0.088
distribution 20		0.304	0.100

n = 30	skewness	T^*	S
distribution 7	0.50	0.341	0.262
distribution 8	1.50	0.817	0.452
distribution 9	0.90	0.325	0.175
distribution 10	1.50	0.478	0.222
distribution 11	0.80	0.072	0.038
distribution 12	2.00	0.209	0.089
distribution 13	3.16	0.924	0.380
distribution 14	3.88	0.940	0.357
distribution 15		0.345	0.287
distribution 16		0.606	0.579
distribution 17		0.393	0.232
distribution 18		0.846	0.562
distribution 19		0.164	0.026
distribution 20		0.396	0.048

One can see that the power values are generally considerably higher for the the "triples test", and so it would be nice to find a way to correct it's anticonservativeness problem. Note that for the smallish sample sizes considered in the Monte Carlo study, the power can be rather low when the skewness is less than 3 (and so while it's perhaps the better of the two tests, it can have disappointingly low power, since a skewness of 3 is fairly large in my opinion, and one might think that smaller skewnesses should be detected with higher probability than some of the low powers observed).

Below are some more comments.

p. 93, Comment 64: This scheme seems a lot easier than the one described on p. 88. One just needs to compare the average of the minimum and maximum of the two values to the middle value.
p. 93, Comment 65: A more theoretical course than STAT 657 would spend a bit of time on the general theory of U-statistics. But such a course would be more appropriate for Ph.D. students who want to do research on U-statistics, and develop a new test procedure. The focus of STAT 657 is on the correct application of exisiting test procedures. Note that here U-statistics refer to a large class of statistics, and not what is sometimes referred to as the Mann-Whitney U statistic, which is just one member of the more general class of U-statistics.

Section 3.10

As far as I know, the test described in this section isn't included on StatXact or any other major statistical software package. Since it's a pain to perform unless n is rather small, I can't imagine that this test gets a lot of use (and in fact, I'll guess that it get's practically no use at all).

The null hypothesis under consideration in this section implies that the distribution of X_i - Y_i is symmetric about 0, and we can shoot down that hypothesis with easier-to-perform tests such as the signed-rank test, the sign test, the normal scores test, and the permutation test.

For example, using the data from Problem 3.113 on p. 104 of the text (which is the data in Table 3.3 on p. 50), one gets a p-value of 0.125 using the test from Sec. 3.10, but one can get that same p-value much easier using the sign test, and one can get the smaller p-value of about 0.1094 using the signed-rank test. (For the data in Example 3.11 on p. 97, one can get smaller p-values using the sign test, the signed-rank test, and the normal scores test, than one can get with the test from Sec. 3.10. But none of the p-values are very small, and so perhaps the differences in their values don't mean a lot.)

Section 3.11

I'll work through an example in class to show how to obtain results like those given by (3.116) on p. 104 and (3.118) on p. 105. (I will assign a homework problem or two that will instruct you to find similar results for other distributions.)