Some Comments about Chapter 8 of Hollander & Wolfe



The setting here is a bit different from the types of things that we've been considering in previous chapters. In Ch. 8 it is assumed that the sample arises from n iid random variables from a bivariate distribution, and the basic test considered is of the null hypothesis that Xi is independent of Yi (i = 1, 2, ..., n) against the general alternative of lack of independence. The exact null sampling distributions which correspond to the tables in the back of H&W assume no ties (which will be the case with probability 1 if the observations truely arise from a continuous distribution), but by using StatXact we can deal (in an exact way) with data sets in which ties occur.


Section 8.1

StatXact will do this test. (Click here to see how to easily do Kendall's test using StatXact. But note that StatXact screws up the asymptotic version of the test, which shouldn't bother us a lot since it's typically better to go with a Monte Carlo estimate of the exact p-value in cases for which an exact p-value cannot be obtained (due to n being too large), but unless one is aware of StatXact's error, the difference between the asymptotic p-value and the others may cause one to worry that something is messed up (and it is). Click here to learn about what StatXact has wrong.)

I'll offer some specific comments about the text below.
p.364
Rather than view expression (8.2) as the definition of tau, I think that it is better to view pc - pd as the definition of tau, where pc and pd are the probabilities of concordance and discordance. (See Comment 3 on p. 369 to learn what is meant by concordant and discordant relationships.)
p.365, Procedure
When doing a one-tailed test using (8.7) as the alternative, it needs to be kept in mind that a failure to reject shouldn't be taken as strong evidence that the null hypothesis of independence is true, since a failure to reject could be a very likely event if there is negative correlation (or even weak positive correlation).
p.365, Procedure
Note that Table A.30 is in two main parts. If n choose 2 is even, the value of n is included in the first part of the table, and if n choose 2 is odd, the value of n is included in the second part of the table. (See Comment 9 on p. 372 for related remarks.) Of course, if we use StatXact then we won't have to worry about using the tables.
p.365, Procedure
That the lower-tailed test critical values indicated in (8.9) are just the additive inverses of the related upper-tailed test critical vlaues follows from the symmetry of the null sampling distribution about 0. (See Comment 8 on p. 372.)
p. 366, Ties
(error in book) In my copy of the book, (8.17) is incomplete. The last part of (8.17) should be if (d - b)(c - a) < 0.
p. 367, Example 8.1
On p. 367 it is indicated that the panel scores arise from a discrete distribution because the only possible values are in increments of 1/80 from 1 to 6 (but since data values are rounded to the nearest tenth, one can say that the only possible values are in increments of 1/10 from 1 to 6). Of course, the Hunter L values are also really from a discrete distribution if the only possible values are in increments of 1/10. There are no tie situations in the small data set, but in principle a tie could have occurred. The nice thing about StatXact is that it can deal with ties in an exact manner.
p. 368, Example 8.1
StatXact produces an exact p-value of about 0.060, which matches the value from Table A.30. (Click here to see how to easily do Kendall's test using StatXact.) If a continuity correction of 1 is used, then the approximate p-value is about 0.059, which is very close to the exact p-value. (Note that the appropriate continuity correction is 1, as opposed to 1/2, because the null sampling distibution only has positive probability assigned to even integers.) H&W report an approximate p-value of 0.0475, which results from not using a continuity correction. (A continuity correction usually improves things except for some cases where the probability being approximated is rather extreme.) StatXact screws up the asymptotic version of the test, which shouldn't bother us a lot since it's typically better to go with a Monte Carlo estimate of the exact p-value in cases for which an exact p-value cannot be obtained (due to n being too large), but unless one is aware of StatXact's error, the difference between the asymptotic p-value and the others may cause one to worry that something is messed up (and it is). (Click here to learn about what StatXact has wrong.)
pp. 369-370, Comment 5
The "trick" presented is convenient for a hand calculation, but with StatXact we won't need to bother with such things. Alternatively, the value of the test statistic can be determined by making a scatter plot of the ordered pairs, and carefully counting the number of concordant and discordant pairs of ordered pairs (but unless n is rather small, this can be tedious to do).
p. 370, Comment 6
Note that to obtain the exact sampling distribution, one just needs to consider n!, and not (n!)(n!), equally-likely outcomes.
pp. 372-374, Comment 10
I'm not going to take class time to derive (8.23) on p. 373, which H&W indicates involves "considerable tedious calculation." Likewise, I won't discuss (8.25) and (8.26). (This semester, I'm doing a few derivations of a similar sort, but not taking time to derive all of the results presented in H&W.)
pp. 374-375, Comment 11
Prior to StatXact, discussing the merits of the possibilities considered would have been more important, but now I am content to use StatXact's sensible method of dealing with ties, which I will explain in class.
p. 376, Comment 14
Mann's test for trend is a nifty use of the null sampling distribution of Kendall's test. I'll discuss it during class. (For rejecting the null hypothesis that a sample arose from n iid random variables, Mann's test can be considerably more powerful than dichotomizing the data and then applying the runs test for binary outcomes. To use Mann's test, or any other test for nonrandomness, one needs to have that the observations making up the sample have a natural ordering to them (e.g., they could be time-ordered, or spatially-ordered). So, counting the "variable" that gives the ordering, it's like we still have two variables --- but one of them need not be random for Mann's test of trend.)
pp. 377-378, Problem 1
I don't have anything terribly important to add here --- I just thought that I would point out the rather sad situation the experimental units found themselves in. The dogs were force-fed tape worms collected from sheep carcasses, and then apparently killed and autopsied 20 days later in order to determine an effect of the force-feeding.
p. 379, Problem 7
It's interesting here that there doesn't seem to have been a method to determine which twin corresponds to the x measurement, and which twin corresponds to the y measurement. I guess a random assignment would be a fair way to do it.

Section 8.2

This is a very short section. I may discuss it in class prior to discussing Sec. 8.1.

In order to refresh your memory of what the estimand is, you may want to go back and read the first two comments (2 and 3) at the top of p. 369.

Section 8.3

Note that StatXact's confidence interval for tau doesn't match the one presented in this section. If you read Comment 28 on p. 386, you'll see that over the years a variety of standard error estimates for the nonnull case (when we compute a confidence interval for tau, we certainly don't assume that the null hypothesis of independence is true (because independence implies that tau is equal to 0)) have been proposed. My guess is that StatXact is using one of the other standard error estimates in it's confidence interval (which is approximate --- as is the one in Sec. 8.3 of H&W --- and not exact).

I'll offer some specific comments about the text below.
p. 383, Procedure
The equation on the line of the text right above (8.38) can be used to check the values of the Ci, provided that one obtains the value of K some other way. ((8.42) on p. 385 can also be used as a check.)
p. 385, Comment 23
Provided that n is small enough so that overplotting doesn't cause a problem, a scatter plot can be used to determine the values of the Ci, by expressing Ci as a difference in the numbers of concordant and discordant bivariate observations (ordered pairs) involving the ith observation.
p. 385, Comment 25
H&W don't supply us with the needed critical values, and so one may need to get them from the referenced journal article.
p. 386, Comment 26
This comment seems extremely silly to me. We can always elect to throw away part of the data to cut down on the work, but most of the time we choose not do because it would lead to a decrease in accuracy --- and that would often be the case here. The Samara-Randles confidence intervals of this section are based on an assumption that the estimated standard error of the point estimator is equal to the true standard error, and this would usually not be the case, but would be typically closer to being the case as the number of observations used to estimate the standard error is increased.
p. 386, Comment 28
Note that the confidence intervals presented in this section are the ones of Samara and Randles
p. 386, Comment 29
By using the confidence interval to perform a test (or equivalently, making a test statistic out of the asymptotically normal pivot), one can do a test of the null hypothesis that tau equals 0 against the alternative that tau is not equal to 0, as opposed to using Kendall's test to test the null hypothesis of independence against the general alternative of lack of independence.
pp. 386-387, Comment 30
In the interest of saving time for other topics, I'm going to skip the material on partial correlation coefficients.

Section 8.4

Those of you who took my summer course this past summer got a decent introduction to bootstrapping, which included an explaination of the reasoning behind the percentile confidence interval method. Why the percentile method should work (and sometimes it doesn't work all that well) doesn't come across from just reading the presentation given in H&W. Unfortunately, there just won't be time in our nonparametric statistics course to spend a week or more on bootstrapping basics and a justification of the percentile confidence interval. But I will cover a lot of the bootstrap material that is presented in H&W (keeping it at the level of the H&W presentation, and skipping some of the more advanced material referred to in the Comments).

Bootstrapping can be used to estimate the bias and the standard error of estimators, obtain confidence intervals and perform hypothesis tests, estimate average prediction error, and other things as well. My Summer 2002 Advanced Topics course spent several weeks on bootstrapping and jackkniffing. The web site for that course contains some information and links that may be of interest if you want to learn more about bootstrapping. (In my opinion, reading the book An Introduction to the Bootstrap by Efron and Tibshirani is by far the best way to learn about bootstrapping basics.) You may be able to develop a better idea of how bootstrapping works by reading my description of bootstrap bias estimation.

I'll offer some specific comments about the text below.
p. 388
In the first paragraph of the section, tau is referred to as "the population measure of association defined by (8.2)" --- which is okay, only I think that viewing pc - pd as the definition of tau is better (where pc and pd are the probabilities of concordance and discordance).
p. 388, Procedure
Some (see p. 389) replace B by B + 1 in (8.51), and then use a value like 999, 1999, or 4999 for B, so that k is an integer.
p. 389, Example 8.4
Note that Appendix B, which contains S-Plus functions, is found in the Solutions Manual, and not in the main text. (One had to purchase the solutions manual separately.)
p. 389, Example 8.4
Near the bottom of the page it is pointed out that the bootstrap confidence interval differs from the Samara-Randles interval of Sec. 8.3. With only 9 data points, I wouldn't expect good accuracy in either case. If I had to choose which interval is more trustworthy, I'd go with the Samara-Randles interval in this case, relying on the approximate normality and the estimated stadard error, as opposed to thinking that 9 observations provides a good estimate of the cdf of the unknown bivariate distribution. It bothers me a bit that the bootstrap interval is centered on 0.347, which is appreciably different from the point estimate of about 0.444. (While the center of a good bootstrap interval need not equal the point estimate, I'm uncomfortable with the rather large difference in this case.)
p. 390, Comment 31
This is a special case of the general result developed in the first half of Comment 32 (see p. 391).
p. 391, Comment 32
Notice that here H&W are using B + 1, instead of B, to determine k (as compared to (8.51) on p. 388). (Note: Rather than "use the largest integer that is less than or equal to" one should just use a value for B so that multiplying B + 1 by alpha/2 produces an integer.) The use of B + 1 is related to some bootstrap hypothesis testing methods --- one can control the type I error rate at exact test sizes like 0.01 and 0.05 by using values for B such as 999 and 4999. Plus, using B + 1 results in a nice symmetry for the confidence bounds: if the lower confidence bound is the kth ordered value from the "bottom", the upper confidence bound is the kth ordered value from the "top."
p. 392, Comment 33
I'm not going to discuss the details of the BCa method in class (since there isn't even time to discuss the justification of the simpler percentile method). But be sure to note (see last paragraph on p. 392) that the BCa method is generally superior to the percentile method. (S-Plus functions make computing a BCa interval about as simple as computing a percentile interval.) Another bootstrap confidence interval method, known as the bootstrap t, is sometimes a good choice (but in the application considered in Sec. 8.4, I'd prefer the BCa interval).
p. 393, Comment 35
The uniform distribution example considered is a well known case of where the bootstrap method fails. It should be pointed out that there is no need to rely on the bootstrap in such a case. A parametric model is being considered, and one can use STAT 652 techniques to deal with it. Bootstrapping is most often used in cases for which one doesn't know which parametric model to assume.
p. 393, Comment 36
Note that using StatXact's Monte Carlo option violates Gleser's "first law of applied statistics." I don't think that too many statisticians worry too much about the introduction of randomness into statistical procedures when the randomness should create only slight differences when a large enough number of Monte Carlo trials or bootstrap samples are used. I like to round approximate p-values and confidence bounds in order to not reflect too much accuracy. If the p-value rounds to 0.050, then it provides a good measure of the strength of the evidence against the null hypothesis, but I wouldn't feel comfortable arriving at a drasticially different conclusion or decision whether the approximate p-value is 0.049, 0.050, or 0.051. Similarly with confidence bounds; I don't care if the upper bound is 0.71, 0.72, or 0.73, since they all suggest about the same thing. In light of the fact that a competing method, which may be about as good, could produce a bound of 0.86, it seems foolish to treat 0.722 as being different from 0.72 (or even 0.73).
p. 393, Comment 37
I think that Julian Simon should be at least mentioned in any discussion of the origins of the bootstrap. While Efron is credited with inventing the bootstrap, Simon was doing similar stuff earlier, and I suspect that sometimes when his name is ommitted, it is due to "bad blood."

Section 8.5

StatXact will do this test. (When covering this section of H&W in class, I may also present another test (an exact permutation version of a test based on Pearson's correlation coefficient, that does not require an assumption of bivariate normality) that is included on StatXact.)

I'll offer some specific comments about the text below.
p. 394, Hypothesis
I don't like the phrase "testing for independence" applied to the tests of Ch. 8, because we're really testing for lack of independence. (When we reject, we can make a claim of statistically significant evidence for lack of independence, but when we fail to reject, we don't have strong evidence for independence.) I also don't like that H&W refer to Kendall's tau as "the correlation coefficient" since commonly we refer to Pearson's product-moment correlation coefficient as the correlation coefficient.
p. 394, Procedure
It can be shown that (8.63) is just Pearson's product-moment correlation coefficient with the x and y values replaced by their ranks. (See Comment 40 on p. 398.)
p. 394, Procedure
When doing a one-tailed test using (8.66) as the alternative, it needs to be kept in mind that a failure to reject shouldn't be taken as strong evidence that the null hypothesis of independence is true, since a failure to reject could be a very likely event if there is negative correlation (or even weak positive correlation).
p. 394, Procedure
Note that Table A.31 only gives approximate critical values for certain levels, and so it's not a great table. Of course, if we use StatXact then we won't have to worry about using the tables.
p. 394, Procedure
That the lower-tailed test critical values indicated in (8.68) are just the additive inverses of the related upper-tailed test critical vlaues follows from the symmetry of the null sampling distribution about 0. (See Comment 43 on pp. 400-401.)
pp. 396-397, Example 8.5
StatXact can be used to obtain an exact p-value of about 0.0456, and so the conclusion reached from the use of Table A.31 (that the p-value is between 0.05 and 0.1) is wrong (due to the lack of adjustment for the ties). It can be noted that the normal approximation p-value of about 0.0436 given at the bottom of p. 397 isn't too far from the exact p-value, even though n is only 7. StatXact reports 0.0400 as the asymptotic p-value. The output refers to a t distribution with 5 df, and I have no idea where that is coming from. The StatXact asymptotic output is odd in a couple of ways. For one, an asymptotic p-value should refer to one obtained from a method that should be accurate as n is very large. Since the standard normal distribution is the limiting distribution for the family of t distributions, why not use it for the large sample approximation. If something like a t distribution with n-2 df provides a better approximation for smallish n, then I would refer to the corresponding p-value as an approximate p-value, and not an asymptotic p-value. But I don't know why the t distribution would be more appropriate, since when the fixed midranks are permuted, there is no uncertainty in the variance (i.e., we can determine the appropriate variance exactly, and don't have to estimate it from the data, as is usually the case when one uses a t distribution). Another odd thing is that if one uses the observed value of rs* and gets an upper-tail probability from the t distribution with 5 df, the result is 0.07354, a value larger than the upper-tail probability from the standard normal distribution, and not the value 0.0400 that StatXact reports, which is smaller than the normal approximation p-value. (My current belief, is that the fine folks at Cytel have some things wrong --- but I think that StatXact's exact p-value is correct, and if n is too large to obtain an exact p-value, a Monte Carlo approximation of the exact p-value should be used instead of the asymptotic p-value in most cases. (What I believe Cytel has wrong is similar to what's wrong with the asymptotic results related to Kendall's statistic: they aren't using the null variance, which can be determined exactly, but are instead using some other estimate of the variance.))
p. 398, Comment 39
(errors in book) H&W have it wrong --- Minitab's corr command does indeed provide the ties corrected value of rs*. (H&W make the same incorrect claim in Problem 50 on p. 406.) Also, one does not have to "manually obtain the separate rank vectors" --- one can just apply Minitab's rank command to the two data vectors (columns).
p. 405, Comment 48
When Spearman's statistic is used to test for trend as described in the comment, the test is sometimes referred to as the Daniels test for trend.

Section 8.6

Hoeffding's test is not nearly as commonly used as Kendall's test and Spearman's test (which, in a relative sense, are somewhat popular nonparametric tests). Because it can have decent power with a wider range of alternatives, it isn't expected to be as powerful as Kendall's or Spearman's tests when the lack of independence can be characterized as being such that small values of Y tend to occur with small values of X, and large values of Y tend to occur with large values of X. But Kendall's and Spearman's tests can have rather low power to identify dependence in some cases, and in such cases Hoeffding's test may have much greater power.

As has often been the case, H&W don't provide a lot of motivation for the test statistic. Some motivation is provided by Comment 50 on p. 412, but it's a bit sketchy. In class, I'll show how the information is Comment 50 leads to (8.89) on p. 409 as being a sensible test statistic. We can view the procedure described on p. 408 as being a fine-tuning of the same general scheme. (I won't attempt to justify the details of (8.84), (8.85), (8.86), and (8.87).)

I'll offer some specific comments about the text below.
p. 412, Comment 53
In the two examples that I've tried the various approximation methods on, this large sample approximation based on D has worked better than the ones based on B. For the data of Example 8.6, the approximation based on D results in an approximate p-value of about 0.0215 (which I obtained from Table A.33 using linear interpolation (which upon inspection of the table entries, seems to be reasonable to use)). This approximate p-value is close to the one obtained using B (see the bottom of p. 411), and not close to the one obtained from the table of the exact null sampling distribution (Table A.32). (Note: Because of the tie situations, the p-value from Table A.32 can only be considered to be an approximate p-value.) For another example, consider the data from Table 8.1 on p. 367. For this data, n = 9, and so maybe the slightly larger sample size will make the approximate p-values closer to the exact p-value. (Because there are no tie situations, the p-value obtained from Table A.32 is an exact p-value.) Below I'll give the values of the components of D and B. (I encourage you to check your understanding of the procedure by making sure that you can obtain the values given below.)

i Ri Si ci N1(i) N2(i) N3(i) N4(i)
1 3 2 1 2 0 1 6
2 6 4 3 4 0 2 3
3 1 1 0 1 0 0 8
4 8 8 6 7 1 1 0
5 4 5 2 3 2 1 3
6 2 7 1 2 5 0 2
7 7 9 6 7 2 0 0
8 5 3 2 3 0 2 4
9 9 6 5 6 0 3 0

Using the values given above, one can obtain that So while none of the approximate p-values is close to the exact p-value, the one based on D is the most accurate one. For this data, Kendall's test (doing a two-tailed test using the general alternative of lack of independence, and so doubling the value of 0.060 given on p. 368) yields a p-value of 0.1194 (from StatXact), and Spearman's test yields an exact p-value of about 0.0968 (from StatXact, since the crummy table in H&W can only supply us with the information that the p-value is about 0.1).

Section 8.7

I don't plan to say anything about this short section in class.