Some Comments about Chapter 8 of Hollander & Wolfe
The setting here is a bit different from the types of things that we've
been considering in previous chapters. In Ch. 8 it is assumed that the
sample arises from n iid random variables from a bivariate distribution,
and the basic test considered is of the null hypothesis that
Xi is independent of
Yi (i = 1, 2, ..., n) against the
general alternative of lack of independence. The exact null sampling
distributions which correspond to the tables in the back of H&W assume
no ties (which will be the case with probability 1 if the observations
truely arise from a continuous distribution), but by using
StatXact we can deal (in an exact way) with data sets in which ties
occur.
Section 8.1
StatXact will do this test.
(Click here to see how to easily do
Kendall's test using StatXact. But note that
StatXact screws up the asymptotic version of the test, which
shouldn't bother us a lot since it's typically better to go with a Monte
Carlo estimate of the exact p-value in cases for which an exact p-value
cannot be obtained (due to n being too large), but unless one is
aware of StatXact's error, the difference between the asymptotic
p-value and the others may cause one to worry that something is messed
up (and it is).
Click here to learn about what
StatXact has wrong.)
I'll offer some specific comments
about the text below.
- p.364
- Rather than view expression (8.2) as the definition of tau,
I think that it is better to view
pc
- pd as the definition of tau,
where
pc and
pd are the probabilities of concordance and
discordance. (See Comment 3 on p. 369 to learn what is meant by
concordant and
discordant relationships.)
- p.365, Procedure
- When doing a one-tailed test using (8.7) as the alternative, it
needs to be kept in mind that a failure to reject shouldn't be taken as
strong evidence that the null hypothesis of independence is true, since
a failure to reject could be a very likely event if there is negative
correlation (or even weak positive correlation).
- p.365, Procedure
- Note that Table A.30 is in two main parts. If n choose 2 is
even, the value of n is included in the first part of the table,
and if
n choose 2 is
odd, the value of n is included in the second part of the table.
(See Comment 9 on p. 372 for related remarks.) Of course, if we
use StatXact then we won't have to worry about using the tables.
- p.365, Procedure
- That the lower-tailed test critical values indicated in (8.9) are
just the additive inverses of the related upper-tailed test critical
vlaues follows from the symmetry of the null sampling distribution about
0.
(See Comment 8 on p. 372.)
- p. 366, Ties
- (error in book)
In my copy of the book, (8.17) is incomplete. The last part of (8.17)
should be
if
(d - b)(c - a) < 0.
- p. 367, Example 8.1
- On p. 367 it is indicated that the panel scores arise from a
discrete distribution because the only possible values are in increments
of 1/80 from 1 to 6 (but since data values are rounded to the nearest
tenth, one can say that the only possible values are in increments of
1/10 from 1 to 6). Of course, the Hunter L values are also
really from a discrete distribution if the only possible values are in
increments of 1/10. There are no tie situations in the small data set,
but in principle a tie could have occurred. The nice thing about
StatXact is that it can deal with ties in an exact manner.
- p. 368, Example 8.1
- StatXact produces an exact p-value of about 0.060, which
matches the value from Table A.30.
(Click here to see how to easily do
Kendall's test using StatXact.)
If a continuity correction of 1 is
used, then the approximate p-value is about 0.059, which is very close
to the exact p-value. (Note that the appropriate continuity correction
is 1, as opposed to 1/2, because the null sampling distibution only
has positive probability assigned to even integers.) H&W report an
approximate p-value of 0.0475, which results from not using a continuity
correction. (A continuity correction usually improves things except for
some cases where the probability being approximated is rather extreme.)
StatXact screws up the asymptotic version of the test, which
shouldn't bother us a lot since it's typically better to go with a Monte
Carlo estimate of the exact p-value in cases for which an exact p-value
cannot be obtained (due to n being too large), but unless one is
aware of StatXact's error, the difference between the asymptotic
p-value and the others may cause one to worry that something is messed
up (and it is).
(Click here to learn about what
StatXact has wrong.)
- pp. 369-370, Comment 5
- The "trick" presented is convenient for a hand calculation, but
with StatXact we won't need to bother with such things.
Alternatively, the value of the test statistic can be determined by
making a scatter plot of the ordered pairs, and carefully counting the
number of concordant and discordant pairs of ordered pairs (but unless
n is rather small, this can be tedious to do).
- p. 370, Comment 6
- Note that to obtain the exact sampling distribution, one just needs
to consider n!,
and not (n!)(n!), equally-likely outcomes.
- pp. 372-374, Comment 10
- I'm not going to take class time to derive (8.23) on p. 373, which
H&W indicates involves "considerable tedious calculation." Likewise, I
won't discuss (8.25) and (8.26). (This semester, I'm doing a few
derivations of a similar sort, but not taking time to derive all of the
results presented in H&W.)
- pp. 374-375, Comment 11
- Prior to StatXact, discussing the merits of the possibilities
considered would have been more important, but now I am content to use
StatXact's sensible method of dealing with ties, which I will
explain in class.
- p. 376, Comment 14
- Mann's test for trend is a nifty use of the null sampling
distribution of Kendall's test. I'll discuss it during class.
(For rejecting the null hypothesis that a sample arose from n iid
random variables, Mann's test can be considerably more powerful than
dichotomizing the data and then applying the runs test for binary
outcomes. To use Mann's test, or any other test for nonrandomness, one
needs to have that the observations making up the sample have a natural
ordering to them (e.g., they could be time-ordered, or
spatially-ordered). So, counting the "variable" that gives the ordering,
it's like we still have two variables --- but one of them need not be
random for Mann's test of trend.)
- pp. 377-378, Problem 1
- I don't have anything terribly important to add here --- I just
thought that I would point out the rather sad situation the experimental
units found themselves in. The dogs were force-fed tape worms collected
from sheep carcasses, and then apparently killed and autopsied 20 days
later in order to determine an effect of the force-feeding.
- p. 379, Problem 7
- It's interesting here that there doesn't seem to have been a method
to determine which twin corresponds to the x measurement, and
which twin corresponds to the y measurement. I guess a random
assignment would be a fair way to do it.
Section 8.2
This is a very short section. I may discuss it in class prior to
discussing Sec. 8.1.
In order to refresh your memory of what the estimand is, you may want to
go back and read the first two comments (2 and 3) at the top of p. 369.
Section 8.3
Note that StatXact's confidence interval for tau doesn't
match the one presented in this section. If you read Comment 28
on p. 386, you'll see that over the years a variety of standard error
estimates for the nonnull case (when we compute a confidence interval
for tau,
we certainly don't assume that the null hypothesis of independence is
true (because independence implies that tau is equal to 0))
have been proposed. My guess is that StatXact is using one of
the other standard error estimates in it's confidence interval (which is
approximate --- as is the one in Sec. 8.3 of H&W --- and not exact).
I'll offer some specific comments
about the text below.
- p. 383, Procedure
- The equation on the line of the text right above (8.38) can be used
to check the values of the Ci, provided that one
obtains the value of K some other way. ((8.42) on p. 385 can
also be used as a check.)
- p. 385, Comment 23
- Provided that n is small enough so that overplotting doesn't
cause a problem, a scatter plot can be used to determine the values of
the
Ci,
by expressing
Ci as a difference in the numbers of concordant and
discordant bivariate observations (ordered pairs) involving the ith
observation.
- p. 385, Comment 25
- H&W don't supply us with the needed critical values, and so one
may need to get them from the referenced journal article.
- p. 386, Comment 26
- This comment seems extremely silly to me. We can always
elect to throw away part of the data to cut down on the work, but most
of the time we choose not do because it would lead to a decrease in
accuracy --- and that would often be the case here. The Samara-Randles
confidence intervals of this section are based on an assumption that the
estimated standard error of the point estimator is equal to the true
standard error, and this would usually not be the case, but would be
typically closer to being the case as the number of observations used to
estimate the standard error is increased.
- p. 386, Comment 28
- Note that the confidence intervals presented in this section are
the ones of Samara and Randles
- p. 386, Comment 29
- By using the confidence interval to perform a test (or
equivalently, making a test statistic out of the asymptotically normal
pivot), one can do a test of the null hypothesis that tau equals 0
against the alternative that tau is not equal to 0, as opposed to using
Kendall's test to test
the null hypothesis of independence
against the general alternative of lack of independence.
- pp. 386-387, Comment 30
- In the interest of saving time for other topics, I'm going to skip
the material on partial correlation coefficients.
Section 8.4
Those of you who took my summer course this past summer got a decent
introduction to bootstrapping, which included an explaination of the
reasoning behind the
percentile confidence interval method.
Why the percentile method should work (and sometimes it doesn't work all
that well) doesn't come across from just reading the presentation given
in H&W. Unfortunately, there just won't be time in our nonparametric
statistics course to spend a week or more on bootstrapping basics and a
justification of the percentile confidence interval. But I will cover
a lot of the bootstrap material that is presented in H&W (keeping it at
the level of the H&W presentation, and skipping some of the more
advanced material referred to in the Comments).
Bootstrapping can be used to estimate the bias and the standard error of
estimators, obtain confidence intervals and perform hypothesis tests,
estimate average prediction error, and other things as well.
My
Summer 2002 Advanced Topics course spent several weeks on
bootstrapping and jackkniffing. The
web site for that course
contains some information and links that may be of interest if you want
to learn more about bootstrapping. (In my opinion, reading the book
An Introduction to the Bootstrap by Efron and Tibshirani is by
far the best way to learn about bootstrapping basics.) You may be able
to develop a better idea of how bootstrapping works by reading
my description of bootstrap bias estimation.
I'll offer some specific comments
about the text below.
- p. 388
- In the first paragraph of the section, tau is referred to as
"the population measure of association defined by (8.2)" --- which is
okay, only I think that viewing
pc -
pd as the definition of tau is better
(where
pc and
pd are the probabilities of concordance and
discordance).
- p. 388, Procedure
- Some (see p. 389) replace B by B + 1 in (8.51),
and then use a value like 999, 1999, or 4999 for B,
so that k is an integer.
- p. 389, Example 8.4
- Note that Appendix B, which contains S-Plus functions, is
found in the Solutions Manual, and not in the main text. (One
had to purchase the solutions manual separately.)
- p. 389, Example 8.4
- Near the bottom of the page it is pointed out that the bootstrap
confidence interval differs from the Samara-Randles interval of Sec.
8.3. With only 9 data points, I wouldn't expect good
accuracy in either case. If I had to choose which interval is more
trustworthy, I'd go with the Samara-Randles interval in this case,
relying on the approximate normality and the estimated stadard error, as
opposed to thinking that 9 observations provides a good estimate of the
cdf of the unknown bivariate distribution. It bothers me a bit that the
bootstrap interval is centered on 0.347, which is appreciably different
from the point estimate of about 0.444. (While the center of a good
bootstrap interval need not equal the point estimate, I'm uncomfortable
with the rather large difference in this case.)
- p. 390, Comment 31
- This is a special case of the general result developed in
the first half of Comment 32 (see p. 391).
- p. 391, Comment 32
- Notice that here H&W are using B + 1, instead of B,
to determine k (as compared to (8.51) on p. 388). (Note: Rather
than "use the largest integer that is less than or equal to" one should
just use a value for B so that multiplying B + 1 by
alpha/2 produces an integer.) The use of B + 1 is related
to some bootstrap hypothesis testing methods --- one can control the
type I error rate at exact test sizes like 0.01 and 0.05 by using values
for B such as 999 and 4999. Plus, using B + 1 results in
a nice symmetry for the confidence bounds: if the lower confidence bound
is the kth ordered value from the "bottom", the
upper confidence bound
is the kth ordered value from the "top."
- p. 392, Comment 33
- I'm not going to discuss the details of the BCa
method in class (since there isn't even time to discuss the
justification of the simpler percentile method). But be sure to note
(see last paragraph on p. 392) that the
BCa method is generally superior to the percentile
method. (S-Plus functions make computing a
BCa interval about as simple as computing a percentile
interval.) Another bootstrap confidence interval method, known as the
bootstrap t, is sometimes a good choice (but in the application
considered in Sec. 8.4, I'd prefer the
BCa interval).
- p. 393, Comment 35
- The uniform distribution example considered is a well known case of
where the bootstrap method fails. It should be pointed out that there
is no need to rely on the bootstrap in such a case. A parametric model
is being considered, and one can use STAT 652 techniques to deal with
it. Bootstrapping is most often used in cases for which one doesn't know
which parametric model to assume.
- p. 393, Comment 36
- Note that using StatXact's Monte Carlo option violates
Gleser's "first law of applied statistics." I don't think that too many
statisticians worry too much about the introduction of randomness into
statistical procedures when the randomness should create only slight
differences when a large enough number of Monte Carlo trials or
bootstrap samples are used. I like to round approximate p-values and
confidence bounds in order to not reflect too much accuracy. If the
p-value rounds to 0.050, then it provides a good measure of the strength
of the evidence against the null hypothesis, but I wouldn't feel
comfortable arriving at a drasticially different conclusion or decision
whether the approximate p-value is 0.049, 0.050, or 0.051. Similarly
with confidence bounds; I don't care if the upper bound is 0.71, 0.72,
or 0.73, since they all suggest about the same thing. In light of the
fact that a competing method, which may be about as good, could produce
a bound of 0.86, it seems foolish to treat 0.722 as being different from
0.72 (or even 0.73).
- p. 393, Comment 37
- I think that Julian Simon should be at least mentioned in any
discussion of the origins of the bootstrap. While Efron is credited
with inventing the bootstrap, Simon was doing similar stuff
earlier, and I suspect that sometimes when his name is ommitted, it
is due to "bad blood."
Section 8.5
StatXact will do this test. (When covering this section of H&W
in class, I may also present another test (an exact permutation version
of a test based on Pearson's correlation coefficient, that does not
require an assumption of bivariate normality) that is included on
StatXact.)
I'll offer some specific comments
about the text below.
- p. 394, Hypothesis
- I don't like the phrase "testing for independence" applied to the
tests of Ch. 8, because we're really testing for lack of independence.
(When we reject, we can make a claim of statistically significant
evidence for lack of independence, but when we fail to reject, we don't
have strong evidence for independence.)
I also don't like that H&W refer to Kendall's tau as "the correlation
coefficient" since commonly we refer to Pearson's product-moment
correlation coefficient as the correlation coefficient.
- p. 394, Procedure
- It can be shown that (8.63) is just
Pearson's product-moment
correlation coefficient with the x and y values replaced
by their ranks. (See Comment 40 on p. 398.)
- p. 394, Procedure
- When doing a one-tailed test using (8.66) as the alternative, it
needs to be kept in mind that a failure to reject shouldn't be taken as
strong evidence that the null hypothesis of independence is true, since
a failure to reject could be a very likely event if there is negative
correlation (or even weak positive correlation).
- p. 394, Procedure
- Note that Table A.31 only gives approximate critical values for
certain levels, and so it's not a great table.
Of course, if we
use StatXact then we won't have to worry about using the tables.
- p. 394, Procedure
- That the lower-tailed test critical values indicated in (8.68) are
just the additive inverses of the related upper-tailed test critical
vlaues follows from the symmetry of the null sampling distribution about
0.
(See Comment 43 on pp. 400-401.)
- pp. 396-397, Example 8.5
- StatXact can be used to obtain an exact p-value of about
0.0456, and so the conclusion reached from the use of Table A.31 (that
the p-value is between 0.05 and 0.1) is wrong (due to the lack of
adjustment for the ties). It can be noted that the normal approximation
p-value of about 0.0436 given at the bottom of p. 397 isn't too far from
the exact p-value, even though n is only 7.
StatXact reports 0.0400 as the asymptotic p-value. The output
refers to a t distribution with 5 df, and I have no idea
where that is coming from. The StatXact asymptotic output is odd
in a couple of ways. For one, an asymptotic p-value should refer to one
obtained from a method that should be accurate as n is very
large.
Since the standard normal distribution is the limiting distribution for
the family of t distributions, why not use it for the large
sample approximation. If something like a t distribution with
n-2 df provides a better approximation for smallish
n, then I would refer to the corresponding p-value as an
approximate p-value, and not an asymptotic p-value. But I don't know
why the t distribution would be more appropriate, since when the
fixed midranks are permuted, there is no uncertainty in the variance
(i.e., we can determine the appropriate variance exactly, and don't have
to estimate it from the data, as is usually the case when one uses a
t distribution). Another odd thing is that if one uses the
observed value of rs* and gets an
upper-tail probability from the t distribution with 5 df,
the result is 0.07354, a value larger than the upper-tail probability
from the standard normal distribution, and not the value 0.0400 that
StatXact reports, which is smaller than the normal approximation
p-value. (My current belief, is that the fine folks at Cytel have some
things wrong --- but I think that StatXact's exact p-value is
correct, and if n is too large to obtain an exact p-value, a
Monte Carlo approximation of the exact p-value should be used instead of
the asymptotic p-value in most cases. (What I believe Cytel has wrong
is similar to what's wrong with the asymptotic results related to
Kendall's statistic: they aren't using the null variance, which can be
determined exactly, but are
instead using some other estimate of the variance.))
- p. 398, Comment 39
- (errors in book)
H&W have it wrong --- Minitab's corr command does indeed
provide the ties corrected value of
rs*. (H&W make the same incorrect claim in
Problem 50 on p. 406.) Also, one does not have to
"manually obtain the separate rank vectors" --- one can just apply
Minitab's rank command to the two data vectors (columns).
- p. 405, Comment 48
- When Spearman's statistic is used to test for trend as described in
the comment, the test is sometimes referred to as the Daniels test for
trend.
Section 8.6
Hoeffding's test is not nearly as commonly used as Kendall's test and
Spearman's test (which, in a relative sense, are somewhat popular
nonparametric tests). Because it can have decent power with a wider range of
alternatives, it isn't expected to be as powerful as Kendall's or
Spearman's tests when the lack of independence can be characterized as
being such that small values of Y tend to occur with small values
of X, and
large values of Y tend to occur with large values
of X.
But Kendall's and Spearman's tests can have rather low power to identify
dependence in some cases, and in such cases Hoeffding's test may have
much greater power.
As has often been the case, H&W don't provide a lot of motivation for
the test statistic. Some motivation is provided by Comment 50 on
p. 412, but it's a bit sketchy. In class, I'll show how the information
is Comment 50 leads to (8.89) on p. 409 as being a sensible test
statistic. We can view the procedure described on p. 408 as being a
fine-tuning of the same general scheme. (I won't attempt to justify the
details of (8.84), (8.85), (8.86), and (8.87).)
I'll offer some specific comments
about the text below.
- p. 412, Comment 53
- In the two examples that I've tried the various approximation
methods on, this large sample approximation based on D has worked
better than the ones based on B. For the data of Example 8.6,
the approximation based on D results in an approximate p-value of
about 0.0215 (which I obtained from Table A.33 using linear
interpolation (which upon inspection of the table entries, seems to be
reasonable to use)). This approximate p-value is close to the one
obtained using B (see the bottom of p. 411), and not close to the
one obtained from the table of the exact null sampling distribution
(Table A.32). (Note: Because of the tie situations, the p-value from
Table A.32 can only be considered to be an approximate p-value.) For
another example, consider the data from Table 8.1 on p. 367. For this
data, n = 9, and so maybe the slightly larger sample size will
make the approximate p-values closer to the exact p-value. (Because
there are no tie situations, the p-value obtained from Table A.32 is
an exact p-value.) Below I'll give the values of the components of
D and B. (I encourage you to check your understanding of
the procedure by making sure that you can obtain the values given
below.)
i |
Ri |
Si |
ci |
N1(i) |
N2(i) |
N3(i) |
N4(i) |
1 |
3 |
2 |
1 |
2 |
0 |
1 |
6 |
2 |
6 |
4 |
3 |
4 |
0 |
2 |
3 |
3 |
1 |
1 |
0 |
1 |
0 |
0 |
8 |
4 |
8 |
8 |
6 |
7 |
1 |
1 |
0 |
5 |
4 |
5 |
2 |
3 |
2 |
1 |
3 |
6 |
2 |
7 |
1 |
2 |
5 |
0 |
2 |
7 |
7 |
9 |
6 |
7 |
2 |
0 |
0 |
8 |
5 |
3 |
2 |
3 |
0 |
2 |
4 |
9 |
9 |
6 |
5 |
6 |
0 |
3 |
0 |
Using the values given above, one can obtain that
- the value of Q is 4780,
- the value of R is 608,
- the value of S is 90,
- the value of D is 24/7560 = 0.00317460,
- the exact p-value is 0.1311,
- the value of B is 0.00951752,
- the value of the test statistic based on nB is
4.17192, and the approximate p-value obtained from Table A.33 using
linear interpolation is about 0.011,
- the value of the test statistic based on (n - 1)B is
3.70833, and the approximate p-value obtained from Table A.33 using
linear interpolation is about 0.018,
- the value of the test statistic based on D is
2.74446, and the approximate p-value obtained from Table A.33 using
linear interpolation is about 0.056.
So while none of the approximate p-values is close to the exact p-value,
the one based on D is the most accurate one. For this data,
Kendall's test (doing a two-tailed test using the general alternative of
lack of independence, and so doubling the value of 0.060 given on p.
368) yields a p-value of 0.1194 (from StatXact), and Spearman's test
yields an exact p-value of about 0.0968 (from StatXact, since the
crummy table in H&W can only supply us with the information that the
p-value is about 0.1).
Section 8.7
I don't plan to say anything about this short section in class.