Some Comments about Chapter 4 of Hollander & Wolfe
I'll generalize the coverage of this chapter in several ways.
- I'll discuss how we can use the tests of Sec. 4.1 in situations
for which we don't assume that the distributions differ only by a location
shift if they differ.
- I'll present ways of doing tests using scores other than those
corresponding to the rank sum test and normal scores test. (We can use
StatXact to perform such tests, and we can do approximate
versions of tests based on other scores by making use of general
formulas for the mean and variance of two-sample linear rank statistics that
I'll present.) One example of a test which can be done which is not
included on the menus of StatXact is the two-sample median test.
(Although the two sample median test can be done on StatXact
using commands for tests based on categorical data, it's awkward since one
has to do a bit of work outside of StatXact to figure out what
values should be entered into the contingency table. If you already have
the data entered as "case data" it'd be nice to be able to do the
two-sample median test along with other two-sample tests, and so I'll
show you how to do it using the two-sample permutation test command.)
- I'll cover the Savage scores test and the two-sample permutation
test which are included on StatXact.
Note that assumption A3 states that the distributions underlying
the data should be continuous. Although ties resulting from discrete
distributions used to be a bother, with StatXact they don't
necessarily cause us any grief --- StatXact can deal with ties in
a fair and exact way. The stated assumption of no ties made an
exact approach feasible in the years prior to StatXact, but in
practice ties used to be a problem since even though the phenomenon
underlying the data was continuous, the distribution from which the
observations come is always discrete due to limitations in the
precision of measuring things, and with discrete distributions, ties can
occur with positive probability. (The beauty of the continuous
distributions assumption is that with probability 1 there will be no
ties.)
Section 4.1
For the tests of this section (noting that while most of the section
deals with the rank sum test, the normal scores test is introduced in
Comment 12 on p. 121), and also most of the tests that will be
covered in Ch. 5, the null hypothesis is that both samples arose from
the same distribution, with all of the random variables being
independent (so that we have m+n iid random variables). When a
small p-value is obtained from any of these tests, we can say that there
is statistically significant evidence that the samples did not come from
m+n iid random variables, but if we take independence as a given
(that we have m+n independent random variables whether or not
they have the same distribution), and further take as a given that the
Xi all have the same distribution and the
Yj all have the same distribution (and this is what is
commonly done), then a small
p-value can be taken as being significant evidence (supporting
the general alternative) that the two
distributions differ. To say anything further, perhaps a statement
about the means or medians, we have to impose additional assumptions.
In Sec. 4.1, H&W simplify things greatly by restricting attention to
what is know as the shift model, but to me the shift model isn't
realistic in a lot of situations. I'll discuss the shift model and
other possibilities in class.
The main test in this section is sometimes referred to as the W-M-W test
or the M-W-W test. As noted on p. 11 of Ch. 1, Wilcoxon's 1945 paper
presented the rank sum test for the equal sample size case. The 1947
paper by Mann and Whitney presented a version of the test that was more
general (not just for when m = n), and their test statistic had a
different form, but since the Mann-Whitney version of the test is
equivalent to the Wilcoxon version of the test (in that no matter
which version of the test is used, the p-value will always be the same
(if there are no ties, or if there are ties and the ties are treated in
equivalent ways when performing the two tests)), it's best to think of
them as the same test. Interestingly, while Minitab has a
mann command, the value of the
test statistic produced corresponds to the
Wilcoxon
version of the test. StatXact emphasizes the Wilcoxon version of the
test, but some statistical software uses the Mann-Whitney version of the
test. Similarly, some books have tables for the Wilcoxon version and
other books have tables for the Mann-Whitney version. (It would be
somewhat silly and wasteful for a book to have tables for both, since
the two tests are equivalent.)
I'll offer some specific comments
about the text below.
- p. 107 (top half of page)
- One way for Y to have the same distribution as X +
Delta is for the treatment to have the exact same effect (change the
value by the same amount) for all subjects/units. But I find this to
not be realistic for a lot of treatments --- maybe a treatment can have
no effect on some subjects, and have effects of varying
amounts on other subjects. (I would think that life would be so much
easier for doctors (and patients) if every treatment/medication had
exactly the same effect on all people. But that just doesn't seem
realistic, does it?) One case in which a shift model could plausibly
hold is when there is just one subject, and repeated measurements are
made on the subject prior to a treatment, and repeated measurements are
made on a subject after a treatment, and it can be assumed that the sole
source of variation in the pretreatment measurements is measurement
error, the sole source of variation in the posttreatment measurements is
measurement error, and that the m+n measurement errors can be
thought to be observations of iid random variables.
(H&W briefly touch on an alternative to the simple shift model by
introducing the location-shift function in Comment 13 on
p. 122, but they don't give a lot of information about this, and
in Sections 4.2 amd 4.3,
attention is refocused on the simple shift model given by (4.2).)
- p. 107 (bottom half of page)
- Note that the two samples can result from a control group and a
treatment group, from two different treatment groups, or from the same
treatment being applied to two different populations (e.g., men and
women). Note that while in (4.3) the ranks for the treatment group subjects
are being summed to be the test statistic, on p. 108 it can be seen that
to use the tables (which is not necessary, given that you have
StatXact handy), one has to sum the ranks for the smaller of the
two samples, whether it be the treatment sample or the control sample.
Also note that to use the table in H&W, both sample sizes have to be
less than or equal to 10. If one sample size is 5 and the other is 11,
we can't make use of the table, and at the same time it could be that
the normal approximation should not be trusted (but this situation causes us
no worry if StatXact is handy, since it can easily do an exact
computation of the p-value when the sample sizes are 5 and 11). It can
be noted that the table in H&W can be annoying to use for lower-tailed tests
and two-tailed tests. You may prefer to use the tables I distributed in
STAT 554 (assuming that you have those). But of course, why use a table
at all? In order to give you a better "feel" for the test, in class I'll
describe the construction of tables using a recursive method.
- p. 108, Large-Sample Approximation
- As was the case for the signed-rank test of Chapter 3 (but not the
sign test of Ch. 3), establishing the asymptotic normality of the
rank sum test is difficult (and uses probability theory beyond the
prerequisites of this course) because the rank sum statistic is not a
sum of iid random variables. In class I will give a derivation of the
expected value (given by (4.7)), and I will generalize to cover other
two-sample rank tests based on other scores. (Perhaps they should be
referred to as two-sample score tests, but typically they are
referred to as rank tests even if the scores are not integer ranks.)
I will also give a formula for the null sampling distribution variance
of a two-sample linear rank statistic based on general scores. (If one
plugs in midrank adjusted integer scores, the value of the variance will
match the values which result from (4.13) and (4.14) on p. 109.)
It should be noted that H&W do not employ the continuity correction for
their normal approximation (and neither does StatXact).
Over the years, I've found that the continuity correction improves the
normal approximation more often than not, but with StatXact
handy, there is little need for a normal approximation. Minitab
includes the continuity correction in it's normal approximation for this
test. (Note: Minitab does not do an exact version of the test
--- the mann command results in a normal approximation no matter
how small the sample sizes are.)
- pp. 110-111, Example 4.1
- Things got messed up when the StatXact output on p. 111 was
put into the book --- it'd be impossible to get the exact output that is
shown. To get something close to the output, you could put ten 2s and
five 1s down the first column of the CaseData spreadsheet, and then
put the sample of ten values followed by the sample of five values down
the second column. Then pull down the Statistics menu, go down
to Two Independent Samples and select
Wilcoxon-Mann-Whitney. Finally, click VAR1 into the
Population box,
VAR2 into the
Response box, select Exact, and click OK. The
Observed value of 30.00 on the output is the sum of the ranks for
the sample of size five (the sample coded with 1s). The Mean
value corresponds to (4.8) from H&W, putting n equal to 5
and m equal to 10. (Recall, in H&W, n is the
sample size of the sample yielding the ranks which are summed to get
the test statistic.) The rest of your output should match what is shown
in H&W, only in two places GE should be LE. (Note: If you
had coded the sample of size ten with 1s and the sample of size five
with 2s, you would get a different value for the test statistic (the
Observed value would be 90.00, the sum of the ranks for the
sample of size ten --- the sample coded with 1s).) Note that with
StatXact one doesn't specify which of the two possible one-tailed
tests to perform --- it just reports the p-value for the one-tailed test
that yields the smaller p-value. For example, with the coding of the
sample of size five with 1s and the sample of size ten with 2s, the
exact One-sided P-value is .1272 and it corresponds to
Pr{Test Statistic .LE. Observed}, which is the probability under
the null hypothesis that the sum of the ranks for the sample of size
five assumes a value less than or equal to 30. This is the p-value for
the alternative hypothesis that the permeability is less for 12-26
weeks, or equivalently, that the permeability is greater at term. If
one wanted the p-value for the other possible alternative, that the
permeability is greater for 12-26 weeks, then the p-value corresponds to
the null probability that the test statistics is greater than or equal
to 30. Letting W denote the test statistic, the desired
probability is P(W >= 30). This can be obtained from the
StatXact output by noting that it is equal to 1 - P(W <=
29), which is equal to 1 - P(W <= 30) + P(W = 30), which
explains the presence of the
Pr{Test Statistic .EQ. Observed} line in the StatXact
output.
- p. 113, Comment 1
- Notice how simple the motivation is.
- p. 113, Comment 3
- As I indicated above, I will explain how to find the null
distribution of the test statistic in class, for general m and
n, using a recursive
relationship.
- p. 114, Comment 4
- I plan to generalize this material some in class,
- pp. 115-116, Comment 5
- This is how StatXact handles ties (in an exact way), so try
to understand this. (It may be helpful to put a subscript of A
on one of the 3.5 values, and put a subscript of B on the other
one.)
- pp. 119-120, Comment 9
- Don't worry so much about this material --- H&W don't really derive
the power results. Note that to use the formula, one needs to know
something about the underlying pdf.
- p. 120, Comment 10
- Don't worry so much about this material.
- p. 120, Comment 11
- Two distributions can have the same mean and/or median, but the
sampling distribution of the test statistic need not match the
distribution under the null hypothesis of identical distributions.
So a small p-value from the test should not be taken as representing
statistically significant evidence against the hypothesis of equal means
(or the null hypothesis of equal medians), unless it can be assumed that
if the means (medians) are equal then the distributions are identical.
- pp. 121-122, Comment 12
- The scores being summed in (4.26) are the Van der Waerden normal
scores, which for large N = m+n approximate the expected values of the
order statistics from a sample of size N from a standard normal
distribution. The test based on these scores is typically called the
normal scores test, and to execute it in StatXact, you can do
exactly as for the rank sum test, only select Normal Scores,
instead of Wilcoxon-Mann-Whitney, from the choices corresponding
to Two Independent Samples from the Statistics menu. The
test corresponding to the test statistic given by c1
on p. 122 is also referred to as the normal scores test, and it is based
on the exact expected values of the order statistics from a sample of
size N from a standard normal distribution. I don't think this
test is included on any statistical software package --- to do so would
require storing the needed expected values for each sample size.
(I don't know of any
statistical software package except StatXact that includes
the Van der Waerden (normal scores) test, but please check out any
statistical software that you have access to and let me know if you find
the Van der Waerden test, and if so, note if a normal approximation
version is employed.)
- p. 122, Comment 13
- Note that while the model considered here allows for the treatment
to not have the exact same effect on all subjects/units, it dictates
that the effect will be the same for all units corresponding to the same
value of X, and I don't think that's enough of a generalization.
- p. 123, Comment 14
- The main point to be made here is that if P(Xi
< Yj) = 1/2, then one cannot expect the rank sum test
to reject with high probability even though the two distributions may
not be identical. (Understanding why this is so is perhaps most easily
done by examining the normal approximation version of the Mann-Whitney
version of the test.)
It can be noted that some books refer to Xi
as being stochastically smaller than Yj
if P(Xi
< Yj) > 1/2, but this does not correspond to the usual
definintion of one distribution being stochastically smaller than
another distribution.
Section 4.2
It should be noted that the Hodges-Lehmann point estimator
for the shift parameter
only makes completely good sense if the shift model described on p. 107
holds (and I question whether it is often good to make such an
assumption). However, I suppose that if it appeared that a shift model
nearly held, and there was concern about outliers, then the interval may
suitably serve as an approximate confidence interval for the difference in
means --- it'd be a trade-off, giving up precision in the coverage
probability for some protection against the ill effects of outliers.
(The resistance to outliers is suggested by Comment 16 on p.
128.)
I'll offer some specific comments
about the text below.
- p. 126 & p. 128, Comment 15
- Make sure that you understand the Hodges-Lehmann estimation scheme.
- p. 128, Comment 17
- The alternative point estimator introduced here is fairly
interesting, as is the way of choosing between it and the main estimator
dealt with in Sec. 4.2 using the widths of confidence intervals associated
with
the two different point estimators to suggest which estimator may be the
more accurate of the two. Notice how the widths of confidence intervals
associated with the two point estimators that are used in the
alternative estimator introduced in Comment 17 are used to
estimate the standard error of those two estimators, by assuming that they
are approximately normally distributed (I'll go over this in class), and
then those two standard error estimates are combined in the right way to obtain an
estimate of the standard error of the alternative estimator.
- pp. 129-131, Comment 18
- The first four paragraphs of Comment 18 are rather
interesting. They suggest that, in many cases,
P(Xi
< Yj) may be a more meaningful thing to estimate than the
amount of shift between the distributions. The rest of the
Comment concerns a confidence interval for the estimand --- I
think it's okay not to be concerned about the details of this interval
estimator. Note that the distribution of the point estimator, U/mn,
is more complicated than the distribution of the sample proportion based
on iid Bernoulli trials, because U is not a sum of independent
random variables.
Section 4.3
It should be noted that the confidence interval for the shift parameter
only makes completely good sense if the shift model described on p. 107
holds (and I question whether it is often good to make such an
assumption). However, I suppose that if it appeared that a shift model
nearly held, and there was concern about outliers, then the interval may
suitably serve as an approximate confidence interval for the difference in
means --- it'd be a trade-off, giving up precision in the coverage
probability for some protection against the ill effects of outliers.
I'll offer some specific comments
about the text below.
- p. 133 (top half of page)
- Note that StatXact and Minitab both give exact
confidence intervals, and that the large sample approximation scheme
described in the book produces an interval that does not match the exact
interval (but of course the sample sizes are only 5 and 10 for the
example). My guess is that the approximation scheme does quite well in
general, for sample sizes that are not really small, but that one may
not have to use the approximation often due to the availability of
software to produce exact intervals.
- p. 133, Comment 20
- While the alternative estimators referred to in Comment 20
are somewhat interesting, there seems to be no reason to ever
choose them over the more commonly used estimators emhasized in Sections
4.2 and 4.3.
- p. 133, Comment 21
- It should be noted that even if the shift model holds, one should
not necessarily use the point estimator of Sec. 4.2 and the interval
estimator of Sec. 4.3. Since the shift amount is the same as the
difference in the means, the difference in the medians, and the
difference in matching quantiles other than the medians, there are lots
of ways to estimate the shift amount. In choosing between estimators,
it's nice to know something about their performances, and the estimated
standard
errors of point estimators provide one way to compare expected
performance. The quantity described in this comment is a good way to
estimate the standard error of the Hodges-Lehmann point estimator
introduced in Sec. 4.2. It should also provide a decent indication of
how the related confidence interval may perform with the distributions
under consideration.
Section 4.4
I don't think that the test covered by this section is a particularly
good test, and I'm not going to emphasize it. During the early
1990s, Kelly J. Buchanan (now Kelly Thomas) did an excellent M.S. thesis
for me, and she compared the performances of many test procedures for
the generalized Behrens-Fisher problem. Based on research done
for her thesis, Kelly and I concluded that the Fligner-Policello test
wasn't very trustworthy, since it could have an inflated
type I error rate, appreciably greater than the nominal level of the
test. From results in the literature, it appeared that a test by Sen,
which like the Fligner-Policello test is a modification of the W-M-W
test, is a bit better. Although it may have been somewhat unfair to
dismiss the Fligner-Policello test so quickly, it was not included in
Kelly's study, but she did include Sen's test, a modification of Sen's
test by Fung, Yuen's test based on trimmed means, and several other
procedures. It was found that Yuen's test was better than Fung's
modification to Sen's test, which was better than Sen's test. So if
there is no reason to think that Fligner and Policello's test works
better than Sen's test, we can conclude that there is no good reason to
choose to use the Fligner and Policello
test. Kelly also investigated some bootstrap methods that I helped her
develop, and at least one of them appeared to be promising for certain
types of situations. My guess is that with advances in bootstrp methods
over the past 10 years, that one might be able to come up with a
bootstrap method that would pretty much dominate Fligner and Policello's
test. But right now, if I was asked to do a test about the
medians/means of two symmetric distributions having unequal variances, I
would choose to rely on Yuen's test. (Note: The test by Yuen which I am
referring to is from her 1974 Biometrika paper, "The two-sample
trimmed t for unequal population variances" and not the test
based on trimmed means from her 1973 paper with Dixon which is for when
homoscedasticity can be assummed.)
I'll offer some specific comments
about the text below.
- p. 135
- I think that H&W should indicate that they are dealing with the
generalized Behrens-Fisher problem, and that the Behrens-Fisher
problem should be used to refer to tests and confidence intervals
for the difference in the means of two normal distributions having
unequal variances.
- p. 136
- The table only covers cases for which both sample sizes are less
than or equal to 12. Upon looking at the entries in the table, it can
be noted that the 0.1 critical values are fairly close to
z0.1 when both sample sizes are at least 8, but that
the other tabulated critical values for the Fligner-Policello test are
greater than the corresponding standard normal critical values. This
suggests that for sample sizes just beyond the ranges covered by the
table, the normal approximation may be appreciably anticonservative
unless one is doing a level 0.1 test.
Section 4.5
Many of the results presented in this section match those presented in
Sec. 3.11, which pertains to the efficiencies of paired replicates and
one-sample location procedures. In fact, we can use the results of Sec.
3.11 to give us information about how the two-sample median test (which
I will present in class) performs --- we can use (3.118) on p. 105
(which pertains to the asymptotic relative efficieny of the sign test
with respect to the one-sample t test) to see the
asymptotic relative efficieny of the two-sample median test
with respect to Student's two-sample t test.
We can also use the results of Sec.
4.5 to give us information about how the one-sample normal scores test
performs --- we can use the second table on p. 140
(which pertains to the asymptotic relative efficieny of the two-sample
normal scores test
with respect to the W-M-W test) to see the
asymptotic relative efficieny of the one-sample normal scores test
with respect to the signed-rank test.
I'll offer some specific comments
about the text below.
- p. 140
- One can multiply the corresponding entries in the two tables on p.
140 to obtain the asymptotic relative efficiencies of the two-sample
normal scores test with respect to Student's two-sample t test
(and the values obtained are also the asymptotic relative efficiencies of
the one-sample
normal scores test with respect to Student's one-sample t test).
For example, when the distributions underlying the data are normal, the
ARE of the normal scores test with respect to the t test is about
1.047*0.955, which rounds to 1.00 (and in fact a precise
calculation would yield the value 1 exactly).
When the distributions underlying the data are logistic, the
ARE of the normal scores test with respect to the t test is about
0.955*1.097, which is about 1.05.
- p. 140
- Note the very interesting fact that the ARE of the two-sample
normal scores test based on the exact expected value normal scores with
respect to Student's two-sample t test is greater than or equal
to 1 for all underlying distributions.
A similar story holds if we consider the test based on Van der Waerden
scores, or if we consider one sample normal scores tests.