Some Comments about Chapter 9 of Samuels & Witmer
Section 9.1
- (p. 347, 1st paragraph) If you designed the experiment, then of
course you'll know whether you paired or not. But if you are given
data, it may be a bit less obvious. To determine whether you have
paired data or two independent samples, ask yourself if (I'm going to
use x and y to denote the observations in the two samples)
if y1 is more closely related to
x1 than it is to
x2 (or
if x3 is more closely related to
y3 than it is to
y7). If the answer is yes, then the data
should be viewed as matched pairs data.
- (pp. 347-348) The sentence that begins on the bottom of p. 347 and
continues on p. 348 is typically true for this type of matched pairs
experiment, but it's not a key thing that you should focus on in the initial
stages of studying the analysis of matched pairs data.
Section 9.2
- (p. 349, 2nd line) I think it's often better to think of it as
making an inference about the mean difference (the mean of the
difference distribution) or the mean change, especially if the two
measurements of a pair are from the same experimental unit.
Note that Condition 1 on p. 353 indicates that the
differences should be such that they can be regarded as a random sample.
This sample would be from the population/distribution of differences,
and the mean of the difference distribution is often the focus.
- (pp. 351-353, Example 9.5) This example is typical in that
usually if pairing is ignored, one gets less evidence of a difference.
- (p. 353, Conditions for Validity of Student's t Analysis)
Referring to Condition 2, one never expects to have exact
normality, and so the test is really always approximate. If the
difference distribution is heavy-tailed and symmetric, the test is
conservative, meaning that reported p-values will be larger than they
should be (and so there is no danger of having an inflated type I error
rate if testing at a specific level). If the
difference distribution is light-tailed and symmetric, the test is
slightly
anticonservative, meaning that reported p-values will be slightly smaller than they
should be (and so one can have a slightly inflated type I error
rate if testing at a specific level, but unless the sample size is less
than 12 or so, I wouldn't worry about this). Skewness is what can cause
serious problems --- with one-tailed tests, the type I error rate can be
off by a factor of around 2 or 3 (or more) if the sample size is small
and the skewness is strong, but in other cases the test can be quite
conservative (and this raises concerns about low power).
If one is doing a two-tailed test to determine if there is evidence of a
treatment effect (without needing to state anything specifically about the
mean difference), then there is no need to be concerned about
distibution skewness, since if the null hypothesis of no treatment
effect is true, then the sampling distribution of the test statistic is
guaranteed symmetric about 0, and if distribution skewness exists, and
contributes to a rejection of the null hypothesis, this is good because
distribution skewness indicates that there is a treatment effect.
When testing for a treatment effect (as opposed to doing a test about
the mean difference), the only concern about validity is when the
difference distribution appears to be ligh-tailed and the sample size is
really small.
- (p. 354, 1st 3 lines) If one does a normality plot (aka probit
plot), I don't think a histogram or any of the other things referred to,
is going to contribute anything to the assessment of approximate
normality.
- (pp. 354-355, Example 9.6) Most of the examples in S&W are
decent --- this one is just plain silly. It seems hard to view the data
as a random sample. Even if we could, why would we want to do a test
about the mean difference? My guess is that the distance the person is
before the animal starts to run depends on the distance from the tree,
and it may be kind of interesting to study that relationship, but I
don't see much interest in the hypothesis that the difference is zero.
Section 9.3
- (p. 358) The first paragraph of this section is a good one to
focus on --- it should remind you of some of the points made previously
in Ch. 8 and earlier in this chapter.
- (p. 360, Example 9.10) I guess the main point of the
experiment is to determine if differences exist, and perhaps
characterize them. Estimation of the mean or median difference may not
be so important, since the inference would apply to a distribution
pertaining to the specific laboratory conditions used. So it might be
risky to try to generalize anything about the magnitude of the mean
difference, but knowing that there is a difference between the two
strains in the lab setting might serve to suggest that there might well
be a difference in other settings as well. (If no significant
difference is
observed in the lab, then perhaps it's reasonable to think that the
growth rates are the same (in general) for the two strains.)
- (p. 361, Purposes of Pairing) The first paragraph of this
section is good --- serves as another reminder of some important points.
- (p. 361, Purposes of Pairing, 1st sentence of 2nd paragraph)
That randomization controls bias may be a bit misleading --- perhaps
better to put that it absorbs bias, or accounts for bias in a fair way.
Randomization doesn't reduce or eliminate the effects of bias altogether
--- the bias adds to the experimental noise --- but it reduces the adverse
effect of bias on the validity of inferences.
- (p. 362, 1st paragraph) It is important to not use the observed
values of the response variable of interest to create the pairs ---
indeed, the pairing should occur before the responses are observed (by
the person doing the pairing).
- (p. 362, Randomized Pairs Design Versus Completely Randomized
Design) When in doubt, perhaps best to pair --- typically, inferences can only
be hurt a little by pairing when pairing isn't called for, but they can
be hurt more by not pairing when pairing is appropriate. For example,
in Example 9.11 I'd recommend pairing unless prior experience has
indicated location differences aren't very important --- I'd worry that
location differences may be appreciable, and too much experimental noise
may exist if pairing isn't done.
Section 9.4
The sign test is always valid as a test about the median
of a continuous distribution. Also, when working with matched pairs of
treatment and control observations, it's valid as a test of the null
hypothesis of no treatment effect against the general alternative
(of some sort of a treatment effect), and can be used with matched pairs
of observations corresponding to two treatments to test the null
hypothesis of no difference between treatments against the
general alternative (of some sort of difference).
- (pp. 364-365, Example 9.12) On p. 364, S&W points out that
with the sign test it's sometimes possible to do a test even though
censoring or truncation has occurred. (It's truncation if
values cannot be observed beyond a certain fixed point --- e.g., if a
scale can only measure up to 300 pounds, one would know that an object,
or subject, weighs more than 300 pounds, but it cannot be determined how
much more. It's censoring if the limit for observable values isn't
fixed, but varies --- e.g., in measuring survival times (times to death
or failure), if something is still okay after 526 days it can be
concluded that the survival time is at least 526 days ... if another
experimental unit left the study for some reason (not related to
survival time) after 30 days, all we know is that the survival time of
that unit is greater than 30 days.) I don't like the way the
alternative hypothesis is worded on p. 365 --- it's in one sense too
vague, and in another sense too specific. One could test for a general
treatment effect using a two-tailed test, or one could test to determine
if the median of the difference distribution is greater than 0, which is
equivalent to testing to determine if there is evidence that the
majority of subjects would benefit from close compatibility. Also, on
p. 365, the test statistic is introduced. Some books use S to
denote the test statistic (and other books use K). Usually the
test statistic is defined to be the number of positive differences (in a
matched pairs setting --- so equal to S&W's N+ --- or
equal to the number of observations greater than some specified value,
say 0, 100, or some other number (when doing a test about the median of
a distribution with a single sample of observations). Finally, it's
ridiculous that S&W doesn't include a table to use to get p-values for a
sign test. I'll give you a table from which you can obtain the p-value
by looking up the entry corresponding to a statistic value of 8
in the sample size equal to 11 part of the table, obtaining the null
probability that the test statistic assumes a value less than or equal
to 8, and subtracting that value from 1 to get the null probability that
the test statistic assumes a value greater than or equal to 9, which is
the desired p-value.
- (pp. 366-367, Example 9.13) I don't like the silly tables
of critical values for the sign test included in S&W --- better to get a
p-value (using software, or proper tables). Using SPSS one can obtain
an exact (although rounded) p-value of 0.019 --- I encourage you to
obtain this result using SPSS. (Once again, I don't care
for the way the alternative hypothesis is stated.)
- (p. 367, Bracketing the p-value) There is no need to
do this --- one should report the p-value to 2 significant digits.
Plus, as the footnote points out, the bracketing may not be entirely
correct.
- (pp. 367-368, Example 9.14) I've never seen any other book
use the "folded" distribution. (Most books simply (!!!) use a
binomial distribution --- no need to complicate matters.) See if you
can get the value 0.1719 from the table I'm supplying you with.
- (pp. 368-369, Example 9.15) I've never seen any other book
use the term "P-value of the data" (and I hope I don't see or hear you
use that term, since from the same data set one can do 10 different
tests and get 10 different p-values --- indicating that the data doesn't
have one particular p-value ... rather, a test applied to a data set
results in a p-value).
- (p. 369, Applicability of the Sign Test) For a test for a
treatment effect, the sign test often supplies a p-value larger than you
can get using the t test or the signed-rank test, and if the
difference distribution is approximately normal, the power of the sign
test can be much lower than the power of the t test. S&W claim
that the signed-rank test "is more difficult to carry out" but using
SPSS one can do the signed-rank test at the same time as doing the sign
test.
- (p. 369, Example 9.16) I don't think the setting described
here is a good one in which to employ the sign test.
Section 9.5
Our main use for the Wilcoxon signed-rank test (not to be confused with
the Wilcoxon rank sum test, which is for two independent samples)
will be for testing for the presence of a treatment effect with data
from a matched-pairs experiment. The signed-rank test is always valid
in such a setting. (If one assumes that the distribution underlying the
data (whether it be the distribution of the differences from matched
pairs, or the distribution of independent observations of some
phenomenon) is symmetric, then the signed-rank test can be safely
interpretted as a test about the mean/median of the symmetric
distribution. But using the test in this way if the distribution is not
symmetric can lead to false rejections of the null hypothesis with high
probability, and so one should worry that skewness can cause
misbehavior. (When used as a test for a treatment effect, the
signed-rank test is always valid, and one doesn't have to worry about
apparent skewness --- any skewness would be evidence of a treatment
effect, and if the skewness contributes to a rejection of the null
hypothesis of no treatment effect, it would not be a false rejection.))
- (p. 372, 1st paragraph) It is not true that the signed-rank test
is always more powerful than the sign test --- for some heavy-tailed
distributions the sign test can be more powerful (Plus, the sign test
can sometimes be applied in cases for which the signed-rank test cannot be
considered to be a valid test --- for tests about the median of a skewed
distribution.)
- (pp. 372-373, Example 9.17) In step 5, the test statistic is
defined in a nonstandard way --- the usual definition of the test
statistic is the sum of the ranks for the positive
observations/differences ... what S&W denotes by
W+, and what some other books denote by
T+, or some other symbol. The use of the table and
bracketing the p-value, as described in step 6, is nonstandard ---
better to just obtain the value of the test statistic and use the table
I supplied in class to get a p-value when n <= 20, or just let
SPSS produce an approximate p-value otherwise. (SPSS uses a normal
approximation to produce p-values for the signed-rank test.
Unfortunately, the approximation used is not the best one in most cases
--- usually better to employ an approximation that incorporates a
continuity correction.)
- (pp. 373-374, Bracketing the p-value)
As stated above, this is just silly --- better to report an exact
p-value (perhaps rounded), or an approximation to the exact p-value.
- (p. 374, Directional Alternative)
Usually, one does a two-tailed test with the signed-rank statistic.
Using the table I supplied in class, one can do a lower-tailed test, an
upper-tailed test, or a two-tailed test, as long as n <= 20.
SPSS always outputs the (approximate) p-value for a two-tailed test.
Denoting the outputted p-value by p, the (approximate) p-value
for a one-tailed test will either be p/2 or
1 - p/2, depending upon whether or not the value of the test statistic
is on the side of n(n+1)/4 that most supports the rejection
of the null hypothesis in favor of the alternative.
- (p. 374, Treatment of Zeros)
SPSS, and most other statistical software, ignores the observations of 0
as is described in S&W.
- (p. 374, Treatment of Ties)
SPSS, and most other statistical software, uses the mid-rank method,
as is described in S&W. This is fine if one is going to use the normal
approximation to obtain an approximate p-value. But mid-ranks can cause
a problem when using a table to get an exact p-value,
because the use of mid-ranks can result in
a value for the test statistic which is not an integer, and is not in
the table. If ties are encountered when assigning ranks when the sample
size is small, the best thing to do would be to use StatXact
(software that is great for doing exact nonparametric tests), and
another alternative would be to break all ties in such a way as to
maximize the p-value (which is a conservative approach, which makes it
tougher to get a rejection, but if a rejection (or generally, a small
p-value) is obtained, it can be taken seriously, without being viewed as
questionable in any way).
- (p. 374, Applicability of the Wilcoxon Signed-Rank Test, 1st
paragraph)
I prefer to think of the signed-rank test as a test of the null
hypothesis of no treatment effect against the general alternative
(of some sort of treatment effect), and not as a test about the mean
--- since skewness can make the test unreliable as a test about the
mean.
- (p. 375, 1st paragraph) The confidence interval referred to
certainly won't be emphasized in STAT 535. It's only reliable as a
confidence interval for the mean/median if the distribution is
symmetric, and since we never know if we have symmetry (the sample
skewness isn't a reliable measure of assymmetry), it isn't very useful.
- (p. 375, last paragraph) There are several things wrong here.
First of all, if we required exact normality in order to use the
t test, it wouldn't ever be used for analyzing real-world data.
Also, it's not true that the signed-rank test is always more powerful
than the sign test, and it's not true that the sign test is always the
least powerful of the three methods --- in some settings the sign test
is more powerful than both the t test and the signed-rank test.
Section 9.6
- (pp. 377-378, Example 9.18) This is an interesting example
--- involving two independent samples of matched-pairs differences.
Perhaps the experiment could be improved by creating matched pairs at
the higher level --- that is, matching members of the biofeedback group
to members of the control group, using age, weight, initial
blood pressure, overall fitness, and perhaps other characteristics to
create the pairs. In such a case, the final data used would be
differences of differences, and with this data any of the methods from
Ch. 9 could be employed for a test for a treatment effect, or one could
find an appropriate test to do a test about the mean or median
difference.
Section 9.7
(I don't have any comments to add at the present time.)