Ch. 9 notes for S&W, STAT 535

Some Comments about Chapter 9 of Samuels & Witmer

Section 9.1

(p. 347, 1st paragraph) If you designed the experiment, then of course you'll know whether you paired or not. But if you are given data, it may be a bit less obvious. To determine whether you have paired data or two independent samples, ask yourself if (I'm going to use x and y to denote the observations in the two samples) if y₁ is more closely related to x₁ than it is to x₂ (or if x₃ is more closely related to y₃ than it is to y₇). If the answer is yes, then the data should be viewed as matched pairs data.
(pp. 347-348) The sentence that begins on the bottom of p. 347 and continues on p. 348 is typically true for this type of matched pairs experiment, but it's not a key thing that you should focus on in the initial stages of studying the analysis of matched pairs data.

Section 9.2

(p. 349, 2nd line) I think it's often better to think of it as making an inference about the mean difference (the mean of the difference distribution) or the mean change, especially if the two measurements of a pair are from the same experimental unit. Note that Condition 1 on p. 353 indicates that the differences should be such that they can be regarded as a random sample. This sample would be from the population/distribution of differences, and the mean of the difference distribution is often the focus.
(pp. 351-353, Example 9.5) This example is typical in that usually if pairing is ignored, one gets less evidence of a difference.
(p. 353, Conditions for Validity of Student's t Analysis) Referring to Condition 2, one never expects to have exact normality, and so the test is really always approximate. If the difference distribution is heavy-tailed and symmetric, the test is conservative, meaning that reported p-values will be larger than they should be (and so there is no danger of having an inflated type I error rate if testing at a specific level). If the difference distribution is light-tailed and symmetric, the test is slightly anticonservative, meaning that reported p-values will be slightly smaller than they should be (and so one can have a slightly inflated type I error rate if testing at a specific level, but unless the sample size is less than 12 or so, I wouldn't worry about this). Skewness is what can cause serious problems --- with one-tailed tests, the type I error rate can be off by a factor of around 2 or 3 (or more) if the sample size is small and the skewness is strong, but in other cases the test can be quite conservative (and this raises concerns about low power). If one is doing a two-tailed test to determine if there is evidence of a treatment effect (without needing to state anything specifically about the mean difference), then there is no need to be concerned about distibution skewness, since if the null hypothesis of no treatment effect is true, then the sampling distribution of the test statistic is guaranteed symmetric about 0, and if distribution skewness exists, and contributes to a rejection of the null hypothesis, this is good because distribution skewness indicates that there is a treatment effect. When testing for a treatment effect (as opposed to doing a test about the mean difference), the only concern about validity is when the difference distribution appears to be ligh-tailed and the sample size is really small.
(p. 354, 1st 3 lines) If one does a normality plot (aka probit plot), I don't think a histogram or any of the other things referred to, is going to contribute anything to the assessment of approximate normality.
(pp. 354-355, Example 9.6) Most of the examples in S&W are decent --- this one is just plain silly. It seems hard to view the data as a random sample. Even if we could, why would we want to do a test about the mean difference? My guess is that the distance the person is before the animal starts to run depends on the distance from the tree, and it may be kind of interesting to study that relationship, but I don't see much interest in the hypothesis that the difference is zero.

Section 9.3

(p. 358) The first paragraph of this section is a good one to focus on --- it should remind you of some of the points made previously in Ch. 8 and earlier in this chapter.
(p. 360, Example 9.10) I guess the main point of the experiment is to determine if differences exist, and perhaps characterize them. Estimation of the mean or median difference may not be so important, since the inference would apply to a distribution pertaining to the specific laboratory conditions used. So it might be risky to try to generalize anything about the magnitude of the mean difference, but knowing that there is a difference between the two strains in the lab setting might serve to suggest that there might well be a difference in other settings as well. (If no significant difference is observed in the lab, then perhaps it's reasonable to think that the growth rates are the same (in general) for the two strains.)
(p. 361, Purposes of Pairing) The first paragraph of this section is good --- serves as another reminder of some important points.
(p. 361, Purposes of Pairing, 1st sentence of 2nd paragraph) That randomization controls bias may be a bit misleading --- perhaps better to put that it absorbs bias, or accounts for bias in a fair way. Randomization doesn't reduce or eliminate the effects of bias altogether --- the bias adds to the experimental noise --- but it reduces the adverse effect of bias on the validity of inferences.
(p. 362, 1st paragraph) It is important to not use the observed values of the response variable of interest to create the pairs --- indeed, the pairing should occur before the responses are observed (by the person doing the pairing).
(p. 362, Randomized Pairs Design Versus Completely Randomized Design) When in doubt, perhaps best to pair --- typically, inferences can only be hurt a little by pairing when pairing isn't called for, but they can be hurt more by not pairing when pairing is appropriate. For example, in Example 9.11 I'd recommend pairing unless prior experience has indicated location differences aren't very important --- I'd worry that location differences may be appreciable, and too much experimental noise may exist if pairing isn't done.

Section 9.4

The sign test is always valid as a test about the median of a continuous distribution. Also, when working with matched pairs of treatment and control observations, it's valid as a test of the null hypothesis of no treatment effect against the general alternative (of some sort of a treatment effect), and can be used with matched pairs of observations corresponding to two treatments to test the null hypothesis of no difference between treatments against the general alternative (of some sort of difference).

(pp. 364-365, Example 9.12) On p. 364, S&W points out that with the sign test it's sometimes possible to do a test even though censoring or truncation has occurred. (It's truncation if values cannot be observed beyond a certain fixed point --- e.g., if a scale can only measure up to 300 pounds, one would know that an object, or subject, weighs more than 300 pounds, but it cannot be determined how much more. It's censoring if the limit for observable values isn't fixed, but varies --- e.g., in measuring survival times (times to death or failure), if something is still okay after 526 days it can be concluded that the survival time is at least 526 days ... if another experimental unit left the study for some reason (not related to survival time) after 30 days, all we know is that the survival time of that unit is greater than 30 days.) I don't like the way the alternative hypothesis is worded on p. 365 --- it's in one sense too vague, and in another sense too specific. One could test for a general treatment effect using a two-tailed test, or one could test to determine if the median of the difference distribution is greater than 0, which is equivalent to testing to determine if there is evidence that the majority of subjects would benefit from close compatibility. Also, on p. 365, the test statistic is introduced. Some books use S to denote the test statistic (and other books use K). Usually the test statistic is defined to be the number of positive differences (in a matched pairs setting --- so equal to S&W's N₊ --- or equal to the number of observations greater than some specified value, say 0, 100, or some other number (when doing a test about the median of a distribution with a single sample of observations). Finally, it's ridiculous that S&W doesn't include a table to use to get p-values for a sign test. I'll give you a table from which you can obtain the p-value by looking up the entry corresponding to a statistic value of 8 in the sample size equal to 11 part of the table, obtaining the null probability that the test statistic assumes a value less than or equal to 8, and subtracting that value from 1 to get the null probability that the test statistic assumes a value greater than or equal to 9, which is the desired p-value.
(pp. 366-367, Example 9.13) I don't like the silly tables of critical values for the sign test included in S&W --- better to get a p-value (using software, or proper tables). Using SPSS one can obtain an exact (although rounded) p-value of 0.019 --- I encourage you to obtain this result using SPSS. (Once again, I don't care for the way the alternative hypothesis is stated.)
(p. 367, Bracketing the p-value) There is no need to do this --- one should report the p-value to 2 significant digits. Plus, as the footnote points out, the bracketing may not be entirely correct.
(pp. 367-368, Example 9.14) I've never seen any other book use the "folded" distribution. (Most books simply (!!!) use a binomial distribution --- no need to complicate matters.) See if you can get the value 0.1719 from the table I'm supplying you with.
(pp. 368-369, Example 9.15) I've never seen any other book use the term "P-value of the data" (and I hope I don't see or hear you use that term, since from the same data set one can do 10 different tests and get 10 different p-values --- indicating that the data doesn't have one particular p-value ... rather, a test applied to a data set results in a p-value).
(p. 369, Applicability of the Sign Test) For a test for a treatment effect, the sign test often supplies a p-value larger than you can get using the t test or the signed-rank test, and if the difference distribution is approximately normal, the power of the sign test can be much lower than the power of the t test. S&W claim that the signed-rank test "is more difficult to carry out" but using SPSS one can do the signed-rank test at the same time as doing the sign test.
(p. 369, Example 9.16) I don't think the setting described here is a good one in which to employ the sign test.

Section 9.5

Our main use for the Wilcoxon signed-rank test (not to be confused with the Wilcoxon rank sum test, which is for two independent samples) will be for testing for the presence of a treatment effect with data from a matched-pairs experiment. The signed-rank test is always valid in such a setting. (If one assumes that the distribution underlying the data (whether it be the distribution of the differences from matched pairs, or the distribution of independent observations of some phenomenon) is symmetric, then the signed-rank test can be safely interpretted as a test about the mean/median of the symmetric distribution. But using the test in this way if the distribution is not symmetric can lead to false rejections of the null hypothesis with high probability, and so one should worry that skewness can cause misbehavior. (When used as a test for a treatment effect, the signed-rank test is always valid, and one doesn't have to worry about apparent skewness --- any skewness would be evidence of a treatment effect, and if the skewness contributes to a rejection of the null hypothesis of no treatment effect, it would not be a false rejection.))

(p. 372, 1st paragraph) It is not true that the signed-rank test is always more powerful than the sign test --- for some heavy-tailed distributions the sign test can be more powerful (Plus, the sign test can sometimes be applied in cases for which the signed-rank test cannot be considered to be a valid test --- for tests about the median of a skewed distribution.)
(pp. 372-373, Example 9.17) In step 5, the test statistic is defined in a nonstandard way --- the usual definition of the test statistic is the sum of the ranks for the positive observations/differences ... what S&W denotes by W₊, and what some other books denote by T⁺, or some other symbol. The use of the table and bracketing the p-value, as described in step 6, is nonstandard --- better to just obtain the value of the test statistic and use the table I supplied in class to get a p-value when n <= 20, or just let SPSS produce an approximate p-value otherwise. (SPSS uses a normal approximation to produce p-values for the signed-rank test. Unfortunately, the approximation used is not the best one in most cases --- usually better to employ an approximation that incorporates a continuity correction.)
(pp. 373-374, Bracketing the p-value) As stated above, this is just silly --- better to report an exact p-value (perhaps rounded), or an approximation to the exact p-value.
(p. 374, Directional Alternative) Usually, one does a two-tailed test with the signed-rank statistic. Using the table I supplied in class, one can do a lower-tailed test, an upper-tailed test, or a two-tailed test, as long as n <= 20. SPSS always outputs the (approximate) p-value for a two-tailed test. Denoting the outputted p-value by p, the (approximate) p-value for a one-tailed test will either be p/2 or 1 - p/2, depending upon whether or not the value of the test statistic is on the side of n(n+1)/4 that most supports the rejection of the null hypothesis in favor of the alternative.
(p. 374, Treatment of Zeros) SPSS, and most other statistical software, ignores the observations of 0 as is described in S&W.
(p. 374, Treatment of Ties) SPSS, and most other statistical software, uses the mid-rank method, as is described in S&W. This is fine if one is going to use the normal approximation to obtain an approximate p-value. But mid-ranks can cause a problem when using a table to get an exact p-value, because the use of mid-ranks can result in a value for the test statistic which is not an integer, and is not in the table. If ties are encountered when assigning ranks when the sample size is small, the best thing to do would be to use StatXact (software that is great for doing exact nonparametric tests), and another alternative would be to break all ties in such a way as to maximize the p-value (which is a conservative approach, which makes it tougher to get a rejection, but if a rejection (or generally, a small p-value) is obtained, it can be taken seriously, without being viewed as questionable in any way).
(p. 374, Applicability of the Wilcoxon Signed-Rank Test, 1st paragraph) I prefer to think of the signed-rank test as a test of the null hypothesis of no treatment effect against the general alternative (of some sort of treatment effect), and not as a test about the mean --- since skewness can make the test unreliable as a test about the mean.
(p. 375, 1st paragraph) The confidence interval referred to certainly won't be emphasized in STAT 535. It's only reliable as a confidence interval for the mean/median if the distribution is symmetric, and since we never know if we have symmetry (the sample skewness isn't a reliable measure of assymmetry), it isn't very useful.
(p. 375, last paragraph) There are several things wrong here. First of all, if we required exact normality in order to use the t test, it wouldn't ever be used for analyzing real-world data. Also, it's not true that the signed-rank test is always more powerful than the sign test, and it's not true that the sign test is always the least powerful of the three methods --- in some settings the sign test is more powerful than both the t test and the signed-rank test.

Section 9.6

(pp. 377-378, Example 9.18) This is an interesting example --- involving two independent samples of matched-pairs differences. Perhaps the experiment could be improved by creating matched pairs at the higher level --- that is, matching members of the biofeedback group to members of the control group, using age, weight, initial blood pressure, overall fitness, and perhaps other characteristics to create the pairs. In such a case, the final data used would be differences of differences, and with this data any of the methods from Ch. 9 could be employed for a test for a treatment effect, or one could find an appropriate test to do a test about the mean or median difference.

Section 9.7

(I don't have any comments to add at the present time.)