Ch. 7 notes for S&W, STAT 535

Some Comments about Chapter 7 of Samuels & Witmer

Section 7.1

(pp. 219-220, Example 7.1) The histograms, and the near equality of the two sample standard deviations, suggest that a shift model might be appropriate. With a shift model, it is assumed that the two densities have exactly the same shape, but that one may be shifted relative to the other. Having looked at a lot of data sets over the years, I think that most of the time a shift model is not appropriate --- usually if the means of two distributions differ, other features, like variance and degree of skewness, differ as well. A nice thing about being able to assume a shift model is that if it is also the case that the sample sizes are equal, or not too different, then some statistical procedures based on an assumption of normality are fairly robust against violations of the normality assumption.
(p. 220, Example 7.2) The two samples here exhibit a pattern that is relatively common --- they have the same general shape (both skewed in the same direction), but have different variances. (I wonder how they injected the flies.)
(p. 221, Notation) It is common to use x for one sample, and y for the other.

Section 7.2

(p. 222, about middle of page) The difference in sample means is a point estimate for the difference in distribution means.
(p. 222, Example 7.3) The first paragraph of the example refers to sample sizes of 8 and 7, but the table shows sample sizes of 7 and 5. The chapter notes on p. 645 don't provide any explanation for the differences in sample sizes.
(p. 223, gray box) The actual standard error for the difference in the sample means would be the expression given in the box with the sample variances replaced by the true variances. (The actual standard error wouldn't need to be given by a special definition --- it results from the general definition of the standard error of a statistic.) The expression given in the gray box is the estimated standard error.
(p. 223, about 60% of the way down the page) The sentence that begins "Whether we add ..." states a good point: both sample means have some "noise" associated with them, and the "nosie" doesn't cancel just because of the subtraction of one sample mean from the other.
(p. 225, last paragraph) I agree that it's generally good to allow for the variances being unequal, since if one assumes heteroscedasticity (unequal variances) and the variances are really equal, little harm is done --- the method which allows for unequal variances isn't optimal if the variances are equal, but it generally produces about the same result as the slightly better method which is based on an assumption of equal variances --- but if one assumes homoscedasticity (equal variances) and the variances really differ, then it's possible that the method which is based on an assumption of equal variances produces an appreciably inferior result. Still, the pooled estimate of the variance is appropriate for some settings in which there is reason to believe that the variances are really equal. For example, if the only source of variation is measurement error (i.e., no natural population variability), and the same measuring instrument is used for both samples, then it may be reasonable to assume that the true distribution variances are the same.

Section 7.3

(p. 227) SPSS will compute the df given by expression (7.1) for us. When I compute a confidence interval based on the degrees of freedom given by (7.1) and the estimated standard error given in the gray box on p. 223, I say that I am using Welch's method. (The footnote on p. 227 also refers to Satterthwaite's method. Welch is the one who proposed using the estimated standard error given on p. 223 for two-sample tests and confidence intervals for the difference in means. One can say that the degrees of freedom formula results from Satterthwaite's method --- Satterthwaite developed a general scheme to determine the appropriate degrees of freedom for settings in which variances are not assumed to be equal.)
(pp. 228-229, Example 7.7) I recommend that you try to duplicate the results of this example using SPSS. A first step is to read the data into the Data Editor. (I find that I have to always attempt to read the data from the CD that came with the book two times --- the first time produces some sort of complaint, but when I try again to open the data file it works. Is anyone else having this problem? Is anyone not having this problem?) Once the data is in, I like to look at some plots and summary statistics before using any inference methods. Going down from the Analyze menu, I stop on Descriptive Statistics and then across to select Explore. I click height into the Dependent List and group into the Factor List, and then click OK. This produces some summary statistics for each of the two samples, and also some plots. With such small sample sizes, it's a bit worrisome that the estimated skewnesses differ by more than 1 (one is positive, and the other is negative), but at the same time, with such small sample sizes these estimates may be way off. (Note that small samples sizes are bad --- not only is it hard to check to determine if there are serious violations of the assumption of (approximate) normality, but if there is a problem then the smaller the sample sizes, the greater any adverse effects will be.) Next I want to examine some normality plots. To make the desired plots, I first need to mouse the 8 control values into the 3rd column of the data editor, and mouse the 7 treatment group values into the 4th column of the data editor. Then I can make a normaility plot of the data in the 3rd column, and then one of the data in the 4th column. Since neither plot looks horribly nonnormal, it is perhaps okay to use some inference procedures designed for samples from normal distributions. Next, going down from the Analyze menu, I stop on Compare Means and then across to select Independent-Samples T Test, which will produce a confidence interval based on Welch's method, as well as one based on Student's method, which uses the pooled estimate of an assumed common variance in the estimated standard error of the difference in the two sample means. (The one based on Welch's method is generally preferred, but with SPSS you get them both at the same time.) Next I click height into the Test Variable(s) box, and group into the Grouping Variable box. The latter activity seems to create confusion for SPSS --- two questions marks (??) appear in the box. In order to move ahead with the analysis, one needs to click on Define Groups, and then type control and ancy into the two boxes for the groups. (Note that control and ancy are the two labels for the cases in the 2nd column of the data editor (for the group variable).) Now clicking Continue and then OK produces the desired output. 95% confidence intervals are formed by default --- if some other confidence level is desired, then one would have to click on Options and make a change before clicking OK. Note that the two intervals produced are nearly identical --- and are identical upon proper rounding. Since this will not always be the case, note that the bottom interval outputted is the one corresponding to Welch's method. (Question: Does the 14 day time period start when the seed is first planted, or when the plant comes up out of the soil?)
(p. 231) Make sure that you understand the use of the rounding rule which leads to the (2, 191) interval near the top of the page --- I think it's bad to indicate more accuracy than is warranted.
(p. 231, Conditions for Validity) One never expects exact normality. Also, certain types of nonnormaility aren't of much concern. For the confidence interval procedure covered in this section, the main concern is whether or not the skewnesses appreciably differ if the sample sizes are small and equal or nearly equal --- it's a bit trickier if the sample sizes are small and appreciably different. For large sample sizes, the skewnesses aren't nearly as important (due to the method's large sample robustness).

Section 7.4

(pp. 234-235, The Null and Alternative Hypotheses) This section introduces hypothesis testing, and restricts attention to a test about the means using a two-sided alternative hypothesis (or equivalently, a two-tailed test). The hypotheses for such a test are given at the bottom of p. 234 and the top of p. 235. (Note: I don't like using H_A for the alternative hypothesis, preferring to use H₁ instead. (After all, zero and one make a good pair, while zero and A seems screwy.)) This type of test is appropriate if the point is to determine if the data provide statistically significant evidence that the distribution means differ. (Later, in Sec. 7.6, a one-tailed test will be introduced, which is appropriate if the goal is to determine if the data provide statistically significant evidence that a particular one the distribution means is greater than the other distribution mean.)
(p. 235, Example 7.10) A different set of hypotheses is given in this example. S&W indicates that the two sets of hypotheses aren't equivalent, but doesn't do a good job of describing how they differ. The example deals with a very common situation: the comparison of a treatment group to a control group. If the experiment is done correctly, the only difference between the two groups that is not due to random assignment of the subjects (who are not identical) to the groups is that one group of rats (the treatment group) was exposed to toluene. The way the alternative hypothesis is worded in this example, the test is a test to determine if the data provide statistically significant evidence that the exposure to toluene had any effect on the NE concentration. It could be that the treatment affected the distribution in some way even though the mean of the treatment distribution is the same as the mean of the control distribution. If we are looking for evidence of any treatment effect, we can say that we are testing the null hypothesis of no treatment effect against the general alternative (of some sort of treatment effect). The point is that testing for evidence that the means differ is not equivalent to testing for evidence of a treatment effect. (Some, including me, would say that the test procedure emphasized in this section is not appropriate if one is testing the null hypothesis of no effect against the general alternative.)
(p. 235, The t Statistic) I don't know why S&W put a subscript of S on the test statistic, t. I refer to the test statistic indicated here as Welch's test statistic, and say that I am doing Welch's test when I use it to do a test. Others call it the unequal variance t test, and some just call it the two-sample t test, but doing that could lead to confusion since another similar test procedure is also commonly referred to as the two-sample t test. This other procedure, Student's two-sample t test, has the same basic form for the test statistic, but uses the pooled estimate of the variance (see p. 224) in the estimated standard error (the demoninator of the test statistic). The df used for Student's two-sample t test can (and typically does) also differ from what is used for Welch's test. Because there are two ways to estimate the standard error for the difference in sample means, I wouldn't express the test statistic in either case as is done near the bottom of p. 235, since it isn't clear (unless one follows the conventions of some particular book) which estimate of the standard error is meant. *** Student's two-sample t (which uses the pooled estimate of the variance in the estimated standard error for the denominator) is appropriate to use if the two distribution variances can be assumed to be equal. This may be a good assumption if the only source of variation within a sample is due to measurement error (e.g., if the sample consists of several measurements of exactly the same thing), and the same measuring procedure is used for both samples. But if some of the variation within a sample is due to differences in sampling units (e.g., people, plots of land), then maybe it isn't good to assume equal variances. For example, if measurements are made on a sample of men and a sample of women, there may be no good reason to assume that the degree of variation among men is the same as the degree of variation among women. *** If we are testing the null hypothesis of no differences (perhaps no treatment effet) against the general alternative, then, assuming nonnormality is not a concern, Student's two-sample t test seems to me to be more appropriate than Welch's test, since if the null hypothesis of no difference is true, then the distributions are the same, and the variances are equal. (Although the variances need not be equal if the alternative hypothesis is true, with regard to the accuracy of a test, the concern is the sampling distribution under the assumption that the null hypothesis is true.) *** Unfortunately, S&W in places use the term Student's t when referring to Welch's test (see Sec. 7.9). They don't call Welch's test Student's t test, but they refer to using Student's t distribution to perform Welch's test (which is accurate --- Welch's test does make use of the family of T distributions). I think that calling Welch's test just the two-sample t test is bad because of possible confusion with Student's two-sample t test, and I don't think it is good to possibly add to the confusion by overly using the term Student's t when Welch's test is the focus.
(p. 236, Example 7.11) The sentence "But even if the null hypothesis H₀ were true, we do not expect t to be exctly zero; we expect the sample means to differ from one another ..." gets at a very important point --- we don't expect the sample means to be equal even if the distribution means are equal, and so what is of interest is whether the sample means are sufficiently different from one another to provide strong evidence that the distribution means are not the same. We can see something similar expressed in Example 7.10: "or whether the truth might be that toluene has no effect and that the observed difference ... reflects only chance variation." It can be seen from Problem 1 of the homework that two different random samples from the same population need not produce the same value of a statistic, and so observing different values of some statistical measure should not necessarily be taken as evidence of distributional difference.
(p. 236) Shortly after the indented statment near the middle of the page, S&W has "We require independent random samples from normally distributed populations." This isn't exactly correct --- if we needed the samples to be from exactly normally distributed populations, then test procedure would be seldom, if ever, used. The fact is that while the test is based on an assumption of normality, it is robust against certain types of deviations from the normality assumption. (I'll discuss the robustness properties of the test procedure in class.)
(pp. 236-237) The bottom portion of p. 236 addresses the compatibility of the data with the null hypothesis, using the observed value of the test statistic as a measure of compatibility. In general, with hypothesis testing one should also be concerned about the compatibility of the data with the alternative hypothesis. However, in this particular case, all possible values of the test statistic have the same level of campatibility with the alternative hypothesis, and so the focus can just be on the degree of compatibility with the null hypothesis. The last sentence of the paragraph at the top of p. 237 is a good one: since the density of the sampling distribution of the test statistic, considering the case of the null hypothesis being true, is low "in the far tails," such values of the test statistic are deemed to be incompatible with the null hypothesis --- but such values aren't necessarily as unlikely if the alternative hypothesis is true.
(p. 237, The P-Value) I use p-value instead of P-value. Some just use P, but I avoid that due to possible confusion with the probability function. I don't approve of the terms double tail and two-tailed p-value. (One can refer to the p-value of a two-tailed test, but two-tailed p-value isn't a sensible term.) Note that the indented statement is not a general definition of p-value --- it just specifies what the p-value is equal to for the specific type of two-tailed test under consideration.
(p. 238, Definition (of p-value)) The gray box does not give the general definition of p-value. The trouble with it as a definition is that for some tests the values of the test statistic which are "at least as extreme" may not be clearly indentifiable (e.g., if one is doing a two-tailed test and the test statistic's null sampling distribution is not symmetric). Nevertheless, the gray box gives a prescription of how the p-value may be determined for many (but not all) tests. A better definition is that the p-value is the smallest level at which we can reject the null hypothesis in favor of the alternative with the given data.
(p. 238) The concept of the p-value being a measure of compatibility of the data with the null hypothesis is useful.
(p. 238, Drawing Conclusions from a t Test) The first paragraph of this section addresses the issue of how small is small when it comes to p-values --- that is, how small does a p-value have to be to be regarded as evidence against the null hypothesis? If the test result is going to be used to make a decision; that is, if the null hypothesis is rejected in favor of the alternative, then one thing will be done, and if not, another thing will be done, then how small the p-value should be in order to reject the null hypothesis should depend on the consequences of making an error --- what happens if one rejects but the null hypothesis is really true, and what is the penalty if one fails to reject but the alternative hypothesis is really true? But in some situations, the losses due to errors may be hard to quantify. For example, in a scientific study, how strong should the evidence be in order to make an experimental result worthy of publication may not be so easy to determine. On one hand, we don't want false conclusions to be published, but on the other hand, we don't want the standard to be so high that possibly important results are not reported due to some small amount of experimental noise giving rise to some (possibly very small) doubt as to whether the observed result is meaningful, as opposed to being just due to chance variation (e.g., the random assignment of subjects making the treatment appear to be effective when in fact it was just the case that stronger subjects were randomly assigned to the treatment group and weaker ones to the control group).
(pp. 238-239, Example 7.13) Here's a situation where I don't think the conclusion stated on p. 239 is that useful: the p-value provides a measure of the strength of the evidence against the null hypothesis, and stating a conclusion seems pointless. The smallish p-value means that it would be rather unlikely for the observed result to have been obtained if the null hypothesis is true, and so one might think that there is some meaningful evidence that the alternative is true --- but at the same time, the fact that the p-value isn't much smaller should suggest that there is some doubt as to whether the alternative is true. The experimental results don't prove things one way or the other, and since we have some uncertainty, it seems better to not state a "conclusion" but rather to let the p-value provide some indication of the strength of the evidence ... a measure of the uncertainty which exists. (Also see the last sentence of the first paragraph under the Reporting the Results of a t Test heading on p. 242.) To consider another example, suppose one experiment resulted in a p-value of 0.049, and another resulted in a p-value of 0.051 --- in both cases the strength of the evidence against the null hypothesis is about the same, and it would be somewhat silly to make a statement about rejecting the null hypothesis in one case and not in the other. (Also, with regard to the footnote that pertains to this example, I think it's proper to state that there is evidence of an increase instead of evidence of a difference --- one chooses a two-sided alternative when one is interested in making a claim of a significant difference whichever mean is larger, but once a significant result is obtained, it's clear which mean is larger and that can be stated.)
(p. 240) The first paragraph following Example 7.14 is quite important. I don't like to use the phrase "accept the null hypothesis" when it's not rejected, because as the next paragraph points out, the data can be compatible with the alternative hypothesis even if it is also compatible with the null hypothesis, and in such a case the data doesn't strongly favor either hypothesis over the other one.
(p. 240) The paragraph right before the Using Tables Versus Using Technology heading is interesting. In some cases, say some sort of comparison involving males and females, it might seem very unlikely that the two distribution means are exactly equal, but testing the null hypothesis that they are equal against the alternative that they are not can still be useful --- if one does not reject the null hypothesis it can be thought that since there is not strong evidence suggesting that one of the means is larger than the other one, then even if we think that they aren't exactly equal, it isn't clear whcih one is the larger one ... that is, the sample variability is great enough so that the sample mean from the distribution having the greater distribution mean might be smaller than the other sample mean. In a treatment versus control experiment, it may be possible that the treatment does nothing at all, in which case the two samples can be viewed as having come from the same distribution, and so in such a case the distribution means would be exactly equal (and so the null hypothesis of equal means may actually be exactly true).
(p. 241, Example 7.15) The details of this example aren't important, since the SPSS software can be used to supply p-values for us.
(p. 242) The 4th line on the page reminds me of something that I want to inform you about: unless the observed outcome is absolutely impossible under the null hypothesis, don't report a p-value as being equal to zero. Even if the p-value is very small, give at least one significant digit, or else state that the p-value is less than some small upper bound, such as 0.001, 0.0005, etc. (Often bounding a rather small p-value is preferable than reporting a more precise value, because the accuracy of really small p-value depends more heavily on the assumptions of the test procedure (e.g., normality) being exactly met.)
(p. 242, Reporting the Results of a t Test) Note that stating that a result is significant at the 5% level just means that the p-value is less than or equal to 0.05. At times when there are a lot of p-values at hand, for convenience one might just state which ones are significant at a certain level, rather than giving all of the detailed information. Also, when there is some doubt as to the accuracy of the precise p-value (see previous comment pertaining to p. 242), but it seems safe to assume that the p-value is rather small, one might opt to just state that the result is significant at a certain small level instead of reporting a p-value. I tend to prefer the term statistically significant when referring to finding support for the alternative hypothesis, since just using significant may be taken to mean notable in some informal sense.
(p. 242) As S&W points out, there is nothing particularly special about 0.05 --- but it is commonly used as a significance level when fixing a certain level is desired.

Section 7.5

(p. 250) As the paragraph after Example 7.16 indicates, while there is a relationship between a confidence interval and an associated test result, there is an advantage in reporting both a confidence interval and a p-value, and so I recommend that one generally does both when reporting the results from an experiment. (What isn't needed is a statement as to whether one can reject or not reject the null hypothesis at a certain level, since that information and more can be obtained from the p-value. (Also see the last sentence, not counting the footnote, on p. 252.))
(p. 252, Significance Level Versus P-Value) Significance levels are important for studying the theoretical properties of a test procedure. For example, if one wants to do a power analysis (see Sec. 7.8), then it's necessary to specify the level of the test being considered. But for reporting the results of a particular experiment, I focus on the p-value, and usually don't even specify a level for the test. (Note that one can report the p-value without specifying a level for the test.) In cases where a level might be specified, it's important to realize that the p-value may be less than, equal to, or greater than the stated level of the test. The level of a test pertains to the performance of a test having a predetermined rejection criterion, and should be set (if set at all) before one even looks at the data. The p-value results from the data from a particular experiment --- it gives the strength of the evidence against the null hypothesis.
(p. 253, Table 7.10) This is an important table --- I'll refer to it more than a few times in class. I don't think that there is anything hard to understand about the table, but please make sure that you take the time to understand it as soon as possible.
(p. 253, Example 7.19) The dilemma of whether to reject the null hypothesis or not in the case of a marginal p-value becomes less of an important issue if the sample size is fairly large, since with a large sample size an appreciable treatment effect should result in a small p-value with high probability. But with a small sample size one has to worry that if the null hypothesis isn't rejected when the p-value is marginal, it could be that the experimental noise resulting from the small sample has resulted in a decent treatment effect not being statistically significant. (Comment: I think a one-tailed test (see Sec. 7.6) would be better here --- it would increase the power of detecting an important treatment effect.)
(p. 254) It's important to realize that the two hypotheses are not treated the same way --- in a sense we give the null hypothesis the benefit of the doubt, in that we reject the null in favor of the alternative only if the data is rather incompatible with the null hypothesis. But if we don't have a tough standard for a rejection, then a rejection could occur with a relatively high probability even though the null hypothesis is true, and upon realizing that, it can be concluded that a rejection of the null hypothesis doesn't really mean much in such a case (since a rejection could occur if the null hypothesis is false and should be rejected, or a rejection could easily occur if the null hypothesis is true and shouldn't be rejected). Only by requiring that the probability of a type I error be small --- which is equivalent to requiring that the p-value be rather small in order to reject --- can we have a meaningful test procedure ... one that can sometimes lead to a meaningful claim of significant evidence in favor of the alternative hypothesis. But we must also realize that by requiring that the probability of a type I error be small, we may wind up with a test procedure for which the probability of a type II error is large --- but there may be little that can be done about that unless the sample sizes are made larger.
(p. 254, power) While undergraduate books tend to use beta for the probability of a type II error, a lot of graduate-level books use beta for the power, which is quite different --- I'm used to using beta for power ... specifically, I use beta for the power function (noting that for most tests there isn't just a single value for a power, but rather the power usually depends upon the magnitude of the treatment effect). It's important to note that the power of a test depends upon the sample size(s) --- if one doesn't have enough observations in an experiment, the power to detect an important treatment effect may be rather small. (I've seen this phenomenon work against many students in biology and enviromental science during my years at GMU --- their sample sizes were too small, and they wound up lacking statistically significant evidence to support the hypothesis that they wanted to support ... due to there being too much uncertainty in the results when the sample sizes are small ... the experimental noise makes it so it's hard to say that the data are incompatible with the null hypothesis.)

Section 7.6

(pp. 256-257, Note) I prefer stating hypotheses for a one-sided test as is described in the note --- have the pair of hypotheses cover every possibility.
(p. 258) 5 lines from the bottom of the page, we could also bound the p-value from above by 0.5. Some would argue that if the p-value is greater than 0.2, it really doesn't matter what value it is, but bounding it from above by 0.5 does indicate that the estimated difference in means is in the direction corresponding to the alternative hypothesis.
(p. 259, 1st paragraph) Note that the conclusion from a two-tailed test can be directional if one rejects the null hypothesis.
(pp. 261-262, Example 7.24) This example shows that if you always first look at the data and then decide to do a one-sided test to determine if there is statistically significant evidence that the means differ in the way suggested by the data, your type I error rate will be twice the nominal level of the test. In terms of p-values, your p-value would always be half of what it should be. In order to prevent expressing too strong of a result, you should decide what type of alternative hypothesis to use before looking at the data in any way.
(p. 262, Computer note) I don't see a way to get SPSS to report the p-value for a one-tailed test. (Sometimes I am tempted to check to see if I've somehow installed SPSS Jr. by mistake, even though there isn't such a thing. My guess is that later on I'll see that SPSS has some nice things about it, but so far I've been disappointed in that it doesn't have some basic things that I think any statistical software package ought to have. But one thing good about it is that it's easy to use. In case you're wondering how the choice of SPSS was made, it came after discussion with faculty involved in enviromental science and biology, who were consulted when creating STAT 535. Originally, the plan was to use Stata, because one faculty member really pushed for it, but since he seems to be out of the picture at GMU now, the decision was made to use SPSS because it is more commonly used and is easy to use.) We can get the p-value for a one-tailed tests about the means from the p-value which is reported for a two-tailed test, noting that if the sample means are in the order indicated by the alternative hypothesis, the p-value for a one-tailed test if just one half the value of the p-value for a two-tailed test, and otherwise, if the sample means are not in the order indicated by the alternative hypothesis, the p-value for a one-tailed test is 1 - p/2, where p is the p-value from a two-tailed test.

Section 7.7

(p. 267, Significant Difference Versus Important Difference) The first paragraph gives examples of the use of the significant label in statistical analysis. I prefer to use the term statistically significant. For example, in the last sentence of the paragraph, I'd instead use: The data do not provide statistically significant evidence of toxicity.
(pp. 267-268, Significant Difference Versus Important Difference) The point is that a statistically significant difference need not be a large difference, and could be such a small difference as to be unimportant. (Because of this, one should always report an estimate of the difference (perhaps using a confidence interval) in addition to the p-value.) On the other hand, insufficient data may prevent one from claiming that an important difference is statistically significant.
(pp. 268-269, Effect Size) There is no "magic number" that has to be exceeded in order for an effect size to correspond to an important difference --- it depends upon the particular situation. In many fields, the effect size is not commonly used.

Section 7.8

(p. 273) The first two paragraphs of the section are important. The first paragraph gives a definition of power. Some students tend to think of power as 1 minus the probability of a type II error because a lot of books first introduce power like S&W does on p. 254, but I think it's better to think of power as it's described in the first pargraph of this section.
(p. 273) Note that to maximize power given a fixed number of subjects that can be assigned to either of two groups (say treatment and control, or treatment 1 and treatment 2), it's typically best to divide the subjects equally. If you have 20 subjects, don't put 15 in the treatment group and 5 in the control group because you think the treatment is of more interest than the control, because doing so can hurt the power ... and if one has nonnormality to deal with, there is less robustness when one of the sample sizes is so small.
(pp. 273-274, Dependence on sigma) Note that power can be increased if the experimental noise is decreased.
(p. 277, Example 7.35) The last paragraph (the note at the end of the example) makes an important point: when the sample sizes are large so that the power to detect an important difference is high, then a failure to reject can provide some meaningful information --- but if the sample sizes are small, a failure to reject could just be due to low power, even though the difference in means is rather large.
(Table 5) This table is nice --- a lot of statistics books don't include such a table. But when it gives values of 3 and 4 for sample sizes, I wouldn't want to ever depend on using such small sample sizes. Nonnormality can hurt the power, and it can also hurt the validity of the test featured in Ch. 7. If you use such small sample sizes, you're not going to be sure that your test results are reliable --- there is no way to reasonably check the assumptions needed for validity when the sample sizes are so small.

Section 7.9

I don't like the use of the term Student's t in this section, since there is a testing procedure which is properly referred to as Student's (two-sample) t test which is different from the testing procedure emphasized in this chapter, which is Welch's test. Welch's test uses a T distribution as an approximation for the (null) sampling distribution of the test statistic, and some call it the unequal variance two-sample t test, and I suppose that these things have created a bit of confusion.

It used to be that Student's two-sample t test, which uses a pooled estimate of the assumed common variance in the estimated standard error of the difference in the sample means, was the method emphasized in most elementary statistics books. But in more recent years, Welch's test (which is seldom called by that name in books) has been getting more support from text book authors. (Comment: I often think that, for the most part, the wrong people write statistics text books. Many introductory statistics books authors are people who are not at major research universities and who tend to teach low-level classes a lot, and graduate-level classes rarely, if ever. My guess is that such people don't always keep up with the latest and the greatest when it comes to statistics.) In general, Welch's test should be the one which is emphasized more, because if Welch's test is used when Student's t test should have been used, typically little harm is done, but if Student's test is used when Welch's test should have been used, some rather bad things can happen (in some cases, the test can reject the null hypothesis with a rather large probability if the null hypothesis is true, and it other cases, the test can have rather low power to reject the null hypothesis when it should reject the null hypothesis).

(p. 280, Conditions) Part (a) of the 1st condition states that the "populations must be large." Sometimes the "populations" are hypothetical. For example, in testing the accuracy of a new type of heat-seeking missle, one might test 25 missles that are built in a certain way. They may not be randomly chosen from a larger population of missles (since a large number of missles may not be built before some are tested), but we may view the 25 missles as being representative of other missles that could be built --- in a case like this, it may be better to think of the 25 observations based on the missles (maybe the observation is whether or not it hit its target) as being the observed values of random variables having a certain distribution, and the goal is to make inferences about this unknown distribution (although an alternative viewpoint would be to say that inferences are to be made about a hypothetical population of missles which could be built in the future, but with such a viewpoint, we don't have that the 25 missles used in the study were randomly drawn from the population).
(p. 280, Conditions) Part (a) of the 2nd condition states that "the population distributions must be approximately normal" if the sample sizes are small. In some cases the test procedure works quite well even if the distributions are rather nonnormal, with an example being if the sample sizes are equal and both distributions are skewed in exactly the same way (same direction, and to the same degree).
(p. 280, Conditions) Part (b) of the 2nd condition states that "the population distributions need not be approximately normal" if the sample sizes are large, and then a follow-up comment indicates that in many cases, 20 may qualify as large. But in some situations, even samples of size 50 may not be large enough --- it all depends on the nature of the nonnormality. The worst cases tend to be ones for which the distributions are strongly skewed in different direction, or perhaps one distribution is strongly skewed, and the other isn't. When both distributions are skewed about the same way, there is a cancellation effect due to the fact that one sample mean is being subtracted from another, but if they are skewed in opposite directions, the subtraction in the numerator of the test statistic can cause the sampling distribution of the numerator to be appreciably skewed (because sometimes samples of size 40 aren't large enough to have the "central limit theorem effect" kick in to a large enough degree).
(p. 280, last paragraph) I don't think a histogram and a stem-and-leaf display adds anything to the assessment of approximate normality, if one knows how to interpret a normal probability plot (aka, probit plot).
(p. 280, last sentence) The truth of this sentence depends on the nature of the skewness --- if the distributions are skewed too differently, and the sample sizes are rather small and perhaps not equal, then skewness could be a problem. The skewness issue is more important if a one-tailed test is being done, since for two-tailed tests there is a type of cancellation effect (different from the cancellation effect referred to above) that reduces concern about validity (but one can still have screwy power characteristics for the test).
(p. 281, Consequences of Inappropriate Use of Student's t) It is noted that "long straggly tails" (a phrase which I don't care for) can hurt the power of the test. Not only that, but in cases where the null hypothesis is true, they can lead in an inapropriately high type I error rate. (So it's the worst of both worlds --- not rejecting with high probability when rejection should occur, and rejection with too high of a probability when rejection shouldn't occur.)
(pp. 281-283, Example 7.36) It should be noted that the means of the log-transformed random variables can be equal while the means of the untransformed random variables differ, or the means of the log-transformed random variables can differ while the means of the untransformed random variables are equal. So the results from testing the transformed data cannot be safely applied to the distributions of the orignal data, which is sometimes quite undesirable. That is, by transforming to approximate normality, you can sometimes feel comfortable in using Welch's test, but you may wind up reaching a conclusion about a pair of distributions that aren't the ones you'd like to reach a conclusion about.

Section 7.10

(p. 285, How is H₀ Chosen, 1st paragraph) I don't agree with a lot of what's in this paragraph. One should first determine what one wants to see if there is statistically significant evidence to support, and this should be the alternative (or research) hypothesis. Then the null hypothesis is just everything else. In the middle of the paragraph, I don't think it's right that "in the absence of evidence, we would expect the two drugs to be equally effective." In the absence of evidence, why expect anything in particular? (One might hazard a guess, but it'd just be a guess.) In the case of a new drug/method/whatever, in some cases the alternative should be that the new thing is better --- that's what we want to see if we have significant evidence of.
(p. 285, How is H₀ Chosen, 2nd paragraph) I don't agree with a lot of what's in this paragraph.
(pp. 285-286, Another Look at P-Value) The phrase "the P-value of the data" isn't commonly used. More commonly used is the p-value of the test, but in such a case, the test refers to the test done on a particular set of data. Also, none of the definitions given in this subsection are actually general definitions of p-value. The closest one to a general definition (and I guess it qualifies as a suitable definition --- just a bit awkward/informal, but nevertheless expresses the correct point) is given near the top of p. 286: the indented portion of the 2nd paragraph on that page. Finally, the last 10 lines of this subsection (near the middle of p. 286) state some important points --- so read and learn!
(p. 286, footnote) I would say that Bayesian methods are seldom appropriate, and even when they are, I have a hard time accepting that the probability that the null hypothesis is true makes any sense (since really, the null hypothesis is either true or it's not).

Section 7.11

(p. 288) The Wilcoxon version of the test (which is completely equivalent to the Mann-Whitney test (equivalent in that although the test statistic is computed differently, one would always get the same p-value whichever version of the test is used)) is called the Wilcoxon rank sum test. In Ch. 9, we'll encounter the Wilcoxon signed-rank test, which is used for different situations.
(p. 288) The reason given for why it is called a nonparametric test is not good. It's a nonparametric test because we don't have to assume any particular parametric model (like a pair of normal distributions).
(p. 289) The pertinent null hypothesis is that the two distributions are identical, and it is tested against the general alternative that the two distributions differ. In some situations (if it can be believed that either the two distributions or the same, or if they differ that one is stochastically larger then the other), the W-M-W test can be used as a test about the distribution means. In a more limited set of circomstances, the test can be viewed as a test about distribution medians. (Some statistical software packages (not SPSS) make it seem as though it is a test about the medians, but this just isn't true --- one has to add extra assumptions for it to be viewed as a test about the medians.)
(p. 289, near the bottom of the page) It's not at all clear that the gap sample is slightly skewed to the left (i.e., negatively skewed). (Note: The probit plots on p. 290 have the axes reversed from the way SPSS produces them, and from the way I describe them in class. So the guidelines I give in class cannot be applied here.)
(p. 290, Method) It's not necessarily true that the test statistic measures the degree of separateion or shift, since it's not necessarily the case that one distribution is merely shifted up or down relative to the other one --- the two distributions can have very different shapes.
(pp. 291-292) The tables on p. 291 and p. 292 have the "One tail" and "Two tails" labels on the wrong rows. However, I recommend that you ignore these tables altogether! The way S&W describes how to do the test is nonstandard, and I think you'll be better off doing it as I describe in class and using the tables I supply in class (if the sample sizes are less than 10, unless SPSS can also give an exact p-value). (One cannot achieve good accuracy using the tables in S&W.)
(pp. 295-296, The Wilcoxon-Mann-Whitney Test Versus the t Test) The book is wrong in that the two tests are not really aimed at answering the same question. Welch's test is a test about the distribution means, whereas the W-M-W test is a test for the general two-sample problem (testing equal distributions against the general alternative) that can sometimes be used as a test about distribution means. In cases where they can both be used for testing hypotheses about the means, neither one dominates the other --- in some cases Welch's test is more powerful and in other cases the W-M-W test is more powerful.