Some Comments about Chapter 7 of Samuels & Witmer
Section 7.1
- (pp. 219-220, Example 7.1) The histograms, and the near
equality of the two sample standard deviations, suggest that a shift
model might be appropriate. With a shift model, it is assumed that
the two densities have exactly the same shape, but that one may be
shifted relative to the other. Having looked at a lot of data sets over
the years, I think that most of the time a shift model is not
appropriate --- usually if the means of two distributions differ, other
features, like variance and degree of skewness, differ as well.
A nice thing about being able to assume a shift model is that if it is
also the case that the
sample sizes are equal, or not too different, then some statistical
procedures based on an assumption of normality are fairly robust against
violations of the normality assumption.
- (p. 220, Example 7.2) The two samples here exhibit a
pattern that is relatively common --- they have the same general shape
(both skewed in the same direction), but have different variances.
(I wonder how they injected the flies.)
- (p. 221, Notation) It is common to use x for one
sample, and y for the other.
Section 7.2
- (p. 222, about middle of page) The difference in sample means is
a point estimate for the difference in distribution means.
- (p. 222, Example 7.3) The first paragraph of the example
refers to sample sizes of 8 and 7, but the table shows sample sizes of 7
and 5. The chapter notes on p. 645 don't provide any explanation for
the differences in sample sizes.
- (p. 223, gray box) The actual standard error for the difference
in the sample means would be the expression given in the box with the
sample variances replaced by the true variances. (The actual standard
error wouldn't need to be given by a special definition --- it results
from the general definition of the standard error of a statistic.)
The expression given
in the gray box is the estimated standard error.
- (p. 223, about 60% of the way down the page) The sentence that
begins "Whether we add ..." states a good point: both sample means have
some "noise" associated with them, and the "nosie" doesn't cancel just
because of the subtraction of one sample mean from the other.
- (p. 225, last paragraph) I agree that it's generally good to
allow for the variances being unequal, since if one assumes
heteroscedasticity (unequal variances) and the variances are really
equal, little harm is done --- the method which allows for unequal
variances isn't optimal if the variances are equal, but it generally
produces about the same result as the slightly better method which
is based on an assumption of equal variances --- but if one assumes
homoscedasticity (equal variances) and the variances really differ, then
it's possible that the method which is based on an assumption of equal
variances produces an appreciably inferior result. Still, the pooled
estimate of the variance is appropriate for some settings in which there
is reason to believe that the variances are really equal. For example,
if the only source of variation is measurement error (i.e., no natural
population variability), and the same measuring instrument is used for
both samples, then it may be reasonable to assume that the true
distribution variances are the same.
Section 7.3
- (p. 227) SPSS will compute the df given by expression (7.1)
for us. When I compute a confidence interval based on the degrees of
freedom given by (7.1) and the estimated standard error given in the
gray box on p. 223, I say that I am using Welch's method. (The
footnote on p. 227 also refers to Satterthwaite's method. Welch is the
one who proposed using the estimated standard error given on p. 223 for
two-sample tests and confidence intervals for the difference in means.
One can say that the degrees of freedom formula results from
Satterthwaite's method --- Satterthwaite developed a general scheme to
determine the appropriate degrees of freedom for settings in which
variances are not assumed to be equal.)
- (pp. 228-229, Example 7.7) I recommend that you try
to duplicate the results of this example using SPSS. A first step is to
read the data into the Data Editor. (I find that I have to always attempt to
read the data from the CD that came with the book two times --- the
first time produces some sort of complaint, but when I try again to open
the data file it works. Is anyone else having this problem? Is anyone
not having this problem?) Once the data is in, I like to look at
some plots and summary statistics before using any inference methods.
Going down from the Analyze menu, I stop on Descriptive
Statistics and then across to select Explore.
I click height into the Dependent List and group
into the Factor List, and then click OK. This produces
some summary statistics for each of the two samples, and also some
plots. With such small sample sizes, it's a bit worrisome that the
estimated skewnesses differ by more than 1 (one is positive, and the
other is negative), but at the same time, with such small sample sizes
these estimates may be way off. (Note that small samples sizes are bad
--- not only is it hard to check to determine if there are serious
violations of the assumption of (approximate) normality, but if there is
a problem then the smaller the sample sizes, the greater any adverse
effects will be.) Next I want to examine some normality plots. To make
the desired plots, I first need to mouse the 8 control values into the
3rd column of the data editor, and mouse the 7 treatment group values
into the 4th column of the data editor. Then I can make a normaility
plot of the data in the 3rd column, and then one of the data in the 4th
column. Since neither plot looks horribly nonnormal, it is perhaps okay
to use some inference procedures designed for samples from normal
distributions. Next,
going down from the Analyze menu, I stop on Compare
Means and then across to select Independent-Samples T Test,
which will produce a confidence interval based on Welch's method, as
well as one based on Student's method, which uses the pooled estimate of
an assumed common variance in the estimated standard error of the
difference in the two sample means. (The one based on Welch's method is
generally preferred, but with SPSS you get them both at the same time.)
Next I click height into the Test Variable(s) box, and
group into the Grouping Variable box. The latter activity
seems to create confusion for SPSS --- two questions marks (??) appear
in the box. In order to move ahead with the analysis, one needs to
click on Define Groups, and then type control and
ancy into the two boxes for the groups. (Note that
control and ancy are the two labels for the cases in the
2nd column of the data editor (for the group variable).) Now clicking
Continue and then OK produces the desired output. 95%
confidence intervals are formed by default --- if some other confidence
level is desired, then one would have to click on Options and
make a change before clicking OK. Note that the two intervals
produced are nearly identical --- and are identical upon proper
rounding. Since this will not always be the case, note that the bottom
interval outputted is the one corresponding to Welch's method.
(Question: Does the 14 day time period start when the seed is
first planted, or when the plant comes up out of the soil?)
- (p. 231) Make sure that you understand the use of the rounding rule
which leads
to the (2, 191) interval near the top of the page --- I think it's bad
to indicate more accuracy than is warranted.
- (p. 231, Conditions for Validity) One never expects exact
normality. Also, certain types of nonnormaility aren't of much concern.
For the confidence interval procedure covered in this section, the main
concern is whether or not the skewnesses appreciably differ if the
sample sizes are small and equal or nearly equal --- it's a bit trickier
if the sample sizes are small and appreciably different. For large
sample sizes, the skewnesses aren't nearly as important (due to the
method's large sample robustness).
Section 7.4
- (pp. 234-235, The Null and Alternative Hypotheses)
This section introduces hypothesis testing, and restricts attention to a
test about the means using a two-sided alternative hypothesis (or
equivalently, a two-tailed test). The hypotheses for such a test
are given at the bottom of p. 234 and the top of p. 235. (Note:
I don't like using HA for the alternative hypothesis,
preferring to use
H1 instead. (After all, zero and one make a good
pair, while zero and A seems screwy.))
This type of
test is appropriate if the point is to determine if the data provide
statistically significant evidence that the distribution means
differ. (Later, in Sec. 7.6, a one-tailed test will be introduced,
which is appropriate if the goal is to determine if the data provide
statistically significant evidence that a particular one the distribution
means is greater than the other distribution mean.)
- (p. 235, Example 7.10) A different set of
hypotheses is given in this example. S&W indicates that the two sets of
hypotheses aren't equivalent, but doesn't do a good job of describing
how they differ. The example deals
with a very common situation: the comparison of a treatment group
to a control group. If the experiment is done correctly, the
only difference between the two groups that is not due to random
assignment of the subjects (who are not identical) to the groups is that
one group of rats (the treatment group) was exposed to toluene. The way
the alternative hypothesis is worded in this example, the test is a test
to determine if the data provide
statistically significant evidence that the exposure to toluene had
any effect on the NE concentration. It could be that the
treatment affected the distribution in some way even though the mean of
the treatment distribution is the same as the mean of the control
distribution. If we are looking for evidence of any treatment
effect, we can say that we are testing the null hypothesis of no
treatment effect against the general alternative (of some
sort of treatment effect). The point is that testing for evidence that
the means differ is not equivalent to testing for evidence of a
treatment effect. (Some, including me, would say that the test procedure
emphasized in
this section is not appropriate if one is testing the null hypothesis of
no effect against the general alternative.)
- (p. 235, The t Statistic) I don't know why S&W put a
subscript of S on the test statistic, t. I refer to the
test statistic indicated here as Welch's test statistic, and say
that I am doing Welch's test when I use it to do a test. Others
call it the unequal variance t test, and some just call it the
two-sample t test, but doing that could lead to confusion since
another similar test procedure is also commonly referred to as the
two-sample t test. This other procedure, Student's two-sample
t test, has the same basic form for the test statistic, but uses the
pooled estimate of the variance (see p. 224) in the estimated
standard error (the demoninator of the test statistic). The df used for
Student's two-sample t test can (and typically does) also differ
from what is used for Welch's test. Because there are two ways to
estimate the standard error for the difference in sample means, I
wouldn't express the test statistic in either case as is done near the
bottom of p. 235, since it isn't clear (unless one follows the
conventions of some particular book) which estimate of the standard
error is meant. *** Student's two-sample t (which uses the pooled
estimate of the variance in the estimated standard error for the
denominator) is appropriate to use if the two distribution variances can
be assumed to be equal. This may be a good assumption if the only
source of variation within a sample is due to measurement error (e.g.,
if the sample consists of several measurements of exactly the same
thing), and the
same measuring procedure is used for both samples. But if some of the
variation within a sample is due to differences in sampling units (e.g.,
people, plots of land), then maybe it isn't good to assume equal
variances. For example, if measurements are made on a sample of men and
a sample of women, there may be no good reason to assume that the degree
of variation among men is the same as the degree of variation among
women. *** If we are testing the null hypothesis of no differences (perhaps
no treatment effet) against the general alternative, then, assuming
nonnormality is not a concern, Student's two-sample t test seems
to me to be more appropriate than Welch's test, since if the null
hypothesis of no difference is true, then the distributions are the
same, and the variances are equal. (Although the variances need not be
equal if the alternative hypothesis is true, with regard to the
accuracy of a test, the concern is the sampling distribution under the
assumption that the null hypothesis is true.) *** Unfortunately, S&W in
places use the term Student's t when referring to Welch's test (see
Sec. 7.9). They don't call Welch's test Student's t test, but
they refer to using Student's t distribution to perform Welch's
test (which is accurate --- Welch's test does make use of the family of
T distributions). I think that calling Welch's test just the
two-sample t test is bad because of possible confusion with
Student's two-sample t test, and I don't think it is good to
possibly add to the confusion by overly using the term Student's
t when Welch's test is the focus.
- (p. 236, Example 7.11) The sentence "But even if the null
hypothesis H0 were true, we do not expect t to
be exctly zero; we expect the sample means to differ from one another
..." gets at a very important point --- we don't expect the sample means
to be equal even if the distribution means are equal, and so what is of
interest is whether the sample means are sufficiently different from one
another to provide strong evidence that the distribution means are not
the same. We can see something similar expressed in
Example 7.10: "or whether the truth might be that toluene has no
effect and that the observed difference ... reflects only chance
variation." It can be seen from Problem 1 of the homework that two
different random samples from the same population need not produce the
same value of a statistic, and so observing different values of some
statistical measure should not necessarily be taken as evidence of
distributional difference.
- (p. 236) Shortly after the indented statment near the middle of the
page, S&W has "We require independent random samples from normally
distributed populations." This isn't exactly correct --- if we needed
the samples to be from exactly normally distributed populations, then
test procedure would be seldom, if ever, used. The fact is that while
the test is based on an assumption of normality, it is robust against
certain types of deviations from the normality assumption. (I'll
discuss the robustness properties of the test procedure in class.)
- (pp. 236-237) The bottom portion of p. 236 addresses the
compatibility of the data with the null hypothesis, using the observed
value of the test statistic as a measure of compatibility. In general,
with hypothesis testing one should also be concerned about the
compatibility of the data with the alternative hypothesis. However, in
this particular case, all possible values of the test statistic have the same
level of campatibility with the alternative hypothesis, and so the focus
can just be on the degree of compatibility with the null hypothesis.
The last sentence of the paragraph at the top of p. 237 is a good one:
since the density of the sampling distribution of the test statistic,
considering the case of the null hypothesis being true, is low "in the
far tails," such values of the test statistic are deemed to be
incompatible with the null hypothesis --- but such values aren't
necessarily as
unlikely if the alternative hypothesis is true.
- (p. 237, The P-Value)
I use p-value instead of P-value.
Some just use P, but I avoid that due to possible confusion with
the probability function. I don't approve of the terms double
tail and two-tailed p-value. (One can refer to the p-value
of a two-tailed test, but two-tailed p-value isn't a sensible term.)
Note that the indented statement is not a general definition of p-value --- it
just specifies what the p-value is equal to for the specific type of
two-tailed test under consideration.
- (p. 238, Definition (of p-value))
The gray box does not give the general definition of p-value. The
trouble with it as a definition is that for some tests the values of the
test statistic which are "at least as
extreme" may not be clearly indentifiable (e.g., if one is doing a
two-tailed test and the test statistic's null sampling distribution is
not symmetric). Nevertheless, the gray box gives a prescription of how
the p-value may be determined for many (but not all)
tests. A better definition is that the p-value is the smallest level at
which we can reject the null hypothesis in favor of the alternative with
the given data.
- (p. 238) The concept of the p-value being a measure of
compatibility of the data with the null hypothesis is useful.
- (p. 238, Drawing Conclusions from a t Test)
The first paragraph of this section addresses the issue of how small
is small when it comes to p-values --- that is, how small does a
p-value have to be to be regarded as evidence against the null
hypothesis? If the test result is going to be used to make a decision;
that is, if the null hypothesis is rejected in favor of the
alternative, then one thing will be done, and if not, another thing will
be done, then how small the p-value should be in order to reject the
null hypothesis should depend on the consequences of making an error ---
what happens if one rejects but the null hypothesis is really true, and
what is the penalty if one fails to reject but the alternative
hypothesis is really true? But in some situations, the losses due to
errors may be hard to quantify. For example, in a scientific study, how
strong should the evidence be in order to make an experimental result
worthy of publication may not be so easy to determine. On one hand,
we don't want false conclusions to be published, but on the other hand,
we don't want the standard to be so high that possibly important results
are not reported due to some small amount of experimental noise giving
rise to some (possibly very small) doubt as to whether the observed
result is meaningful, as opposed to being just due to chance variation
(e.g., the random assignment of subjects making the treatment appear to
be effective when in fact it was just the case that stronger subjects
were randomly assigned to the treatment group and weaker ones to the control
group).
- (pp. 238-239, Example 7.13) Here's a situation where I
don't think the conclusion stated on p. 239 is that useful: the p-value
provides a measure of the strength of the evidence against the null
hypothesis, and stating a conclusion seems pointless. The smallish
p-value means that it would be rather unlikely for the observed result
to have been obtained if the null hypothesis is true, and so one might
think that there is some meaningful evidence that the alternative is
true --- but at the same time, the fact that the p-value isn't much
smaller should suggest that there is some doubt as to whether the
alternative is true. The experimental results don't prove things one
way or the other, and since we have some uncertainty, it seems better to
not state a "conclusion" but rather to let the p-value provide some
indication of the strength of the evidence ... a measure of the
uncertainty which exists. (Also see the last sentence of the first
paragraph under the Reporting the Results of a t Test
heading on p. 242.) To consider another example, suppose one
experiment resulted in a p-value of 0.049, and another resulted in a
p-value of 0.051 --- in both cases the strength of the evidence against
the null hypothesis is about the same, and it would be somewhat silly to
make a statement about rejecting the null hypothesis in one case and not
in the other. (Also, with regard to the footnote that pertains to this
example, I think it's proper to state that there is evidence of an
increase instead of evidence of a difference --- one chooses a
two-sided alternative when one is interested in making a claim of a
significant difference whichever mean is larger, but once a significant
result is obtained, it's clear which mean is larger and that can be
stated.)
- (p. 240) The first paragraph following Example 7.14 is quite
important. I don't like to use the phrase "accept the null hypothesis"
when it's not rejected, because as the next paragraph points out, the
data can be compatible with the alternative hypothesis even if it is
also compatible with the null hypothesis, and in such a case the data
doesn't strongly favor either hypothesis over the other one.
- (p. 240) The paragraph right before the Using Tables Versus
Using Technology heading is interesting. In some cases, say some
sort of comparison involving males and females, it might seem very
unlikely that the two distribution means are exactly equal, but testing
the null hypothesis that they are equal against the alternative that
they are not can still be useful --- if one does not reject the null
hypothesis it can be thought that since there is not strong evidence
suggesting that one of the means is larger than the other one, then even
if we think that they aren't exactly equal, it isn't clear whcih one is
the larger one ... that is, the sample variability is great enough so
that the sample mean from the distribution having the greater
distribution mean might be smaller than the other sample mean.
In a treatment versus control experiment, it may be possible that the
treatment does nothing at all, in which case the two samples can be
viewed as having come from the same distribution, and so in such a case the
distribution means would be exactly equal (and so the null hypothesis of
equal means may actually be exactly true).
- (p. 241, Example 7.15) The details of this example aren't
important, since the SPSS software can be used to supply p-values for
us.
- (p. 242) The 4th line on the page reminds me of something that I
want to inform you about: unless the observed outcome is absolutely
impossible under the null hypothesis, don't report a p-value as being
equal to zero. Even if the p-value is very small, give at least one
significant digit, or else state that the p-value is less than some
small upper bound, such as 0.001, 0.0005, etc. (Often bounding
a rather small p-value is preferable than reporting a more precise value,
because the accuracy of really small p-value depends more heavily on
the assumptions of the test procedure (e.g., normality) being exactly
met.)
- (p. 242, Reporting the Results of a t Test) Note that
stating that a result is significant at the 5% level just means
that the p-value is less than or equal to 0.05. At times when there are
a lot of p-values at hand, for convenience one might just state which
ones are significant at a certain level, rather than giving all of the
detailed information. Also, when there is some doubt as to the accuracy
of the precise p-value (see previous comment pertaining to p. 242), but
it seems safe to assume that the p-value is rather small, one might opt
to just state that the result is significant at a certain small level
instead of reporting a p-value. I tend to prefer the term
statistically significant when referring to finding support for
the alternative hypothesis, since just using significant may be
taken to mean notable in some informal sense.
- (p. 242) As S&W points out, there is nothing particularly special
about 0.05 --- but it is commonly used as a significance level when
fixing a certain level is desired.
Section 7.5
- (p. 250) As the paragraph after Example 7.16 indicates,
while there is a relationship between a confidence interval and an
associated test result, there is an advantage in reporting both a
confidence interval and a p-value, and so I recommend that one generally
does both when reporting the results from an experiment. (What isn't
needed is a statement as to whether one can reject or not reject the
null hypothesis at a certain level, since that information and more can
be obtained from the p-value. (Also see the last sentence, not counting
the footnote, on p. 252.))
- (p. 252, Significance Level Versus P-Value)
Significance levels are important for studying the theoretical
properties of a test procedure. For example, if one wants to do a power
analysis (see Sec. 7.8), then it's necessary to specify the level of the
test being considered. But for reporting the results of a particular
experiment, I focus on the p-value, and usually don't even specify a
level for the test. (Note that one can report the p-value without
specifying a level for the test.) In cases where a level might be
specified, it's important to realize that the p-value may be less than,
equal to, or greater than the stated level of the test. The level of a
test pertains to the performance of a test having a predetermined
rejection criterion, and should be set (if set at all) before one even
looks at the data. The p-value results from the data from a particular
experiment --- it gives the strength of the evidence against the null
hypothesis.
- (p. 253, Table 7.10) This is an important table --- I'll
refer to it more than a few times in class. I don't think that there is
anything hard to understand about the table, but please make sure
that you take the time to understand it as soon as possible.
- (p. 253, Example 7.19) The dilemma of whether to reject the
null hypothesis or not in the case of a marginal p-value becomes less of
an important issue if the sample size is fairly large, since with a
large sample size an appreciable treatment effect should result in a
small p-value with high probability. But with a small sample size one
has to worry that if the null hypothesis isn't rejected when the p-value
is marginal, it could be that the experimental noise resulting from the
small sample has resulted in a decent treatment effect not being
statistically significant. (Comment: I think a one-tailed test
(see Sec. 7.6) would be better here --- it would increase the power of
detecting an important treatment effect.)
- (p. 254) It's important to realize that the two hypotheses are not
treated the same way --- in a sense we give the null hypothesis the
benefit of the doubt, in that we reject the null in favor of the
alternative only if the data is rather incompatible with the null
hypothesis. But if we don't have a tough standard for a rejection,
then a rejection could occur with a relatively high probability even
though the null hypothesis is true, and upon realizing that, it can be
concluded that a rejection of the null hypothesis doesn't really mean
much in such a case
(since a rejection could occur if the null hypothesis is false and
should be rejected, or a rejection could easily occur if the null
hypothesis is true and shouldn't be rejected). Only by requiring that
the probability of a type I error be small --- which is equivalent to
requiring that the p-value be rather small in order to reject --- can we
have a meaningful test procedure ... one that can sometimes lead to a
meaningful claim of significant evidence in favor of the alternative hypothesis.
But we must also realize that by requiring that the probability of a
type I error be small, we may wind up with a test procedure for which
the probability of a type II error is large --- but there may be little
that can be done about that unless the sample sizes are made larger.
- (p. 254, power) While undergraduate books tend to use
beta for the probability of a type II error, a lot of
graduate-level books use beta for the power, which is quite
different --- I'm used to using beta for power ... specifically,
I use beta for the power function (noting that for most
tests there isn't just a single value for a power, but rather the power
usually depends upon the magnitude of the treatment effect). It's
important to note that the power of a test depends upon the sample
size(s) --- if one doesn't have enough observations in an experiment,
the power to detect an important treatment effect may be rather small.
(I've seen this phenomenon work against many students in biology and
enviromental science during my years at GMU --- their sample sizes were
too small, and they wound up lacking statistically significant evidence
to support the hypothesis that they wanted to support ... due to there
being too much uncertainty in the results when the sample sizes are
small ... the experimental noise makes it so it's hard to say that the
data are incompatible with the null hypothesis.)
Section 7.6
- (pp. 256-257, Note) I prefer stating hypotheses for
a one-sided test as is described in the note --- have the pair of
hypotheses cover every possibility.
- (p. 258) 5 lines from the bottom of the page, we could also bound
the p-value from above by 0.5. Some would argue that if the p-value is
greater than 0.2, it really doesn't matter what value it is, but
bounding it from above by 0.5 does indicate that the estimated
difference in means is in the direction corresponding to the alternative
hypothesis.
- (p. 259, 1st paragraph) Note that the conclusion from a two-tailed
test can be directional if one rejects the null hypothesis.
- (pp. 261-262, Example 7.24) This example shows that if you
always first look at the data and then decide to do a one-sided test to
determine if there is statistically significant evidence that the means
differ in the way suggested by the data, your type I error rate will be
twice the nominal level of the test. In terms of p-values, your p-value
would always be half of what it should be. In order to prevent
expressing too strong of a result, you should decide what type of
alternative hypothesis to use before looking at the data in any way.
- (p. 262, Computer note) I don't see a way to get
SPSS to report the p-value for a one-tailed test. (Sometimes I am
tempted to check to see if I've somehow installed SPSS Jr. by
mistake, even though there isn't such a thing. My guess is that later
on I'll see that SPSS has some nice things about it, but so far I've
been disappointed in that it doesn't have some basic things that I think
any statistical software package ought to have. But one thing good
about it is that it's easy to use. In case you're wondering how the
choice of SPSS was made, it came after discussion with faculty involved
in enviromental science and biology, who were consulted when creating
STAT 535. Originally, the plan was to use Stata, because one faculty
member really pushed for it, but since he seems to be out of the picture
at GMU now, the decision was made to use SPSS because it is more
commonly used and is easy to use.) We can get the p-value for a one-tailed
tests about the means from the p-value which is reported for a
two-tailed test, noting that if the sample means are in the order
indicated by the alternative hypothesis, the p-value for a one-tailed
test if just one half the value of the p-value for a two-tailed test,
and otherwise, if the sample means are not in the order indicated by the
alternative hypothesis, the p-value for a one-tailed test is 1 -
p/2, where p is the p-value from a two-tailed test.
Section 7.7
- (p. 267, Significant Difference Versus Important Difference)
The first paragraph gives examples of the use of the significant label
in statistical analysis. I prefer to use the term statistically
significant. For example, in the last sentence of the paragraph, I'd
instead use: The data do not provide statistically significant
evidence of toxicity.
- (pp. 267-268, Significant Difference Versus Important Difference)
The point is that a statistically significant difference need not be a
large difference, and could be such a small difference as to be
unimportant. (Because of this, one should always report an estimate of
the difference (perhaps using a confidence interval) in addition to the
p-value.)
On the other hand, insufficient data may prevent one from
claiming that an important difference is statistically significant.
- (pp. 268-269, Effect Size) There is no "magic number" that
has to be exceeded in order for an effect size to correspond to an
important difference --- it depends upon the particular situation.
In many fields, the effect size is not commonly used.
Section 7.8
- (p. 273) The first two paragraphs of the section are important.
The first paragraph gives a definition of power. Some students tend to
think of power as 1 minus the probability of a type II error
because a lot of books first introduce power like S&W does on p. 254,
but I think it's better to think of power as it's described in the first
pargraph of this section.
- (p. 273) Note that to maximize power given a fixed number of
subjects that can be assigned to either of two groups (say treatment and
control, or treatment 1 and treatment 2), it's typically best to divide
the subjects equally. If you have 20 subjects, don't put 15 in the
treatment group and 5 in the control group because you think the
treatment is of more interest than the control, because doing so can
hurt the power ... and if one has nonnormality to deal with, there is
less robustness when one of the sample sizes is so small.
- (pp. 273-274, Dependence on sigma)
Note that power can be increased if the experimental noise is decreased.
- (p. 277, Example 7.35) The last paragraph (the note at the
end of the example) makes an
important point: when the sample sizes are large so that the power to
detect an important difference is high, then a failure to reject can
provide some meaningful information --- but if the sample sizes are
small, a failure to reject could just be due to low power, even though
the difference in means is rather large.
- (Table 5) This table is nice --- a lot of statistics books
don't include such a table. But when it gives values of 3 and 4 for
sample sizes, I wouldn't want to ever depend on using such small sample
sizes. Nonnormality can hurt the power, and it can also hurt the
validity of the test featured in Ch. 7. If you use such small sample
sizes, you're not going to be sure that your test results are reliable
--- there is no way to reasonably check the assumptions needed for
validity when the sample sizes are so small.
Section 7.9
I don't like the use of the term Student's t in this
section, since there is a testing procedure which is properly referred
to as Student's (two-sample) t test which is different from the
testing procedure emphasized in this chapter, which is Welch's test.
Welch's test uses a T distribution as an approximation for the
(null) sampling distribution of the test statistic, and some call it the
unequal variance two-sample t test, and I suppose that these
things have created a bit of confusion.
It used to be that Student's two-sample t test, which uses a
pooled estimate of the assumed common variance in the
estimated standard error of the difference in the sample means, was the
method emphasized in most elementary statistics books. But in more
recent years, Welch's test (which is seldom called by that name in
books) has been getting more support from text book authors.
(Comment: I often think that, for the most part, the wrong people
write statistics text books. Many introductory statistics books authors
are people who are not at major research universities and who tend to
teach low-level classes a lot, and graduate-level classes rarely, if
ever. My guess is that such people don't always keep up with
the latest and the greatest when it comes to statistics.)
In general, Welch's test should be the one which is emphasized more,
because if Welch's test is used when Student's t test should have
been used, typically little harm is done, but if Student's test is used
when Welch's test should have been used, some rather bad things can
happen (in some cases, the test can reject the null hypothesis with a
rather large probability if the null hypothesis is true, and it other
cases, the test can have rather low power to reject the null hypothesis
when it should reject the null hypothesis).
- (p. 280, Conditions) Part (a) of the 1st condition states
that the "populations must be large." Sometimes the "populations" are
hypothetical. For example, in testing the accuracy of
a new type of heat-seeking missle, one might test 25 missles that are
built in a certain way. They may not be randomly chosen from a larger
population of missles (since a large number of missles may not be built
before some are tested), but we may view the 25 missles as being
representative of other missles that could be built --- in a case like
this, it may be better to think of the 25 observations based on the
missles (maybe the observation is whether or not it hit its target) as
being the observed values of random variables having a certain
distribution, and the goal is to make inferences about this unknown
distribution (although an alternative viewpoint would be to say that
inferences are to be made about a hypothetical population of missles
which could be built in the future, but with such a viewpoint, we don't
have that the 25 missles used in the study were randomly drawn from the
population).
- (p. 280, Conditions) Part (a) of the 2nd condition states
that "the population distributions must be approximately normal" if
the sample sizes
are small. In some cases the test procedure works quite well even if
the distributions are rather nonnormal, with an example being if the
sample sizes are equal and both
distributions are skewed in exactly the same way (same direction, and to
the same degree).
- (p. 280, Conditions) Part (b) of the 2nd condition states
that "the population distributions need not be approximately normal" if the sample sizes
are large, and then a follow-up comment indicates that in many cases, 20
may qualify as large. But in some situations, even samples of size 50
may not be large enough --- it all depends on the nature of the
nonnormality. The worst cases tend to be ones for which the
distributions are strongly skewed in different direction, or perhaps one
distribution is strongly skewed, and the other isn't. When both
distributions are skewed about the same way, there is a cancellation
effect due to the fact that one sample mean is being subtracted from
another, but if they are skewed in opposite directions, the subtraction
in the numerator of the test statistic can cause the sampling
distribution of the numerator to be appreciably skewed (because sometimes
samples of size 40 aren't large enough to have the "central limit theorem
effect" kick in to a large enough degree).
- (p. 280, last paragraph) I don't think a histogram and a
stem-and-leaf display adds anything to the assessment of approximate
normality, if one knows how to interpret a normal probability plot (aka,
probit plot).
- (p. 280, last sentence) The truth of this sentence depends on the
nature of the skewness --- if the distributions are skewed too
differently, and the sample sizes are rather small and perhaps not
equal, then skewness could be a problem. The skewness issue is more
important if a one-tailed test is being done, since for two-tailed tests
there is a type of cancellation effect (different from the cancellation
effect referred to above) that reduces concern about validity (but one
can still have screwy power characteristics for the test).
- (p. 281, Consequences of Inappropriate Use of Student's t)
It is noted that "long straggly tails" (a phrase which I don't care for)
can hurt the power of the test. Not only that, but in cases where the
null hypothesis is true, they can lead in an inapropriately high type I
error rate. (So it's the worst of both worlds --- not rejecting with
high probability when
rejection should occur, and rejection with too high of a probability
when rejection shouldn't occur.)
- (pp. 281-283, Example 7.36) It should be noted that the
means of the log-transformed random variables can be equal while the
means of the untransformed random variables differ, or the
means of the log-transformed random variables can differ while the
means of the untransformed random variables are equal. So the results
from testing the transformed data cannot be safely applied to the
distributions of the orignal data, which is sometimes quite undesirable.
That is, by transforming to approximate normality, you can sometimes
feel comfortable in using Welch's test, but you may wind up reaching a
conclusion about a pair of distributions that aren't the ones you'd like
to reach a conclusion about.
Section 7.10
- (p. 285, How is H0 Chosen, 1st paragraph)
I don't agree with a lot of what's in this paragraph. One should first
determine what one wants to see if there is statistically significant
evidence to support, and this should be the alternative (or
research) hypothesis. Then the null hypothesis is just
everything else. In the middle of the paragraph, I don't think it's
right that "in the absence of evidence, we would expect the two drugs to be
equally effective." In the absence of evidence, why expect anything in
particular? (One might hazard a guess, but it'd just be a guess.)
In the case of a new drug/method/whatever, in some cases the alternative
should be that the new thing is better --- that's what we want to see if
we have significant evidence of.
- (p. 285, How is H0 Chosen, 2nd paragraph)
I don't agree with a lot of what's in this paragraph.
- (pp. 285-286, Another Look at P-Value)
The phrase "the P-value of the data" isn't commonly used.
More commonly used is the p-value of the test, but in such a case,
the test refers to the test done on a particular set of data.
Also, none of the definitions given in this subsection are actually
general definitions of p-value. The closest one to a general definition
(and I guess it qualifies as a suitable definition --- just a bit
awkward/informal, but nevertheless expresses the correct point) is given
near the top of p. 286: the indented portion of the 2nd paragraph on
that page. Finally, the last 10 lines of this subsection (near the
middle of p. 286) state some important points --- so read and learn!
- (p. 286, footnote)
I would say that Bayesian methods are seldom appropriate, and even when
they are, I have a hard time accepting that the probability that the
null hypothesis is true makes any sense (since really, the null
hypothesis is either true or it's not).
Section 7.11
- (p. 288) The Wilcoxon version of the test (which is completely equivalent to the
Mann-Whitney test (equivalent in that although the test statistic is computed differently, one would always get the same p-value
whichever version of the test is used)) is called the Wilcoxon rank sum test. In Ch. 9, we'll encounter the Wilcoxon signed-rank test,
which is used for different situations.
- (p. 288) The reason given for why it is called a nonparametric test is not good. It's a nonparametric test because we don't have
to assume any particular parametric model (like a pair of normal distributions).
- (p. 289) The pertinent null hypothesis is that the two distributions are identical, and it is tested against the general
alternative that the two distributions differ. In some situations (if it can be believed that either the two distributions or the
same, or if they differ that one is stochastically larger then the other), the W-M-W test can be used as a test about the distribution
means. In a more limited set of circomstances, the test can be viewed as a test about distribution medians. (Some statistical
software packages (not SPSS) make it seem as though it is a test about the medians, but this just isn't true --- one has to add extra
assumptions for it to be viewed as a test about the medians.)
- (p. 289, near the bottom of the page) It's not at all clear that the gap sample is slightly skewed to the left (i.e., negatively
skewed). (Note: The probit plots on p. 290 have the axes reversed from the way SPSS produces them, and from the way I describe
them in class. So the guidelines I give in class cannot be applied here.)
- (p. 290, Method) It's not necessarily true that the test statistic measures the degree of separateion or shift, since it's
not necessarily the case that one distribution is merely shifted up or down relative to the other one --- the two distributions can
have very different shapes.
- (pp. 291-292) The tables on p. 291 and p. 292 have the "One tail" and "Two tails" labels on the wrong rows. However, I recommend that
you ignore these tables altogether! The way S&W describes how to do the test is nonstandard, and I think you'll be better off doing
it as I describe in class and using the tables I supply in class (if the sample sizes are less than 10, unless SPSS can
also give an exact p-value). (One cannot achieve good accuracy using the tables in S&W.)
- (pp. 295-296, The Wilcoxon-Mann-Whitney Test Versus the t Test) The book is wrong in that the two tests are not really
aimed at answering the same question. Welch's test is a test about the distribution means, whereas the W-M-W test is a test for the
general two-sample problem (testing equal distributions against the general alternative) that can sometimes be used as a test about
distribution means. In cases where they can both be used for testing hypotheses about the means, neither one dominates the other ---
in some cases Welch's test is more powerful and in other cases the W-M-W test is more powerful.