Some Comments about Chapter 11 of Samuels & Witmer
Section 11.1
- (p. 465, The problem of multiple comparisons)
One way to work around the multiple comparison problem is to make a
simple adjustment for the fact that doing all of the two-sample
comparisons makes the probability of at least one type I error
a value much greater than the type I error rate for each individual
test. Suppose that we have I = 5 groups, so that
5C2 = 10 pairwise comparisons can be considered.
If each of the 10 pairwise tests are done at level 0.005, then Boole's
inequality (a probability fact that is easy to understand) gives us that
the overall chance of at least one type I error from the 10 tests cannot
be larger than 10*0.005 = 0.05. In general, for an overall probability
of at least one type I error to be limited to be no more than 0.05, we
can do each of the
IC2 pairwise tests at level
0.05/(IC2). Such a workaround is not the
best way to go about things when the variances can be assumed to be
equal, since it's a conservative approach, and methods introduced in Ch.
11 will be more powerful. But if the variances differ greatly, then the
Ch. 11 methods aren't so good, and using the scheme based on Boole's
inequality with Welch's statistic may be an attractive option. (There
are some methods which aren't as conservative that allow for
heteroscedasticity (unequal variances), but such methods aren't commonly
used.)
- (p. 465, Estimation of the standard deviation)
If the variances can be assumed to be equal, which is not always a
realistic assumption, then pooling the information from all of the
samples is a good idea. But if the variances differ enough, methods
based on an assumption of homoscedasticity, and involving a pooled
estimate of the (assumed common) standard deviation, can perform badly. If the
sample sizes are equal, or not too different, then the type I error rate
isn't affected so much by heteroscedasticity, but the power
characterisitcs of the test can be screwy. If the sample sizes differ
appreciably, the type I error rate of the commonly used testing
procedure can be very far from the nominal level --- resulting in either
a extremely anticonservative test, or in a test having rather poor
power.
- (p. 467, 1st paragraph) This paragraph explains why the phrase
ANalysis Of VAriance is used for a test about distribution means.
Section 11.2
- (p. 468) Rather than use n* for the total number
of observations (aka combined sample size), I might use the commonly
used N. (K is often used, instead of I, for the
number of groups. I'll try to use I, but may slip and use
K at times.) The grand mean is sometimes called the mean
of the combined sample.
- (p. 469 & p. 472) I've long been of the opinion that it isn't
worthwhile to
try to understand why the phrase degrees of freedom is used, and
why the specific df values are what they are.
- (p. 469, 1st gray box) The expression given is typically called
the sum of squares due to error and denoted SSE. I'll use SSE
instead of writing SS(within).
G&H (Grafen & Hails book) use SSE.
- (p. 470, 1st gray box) The expression given is typically called
the mean squares due to error and denoted MSE. I'll use MSE
instead of writing MS(within).
G&H (Grafen & Hails book) use EMS instead of MSE.
Section 11.3
- (pp. 476-477) Some books use alpha instead of tau in
the model statement (and assume the reader can keep this use of
alpha different from the use of alpha regarding the size
or level of the test. Whether tau or alpha is used in the
model statement, the term is often referred to as the treatment
effect (where S&W use "effect of group"). In this ANOVA setting,
this use of treatment effect gives how the mean of the ith
treatment distribution differs from the "grand population mean" (which
is somewhat of a nebulous thing). Some books use e instead of
epsilon for the error term, and I think it's good to do so, to
follow a convention of using Greek letters for constants (often of
unknown value, but not variable) and Roman letters for random variables.
(Note: In model statements and formulas for test statistics and
confidence intervals, lower case Roman letters are sometimes used in the
ANOVA setting even when the letters refer to the random variables and not
their observed values. I suspect that this practice is due to a desire
to make the formulas appear less "busy"/cluttered.) The term error
term is perhaps unfortunate since it makes some people assume that
it refers to measurement error. It really captures all sources of
variation about the distribution mean, and in most settings the
variation is mostly due to differences among individuals in the
population as opposed to measurement error.
Section 11.4
- (p. 479) Note that Figure 11.6 illustrates the critical
value. Most books put the two df values in the subscript along with the
upper-tail probability.
Section 11.5
My comments here will pertain to the use of SPSS in addition to
addressing some of what is in S&W.
Basically, for the usual one-way ANOVA F test, the key
assumptions are independent samples, normality, and
homoscedasticity.
For independent samples, we can obtain random samples,
independently, from I different distributions/populations, or we
can start with one randomly selected (or sometimes for convenience, they
are just typical (of the population)) group of experimental units, and
then randomly allocate these to the I different "treatment"
groups. But we cannot use a randomized block scheme. (If there is a
good way to create meaningful blocks of experimental units, doing so is
generally a good idea. But if this is done, we no longer can use the
one-way ANOVA F test and many related procedures that assume
independent samples. (Randomly assigning the units in each block to the
I different treatment groups gives us a two-way ANOVA design (see
Sec. 11.6 of S&W).))
The assumption of approximate normality is addressed in the first
paragraph on p. 485 and in
Figure 11.7 and
Figure 11.9. If the sample sizes aren't too small, individual
probit plots can be made. (Using SPSS, this will involve copying and
pasting the response variable values into various columns (since to run
the ANOVA, one needs all of the response values together in a single
column). But if the sample sizes are less than 10, individual probit
plots may not be very useful, and it may be good to examine a probit
plot of the pooled residuals (which are just the residuals
associated with all of the observations (where a residual is the value
of an observation (of the response variable) with the related sample
mean subtracted from it --- the residuals are the deviations
referred to on p. 485)). The residuals are just estimates of the "error
terms" (since an observation minus it's related true mean is the error term for
the observation), and the assumption under consideration here is that
the error terms are iid random variables from a distribution which is
not too nonnormal. So if the sample means are decent estimates of the
distribution means, the pooled residuals should appear to be a random
sample from a nearnormal distribution. (Unfortunately, to obtain the
pooled residuals so that you can plot them, you have to do a bit of
work with SPSS --- getting them isn't as easy as it should be! I'll
explain how to get them and plot them below.)
S&W doesn't contain much information about the robustness of the
one-way F test and related procedures. Basically, it's similar
to what we have for two indepenedent samples and the robustness of
Student's t test and Welch's test. That is, if the distributions
are light-tailed and not too asymmetric, the test works very well. If
the distributions are all skewed in the same direction, and to about the
same degree, and the sample sizes aren't too different, then the
validity of the test isn't too much of a concern, but power can be poor
is the skewness is rather extreme (and one might consider using a
nonparametric procedure, such as the Kruskal-Wallis test). If the
distributions appear to be heavy-tailed and not too asymmetric (unless
the skewness is about the same for all of them), the test is
conservative, and so it is valid, but may have rather poor power (and
again, one might consider using a nonparametric test instead).
Taking a brief detour, given that I referred to the Kruskall-Wallis test
(K-W test) above, let me describe it here, explain when it can be used, and
indicate how to get SPSS to perform the test.
One can think of the K-W test in two ways: as an extension of the
two-sample Wilcoxon rank sum test, and as a rank analog of the one-way
ANOVA F test. To obtain the K-W test statistic, one ranks the
observations in the combined (pooled) sample from 1 to N, and
replaces the xij in the ANOVA F statistic by
their respective ranks, except the I-1 is omitted.
(So the "heart" of the statistic compares the average rank for each
sample to the overall average rank (for the combined sample) in the same
manner that the numerator of the F statistic compares the sample
mean for each sample to the sample mean of the combined sample (the
grand (sample) mean).)
As with the F test, one rejects for
large values of the K-W statistic (often denoted by H). The
asymptotic null sampling distribution is a chi-square distribution with
I-1 df. (The exact sampling distribution isn't a chi-square
distribution, but if the sample sizes are at least 5 or 6, the
upper-tail probability from the chi-square distribution often
approximates the
actual p-value (which is not too easy to obtain (most all statistical software
uses the chi-square approximation)) fairly well.)
Provided that independence is not an issue
the test is valid for the general I sample problem
(which is often referred to as the general K sample problem),
for which the
null hypothesis of identical distributions is tested against the general
alternative. It can at times be used to do a test about the
distribution means, if one is willing to assume that for any two
distributions that differ, one is stochastically larger than the other.
To do a K-W test using SPSS, put all of the response values (the
combined sample) in one column and put integers from 1 to I in
another column to indicate which sample each response value belongs to.
Then use Analyze > Nonparametric Tests > K Independent Samples,
click in the response and group variables, click on Define Range
to provide the values to
indicate the groups (typically 1 and I (if you want to include
all of the samples)), and click OK.
For the lamb data of
Example 11.2 and
Example 11.9 of S&W, one needs to type in the data, being sure to
create a column having the
integers 1, 2, and 3 to indicate the groups (since word group indicators
won't work). Upon running the K-W test, you should obtain a p-value
value (which may not be too accurate given the sample sizes are so
small) of about 0.35.
The 2nd paragraph on p. 485 suggests a check of the assumption of
homoscedasticity, but it isn't real good since one should not expect all
of the sample standard deviations to differ by less than a factor of 2
when the sample sizes are small (not even for sample sizes as large as
12), even if the true variances are all the same. Fortunately, since
checking the assumption is not always real meaningful, the F test is
fairly robust for validity if the sample sizes are equal, or nearly
equal, unless the variances differ severely. But the test can have odd
power characteristics, even if the sample sizes are equal.
Section 11.6
- (pp. 487-488, Example 11.12) This is a typical randomized
block experiment. The investigation could be done with a one-way
design, but the hope is that by making an adjustment for the distance
from the window and the differing amounts of light, the power to detect
differences among treatments will be increased due to the elimination of
some of the experimental noise that could arise from a one-way design.
Note that in Fig. 11.10, the ordering of the treatments within
each block is due to random assignment (which is why this is called a
randomized block design).
- (p. 488) The model statement allows for more than one observation
per cell, but typically for a randomized block experiment, there
is only one observation per cell (as is the case in Example
11.12), in which case we would have
yij instead of
yijk.
- (pp. 490, Example 11.14) To generate the ANOVA table using
SPSS, one can use
Analyze > General Linear Model > Univariate.
(Note: By using the alfalfa data from the CD that came
with S&W, one can see the format needed for the data.) In the initial
dialog window, click in height as the Dependent Variable,
trt (for treatment) as a Fixed Factor, and block as
a Fixed Factor. (Note: Usually, a blocking variable is a
Random Factor, but for this experiment it's better viewed as a
fixed factor.) Next, go to the Model dialog window, and
select Custom. (One gets suitable output using the default, but
the ANOVA table isn't in the usual form. To make the output similar to
what we can find in S&W, one needs to use Custom.) Then,
separately, click
trt and block into the Model box (if you highlight
them both and click in together, they go in as a single product term
instead of as two separate factors), and for Build
Terms select Main effects (meaning that interaction terms
aren't going to be included in the model, which is a sensible choice
because there is no way to include interactions in the analysis if there
is only one observation per cell). Click Continue to close the
Model dialog window. (One can use the default settings for the
other items addressed in the dialog window.) Upon opening the Post
Hoc dialog window, I recommend selecting trt for Post Hoc
Tests and checking the Tukey box, which
will lead to the creation of studentized range simultaneous confidence
intervals, which can be used to possibly gain some understanding about
how the treatments differ (if they do differ). Then, to avoid getting too
much output at an early stage in the analysis, I recommend clicking
Continue to close the dialog window.
Upon opening the Save dialog window, I recommend checking off to
save Unstandardized Predicted Values and
Unstandardized Residuals, and then clicking to Continue.
(Note: When we get into regression analysis, I'll emphasize other
types of residuals. But for now, I'm sticking with the simple
unstandardized ("raw") residuals.) Finally, click OK to produce
all of the requested output. Unfortunately, the output, while coming as
close as is possible to the book while using SPSS in a relatively
straightforward way, differs from what is found in S&W. However, all of
the needed information shown in the book is there --- just arranged
differently, and mixed in with some additional information which can be
quite useful. In particular, you should be able to find the F
statistic value of 5.47, and SPSS also gives the associated p-value,
which is not shown in Table 11.9.
Also, shown in the output from SPSS, but not shown in Table 11.9,
is the F statistic and p-value for a test done to determine if
there is statistically significant evidence for differences between the
blocks. (Even though there isn't strong evidence for differences
between the blocks, it doesn't mean that blocking wasn't a good idea,
and we shouldn't after-the-fact pretend like blocks weren't used
and redo the analysis based on a one-way design.)
In the Multiple Comparisons box in the SPSS output,
one can see that there is statistically significant evidence that the
high treatment differs from the control, but we don't have
strong evidence that the low treatment differs from either of the
other two. The confidence interval for the difference in the means of
the form high - control is about (-1.63, -0.09).
The confidence interval for the difference in the means of
the form low - control is about (-1.41, 0.13) (which includes 0,
indicating that there isn't strong evidence of a difference).
The confidence interval for the difference in the means of
the form high - low is about (-0.99, 0.55) (which includes 0,
indicating that there isn't strong evidence of a difference).
The smallest value in the Sig. coulmn of the
Multiple Comparisons output, 0.031, is the p-value for a test for
differences among treatments based on the Tukey HSD procedure. Note
that it is very close to the p-value of 0.032 which resulted from the
standard F test. (This is often the case when there are just 3
treatment groups. With 4 or more treatment groups, it can be that the
Tukey p-value is appreciably greater than or appreciably less then the
p-value from the F test --- it depends on how the treatment
effects are situated, with the Tukey method being more powerful if a
smallish proportion of the treatment effects are different from the bulk
of the treatment effects, which are the same or tightly clustered.)
A probit plot of the residuals indicates the possibility of a heavy-tailed
error term distribution. A plot of the residuals against the predicted
values isn't very informative --- there are just too few observations
from the experiment.
(See
parts (e) and (f) of Problem 78 on the
homework web page for instructions
for making such plots with SPSS.)
Due to the possibility of a heavy-tailed error term distribution, one
might want to try Friedman's test (a nonparametric test that can be used
with a randomized block design --- but not covered in S&W). I'll
explain in class how to do an approximate version of Friedman's test
using SPSS, and how an exact p-value can sometimes be obtained using
tables.
- (pp. 490-493, Example 11.15 and Example 11.16)
These examples pertain to a two-factor experiment (with both factors
being fixed effects factors). Here, an additive model is assumed
(and in this case it turns out to be a good assumption), but often, when
there is more than one observation per cell (here there are only 4
cells, and 13 observations per cell), interactions will be allowed
(which gets us away from the simple, but sometimes unrealistic, additive model).
- (pp. 495-496, Example 11.19)
Here the same experiment is considered as is considered in the examples
referred to just above. Only now the full model, including an
interaction term, is analyzed. This is typical --- one usually allows
for interactions unless there is some good reason not to suspect that
they exist. To generate the results of the ANOVA table (Table
11.14) using SPSS, we can use
Analyze > General Linear Model > Univariate.
(Note: By using the soybean data from the CD that came
with S&W, one can see the format needed for the data.) In the initial
dialog window, click in area as the Dependent Variable,
and both shaking and
light as Fixed Factors.
Next, go to the Model dialog window, and
select Full factorial.
Then click
Continue to close the
Model dialog window. (One should use the default settings for the
other items addressed in the dialog window.) There is no need to open
the Post
Hoc dialog window, since there are only two levels for each of the
factors (and so if the factor is significant, one has evidence that the
two levels have statistically significantly different treatment effects).
Upon opening the Save dialog window, I recommend checking off to
save Unstandardized Predicted Values and
Unstandardized Residuals, and then clicking to Continue.
Finally, click OK to produce
all of the requested output.
You should be able to find the values of the various F
statistics, and SPSS also gives the associated p-values,
which are not shown in Table 11.14.
It can be noted that both factors are highly significant, with SPSS
reporting p-values of 0.000. I don't like to report a p-value as being
zero, since someone might take that to mean that it's absolutely
impossible for the observed data values to have arisen if the null
hypothesis is true. So when SPSS reports 0.000, I'd just write
p-value < 0.0005 when reporting the results.
It can also be noted that the p-value for the test for interaction is
about 0.865, indicating that there is no statistically significant
evidence for interactions. There may be mild interactions, but one
might be tempted to use an additive model to describe the phenomenon.
(Note: To fit an additive model,
select Custom instead of Full factorial
in the Model dialog window, and just use main effects for the two
fixed factors.)
The output also give 0.000 for a p-value pertaining to the intercept.
This corresponds to a test of the null hypothesis that the grand mean
equals 0 against the alternative that the grand mean isn't 0. The very
significant test result indicates that there is very strong evidence
that the grand mean needs to be included in the model. (The null
hypothesis corresponds to not needing a grand mean term in the model.)
Often, people don't bother to do a test about the grand mean when
performing an ANOVA.
A probit plot of the pooled residuals suggests that assuming an
approximately normal error term distribution is a decent assumption.
A plot of the residuals against the predicted values shows that an
assumption of homoscedasticity may be okay as well.