Ch. 11 notes for S&W, STAT 535

Some Comments about Chapter 11 of Samuels & Witmer

Section 11.1

(p. 465, The problem of multiple comparisons) One way to work around the multiple comparison problem is to make a simple adjustment for the fact that doing all of the two-sample comparisons makes the probability of at least one type I error a value much greater than the type I error rate for each individual test. Suppose that we have I = 5 groups, so that ₅C₂ = 10 pairwise comparisons can be considered. If each of the 10 pairwise tests are done at level 0.005, then Boole's inequality (a probability fact that is easy to understand) gives us that the overall chance of at least one type I error from the 10 tests cannot be larger than 10*0.005 = 0.05. In general, for an overall probability of at least one type I error to be limited to be no more than 0.05, we can do each of the _IC₂ pairwise tests at level 0.05/(_IC₂). Such a workaround is not the best way to go about things when the variances can be assumed to be equal, since it's a conservative approach, and methods introduced in Ch. 11 will be more powerful. But if the variances differ greatly, then the Ch. 11 methods aren't so good, and using the scheme based on Boole's inequality with Welch's statistic may be an attractive option. (There are some methods which aren't as conservative that allow for heteroscedasticity (unequal variances), but such methods aren't commonly used.)
(p. 465, Estimation of the standard deviation) If the variances can be assumed to be equal, which is not always a realistic assumption, then pooling the information from all of the samples is a good idea. But if the variances differ enough, methods based on an assumption of homoscedasticity, and involving a pooled estimate of the (assumed common) standard deviation, can perform badly. If the sample sizes are equal, or not too different, then the type I error rate isn't affected so much by heteroscedasticity, but the power characterisitcs of the test can be screwy. If the sample sizes differ appreciably, the type I error rate of the commonly used testing procedure can be very far from the nominal level --- resulting in either a extremely anticonservative test, or in a test having rather poor power.
(p. 467, 1st paragraph) This paragraph explains why the phrase ANalysis Of VAriance is used for a test about distribution means.

Section 11.2

(p. 468) Rather than use n^* for the total number of observations (aka combined sample size), I might use the commonly used N. (K is often used, instead of I, for the number of groups. I'll try to use I, but may slip and use K at times.) The grand mean is sometimes called the mean of the combined sample.
(p. 469 & p. 472) I've long been of the opinion that it isn't worthwhile to try to understand why the phrase degrees of freedom is used, and why the specific df values are what they are.
(p. 469, 1st gray box) The expression given is typically called the sum of squares due to error and denoted SSE. I'll use SSE instead of writing SS(within). G&H (Grafen & Hails book) use SSE.
(p. 470, 1st gray box) The expression given is typically called the mean squares due to error and denoted MSE. I'll use MSE instead of writing MS(within). G&H (Grafen & Hails book) use EMS instead of MSE.

Section 11.3

(pp. 476-477) Some books use alpha instead of tau in the model statement (and assume the reader can keep this use of alpha different from the use of alpha regarding the size or level of the test. Whether tau or alpha is used in the model statement, the term is often referred to as the treatment effect (where S&W use "effect of group"). In this ANOVA setting, this use of treatment effect gives how the mean of the ith treatment distribution differs from the "grand population mean" (which is somewhat of a nebulous thing). Some books use e instead of epsilon for the error term, and I think it's good to do so, to follow a convention of using Greek letters for constants (often of unknown value, but not variable) and Roman letters for random variables. (Note: In model statements and formulas for test statistics and confidence intervals, lower case Roman letters are sometimes used in the ANOVA setting even when the letters refer to the random variables and not their observed values. I suspect that this practice is due to a desire to make the formulas appear less "busy"/cluttered.) The term error term is perhaps unfortunate since it makes some people assume that it refers to measurement error. It really captures all sources of variation about the distribution mean, and in most settings the variation is mostly due to differences among individuals in the population as opposed to measurement error.

Section 11.4

(p. 479) Note that Figure 11.6 illustrates the critical value. Most books put the two df values in the subscript along with the upper-tail probability.

Section 11.5

My comments here will pertain to the use of SPSS in addition to addressing some of what is in S&W.

Basically, for the usual one-way ANOVA F test, the key assumptions are independent samples, normality, and homoscedasticity.

For independent samples, we can obtain random samples, independently, from I different distributions/populations, or we can start with one randomly selected (or sometimes for convenience, they are just typical (of the population)) group of experimental units, and then randomly allocate these to the I different "treatment" groups. But we cannot use a randomized block scheme. (If there is a good way to create meaningful blocks of experimental units, doing so is generally a good idea. But if this is done, we no longer can use the one-way ANOVA F test and many related procedures that assume independent samples. (Randomly assigning the units in each block to the I different treatment groups gives us a two-way ANOVA design (see Sec. 11.6 of S&W).))

The assumption of approximate normality is addressed in the first paragraph on p. 485 and in Figure 11.7 and Figure 11.9. If the sample sizes aren't too small, individual probit plots can be made. (Using SPSS, this will involve copying and pasting the response variable values into various columns (since to run the ANOVA, one needs all of the response values together in a single column). But if the sample sizes are less than 10, individual probit plots may not be very useful, and it may be good to examine a probit plot of the pooled residuals (which are just the residuals associated with all of the observations (where a residual is the value of an observation (of the response variable) with the related sample mean subtracted from it --- the residuals are the deviations referred to on p. 485)). The residuals are just estimates of the "error terms" (since an observation minus it's related true mean is the error term for the observation), and the assumption under consideration here is that the error terms are iid random variables from a distribution which is not too nonnormal. So if the sample means are decent estimates of the distribution means, the pooled residuals should appear to be a random sample from a nearnormal distribution. (Unfortunately, to obtain the pooled residuals so that you can plot them, you have to do a bit of work with SPSS --- getting them isn't as easy as it should be! I'll explain how to get them and plot them below.)

S&W doesn't contain much information about the robustness of the one-way F test and related procedures. Basically, it's similar to what we have for two indepenedent samples and the robustness of Student's t test and Welch's test. That is, if the distributions are light-tailed and not too asymmetric, the test works very well. If the distributions are all skewed in the same direction, and to about the same degree, and the sample sizes aren't too different, then the validity of the test isn't too much of a concern, but power can be poor is the skewness is rather extreme (and one might consider using a nonparametric procedure, such as the Kruskal-Wallis test). If the distributions appear to be heavy-tailed and not too asymmetric (unless the skewness is about the same for all of them), the test is conservative, and so it is valid, but may have rather poor power (and again, one might consider using a nonparametric test instead).

Taking a brief detour, given that I referred to the Kruskall-Wallis test (K-W test) above, let me describe it here, explain when it can be used, and indicate how to get SPSS to perform the test. One can think of the K-W test in two ways: as an extension of the two-sample Wilcoxon rank sum test, and as a rank analog of the one-way ANOVA F test. To obtain the K-W test statistic, one ranks the observations in the combined (pooled) sample from 1 to N, and replaces the x_ij in the ANOVA F statistic by their respective ranks, except the I-1 is omitted. (So the "heart" of the statistic compares the average rank for each sample to the overall average rank (for the combined sample) in the same manner that the numerator of the F statistic compares the sample mean for each sample to the sample mean of the combined sample (the grand (sample) mean).) As with the F test, one rejects for large values of the K-W statistic (often denoted by H). The asymptotic null sampling distribution is a chi-square distribution with I-1 df. (The exact sampling distribution isn't a chi-square distribution, but if the sample sizes are at least 5 or 6, the upper-tail probability from the chi-square distribution often approximates the actual p-value (which is not too easy to obtain (most all statistical software uses the chi-square approximation)) fairly well.) Provided that independence is not an issue the test is valid for the general I sample problem (which is often referred to as the general K sample problem), for which the null hypothesis of identical distributions is tested against the general alternative. It can at times be used to do a test about the distribution means, if one is willing to assume that for any two distributions that differ, one is stochastically larger than the other. To do a K-W test using SPSS, put all of the response values (the combined sample) in one column and put integers from 1 to I in another column to indicate which sample each response value belongs to. Then use Analyze > Nonparametric Tests > K Independent Samples, click in the response and group variables, click on Define Range to provide the values to indicate the groups (typically 1 and I (if you want to include all of the samples)), and click OK. For the lamb data of Example 11.2 and Example 11.9 of S&W, one needs to type in the data, being sure to create a column having the integers 1, 2, and 3 to indicate the groups (since word group indicators won't work). Upon running the K-W test, you should obtain a p-value value (which may not be too accurate given the sample sizes are so small) of about 0.35.

The 2nd paragraph on p. 485 suggests a check of the assumption of homoscedasticity, but it isn't real good since one should not expect all of the sample standard deviations to differ by less than a factor of 2 when the sample sizes are small (not even for sample sizes as large as 12), even if the true variances are all the same. Fortunately, since checking the assumption is not always real meaningful, the F test is fairly robust for validity if the sample sizes are equal, or nearly equal, unless the variances differ severely. But the test can have odd power characteristics, even if the sample sizes are equal.

Section 11.6

(pp. 487-488, Example 11.12) This is a typical randomized block experiment. The investigation could be done with a one-way design, but the hope is that by making an adjustment for the distance from the window and the differing amounts of light, the power to detect differences among treatments will be increased due to the elimination of some of the experimental noise that could arise from a one-way design. Note that in Fig. 11.10, the ordering of the treatments within each block is due to random assignment (which is why this is called a randomized block design).
(p. 488) The model statement allows for more than one observation per cell, but typically for a randomized block experiment, there is only one observation per cell (as is the case in Example 11.12), in which case we would have y_ij instead of y_ijk.
(pp. 490, Example 11.14) To generate the ANOVA table using SPSS, one can use Analyze > General Linear Model > Univariate. (Note: By using the alfalfa data from the CD that came with S&W, one can see the format needed for the data.) In the initial dialog window, click in height as the Dependent Variable, trt (for treatment) as a Fixed Factor, and block as a Fixed Factor. (Note: Usually, a blocking variable is a Random Factor, but for this experiment it's better viewed as a fixed factor.) Next, go to the Model dialog window, and select Custom. (One gets suitable output using the default, but the ANOVA table isn't in the usual form. To make the output similar to what we can find in S&W, one needs to use Custom.) Then, separately, click trt and block into the Model box (if you highlight them both and click in together, they go in as a single product term instead of as two separate factors), and for Build Terms select Main effects (meaning that interaction terms aren't going to be included in the model, which is a sensible choice because there is no way to include interactions in the analysis if there is only one observation per cell). Click Continue to close the Model dialog window. (One can use the default settings for the other items addressed in the dialog window.) Upon opening the Post Hoc dialog window, I recommend selecting trt for Post Hoc Tests and checking the Tukey box, which will lead to the creation of studentized range simultaneous confidence intervals, which can be used to possibly gain some understanding about how the treatments differ (if they do differ). Then, to avoid getting too much output at an early stage in the analysis, I recommend clicking Continue to close the dialog window. Upon opening the Save dialog window, I recommend checking off to save Unstandardized Predicted Values and Unstandardized Residuals, and then clicking to Continue. (Note: When we get into regression analysis, I'll emphasize other types of residuals. But for now, I'm sticking with the simple unstandardized ("raw") residuals.) Finally, click OK to produce all of the requested output. Unfortunately, the output, while coming as close as is possible to the book while using SPSS in a relatively straightforward way, differs from what is found in S&W. However, all of the needed information shown in the book is there --- just arranged differently, and mixed in with some additional information which can be quite useful. In particular, you should be able to find the F statistic value of 5.47, and SPSS also gives the associated p-value, which is not shown in Table 11.9. Also, shown in the output from SPSS, but not shown in Table 11.9, is the F statistic and p-value for a test done to determine if there is statistically significant evidence for differences between the blocks. (Even though there isn't strong evidence for differences between the blocks, it doesn't mean that blocking wasn't a good idea, and we shouldn't after-the-fact pretend like blocks weren't used and redo the analysis based on a one-way design.) In the Multiple Comparisons box in the SPSS output, one can see that there is statistically significant evidence that the high treatment differs from the control, but we don't have strong evidence that the low treatment differs from either of the other two. The confidence interval for the difference in the means of the form high - control is about (-1.63, -0.09). The confidence interval for the difference in the means of the form low - control is about (-1.41, 0.13) (which includes 0, indicating that there isn't strong evidence of a difference). The confidence interval for the difference in the means of the form high - low is about (-0.99, 0.55) (which includes 0, indicating that there isn't strong evidence of a difference). The smallest value in the Sig. coulmn of the Multiple Comparisons output, 0.031, is the p-value for a test for differences among treatments based on the Tukey HSD procedure. Note that it is very close to the p-value of 0.032 which resulted from the standard F test. (This is often the case when there are just 3 treatment groups. With 4 or more treatment groups, it can be that the Tukey p-value is appreciably greater than or appreciably less then the p-value from the F test --- it depends on how the treatment effects are situated, with the Tukey method being more powerful if a smallish proportion of the treatment effects are different from the bulk of the treatment effects, which are the same or tightly clustered.) A probit plot of the residuals indicates the possibility of a heavy-tailed error term distribution. A plot of the residuals against the predicted values isn't very informative --- there are just too few observations from the experiment. (See parts (e) and (f) of Problem 78 on the homework web page for instructions for making such plots with SPSS.) Due to the possibility of a heavy-tailed error term distribution, one might want to try Friedman's test (a nonparametric test that can be used with a randomized block design --- but not covered in S&W). I'll explain in class how to do an approximate version of Friedman's test using SPSS, and how an exact p-value can sometimes be obtained using tables.
(pp. 490-493, Example 11.15 and Example 11.16) These examples pertain to a two-factor experiment (with both factors being fixed effects factors). Here, an additive model is assumed (and in this case it turns out to be a good assumption), but often, when there is more than one observation per cell (here there are only 4 cells, and 13 observations per cell), interactions will be allowed (which gets us away from the simple, but sometimes unrealistic, additive model).
(pp. 495-496, Example 11.19) Here the same experiment is considered as is considered in the examples referred to just above. Only now the full model, including an interaction term, is analyzed. This is typical --- one usually allows for interactions unless there is some good reason not to suspect that they exist. To generate the results of the ANOVA table (Table 11.14) using SPSS, we can use Analyze > General Linear Model > Univariate. (Note: By using the soybean data from the CD that came with S&W, one can see the format needed for the data.) In the initial dialog window, click in area as the Dependent Variable, and both shaking and light as Fixed Factors. Next, go to the Model dialog window, and select Full factorial. Then click Continue to close the Model dialog window. (One should use the default settings for the other items addressed in the dialog window.) There is no need to open the Post Hoc dialog window, since there are only two levels for each of the factors (and so if the factor is significant, one has evidence that the two levels have statistically significantly different treatment effects). Upon opening the Save dialog window, I recommend checking off to save Unstandardized Predicted Values and Unstandardized Residuals, and then clicking to Continue. Finally, click OK to produce all of the requested output. You should be able to find the values of the various F statistics, and SPSS also gives the associated p-values, which are not shown in Table 11.14. It can be noted that both factors are highly significant, with SPSS reporting p-values of 0.000. I don't like to report a p-value as being zero, since someone might take that to mean that it's absolutely impossible for the observed data values to have arisen if the null hypothesis is true. So when SPSS reports 0.000, I'd just write p-value < 0.0005 when reporting the results. It can also be noted that the p-value for the test for interaction is about 0.865, indicating that there is no statistically significant evidence for interactions. There may be mild interactions, but one might be tempted to use an additive model to describe the phenomenon. (Note: To fit an additive model, select Custom instead of Full factorial in the Model dialog window, and just use main effects for the two fixed factors.) The output also give 0.000 for a p-value pertaining to the intercept. This corresponds to a test of the null hypothesis that the grand mean equals 0 against the alternative that the grand mean isn't 0. The very significant test result indicates that there is very strong evidence that the grand mean needs to be included in the model. (The null hypothesis corresponds to not needing a grand mean term in the model.) Often, people don't bother to do a test about the grand mean when performing an ANOVA. A probit plot of the pooled residuals suggests that assuming an approximately normal error term distribution is a decent assumption. A plot of the residuals against the predicted values shows that an assumption of homoscedasticity may be okay as well.