Notes for SPSS and G&H, STAT 535

Some Comments about Using SPSS for Material Covered in Grafen and Hails

Chapter 4

G&H doesn't give (printed in the book) all of the data used in it's examples (and of course one wouldn't want to spend a lot of time typing in data even if it was given in the book). Fortunately, we can go to the web site for the book and download the data --- and it's really easy to do!

First, "fire up" SPSS. Once SPSS is up, go to the G&H web site, and click SPSS on the right side of the web page under Datasets and supplements. Then click Datasets Individually under Datasets, and then click school_childrens'_maths.sav. If you elect to open the file (as opposed to saving it), the data set for the simple multiple regression example on pp. 56-57 will be put into the SPSS data editor.

If you look at a scatterplot of AMA (a measure of mathematical ability) against height, you'll see a pattern that suggests that taller students tend to do better on the math test than shorter students. (See part (e) of Problem 94 on the homework web page for some information about making scatterplots using SPSS.) If you do a regression using AMA as the dependent variable and height as the explanatory variable, you'll get strong evidence that the expected value of AMA is related to height. (See the web page pertaining to Ch. 12 of S&W for information about how to do a (simple) regression using SPSS.) The value of R² is about 0.96, which indicates that height explains nearly all of the variablility in the AMA scores, and a t test seeking evidence that the slope is not 0 results in a p-value less than 0.0005. Furthermore, a plot of the residuals against the predicted values indicates that the linear model provides a pretty good fit, and a probit plot of the residuals suggests a slightly light-tailed error term distribution, which means that least squares regression (as opposed to some sort of robust regression) should work well. (As a further check that the fit is good, note from the Residuals Statistics part of the regression output that all of the studentized deleted residuals have a magnitude less than 1.9, which is an indication that the error term distribution does not have heavy tails. (Note: In order to feel good about such an interpretation of the studentized deleted residuals, one should feel good that the model fitted is decent. That is, if the model isn't decent, the magnitudes of the studentized deleted residuals don't mean the same thing as they do when the model is decent (meaning that the shape of the residual pattern is okay).) If a correlation analysis is done instead of a regression, the sample correlation coefficient is found to be about 0.98, which is highly significantly different from 0.

From the simple study of the relationship of height with AMA, it is found that height can be a good predictor of AMA score. But we should be cautious in our interpretation. To gain some understanding as to why we should be cautious, study the relationship between age and AMA in a similar manner. The results are similar, suggesting that age is a good predictor of AMA score. In fact, a closer inspection indicates that age is a better predictor than is height. Using age as the predictor variable, we get that R² is greater than 0.98, and the sample correlation is greater than 0.99.

To gain some additional insight about the situation with these three variables, look at a scatter plot of height against age, along with a sample correlation coefficient. The scatterplot shows a strong linear trend, and the sample correlation is nearly 0.99.

Now fit a multiple regression model, using both age and height as explanatory (aka predictor) variables for AMA score. To do this using SPSS, start with Analyze > Regression > Linear, click the AMA veriable into the Dependent box, and click both the age and the height variables into the Independent(s) box. Then click OK (although you might want to fiddle with things a bit more before doing so --- see the web page pertaining to Ch. 12 of S&W for suggestions). From the Coefficients part of the output, it can be seen that the p-value corresponding to age is less than 0.0005, indicating that even if one adjusts for height, age is still a useful predictor --- there is strong evidence that adding age to a model which already includes height will lead in improved predictions. (Note: The t test corresponding to age is a test of the null hypothesis that the coefficient of age is 0 against the alternative that it's not 0, given that a height term and an intercept term are in the model. We can think of it as a test which compares a model which only includes height (plus an intercept) to a model which includes both height and age (plus an intercept), and that a small p-value is an indication that the more complex model is really better --- that adding age to a model which already includes height will lead to improved predictions.) It can be noted that the value of R² doesn't change a lot in absolute terms, going from about 0.960 for the model just using height, to about 0.983 for the model using both age and height. But another way of thinking about it is that more than half of the unexplained variation that existed after fitting the model having only height as a predictor was eliminated when age was added to the model.

Note that based on the above observations, it would be incorrect to conclude that because just using height alone produced a R² of about 0.96, and then adding age only nudged the value of R² to about 0.983, that height was the more important explanatory variable. Because if we just considered the model having only age as a predictor, the value of R² is about 0.983, and adding height to the model really doesn't improve it by any appreciable amount. The results of a t test firms this up: from the Coefficients part of the output, it can be seen that the p-value corresponding to height is about 0.86, indicating that if one adjusts for age, height isn't a useful additional predictor --- there is not strong evidence that adding height to a model which already includes age will lead in improved predictions. (Note: The t test corresponding to height is a test of the null hypothesis that the coefficient of height is 0 against the alternative that it's not 0, given that a age term and an intercept term are in the model. We can think of it as a test which compares a model which only includes age (plus an intercept) to a model which includes both age and height (plus an intercept), and that a small p-value is an indication that the more complex model is really better --- that adding height to a model which already includes age will lead to improved predictions. Only, in this case, we didn't get a small p-value.) Note that pp. 60-61 of G&H present another explanation of why it can be concluded that adding age to a model which already includes height leads to an improved model, but adding height to a model which already includes age doesn't lead to any significant improvement.

At this point, although a failure to reject the null hypothesis that the coefficient of age is 0 doesn't mean that the null hypothesis is true, for the sake of simplicity, one might consider presenting a model that uses only age as a predictor, and eliminates height. There are no universally agreed upon rules for whether or not a model should only include statistically significant predictors. In some situations, for example, if one wants to show that a certain treatment is effective, one might want to err on the side of including variables in the model even though they are not statistically significant so that many variables can be adjusted for when assessing the effect of the treatment of interest. (If one adjusts for many possible influences on the response variable, and the inclusion of a treatment variable is statistically significant, then one can be more comfortable with the interpretation that the treatment directly effects the response.)

Since adding height to a model that already has age in it adds nearly nothing, I'd choose to remove height from the model and just use age as a predictor. Looking back at the model which uses only age as a predictor, one can note that the intercept term (aka constant term) is not statistically significant --- i.e., there isn't strong evidence that an intercept term is needed (there isn't strong evidence that the intercept is not 0). Whether or not to eliminate the intercept term isn't as big of a deal as is whether or not to eliminate a predictor variable, because the inclusion of an intercept doesn't add another variable to the model. In the fitted model having only age as a predictor, the estimated intercept is about 0.2, which is relatively small compared to the magnitudes of the AMA scores which are being modeled. So this might be a case where one would want to remove the nonsignificant intercept term and fit a model for which the mean AMA score is directly proportional to age (as opposed to being a linear function of age). To fit the model without an intercept, one needs to use Analyze > Regression > Linear, click Options (after specifying the model), and then uncheck the Include constant in equation box. Then click Continue, followed by OK. (Note: If you uncheck the box to get rid of the intercept, and then move on to another analysis, you should keep in mind that the box remains unchecked until you go back and check it again to indicate that you want an intercept in the model.) The coefficient for age changed from about 1.968 (for the model including the intercept) to about 1.992 (for the model without the intercept). As a final check that removing the intercept is okay, examine a plot of the residuals against the predicted values. Upon doing this, nothing notable is found. (Note: The value of R² doesn't have the same meaning in the no intercept model as it does in the standard case of having an intercept in the model.)

Moving on to another situation, consider the example on pp. 58-59 of G&H. (The data set is saplings.sav.) Here, two things, water regime and initial height, are used to model/predict final height. Since water regime is categorical instead of numerical, we cannot use Analyze > Regression > Linear, and click in water as an independent (aka explanatory) variable. If we did that, SPSS would treat water as a numerical variable, so that a water value of 2 would mean twice as much water as a water value of 1, and that's not what the values for the water regimes are suppose to indicate. With some statistical software, a workaround would be to create 3 new indicator variables to account for the 4 water regimes, but with SPSS we can do things more easily using Analyze > General Linear Model > Univariate. Just click the final height variable in as the Dependent Variable, water as a Fixed Factor, and initial height as a Covariate. Upon clicking on Model, specify that you want a Custom model using only Main Effects, and click the two predictor variables into the Model box. Then click Continue, followed by OK. The output produced is similar to some of what is in Box 4.3(b) on p. 59 of G&H. (The sums of squares produced by SPSS match the Adj SS column in G&H --- I'm not going to go into the details about what the difference between the Adj SS and the Seq SS are. At least one of the numbers in G&H doesn't match SPSS in the 4th significant digit --- something I'll attribute to numerical inaccuracy at some point.) Note that now we have an F test for each of the predictors, as opposed to a t test. The F test corresponding to initial height can be viewed as a t test, as can any F test for which the numerator df is 1. The interpretation of the test is the same as the t test for a predictor in a regression model. (Really, we're doing a regression when using a general linear model.) A small p-value means that initial height leads to a (statistically) significantly better model. A large p-value would mean that there is no evidence that a model with both initial height and water is really better than a model with just water as a predictor. The F test for water, which has 4 levels, is similar to an ANOVA F test for a factor having 4 levels. The small p-value means that water regime is needed to explain the variation in the final heights. In a case like this, involving a categorical fixed factor, and a second predictor (often of secondary interest) which is numerical, we can say that we are doing an analysis of covariance. (The numerical variable can be referred to as a covariate.) In a case like this with a categorical fixed factor, one generally keeps the intercept term in the model no matter what the p-value for it is. Note that for the case under consideration, all of the p-values are rather small, and the R² value is extremely high (0.998). Finally, I'll note that an alternative analysis would be to model the change in height with a one-way ANOVA model, using water regime as a fixed effects factor.

Chapter 6

Ch. 4 contained one example of of an analysis of covariance (ANCOVA). Another example is covered on pp. 99-102 (but once again, it isn't referred to as an analysis of covariance in G&H). This time, we can add a few new investigations to the analysis. (The data set is on the G&H web site as fats.sav.)

To start with, we can note that Fig. 6.3 shows that the females tend to have higher percentages of body fat than do the men. Provided that we feel comfortable treating the two samples as random samples from some populations of interest, we can do either a two-sample Student's t test or a Welch's test to see if there is statistically significant evidence that the men pct. body fat for females differs from the mean pct. body fat for males. For each test, SPSS can be used to produce a p-value of about 0.003, and we can conclude that the mean for females exceeds the mean for males.

Fig. 6.2 suggests that pct. body fat increases with weight. If we adjust for weight, and determine if there is a sex difference once weight is adjusted for, the evidence will be even stronger that males and females differ. (Since the males are on the whole heavier than the females, if there is no difference due to sex, we would expect for the males to have larger pct. body fat measurements if pct body fat increases with weight.)

Using Analyze > General Linear Model > Univariate, with pct body fat as the dependent variable, sex as a fixed factor, and weight as a covariate, and using a custom model to include only main effects, gives us a F statistic value of 68.73 and a reported p-value of 0.000 (which I would express using p-value < 0.0005), matching Box 6.2 on p. 100. In order to get estimates of the model parameters, one needs to have gone to Options and checked the Parameter estimates box before clicking OK to execute everything. The estimate of the intercept is 9.059, and the estimate of the coefficient for the female indicator (which I'll explain in class) is 7.904. (Note that the females are coded with 1s and the males with 2s in the G&H data set obtained from their web site.) The coefficient for weight (the slope estimate for the regression lines in Fig. 6.4 on p. 101) is 0.217 (which gives us that the mean pct body fat increases by about 0.22% for each additional kg (or 2.2% for every 10 kg)). The intercept for the male line is 9.059, and the intercept for the female line is 9.059 + 7.904 = 16.963. (These match, ignoring a bit of accumulated rounding error and/or numerical imprecision, the information in Box 6.2 on p. 100, where G&H use the ANOVA scheme of giving a grand mean and nonzero deviations from the grand mean for both males and females. For the males we have 13.010 - 3.952 = 9.062, and for the females we have 13.010 + 3.952 = 16.962.) The value of 7.904 supplied by SPSS is the difference in the intercepts (the amount by which the female intercept is greater than the male intercept). This is not equal to the difference in the sample means for the pct body fat of males and females, which is about 4.37. The value of 7.904 represents the difference in mean pct body fat for females and males of the same weight.

Note that the model fit assumes that the change, per kg of weight, in mean pct body fat is the same for both males and females --- that the two regression lines are parallel. Upon an inspection of Fig. 6.2 on p. 100, one might guess that this is a questionable assumption. To check on this assumption, we can include an interaction term in the model. Such a term will allow for the male and females lines to have different slopes. Unfortunately, To fit such a model using SPSS, we cannot do as we did before (using General Linear Model > Univariate), only this time requesting an interaction term be included. But we can "trick" SPSS into giving us what we want by creating a new variable in the Data Editor, having it be the weight values for females, but giving each male the value of 0 for this new variable. (This can be done by copying (not cutting) and pasting the female weights into a new column of the data editor, after entering the value of 0 in the first 10 rows of the column for the males.) Then just do everything as before, only this time adding the new variable as a second covariate. Now the p-value for the sex difference (which pertains to the intercept) is nonsignificant (about 0.32), but the p-value for the new variable is smallish, being about 0.035. The coefficient of weight is 0.186, and the coefficient of the new (interaction) variable is 0.217. So the overall slope for the males in 0.186, and the slope for the females is 0.186 + 0.217 = 0.403 (more than twice the value for the males). To summarize, the data indicates that it is wrong to try to force a model having a common slope for males and females. In a sense we just need to consider two separate regressions, although we don't have statistically significant evidence that the intercepts differ. To fit a model having a common intercept, but allowing different slopes, we can keep the two covariates in the model, but omit the sex fixed effects factor.

An alternative way to do the investigation of different slopes is to use Analyze > Mixed Models > Linear. (If there are more than 2 groups of the fixed factor, this is definitely the way to go.) At the first screen, click on Continue. Then click the pct body fat variable into the Dependent Variable box, sex into the Factor(s) box, and weight into the Covariate(s) box. Then click on Fixed in order to be able to specify the model. Highlight sex by clicking on it, and click Add to put it into the Model box. Highlight weight by clicking on it, and click Add to put it into the Model box. Finally, highlight both sex and weight together, and click Add to put a sex*weight interaction variable/term into the Model box. Select Interaction using the scroll down menu in the middle of the window (although it isn't completely clear to me that this is necessary --- but it is one thing that works, and so I recommend it ... with SPSS it's hard to determine exactly what some of the features do). Then click Continue. Upon clicking on Statistics, check the Parameter estimation box, and then Continue. Upon clicking on Save, under Predicted Values & Residuals, check the boxes for Predicted Values and Residuals, and then click Continue, and finally click OK. One should get the same results as above, where I used General Liner Models. In the Estimates of Fixed Effects part of the output, the estimate of about 0.217 for [sex=1]*weight is the additional amount of slope (corresponding to pct body fat per kg of weight) for the females, and it's significant at level 0.035 (same as before). A plot of the residuals against the predicted values looks in line with the assumption of homoscedasticity, and a probit plot of the residuals shows nothing too worrisome --- there is one slightly large negative residual, but it's nothing to worry about (although it could be interesting to determine which subject it corresponds to).

Here are some addtional comments about analysis of covariance. In a typical analysis of covariance setting, there is one fixed factor (maybe diet, having three different levels, or sex, having two levels) and one (or more) covariates (where the covariate is a numerical variable, such as weight, height, or age). (Aside: If there are only a small number of different values of the numerical variable, it may be that one can address the data through a two-way ANOVA model. But if typically each case in the data set has a unique value for the numerical variable, an analysis of covariance (or regreession) model is needed.) With an additive model, the mean of the response/dependent variable (the "y") is modeled as being a constant (referred to as the intercept in SPSS), plus a possible adjustment due to the fixed factor variable, plus a possible adjustment due to the numerical variable, of the form beta*x, where x is the covariate variable, and beta is its slope parameter (which corresponds to the change in the mean of the response variable per unit change of x). To fit an additive model, I recommend that you begin as is indicated in the immediately preceding paragraph, only upon clicking Fixed, select Main Effects from the little menu in the middle of the two boxes, and just add the two variables (the fixed factor variable and the covariate), one at a time, to the Model box. If both the fixed factor and the covariate produce small p-values (say, less than or equal to 0.05), then there is evidence that the model should at least include both variables, and the fitted additive model corresponds to two or more parallel lines (if we plot the estimated mean response against x). The coefficient of x is the slope of the lines, and the estimates corresponding to the levels of the fixed factors, along with the estimated intercept, give the intercepts for the various lines. (Note: To obtain the estimates, click Statistics and then click Parameter estimates before clicking OK.) If the fixed factor has three levels (say three different diets) which are coded as 1, 2, and 3 in the data set, then the estimated intercept (for the whole model) is the fitted intercept for the third group. (The output indicates that the estimated adjustment for the third group is 0.) The output will give estimates corresponding to the other groups of the fixed effects factor. These are adjustments to the overall intercept needed to give the intercepts for these other groups. For example, if the (overall/baseline) intercept is estimated to be 0.31, and the estimated coefficient corresponding to the first of the fixed factor groups is 0.12, then the estimated intercept for the first group is 0.31 + 0.12 = 0.43.

If the p-value for the fixed factor isn't small, then we don't have strong evidence that we need any adjustments to the overall intercept to accout for differences between the groups. That is, there isn't evidence that a simple regression model isn't adequate. (In a simple regression model we just have a single intercept.) If the p-value for the covariate isn't small, then we don't have strong evidence that we need any adjustment due to differing values of the covariate. That is, there isn't evidence that a model which just allows for differences in the mean due to the level of the fixed factor isn't adequate --- there is no strong evidence that the covariate is needed to model the mean response, and the model reduces to the one used for a one-way ANOVA. If the neither the p-value for the fixed factor, nor the p-value for the covariate is small, then we don't have evidence that either of these two variables are needed to model the mean response --- it may be the case that we just have a single distribution (with a single value for the mean, of course) underlying all of the observations of the response variable.

If both the p-value for the fixed factor and the p-value for the covariate are small, then we have a model which allows for different intercepts, but there is only a single slope to adjust the mean response for the value of the covariate. In some cases it may be that each group has it's own slope. (The model would then correspond to three separate simple linear regression models --- one for each group.) We can incorporate different slopes if we move away from the additive model to one that allows for interaction of the fixed factor and the covariate. (If the covariate term adjusts the mean response for differences in the covariate values, with the slope associated with the covariate providing the rate of change of the mean response, then the interaction of the fixed factor and the covariate allows the covariate value to affect the mean response differently for the different levels of the fixed factor.) If we try the model which has the interaction incorporated, and the p-value which corresponds to the interaction is large, then we can reduce back to the additive model. If the p-value is small, then we have evidence that the more complex model is superior. In this case, to get the various slope estimates, we look at the estimated parameters from the more complex model. The coefficient given for the covariate is the estimated slope for the baseline group (the group coded with 3 if there are three groups coded with 1, 2, and 3). For the other slopes, we need to add an adjustment to the baseline slope. For example, for the slope of the 2nd group, we add the coefficient corresponding to the interaction for the 2nd group to the baseline slope. (The interaction coefficient for the 2nd group represents the difference between the slope of the 2nd group and the slope of the baseline group.)

One might wonder if we wind up fitting three separate lines (and, with regard to estimating the mean response, an analysis of covariance with interaction allowed for is indeed equivalent to fitting three separate simple regression models), why not just start by doing that? Why bother with the analysis of covariance model? Well, by using the analysis of covariance approach we have an easy way to determine if there is strong evidence that we need separate intercepts and/or separate slopes, and these may be the important issues to be addressed, since it's often the case that we want to know if there are any differences in the mean response for the different groups. And if we are just going to use one slope, the analysis of covariance method uses all of the data to estimate that single slope, and it uses all of the data to estimate the differences in the intercepts (which are the differences in the mean response, for the different groups, for observations taken at the same value of the covariate). (If one just tried to estimate these differences in the intercepts by using observations from different groups made at common values of the covariate, then there may not be much data to use, and in a lot of cases even if there was some data of the type necessary, one could still be ignoring valuable information that could be obtained from the rest of the data.)

In all cases, when finalizing an analysis of covariance model, we should carefully examine the residuals. A funnel pattern suggests heteroscedasticity, and one may need to transform the response variable in order to achieve (approximate) homoscedasticity. Curvature in the residual plot (residuals plotted against the predicted values) suggests that just having the covariate in the model in a linear way isn't adequate, and one may need to transform the covariate or add higher order terms (e.g., instead of just using x as a single covariate, one may need to use both x and x²).