Ch. 12 notes for S&W, STAT 535

Some Comments about Chapter 12 of Samuels & Witmer

Section 12.1

Throughout this section, and the whole chapter, S&W frequently has an upper-case X where I think a lower-case x would be more appropriate. For correlation studies, we observe (X, Y) from a joint distribution (for X and Y). So we use (x_i, y_i) to denote the ith observed pair, and view (X_i, Y_i) as the associated random variable (that we observe to get the data values). But for regression studies, x is typically thought of as a design variable which is controlled by the experimenter. Similar to having treatment groups in ANOVA which are controlled by the experimenter, in regression, the experimenter makes observations of Y corresponding to values of x selected by the experimenter. Now, a lot of times the data is observational in nature, and we really do make regression models based on observations of (X, Y) pairs. But when considering the model, we take y to be the observed value of a random variable Y which is associated with a fixed value of x. That is, given a certain value of x, we assume a model that specifies that Y has a distribution which depends upon that particular value of x, and we consider the observed value, y, to be one of many values that could have been observed with that value of x. (Note: The cases of the x_i being controlled by the experimenter, and of the x_i being observed from a bivariate distribution, are referred to on p. 527.)

What I feel are mistakes with regard to the use of upper case / lower case are too numerous to comment on them all, and so I won't bother.

(pp. 525-526, Example 12.1) Here, x should be in lower case --- a classic case of a design variable, since the dose is controlled by the experimenter, and not associated with observations from a joint distribution. The Y_i can be viewed as random variables --- clearly, given a fixed value of x, there is a distribution associated with the response variable, Y. However, on p. 526, it should be lower-case y, since there the y_i are actual observed values, and not random variables which will eventually be observed.
(p. 527, 1st new paragraph) The point is that the mean of Y may be reasonably modeled as a function of x. In the regression model that will be introduced later, there is an error term which will account for the plotted points not all being in a perfectly straight line, and so we shouldn't expect to see a sharp straight line pattern in the scatter plot. But we might imagine that if we could plot the mean of Y, given a fixed value of x, which will be denoted by E(Y | x), against x, we'd get a straight line pattern.
(pp. 528-529, Example 12.3) The figures on pp. 528-529 suggest a curved relationship between E(Y | x) and x, which makes sense --- only if the thickness of snakes was the same for all lengths, would weight tend to increase linearly with length.

Section 12.2

(p. 529, Equation of the Regression Line) While the equation given is that of a line, it's not the model for the phenomenon. If we model Y, there should be an additive error term to allow for variability about the straight line which is used for E(Y | x) --- that is, E(Y | x) can be expressed as a linear equation, but not Y. I'm going to use the Greek letter beta for the unknown parameters which specify the line, and then put "hats" on them to represent the estimates/estimators of these parameters. (I like to use Greek letters for constants, particularly parameters and population measures, and I like to use Roman letters for random variables, with lower case used to designate the observed values of random variables (the data values in a sample).)
(p. 530, Plotting Tip) With SPSS, Graphs > Scatter can be used to create a plot like Figure 12.5 without the regression line shown. I'm not sure how to get the fitted line in the scatter plot.
(p. 531) Although the number of points above the fitted line need not equal the number of points below the fitted line, the average of the residuals is equal to 0. Also note that when judging the nearness of the points to the line, the vertical distance is considered (the distance corresponds to the magnitude of the residual). The prependicular distances from the line to the points don't make sense --- instead of comparing an observed value of Y corresponding to some value of x to the estimate of E(Y | x), the perpendicular distance pertains to an observed value and a mean for two different values of x.
(p. 532, Least Squares Criterion) This criterion leads to the least squares estimates on p. 529 --- all it takes is relatively simple 3rd semester calculus results. The method of least squares makes sense if the error terms (not introduced yet in this chapter) are approximately normally distributed, but is not the best way to get estimates if the error term distribution has heavy tails. For heavy-tailed error term distributions, robust regression methods produce superior estimates --- but these robust methods aren't available on most statistical software.
(p. 533, Residual Standard Deviation) S&W puts the "cart before the horse" here --- the formula in the gray box is an estimate for something that hasn't been introduced yet. (It's an estimate for the standard deviation of the error term distribution.)
(p. 533, Example 12.6) The least sentence of the example isn't very precise --- not clear exactly what is meant by "tends to be off by" ... but the paragraph that starts at the bottom of the page firms up things a bit.
(p. 535) Output similar to what is shown on this page can be produced using SPSS via Analyze > Regression > Linear, clicking the y variable into the Dependent box, clicking the x variable into the Independent box, and clicking OK.

Section 12.3

(p. 542) The material in the blue box is important. It should be lower-case x, instead of X, since the model pertains to the distribution of Y given a specific value of x. I like to write E or e for the error term, instead of epsilon, to comply with my convention of (mostly) using Roman letters for random variables (and lower case Roman letters for their observed values) and Greek letters for constants. This model is referred to as the simple linear regression model, where the term "simple" is due to the fact that there is just a single predictor variable, x, and E(Y | x) is just a (simple) linear function of x. Multiple regression models involve more than one predictor (aka explanatory, or independent) variable.
(p. 543, Figure 12.9) This figure is nice in that it shows how the density of Y changes as x changes. Looking at the various densities in the figure, you can hopefully get an appreciation of how if the simple linear regression model holds, the observed values of Y corresponding to assorted values of x will be scattered about the line corresponding to E(Y | x).
(p. 544, Remark) Most books aimed at statisticians don't use the term curvilinear, but this term seems to be often used in books dealing with statistics which are aimed at (and written by) people in the life sciences and social sciences. If one has that E(Y | x) is a linear function of x², we would still have a linear regression model, and in fact it'd be a simple linear regression model --- just based on using x² as a predictor, instead of x. We could also have a (simple) linear regression model based on logx. When something other than x is used as the predictor, a plot of E(Y | x) against x won't be a straight line, but it's still a linear regression if E(Y | x) is a linear function of some predictor (perhaps some function of x).
(p. 546) The paragraph right before Example 12.16, along with that example, are very important.
(p. 544, Prediction and the Linear Model) This whole page is very important. A main message is: if a linear relationship holds, it's best to make use of the fitted linear relationship when making predcitions in order to let all of the available data contribute, since all of the data is meaningful in that it provides information about the precise nature of the unknown linear relationship, but if an assumed linear relationship is a bad assumption, then using the fitted linear model to make a prediction can be worse than just using a rather limited amount of the data to make a prediction.

Section 12.4

(p. 548, 1st paragraph) Now the assumption of a normal distribution for the error terms is being added. By doing this, inference procedures can be derived (although S&W doesn't provide the details of the derivation, which is fine, since there are more important things to give attention to). Since we should never expect to encounter a situation for which the error term distribution is exactly normal, it's good that the inference procedures can be *okay* even if the error term distribution is only approximately normal, and in some large sample situations they can be okay for some purposes if the error term distribution is appreciably nonnormal. (In cases of appreciable nonnormality, other methods may be better, but alternative methods are rarely used, and are not commonly available on mainstream statistical software.)
(p. 548) The standard error formula in the blue box doesn't require normality for the error terms. Also, it should be noted that it's a formula for the estimated standard error. Although it isn't immediately obvious why that is the correct formula, it follows from some relatively simple probability results.
(p. 549, Implications for Design) This paragraph is very important.
(p. 550, Example 12.18) I don't like either of the ways that S&W gives a confidence interval. An interval estimate should be expressed as an interval. In this case, the confidence interval is (4.9, 9.4).
(p. 550 & pp. 554-555, Example 12.18 & Example 12.21) To get the confidence interval using SPSS, first fix the data if you entered it from the CD that came with S&W --- the 10th row shouldn't be there. Then use Analyze > Regression > Linear, and click weight into the Dependent (variable) box and length into the Independent (variable) box. Before clicking OK, click on Statistics and check the Confidence intervals box, and then click Continue. Next, click on Save, and click to check the boxes for Unstandardized Predicted Values, Unstandardized Residuals, and Studentized deleted Residuals, and then click Continue. Finally, click OK to cause all of the output to be created. In the Coefficients part of the output, you should be able to see that the point estimate for the slope parameter is about 7.19, and that the 95% confidence interval for the slope is about (4.94, 9.45). (Note: I'm rounding the point estimate and confidence bounds to the nearest hundreth since the 2nd significant digit of the estimated standard error of the slope estimator is in the hundreth position.) The value of R² can be seen to be equal to about 0.89 in the Model Summary part of the output, which matches the value given at the bottom of p. 554 of S&W. Making a scatter plot of the unstandardized residuals against the unstandardized predicted values results in a plot like the one given in Fig. 12.29 on p. 571, and making a probit plot of the standardized residuals results in a plot similar to the one given in Fig. 12.30 on p. 572, but the axes are switched. One should check the studentized deleted residuals to see if any of them are greater than 2.5 --- large studentized residuals indicate that one needs to be more careful in checking the fit, since outliers may be having too much influence.
(p. 550, Testing the Hypothesis) SPSS will do tests like this one for us, so you need not be concerned with the details.
(pp. 550-552, Example 12.19) To do the test using SPSS, read in the correct data and follow the SPSS steps described above for the previous example. The t statistic value of about 4.31 can be found in the Coefficients part of the output, matching the result on p. 552 of S&W. SPSS reports the p-value to be 0.000, so I would write p-value < 0.0005. (It can be noted that if S&W is doing a two-tailed test, as is indicated on the bottom of p. 551, then from using Table 4 the conclusion should be that the p-value is less than 0.001, since the table gives us that the upper tail probability is less than 0.0005, and that needs to be doubled for a two-tailed test.)
(p. 552) The first paragraph after the end of the example makes an important point: one can do the test, and get a p-value, even if the linear model is not a good model --- but if the model doesn't hold, and there is no slope parameter associated with the phenomenon, the test is somewhat pointless ... it doesn't make sense to do a test about a parameter of a defective model that doesn't correspond to reality.
(p. 552, Why (n - 2)?) A simple explanation is that using n - 2 provides us with an unbiased estimator, and results in a convenient null distribution for the test statistic.

Section 12.5

(p. 554) In the equation
total variability = explained variability + unexplained variability,
what the total variability and unexplained (by the regression) variability are is fairly clear --- the total variability is the sum of the squared deviations of the y values from their sample mean, and the unexplained variability is the sum of the squared residuals. (If all of the (x, y) pairs were all on the regression line, the regression line gives the value of y for any value of x without error, and there would be no unexplained variability. The fact that all of the (x, y) pairs are not on the regression line, means that the regression line doesn't fully give the relationship between y and x --- the residuals account for the variation in the y values that is not explained by the regression of y on x.) Since it's clear what two of the three variabilities should be, the third one can be obtained by subtraction. That is, we have
explained variability = total variability - unexplained variability.
The proportion of variability explained by the regression is
explained variability / total variability,
or equivalently
1 - unexplained variability / total variability.
Most books and software use R² for the proportion of variability explained by the regression (aka proportion of variation explained, or coefficient of determination). It is algebraically equal to the square of the sample correlation coefficient, r, but this fact isn't real easy to derive, and I don't think it's too important. You should have a clear understanding of what both r and R² are, and not worry too much about how they are related. I don't think R² should be in the section of the chapter that pertains to the sample correlation coefficient, since R² is a measure of how well x explains Y in a regression model, and r is a summary measure of the joint distribution of X and Y. A value of R² close to 1 means that the residuals are relatively small, and x is a rather good predictor of Y. A value of R² close to 0 means that knowing the value of x doesn't give us a lot of information about Y. A low R² doesn't necessarily mean that the regression model is inappropriate (although a low R² is consistent with the hypothesis that the mean of Y is not a linear function of x), since it may be that the model is appropriate and that the error term variance is relatively large, which scatters the y values greatly around a relatively subtle straight line pattern.
(p. 555, The Correlation Coefficient) r should be referred to as the sample correlation coefficient --- it is used to estimate the distribution/population correlation coefficient, which is a measure of the strength of the linear relationship between X and Y. (Other measures of association are useful for measuring the strength of monotonic, but nonlinear, relationships.) I don't think the relationship between r and the estimate of the slope in a linear regression, given near the middle of the page, is important.
(p. 556, Example 12.22) Even though the sample correlation, 0.944, is not too far from 1, a plot of the data (see p. 528 or p. 557) suggests a slight curvature in the relationship (which makes sense, because the relationship between weight and length would be linear if the cross-sectional area didn't depend on length, and it doesn't seem reasonable that short snakes and long snakes are of the same thickness). It's important to keep in mind that r can be close to -1 or 1 and a linear relationship not be appropriate, while for other data, r can be closer to 0 and a linear relationship be a decent summary of the data. The value of r depends on both the linearity of the overall relationship and the variability about the summarizing line.
(p. 556, Example 12.23) The plots and corresponding sample correlations are good to study. Note that when the magnitude of the sample correlation is 0.35, it may not be real clear that there is a relationship between X and Y. But upon a more careful examination of the plots, one can detect a slight tendency for larger values of y to occur with larger values of x for the sample correlation of 0.35, and for smaller values of y to occur with larger values of x for the sample correlation of -0.35. It would be nice if S&W included some plots giving the values of r for bivariate samples from joint distributions for which there is a monotonic but nonlinear relationship between X and Y.
(pp. 557-559) I don't think any of the material on these pages is too important --- there is more important material that we should concentrate on.
(p. 560, 3rd paragraph) Finally, S&W gets around to describing that r is an estimate of a population measure, rho.
(p. 560, Example 12.27) Here a sample of 38 people were selected, and (x, y) measurements obtained from each person. These pairs can be regarded as a random sample from the (joint) distribution of X and Y. r, the two sample means, and the two sample standard deviations, can be regarded as estimates of measures associated with the (joint) distribution of (X, Y). If, for a certain part of the analysis, we are just interested in the distribution of X or just the distribution of Y, then, for example, the sample of the x_i can be regarded as a random sample from some underlying parent distribution, and the sample mean and sample standard deviation can be used as estimates of the mean and standard deviation of the distribution.
(p. 560, Example 12.28) Here we don't have a single random sample from a bivariate distribution. x was controlled by the experimenter, and is not random. We really have three different random samples of y values. The data could be used in an ANOVA to determine if the means are not all the same, and if so, which are different from which other ones. But with a regression we can do something else --- we can model the mean of Y as a function of x. The sample correlation isn't an estimate of some population measure, because do don't view the (x, y) pairs as being due to a joint distribution. The sample mean and sample standard deviation of the x values aren't estimates of anything, because all of the x values were in a sense assigned, and are not viewed as being a random sample from some distribution. Also, the sample mean and sample standard deviation of the y values aren't estimates of anything simple, because the y values are not due to a single distribution, but rather we have observations from three (possibly) different distributions.
(p. 561, the blue box) For doing a test of the null hypothesis that the distribution/population correlation is 0 against the alternative that it's not, the last form of the test statistic given in the box seems more appropriate. If one is interested in the correlation, then the sample correlation serves as an estimate of the distribution/population correlation, and using the last form of the test statistic, we just need the estimated correlation to obtain the value of the test statistic. The first form of the test statistic is based on the results of fitting a regression model, modeling E(Y) as a linear function of x. Although the two forms of the test statistic given always produce the same value, I think it'd be odd to compute a slope parameter for a regression model when the focus is a distribution/population correlation coefficient.
(p. 561, Example 12.29) To obtain the sample correlation coefficient using SPSS, use Analyze > Correlate > Bivariate, click the x and y variables into the Variables box, and then click OK (although in a lot of cases, you might want to also click to check the Spearman box before clicking OK). The output is in the form of a correlation matrix (which is sort of silly if there is only one correlation of interest). The estimate, 0.583, can be seen in two of the boxes. The value of the t statistic isn't given, and the p-value is reported to be 0.000. Since I don't like to express a p-value as being 0 unless it really is 0, I'd write p-value < 0.0005. (One could obtain the t statistic value, as is done in S&W, and use Table 4 to determine that the upper-tail probability is less than 0.0005, and so the p-value for a two-tailed test is less than 2(0.0005) = 0.001.)
(pp. 561-562, Cautionary Notes) These comments are very important. Comment 2 warns us that r is sensitive to outliers. Because of this, the hypothesis test about the correlation coefficient, which is based on r, can misbehave if the parent distribution of the data is too far from a bivariate normal distribution.
(pp. 562-563) Unfortunately, there doesn't seem to be an easy way to obtain confidence intervals for the distribution/population correlation coefficient using SPSS.

Section 12.6

(p. 566, 2nd line) It's not clear that the random variables X and Y are uncorrelated just because the sample correlation coefficient is 0.
(p. 568, Example 12.34) Since there are measurements on only 2 different people, one could use sample means in place of the individual observations (i.e., work with 2 pairs of sample means, with each sample mean based on the 10 observations of a variable for a single person) in order to have independent observations. But then the sample size is only 2, and there is no way to make meaningful inferences about the correlation from a sample of size 2.
(p. 571) With simple regression (i.e., just a single explanatory variable, x) one can also plot the residuals against x to obtain a residual plot --- the pattern is th esame whether x or the predicted values are used on the horizontal axis.
(p. 572, The Use of Transformations) Transforming x can eliminate or reduce nonlinearity (curvature), but won't correct for heteroscedasticity. Transforming y affects the variability as well as the linearity/nonlinearity. In Example 12.38, a transformation of y corrects both the heteroscedasticity and the nonlinearity. This sometimes happens, but often the transformation of y needed to correct the heteroscedasticity results in a nonlinear pattern of the x and transformed y pairs. In such a case, a transformation of x may be needed to obtain a linear pattern. (Instead of transforming x, it is sometimes useful to add other terms to the model involving x, say a x² term, and possibly a x³ term. Fitting such a model would require multiple regression.)
(p. 573, Example 12.39) This example, and the following paragraph on p. 574, are good.

Section 12.7

(pp. 576-578, Example 12.40) This example is rather silly. Viewing it as a two-sample test about the means is much more straightforward than viewing it as a regression problem. The example shows that one can get the the same result via regression, but there is no good reason to go that route.
(p. 580, 1st paragraph) Notice that a relationship can be real without it being a causal. The relationship between akinete counts and photoperiod in Example 12.39 is real, but not causal.
(p. 580, Extensions of Least Squares) The two (model) equations are bad. There needs to be an error term included with each of them --- one models the mean of Y given various predictor variables instead of expecting that Y is a perfect function of the predictors. Also, the predictors should be expressed in lower-case type, since we model Y given fixed values for the predictors. The first (defective) model is sometimes referred to as a polynomial regression model. (I don't care for the term curvilinear.) If there is enough data, one might consider a 3rd or 4th degree polynomial model. When polynomial models are used, often it isn't believed that E(Y | x) is really a polynomial in x, but rather that E(Y | x) is some function of x, and the unknown function of x can be reasonably approximated by a polynomial. (This is based on the result of Taylor's theorem (from calculus).) The polynomial model is an example of a multiple regression model, as is the second (defective) model given, which involves two different predictor variables, instead of two terms based on the same predictor variable.
(pp. 580-581, Example 12.42) This is a good example. Since many things obviously are related to blood pressure, in order to properly study the relationship (if any) between cholesterol and blood pressure, other factors which may possibly be related to blood pressure should be adjusted for. Just looking at the relationship between cholesteral and blood pressure while ignoring other important variables can lead to nonsensical and misleading results. However, it should be noted that bad things can also occur if too many terms are used in a model.
(p. 581, Nonparametric and Robust Regression and Correlation) Spearman's rank correlation coefficient is a measure of association that measures the strength of a monotonic (either increasing or decreasing) relationship/trend (as opposed to the sample Pearson correlation coefficient, which measures the strength of a linear relationship). To compute Spearman's statistic, we can rank the x_i from smallest to largest (from 1 to n), rank the y_i from smallest to largest (from 1 to n), and plug into the formula for the sample Pearson correlation coefficient. But we can get the value easily using SPSS by just clicking to check the Spearman box before clicking OK when computing the Pearson sample correlation coefficient using Analyze > Correlate > Bivariate. Nonparametric and robust regression are topics out of the mainstream that we won't have time to deal with. Plus, they aren't available on a lot of easy-to-use statistical software packages (which is too bad, because if the error term distribution is heavy-tailed, the regression coefficients can be better estimated using robust regression instead of least squares --- but a drawback is that there aren't reliable robust test and interval estimation methods to complement the superior point estimates).
(pp. 581-582, Analysis of Covariance) Although S&W doesn't provide any details, I'll point out that one can do analysis of covariance using SPSS via Analyze > General Linear Model > Univariate (one just needs to click the continuous variable(s) into the Covariate(s) box.
(pp. 582-585, Logistic Regression) We're not going to have time to do much with logistic regression. I may be able to describe it a bit during the last lecture.