Some Comments about Chapter 12 of Samuels & Witmer
Section 12.1
Throughout this section, and the whole chapter, S&W frequently has an upper-case
X where I think a lower-case x would be more appropriate.
For correlation studies, we observe (X, Y) from a joint
distribution (for X and Y). So we use
(xi,
yi) to denote the ith observed pair, and view
(Xi,
Yi) as the associated random variable (that we
observe to get the data values). But for regression studies, x
is typically thought of as a design variable which is controlled
by the experimenter. Similar to having treatment groups in ANOVA which
are controlled by the experimenter, in regression, the experimenter
makes observations of Y corresponding to values of x
selected by the experimenter. Now, a lot of times the data is
observational in nature, and we really do make regression models based
on observations of (X, Y) pairs. But when considering the
model, we take y to be the observed value of a random variable
Y which is associated with a fixed value of x. That is,
given a certain value of x, we assume a model that specifies
that Y has a distribution which depends upon that particular value
of x, and we consider the observed value, y, to be one of
many values that could have been observed with that value of x.
(Note: The cases of the xi being controlled by the
experimenter, and of the xi being observed from a
bivariate distribution, are referred to on p. 527.)
What I feel are mistakes with regard to the use of upper case / lower
case are too numerous to comment on them all, and so I won't bother.
- (pp. 525-526, Example 12.1) Here, x should be in lower
case --- a classic case of a design variable, since the dose is
controlled by the experimenter, and not associated with observations
from a joint distribution. The Yi can be viewed as
random variables --- clearly, given a fixed value of x, there is
a distribution associated with the response variable, Y.
However, on p. 526, it should be lower-case y, since there the
yi are actual observed values, and not random
variables which will eventually be observed.
- (p. 527, 1st new paragraph)
The point is that the mean of Y may be reasonably modeled as a
function of x. In the regression model that will be introduced
later, there is an error term which will account for the plotted
points not all being in a perfectly straight line, and so we shouldn't
expect to see a sharp straight line pattern in the scatter plot. But we
might imagine that if we could plot the mean of Y, given a fixed
value of x, which will be denoted by E(Y | x),
against x, we'd get a straight line pattern.
- (pp. 528-529, Example 12.3) The figures on pp. 528-529
suggest a curved relationship between
E(Y | x) and x, which makes sense --- only if the
thickness of snakes was the same for all lengths, would weight tend to
increase linearly with length.
Section 12.2
- (p. 529, Equation of the Regression Line) While the
equation given is that of a line, it's not the model for the phenomenon.
If we model Y, there should be an additive error term to allow
for variability about the straight line which is used for
E(Y | x) --- that is,
E(Y | x) can be expressed as a linear equation, but not
Y. I'm going to use the Greek letter beta for the unknown
parameters which specify the line, and then put "hats" on them to
represent the estimates/estimators of these parameters. (I like to use
Greek letters for constants, particularly parameters and population
measures, and I like to use Roman letters for random variables, with
lower case used to designate the observed values of random variables
(the data values in a sample).)
- (p. 530, Plotting Tip) With SPSS, Graphs > Scatter
can be used to create a plot like Figure 12.5 without the regression
line shown. I'm not sure how to get the fitted line in the scatter
plot.
- (p. 531) Although the number of points above the fitted line need
not equal the number of points below the fitted line, the average
of the residuals is equal to 0. Also note that when judging the
nearness of the points to the line, the vertical distance is considered
(the distance corresponds to the magnitude of the residual). The
prependicular distances from the line to the points don't make sense
--- instead of comparing an observed value of Y corresponding to
some value of x to the estimate of
E(Y | x), the perpendicular distance pertains to an observed
value and a mean for two different values of x.
- (p. 532, Least Squares Criterion)
This criterion leads to the least squares estimates on p. 529 --- all it
takes is relatively simple 3rd semester calculus results. The method of
least squares
makes sense if the error terms (not introduced yet in this chapter) are
approximately normally distributed, but is not the best way to get
estimates if the error term distribution has heavy tails. For
heavy-tailed error term distributions, robust regression methods
produce superior estimates --- but these robust methods aren't available
on most statistical software.
- (p. 533, Residual Standard Deviation)
S&W puts the "cart before the horse" here --- the formula in the gray
box is an estimate for something that hasn't been introduced yet.
(It's an estimate for the standard deviation of the error term
distribution.)
- (p. 533, Example 12.6) The least sentence of the example
isn't very precise --- not clear exactly what is meant by "tends to be
off by" ... but the paragraph that starts at the bottom of the page
firms up things a bit.
- (p. 535) Output similar to what is shown on this page can be
produced using SPSS via Analyze > Regression > Linear,
clicking the y variable into the Dependent box,
clicking the x variable into the Independent box, and
clicking OK.
Section 12.3
- (p. 542) The material in the blue box is important. It should be
lower-case x, instead of X, since the model pertains to
the distribution of Y given a specific value of x. I like
to write E or e for the error term, instead of epsilon, to
comply with my convention of (mostly) using Roman letters for random
variables (and lower case Roman letters for their observed values) and
Greek letters for constants. This model is referred to as the simple
linear regression model, where the term "simple" is due to the fact
that there is just a single predictor variable, x, and
E(Y | x) is just a (simple) linear function of x.
Multiple regression models involve more than one predictor
(aka explanatory, or independent) variable.
- (p. 543, Figure 12.9) This figure is nice in that it shows
how the density of Y changes as x changes. Looking at the
various densities in the figure, you can hopefully get an appreciation
of how if the simple linear regression model holds, the observed values
of Y corresponding to assorted values of x will be
scattered about the line corresponding to
E(Y | x).
- (p. 544, Remark) Most books aimed at statisticians
don't use the term curvilinear, but this term seems to be often
used in books dealing with statistics which are aimed at (and written
by) people in the life sciences and social sciences. If one has that
E(Y | x)
is a linear function of x2, we would still have a
linear regression model, and in fact it'd be a simple linear
regression model --- just based on using
x2
as a predictor, instead of x. We could also have a (simple)
linear regression model based on logx. When something other than
x is used as the predictor, a plot of
E(Y | x)
against x won't be a straight line, but it's still a linear
regression if
E(Y | x)
is a linear function of some predictor (perhaps some function of
x).
- (p. 546) The paragraph right before Example 12.16, along
with that example, are very important.
- (p. 544, Prediction and the Linear Model)
This whole page is very important. A main message is: if a linear
relationship holds, it's best to make use of the fitted linear
relationship when making predcitions in order to let all of the
available data contribute, since all of the data is meaningful in that
it provides information about the precise nature of the unknown linear
relationship, but if an assumed linear relationship is a bad
assumption, then using the fitted linear model to make a prediction
can be worse than just using a rather limited amount of the data to make
a prediction.
Section 12.4
- (p. 548, 1st paragraph) Now the assumption of a normal distribution
for the error terms is being added. By doing this, inference procedures
can be derived (although S&W doesn't provide the details of the
derivation, which is fine, since there are more important things to give
attention to). Since we should never expect to encounter a situation
for which the error term distribution is exactly normal, it's good that
the inference procedures can be *okay* even if the error term
distribution is only approximately normal, and in some large sample
situations they can be okay for some purposes if the error term
distribution is appreciably nonnormal. (In cases of appreciable
nonnormality, other methods may be better, but alternative methods are
rarely used, and are not commonly available on mainstream statistical
software.)
- (p. 548) The standard error formula in the blue box doesn't require
normality for the error terms. Also, it should be noted that it's a
formula for the estimated standard error.
Although it isn't immediately obvious
why that is the correct formula, it follows from some relatively simple
probability results.
- (p. 549, Implications for Design)
This paragraph is very important.
- (p. 550, Example 12.18) I don't like either of the ways that S&W
gives a confidence interval. An interval estimate should be expressed
as an interval. In this case, the confidence interval is (4.9,
9.4).
- (p. 550 & pp. 554-555, Example 12.18
& Example 12.21) To get the confidence interval
using SPSS, first fix the data if you entered it from the CD that came
with S&W --- the 10th row shouldn't be there. Then use
Analyze > Regression > Linear, and click weight into the
Dependent (variable) box and length into the
Independent (variable) box. Before clicking OK, click on
Statistics and check the Confidence intervals box, and
then click Continue. Next, click on Save, and click to
check the boxes for Unstandardized Predicted Values,
Unstandardized Residuals, and
Studentized deleted Residuals, and then click Continue. Finally,
click OK to cause all of the output to be created.
In the Coefficients part of the output, you should be able to see
that the point estimate for the slope parameter is about 7.19, and that
the 95% confidence interval for the slope is about (4.94, 9.45).
(Note: I'm rounding the point estimate and confidence bounds to
the nearest hundreth since the 2nd significant digit of the estimated
standard error of the slope estimator is in the hundreth position.)
The value of R2 can be seen to be equal to about 0.89 in the
Model Summary part of the output, which matches the value given
at the bottom of p. 554 of S&W.
Making a scatter plot of the unstandardized residuals against the
unstandardized predicted values results in a plot like the one given in
Fig. 12.29 on p. 571, and making a probit plot of the
standardized residuals
results in a plot similar to the one given in
Fig. 12.30 on p. 572, but the axes are switched.
One should check the studentized deleted residuals to see if any of them
are greater than 2.5 --- large studentized residuals indicate that one
needs to be more careful in checking the fit, since outliers may be
having too much influence.
- (p. 550, Testing the Hypothesis)
SPSS will do tests like this one for us, so you need not be concerned
with the details.
- (pp. 550-552, Example 12.19)
To do the test using SPSS, read in the correct data and follow the SPSS
steps described above for the previous example.
The t statistic value of about 4.31 can be found in the
Coefficients part of the output, matching the result on p. 552 of
S&W. SPSS reports the p-value to be 0.000, so I would write p-value
< 0.0005. (It can be noted that if S&W is doing a two-tailed test,
as is indicated on the bottom of p. 551, then from using Table 4
the conclusion should be that the p-value is less than 0.001, since the
table gives us that the upper tail probability is less than 0.0005, and
that needs to be doubled for a two-tailed test.)
- (p. 552) The first paragraph after the end of the example makes an
important point: one can do the test, and get a p-value, even if the
linear model is not a good model --- but if the model doesn't hold, and
there is no slope parameter associated with the phenomenon, the test is
somewhat pointless ... it doesn't make sense to do a test about a
parameter of a defective model that doesn't correspond to reality.
- (p. 552, Why (n - 2)?)
A simple explanation is that using n - 2 provides us with an
unbiased estimator, and results in a convenient null distribution for
the test statistic.
Section 12.5
- (p. 554) In the equation
total variability = explained variability + unexplained variability,
what the total variability and unexplained (by the regression)
variability are is fairly clear --- the total variability is the sum of
the squared deviations of the y values from their sample mean,
and the unexplained variability is the sum of the squared residuals.
(If all of the (x, y) pairs were all on the regression
line, the regression line gives the value of y for any value of
x without error, and there would be no unexplained variability.
The fact that all of the
(x, y) pairs are not on the regression line, means that
the regression line doesn't fully give the relationship between y
and x --- the residuals account for the variation in the y
values that is not explained by the regression of y on x.)
Since it's clear what two of the three variabilities should be, the
third one can be obtained by subtraction. That is, we have
explained variability = total variability - unexplained variability.
The proportion of variability explained by the regression is
explained variability / total variability,
or equivalently
1 - unexplained variability / total variability.
Most books and software use R2 for the proportion of
variability explained by the regression (aka proportion of variation
explained, or coefficient of determination). It is algebraically equal
to the square of the sample correlation coefficient, r, but this
fact isn't real easy to derive, and I don't think it's too important.
You should have a clear understanding of what both r and
R2 are, and not worry too much about how they are
related. I don't think
R2
should be in the section of the chapter that pertains to the sample
correlation coefficient, since
R2
is a measure of how well x explains Y in a regression
model, and r is a summary measure of the joint distribution of
X and Y. A value of
R2 close to 1 means that the residuals are relatively
small, and x is a rather good predictor of Y. A value of
R2 close to 0 means that knowing the value of x
doesn't give us a lot of information about Y. A low
R2
doesn't necessarily mean that the regression model is inappropriate
(although a low
R2
is consistent with the hypothesis that the mean of Y
is not a linear function of
x), since it may be that the model is appropriate and that the
error term variance is relatively large, which scatters the y
values greatly around a relatively subtle straight line pattern.
- (p. 555, The Correlation Coefficient)
r should be referred to as the sample correlation
coefficient --- it is used to estimate the distribution/population
correlation coefficient, which is a measure of the strength of the
linear relationship between X and Y. (Other
measures of association are useful for measuring the strength of
monotonic, but nonlinear, relationships.) I don't think the
relationship between r and the estimate of the slope in a linear
regression, given near the middle of the page, is important.
- (p. 556, Example 12.22)
Even though the sample correlation, 0.944, is not too far from 1, a plot
of the data (see p. 528 or p. 557) suggests a slight curvature in the
relationship (which makes sense, because the relationship between weight
and length would be linear if the cross-sectional area didn't depend on
length, and it doesn't seem reasonable that short snakes and long snakes
are of the same thickness).
It's important to keep in mind that r can be close
to -1 or 1 and a linear relationship not be appropriate, while for other
data, r can be closer to 0 and a linear relationship be a decent
summary of the data. The value of r depends on both the
linearity of the overall relationship and the variability about the
summarizing line.
- (p. 556, Example 12.23)
The plots and corresponding sample correlations are good to study. Note
that when the magnitude of the sample correlation is 0.35, it may not be
real
clear that there is a relationship between X and Y. But
upon a more careful examination of the plots, one can detect a slight
tendency for larger values of y to occur with larger values of
x for the sample correlation of 0.35, and for
smaller values of y to occur with larger values of
x for the sample correlation of -0.35. It would be nice if S&W
included some plots giving the values of r for bivariate samples
from joint distributions for which there is a monotonic but nonlinear
relationship between X and Y.
- (pp. 557-559) I don't think any of the material on these pages is too
important --- there is more important material that we should
concentrate on.
- (p. 560, 3rd paragraph) Finally, S&W gets around to
describing that r is an estimate of a population measure,
rho.
- (p. 560, Example 12.27)
Here a sample of 38 people were selected, and
(x, y)
measurements obtained from each person. These pairs can be regarded as
a random sample from the (joint) distribution of X and Y.
r, the two sample means, and the two sample standard deviations,
can be regarded as estimates of measures associated with the (joint)
distribution of
(X, Y). If, for a certain part of the analysis, we are
just interested in the distribution of X or just the distribution
of Y, then, for example, the sample of the xi
can be regarded as a random sample from some underlying parent distribution,
and the sample mean and sample standard deviation can be used as
estimates of the mean and standard deviation of the distribution.
- (p. 560, Example 12.28)
Here we don't have a single random sample from a bivariate distribution.
x was controlled by the experimenter, and is not random. We really
have three different random samples of y values. The data could
be used in an ANOVA to determine if the means are not all the same, and
if so, which are different from which other ones. But with a regression
we can do something else --- we can model the mean of Y as a
function of x. The sample correlation isn't an estimate of some
population measure, because do don't view the
(x, y) pairs as being due to a joint distribution.
The sample mean and sample standard deviation of the x values
aren't estimates of anything, because all of the x values were in
a sense assigned, and are not viewed as being a random sample from some
distribution. Also,
the sample mean and sample standard deviation of the y values
aren't estimates of anything simple, because the y values are not due to
a single distribution, but rather we have observations from three
(possibly) different distributions.
- (p. 561, the blue box) For doing a test of the null hypothesis
that the distribution/population correlation is 0 against the
alternative that it's not, the last form of the test statistic given in
the box seems more appropriate. If one is interested in the correlation,
then the sample correlation serves as an estimate of the
distribution/population correlation, and using the last form of the test
statistic, we just need the estimated correlation to obtain the value of
the test statistic. The first form of the test statistic is based on
the results of fitting a regression model, modeling E(Y) as a
linear function of x. Although the two forms of the test
statistic given always produce the same value, I think it'd be odd to
compute a slope parameter for a regression model when the focus is a
distribution/population correlation coefficient.
- (p. 561, Example 12.29)
To obtain the sample correlation coefficient using SPSS, use
Analyze > Correlate > Bivariate, click the x and y
variables into the Variables box, and then click OK
(although in a lot of cases, you might want to also click to check the
Spearman box before clicking OK). The output is in the
form of a correlation matrix (which is sort of silly if there is only
one correlation of interest). The estimate, 0.583, can be seen in two
of the boxes. The value of the t statistic isn't given, and the
p-value is reported to be 0.000. Since I don't like to express a
p-value as being 0 unless it really is 0, I'd write p-value <
0.0005. (One could obtain the t statistic value, as is done
in S&W, and use Table 4 to determine that the upper-tail
probability is less than 0.0005, and so the p-value for a two-tailed
test is less than 2(0.0005) = 0.001.)
- (pp. 561-562, Cautionary Notes)
These comments are very important. Comment 2 warns us that
r is sensitive to outliers. Because of this, the hypothesis test
about the correlation coefficient, which is based on r,
can misbehave if the parent
distribution of the data is too far from a bivariate normal
distribution.
- (pp. 562-563) Unfortunately, there doesn't seem to be an easy way
to obtain confidence intervals for the distribution/population
correlation coefficient using SPSS.
Section 12.6
- (p. 566, 2nd line) It's not clear that the random variables
X and Y are uncorrelated just because the sample
correlation coefficient is 0.
- (p. 568, Example 12.34)
Since there are measurements on only 2 different people, one could use
sample means in place of the individual observations (i.e., work with 2
pairs of sample means, with each sample mean based on the 10
observations of a variable for a single person) in order to have
independent observations. But then the sample size is only 2, and there
is no way to make meaningful inferences about the correlation from a
sample of size 2.
- (p. 571) With simple regression (i.e., just a single explanatory
variable, x) one can also plot the residuals against x to
obtain a residual plot --- the pattern is th esame whether x or
the predicted values are used on the horizontal axis.
- (p. 572, The Use of Transformations)
Transforming x can eliminate or reduce nonlinearity (curvature),
but won't correct for heteroscedasticity. Transforming y affects
the variability as well as the linearity/nonlinearity. In Example
12.38, a transformation of y corrects both the
heteroscedasticity and the nonlinearity. This sometimes happens, but
often the transformation of y needed to correct the
heteroscedasticity results in a nonlinear pattern of the x and
transformed y pairs. In such a case, a transformation of
x may be needed to obtain a linear pattern. (Instead of
transforming x, it is sometimes useful to add other terms to the
model involving x, say a
x2 term, and possibly a
x3 term. Fitting such a model would require multiple
regression.)
- (p. 573, Example 12.39) This example, and the following
paragraph on p. 574, are good.
Section 12.7
- (pp. 576-578, Example 12.40) This example is rather silly.
Viewing it as a two-sample test about the means is much more
straightforward than viewing it as a regression problem. The example
shows that one can get the the same result via regression, but there is
no good reason to go that route.
- (p. 580, 1st paragraph) Notice that a relationship can be real
without it being a causal. The relationship between akinete counts and
photoperiod in
Example 12.39 is real, but not causal.
- (p. 580, Extensions of Least Squares) The two (model)
equations are bad. There needs to be an error term included with each
of them --- one models the mean of Y given various
predictor variables instead of expecting that Y is a perfect
function of the predictors. Also, the predictors should be expressed in
lower-case type, since we model Y given fixed values for the
predictors. The first (defective) model is sometimes referred to as
a polynomial
regression model. (I don't care for the term curvilinear.)
If there is enough data, one might consider a 3rd or 4th degree
polynomial model. When polynomial models are used, often it isn't
believed that
E(Y | x)
is really a polynomial in x, but rather that
E(Y | x)
is some function of x, and the unknown function of
x can be reasonably approximated by a polynomial. (This is based
on the result of Taylor's theorem (from calculus).)
The polynomial model is an example of a multiple regression model, as is
the second (defective) model given, which involves two different
predictor variables, instead of two terms based on the same predictor
variable.
- (pp. 580-581, Example 12.42) This is a good example. Since
many things obviously are related to blood pressure, in order to
properly study the relationship (if any) between cholesterol and blood
pressure, other factors which may possibly be related to blood pressure
should be adjusted for. Just looking at the relationship between
cholesteral and blood pressure while ignoring other important variables
can lead to nonsensical and misleading results.
However, it should be noted that bad things can also occur if too many terms
are used in a model.
- (p. 581, Nonparametric and Robust Regression and Correlation)
Spearman's rank correlation coefficient is a measure of
association that measures the strength of a monotonic (either increasing
or decreasing) relationship/trend (as opposed to the sample Pearson
correlation coefficient, which measures the strength of a linear
relationship). To compute Spearman's statistic, we can rank the
xi
from smallest to largest (from 1 to n), rank the
yi
from smallest to largest (from 1 to n), and plug into the formula
for the sample Pearson correlation coefficient. But we can get the
value easily using SPSS by just clicking to check the Spearman
box before clicking OK when computing the Pearson sample
correlation coefficient using Analyze > Correlate > Bivariate.
Nonparametric and robust regression are topics out of the mainstream
that we won't have time to deal with. Plus, they aren't
available on a lot of easy-to-use statistical software packages (which
is too bad, because if the error term distribution is heavy-tailed, the
regression coefficients can be better estimated using robust regression
instead of least squares --- but a drawback is that there aren't
reliable robust test and interval estimation methods to complement the
superior point estimates).
- (pp. 581-582, Analysis of Covariance)
Although S&W doesn't provide any details, I'll point out that one can do
analysis of covariance using SPSS via Analyze > General Linear Model
> Univariate (one just needs to click the continuous variable(s) into
the Covariate(s) box.
- (pp. 582-585, Logistic Regression)
We're not going to have time to do much with logistic regression. I may
be able to describe it a bit during the last lecture.