Ordinary Least Squares Regression: a Look at the Basics Without Matrices
Suppose that for i = 1, 2, ..., n,
Yi =
β0 +
β1xi,1 + ... +
βpxi,p +
Ei,
where
β0,
β1, ...,
βp
are unknown and
E1,
E2, ...,
En
are iid random variables having mean 0. Given the observed values of the dependent variable,
y1,
y2, ...,
yn,
along with the observed values of the predictor variables, the least squares estimates of the unknown coefficients
are the values of
β0,
β1, ...,
βp
which minimize
Σ (
yi -
[β0 +
β1xi,1 + ... +
βpxi,p] )2.
The minimization can be done using calculus and solving a set of simulataneous equations, although in practice
some matrix algebra results can be utilized.
Letting
b0,
b1, ...,
bp
be the least squares estimates of the coefficients, these estimates should be relatively good provided that the
error term distribution (E is often referred to as the error term) is normal or "well behaved"
(i.e., not heavy-tailed and not highly skewed).
If the error term distribution is normal, then the least squares estimators are both MLEs
(maximum likelihood estimators) and UMVUEs
(uniform minimum variance unbiased estimators).
If the error term distribution is not "nice" then some other method
of estimating the unknown parameters
(e.g., some type of robust regression)
may be appreciably better than ordinary least squares.
Provided that the assumed model is correct, and the error term distribution doesn't yield too many extreme outliers,
b0 +
b1x1 + ... +
bpxp
should be a good choice as an estimate of
E( Y |
x1, ...,
xp )
if not too many of the true coefficient values are 0 or negligibly small.
Likewise, in such situations,
b0 +
b1x1 + ... +
bpxp
should be a decent choice as a prediction of a currently unknown response value associated with a given
set of predictor values.
But if too many of the unknown parameters are 0 or rather small, then including the predictors associated with these
small coefficients can actually lead to worse predictions, and cause the mean to be estimated poorly, even though the
assumed linear model is technically correct. (If this wasn't the case, then including additional predictors of the
form
xj2,
xj3,
log xj, and
1/xj
would never hurt performance.) Even though the deletion of weak predictors may create bias, this can be more than
offset by a reduction in variance, and the estimation of the mean and predictions of the response can be done better
using a simpler, but incorrect, model. (This concept of a bias-variance tradeoff is very imprtant!)
However,
if the relationship between a predictor and the response is sufficiently nonlinear, then transforming the predictor
or adding higher-order terms involving the predictor, may appreciably improve performance. (In such a case the
decrease in bias wins out over a possible increase in variance.) If the correct model is not an additive model
--- that is, we do not have
E( Y |
x1, ...,
xp ) =
β0 +
h1(x1) + ... +
hp(xp)
for some set of suitable functions
h1, ...,
hp
--- then it may be desirable to include interaction terms in the model, which involve constructed
variables based on two or more of the explanatory variables (with products being a commonly used form).
(Note: In class I will discuss the difference between linear and nonlinear regression.
One has to be careful with the terminology --- so called polynomial regression is actually a form of linear
regression even though it allows for a nonlinear relationship between the response and a predictor. (The
parameters are in the model in a linear way.))
polynomial regression
For polynomial regression, we use a linear model, but we expand the set of predictors to include constructed
variables like
xj2,
xj3,
xjxk,
xj2xk, and
xjxk2.
If we use the first and second powers for each explanatory variable, as opposed to just using
each original explanatory variables, then we can say that we have a 2nd-order additive model.
If we additionally include the cube of each explanatory variable in the set of predictor variables,
then we can say that we have a 3rd-order additive model.
If to a 2nd-order additive model we add all products of the form
xjxk,
then we have a full 2nd-order model.
If to a full 2nd-order model we add all of cubes and 3rd-order products of the forms
xj2xk &
xjxk2,
then we have a full 3rd-order model.
If there are a lot of explanatory variables, then often one should not use a
full rth-order model, with r being too large, due to the "cost" (in terms of increased variance)
associated with the large number of unknown parameters to be estimated in the linear model.
To cut down on the number of predictors in the linear model, one can use different numbers of terms to adjust
the response for the
various explanatory variables and only selectively include interaction terms, but some statisticians shy away from
this practice because they fear that by being so selective in the inclusion of higher-order terms they will
overfit the model.
hypothesis testing
If one wants to test the null hypothesis
H0: βj = 0
against the alternative hypothesis
H1: βj ≠ 0,
a t test can be used. The value of the
test statistic is the estimated coefficient divided by the estimated standard error
of the coefficient estimator, and with iid normal error terms, the null sampling distribution of the test statistic is a T distribution
with n - p - 1 degrees of freedom, where p is the total number of predictors used in the model (not associating a
predictor with the intercept term). E.g., we would use n - 2 df to do a test about the slope coefficient in a simple
regression model.
If one wants to test the null hypothesis
H0:
βq+1 =
βq+2 = ... =
βp = 0
where p is the total number of predictors and q is less than p,
against the general alternative that not all of
βq+1,
βq+2, ...,
βp
are equal to 0,
an F test can be used. (I'll explain how to do such an F test in class.)
Such a test is useful for determining if there is statistically significant evidence that a full 2nd-order model is
"better" than a 2nd-order additive model, or whether a
full 3rd-order model is
"better" than a full 2nd-order model.
variable/predictor selection
The hypothesis tests described above are useful in arriving at a final model.
But there are various ways to go about building/selecting a model. One can use various stepwise regression algorithms,
or one can use all/best subsets regression, which is generally preferred but it's not always clear whether to use the
best 3 predictor model, the best 4 predictor model, or a model having some other number of predictors. (We'll return
to this issue later (in another class period).) Sometimes I use one of the techniques referred to above as a
starting point and then start tinkering, relying on my years of experience along with any subject matter knowledge
that I can bring to bear on the analysis.
variable transformation
Residual analysis is a good starting point with the variable transformation process. Plotting the residuals against
the fitted values can show evidence of nonlinearity (incorrect model) and/or heteroscedasticity.
If one sees clear evidence of heteroscedasticity, the usual remedy is to transform the response variable.
(Alternatively, one can do a weighted regression, but I won't attempt to cover that in this course.)
Heteroscedasticity often creates a funnel shape on the residual plot. In such a case, one can often find a
monotonic transformation to apply to the response variable which will (largely) correct the nonconstant variance problem.
Power transformations and log transformations are commonly used.
Although one can search for a suitable transformation using an intelligent trial-and-error approach, it is often good
to apply the Box-Cox method (which is a likelihood-based technique to identify a power transformation which hopefully
corrects heteroscedasticity, nonlinearity, and nonnormality). One can make an initial guess as to what is a good
transformation of the response variable based on the Box-Cox results, then tinker with transforming (or adding higher-order
terms to) the set of explanatory variables, and then go back to see if additional adjustment of the response variables
can lead to further improvement.
For transforming the explanatory variables (which should generally be done after transforming the response variable),
one can try the iterative Box-Tidwell technique to arrive at power transformations,
or one can use a variety of graphics-based approaches. One such graphical approach is to make a series of
component plus residual plots (aka partial residual plots), focusing on correcting the strongest
nonlinear relationships first.
To understand component plus residual plots (c+r plots), consider the case of simple regression with just one
predictor variable, x, and suppose that all of the x values are positive.
If a scatterplot of the ordered pairs,
(xi,
yi),
shows an increasing pattern which generally curves upwards with a convex appearence,
then the larger x values need to be "pulled out"
away from 0 if one wants to straighten out the plotted points. This can be
done by using a power transformation with a power greater than 1. For example, maybe a plot of the points
(xi3/2,
yi)
results in a generally linear pattern. If so, then one could fit a linear model using the transformed x
as a predictor in place of x.
Using a positive power transformation might also work well if the original
scatter plot has a pattern of plotted points curving downward with a concave appearence.
If the points exhibit a convex downward trend (i.e., smaller y values are associated with larger x
values, with the negative slope becoming less steep as the points are farther from 0), then one needs to "pull in"
the large x values to increase the straightness of the pattern of plotted points. This can be done with a
power transformation with a power less than 1 (e.g., a square root or cube root transformation),
or perhaps a log transformation. One can even use negative powers like -1 (an inverse transformation).
A power less than 1 (or the use of a log transformation) might also work well if y generally increases with
increasing x, but there is a concave curvature (as opposed to the convex curvature considered previously).
The problem of using such a scatter plot approach in a multiple regression setting is that to determine a good
transformation for a predictor we need to have some insight about its relationship with the dependent variable when
adjustments are being made for the other predictors. If unadjusted y values are plotted against the values of
a predictor, then we needn't expect to see a linear pattern even if the predictor should go into the model in a
simple linear way (i.e., untransformed). Since what we are interested in is the relationship between a predictor and
the response, adjusted for the effects (as well as possible) of the other predictors, then,
assuming that xm is the
predictor currently being considered for transformation, we can plot the points
(xi,
yi - b0 - Σj ≠ m
bjxi,j).
Using r to denote residuals, note that for a fitted model we have
yi = b0 +
b1xi,1 + ... +
bpxi,p + ri.
Since this gives us that
yi - b0 - Σj ≠ m
bjxi,j =
bmxi,m + ri.
we can more simply just plot
(xi,m,
bmxi,m + ri)
(which suggests the name component plus residual plot) to show the relationship of interest.
With such a plot we can see whether an improved linear appearence can be achieved by "pushing out" or "pulling in"
the large values of the predictor. (It should be noted that finding suitable transformations can be considerably
harder if there are both positive and negative predictor values.)
When more than one predictor should go into the model in a nonlinear way, one can search for good transformations in
an iterative way, since the initial set of c+r plots can be a bit misleading as the response variable hasn't been
properly adjusted for the effects of the other predictors (because they haven't been transformed yet).
A good strategy is to look at the initial set of c+r plots (based on none of the predictors having been transformed),
and then start by trying to determine a transformation to straighten the relationship for the predictor yielding the
most highly nonlinear plot (since this bad predictor needs to be corrected in order to have the second round of c+r
plots be based on a more appropriately adjusted for response variable). Once the worst predictor is reasonably
transformed, one can make new c+r plots using the transformed predictor in the model, and then attack the most nonlinear
pattern. If more than one predictor has been transformed, it may be a good idea to go back and see if the early
transformations are still good.