Ordinary Least Squares Regression: a Look at the Basics Without Matrices



Suppose that for i = 1, 2, ..., n,
Yi = β0 + β1xi,1 + ... + βpxi,p + Ei,
where
β0, β1, ..., βp
are unknown and
E1, E2, ..., En
are iid random variables having mean 0. Given the observed values of the dependent variable,
y1, y2, ..., yn,
along with the observed values of the predictor variables, the least squares estimates of the unknown coefficients are the values of
β0, β1, ..., βp
which minimize
Σ ( yi - [β0 + β1xi,1 + ... + βpxi,p] )2.
The minimization can be done using calculus and solving a set of simulataneous equations, although in practice some matrix algebra results can be utilized.

Letting
b0, b1, ..., bp
be the least squares estimates of the coefficients, these estimates should be relatively good provided that the error term distribution (E is often referred to as the error term) is normal or "well behaved" (i.e., not heavy-tailed and not highly skewed). If the error term distribution is normal, then the least squares estimators are both MLEs (maximum likelihood estimators) and UMVUEs (uniform minimum variance unbiased estimators). If the error term distribution is not "nice" then some other method of estimating the unknown parameters (e.g., some type of robust regression) may be appreciably better than ordinary least squares.

Provided that the assumed model is correct, and the error term distribution doesn't yield too many extreme outliers,
b0 + b1x1 + ... + bpxp
should be a good choice as an estimate of
E( Y | x1, ..., xp )
if not too many of the true coefficient values are 0 or negligibly small. Likewise, in such situations,
b0 + b1x1 + ... + bpxp
should be a decent choice as a prediction of a currently unknown response value associated with a given set of predictor values. But if too many of the unknown parameters are 0 or rather small, then including the predictors associated with these small coefficients can actually lead to worse predictions, and cause the mean to be estimated poorly, even though the assumed linear model is technically correct. (If this wasn't the case, then including additional predictors of the form
xj2, xj3, log xj, and 1/xj
would never hurt performance.) Even though the deletion of weak predictors may create bias, this can be more than offset by a reduction in variance, and the estimation of the mean and predictions of the response can be done better using a simpler, but incorrect, model. (This concept of a bias-variance tradeoff is very imprtant!)

However, if the relationship between a predictor and the response is sufficiently nonlinear, then transforming the predictor or adding higher-order terms involving the predictor, may appreciably improve performance. (In such a case the decrease in bias wins out over a possible increase in variance.) If the correct model is not an additive model --- that is, we do not have
E( Y | x1, ..., xp ) = β0 + h1(x1) + ... + hp(xp)
for some set of suitable functions
h1, ..., hp
--- then it may be desirable to include interaction terms in the model, which involve constructed variables based on two or more of the explanatory variables (with products being a commonly used form). (Note: In class I will discuss the difference between linear and nonlinear regression. One has to be careful with the terminology --- so called polynomial regression is actually a form of linear regression even though it allows for a nonlinear relationship between the response and a predictor. (The parameters are in the model in a linear way.))


polynomial regression

For polynomial regression, we use a linear model, but we expand the set of predictors to include constructed variables like
xj2, xj3, xjxk, xj2xk, and xjxk2.
If we use the first and second powers for each explanatory variable, as opposed to just using each original explanatory variables, then we can say that we have a 2nd-order additive model. If we additionally include the cube of each explanatory variable in the set of predictor variables, then we can say that we have a 3rd-order additive model. If to a 2nd-order additive model we add all products of the form
xjxk,
then we have a full 2nd-order model. If to a full 2nd-order model we add all of cubes and 3rd-order products of the forms
xj2xk & xjxk2,
then we have a full 3rd-order model.

If there are a lot of explanatory variables, then often one should not use a full rth-order model, with r being too large, due to the "cost" (in terms of increased variance) associated with the large number of unknown parameters to be estimated in the linear model. To cut down on the number of predictors in the linear model, one can use different numbers of terms to adjust the response for the various explanatory variables and only selectively include interaction terms, but some statisticians shy away from this practice because they fear that by being so selective in the inclusion of higher-order terms they will overfit the model.


hypothesis testing

If one wants to test the null hypothesis
H0: βj = 0
against the alternative hypothesis
H1: βj ≠ 0,
a t test can be used. The value of the test statistic is the estimated coefficient divided by the estimated standard error of the coefficient estimator, and with iid normal error terms, the null sampling distribution of the test statistic is a T distribution with n - p - 1 degrees of freedom, where p is the total number of predictors used in the model (not associating a predictor with the intercept term). E.g., we would use n - 2 df to do a test about the slope coefficient in a simple regression model.

If one wants to test the null hypothesis
H0: βq+1 = βq+2 = ... = βp = 0
where p is the total number of predictors and q is less than p, against the general alternative that not all of
βq+1, βq+2, ..., βp
are equal to 0, an F test can be used. (I'll explain how to do such an F test in class.) Such a test is useful for determining if there is statistically significant evidence that a full 2nd-order model is "better" than a 2nd-order additive model, or whether a full 3rd-order model is "better" than a full 2nd-order model.


variable/predictor selection

The hypothesis tests described above are useful in arriving at a final model. But there are various ways to go about building/selecting a model. One can use various stepwise regression algorithms, or one can use all/best subsets regression, which is generally preferred but it's not always clear whether to use the best 3 predictor model, the best 4 predictor model, or a model having some other number of predictors. (We'll return to this issue later (in another class period).) Sometimes I use one of the techniques referred to above as a starting point and then start tinkering, relying on my years of experience along with any subject matter knowledge that I can bring to bear on the analysis.


variable transformation

Residual analysis is a good starting point with the variable transformation process. Plotting the residuals against the fitted values can show evidence of nonlinearity (incorrect model) and/or heteroscedasticity.

If one sees clear evidence of heteroscedasticity, the usual remedy is to transform the response variable. (Alternatively, one can do a weighted regression, but I won't attempt to cover that in this course.) Heteroscedasticity often creates a funnel shape on the residual plot. In such a case, one can often find a monotonic transformation to apply to the response variable which will (largely) correct the nonconstant variance problem. Power transformations and log transformations are commonly used.

Although one can search for a suitable transformation using an intelligent trial-and-error approach, it is often good to apply the Box-Cox method (which is a likelihood-based technique to identify a power transformation which hopefully corrects heteroscedasticity, nonlinearity, and nonnormality). One can make an initial guess as to what is a good transformation of the response variable based on the Box-Cox results, then tinker with transforming (or adding higher-order terms to) the set of explanatory variables, and then go back to see if additional adjustment of the response variables can lead to further improvement.

For transforming the explanatory variables (which should generally be done after transforming the response variable), one can try the iterative Box-Tidwell technique to arrive at power transformations, or one can use a variety of graphics-based approaches. One such graphical approach is to make a series of component plus residual plots (aka partial residual plots), focusing on correcting the strongest nonlinear relationships first.

To understand component plus residual plots (c+r plots), consider the case of simple regression with just one predictor variable, x, and suppose that all of the x values are positive. If a scatterplot of the ordered pairs,
(xi, yi),
shows an increasing pattern which generally curves upwards with a convex appearence, then the larger x values need to be "pulled out" away from 0 if one wants to straighten out the plotted points. This can be done by using a power transformation with a power greater than 1. For example, maybe a plot of the points
(xi3/2, yi)
results in a generally linear pattern. If so, then one could fit a linear model using the transformed x as a predictor in place of x.

Using a positive power transformation might also work well if the original scatter plot has a pattern of plotted points curving downward with a concave appearence. If the points exhibit a convex downward trend (i.e., smaller y values are associated with larger x values, with the negative slope becoming less steep as the points are farther from 0), then one needs to "pull in" the large x values to increase the straightness of the pattern of plotted points. This can be done with a power transformation with a power less than 1 (e.g., a square root or cube root transformation), or perhaps a log transformation. One can even use negative powers like -1 (an inverse transformation). A power less than 1 (or the use of a log transformation) might also work well if y generally increases with increasing x, but there is a concave curvature (as opposed to the convex curvature considered previously).

The problem of using such a scatter plot approach in a multiple regression setting is that to determine a good transformation for a predictor we need to have some insight about its relationship with the dependent variable when adjustments are being made for the other predictors. If unadjusted y values are plotted against the values of a predictor, then we needn't expect to see a linear pattern even if the predictor should go into the model in a simple linear way (i.e., untransformed). Since what we are interested in is the relationship between a predictor and the response, adjusted for the effects (as well as possible) of the other predictors, then, assuming that xm is the predictor currently being considered for transformation, we can plot the points
(xi, yi - b0 - Σj ≠ m bjxi,j).
Using r to denote residuals, note that for a fitted model we have
yi = b0 + b1xi,1 + ... + bpxi,p + ri.
Since this gives us that
yi - b0 - Σj ≠ m bjxi,j = bmxi,m + ri.
we can more simply just plot
(xi,m, bmxi,m + ri)
(which suggests the name component plus residual plot) to show the relationship of interest. With such a plot we can see whether an improved linear appearence can be achieved by "pushing out" or "pulling in" the large values of the predictor. (It should be noted that finding suitable transformations can be considerably harder if there are both positive and negative predictor values.)

When more than one predictor should go into the model in a nonlinear way, one can search for good transformations in an iterative way, since the initial set of c+r plots can be a bit misleading as the response variable hasn't been properly adjusted for the effects of the other predictors (because they haven't been transformed yet). A good strategy is to look at the initial set of c+r plots (based on none of the predictors having been transformed), and then start by trying to determine a transformation to straighten the relationship for the predictor yielding the most highly nonlinear plot (since this bad predictor needs to be corrected in order to have the second round of c+r plots be based on a more appropriately adjusted for response variable). Once the worst predictor is reasonably transformed, one can make new c+r plots using the transformed predictor in the model, and then attack the most nonlinear pattern. If more than one predictor has been transformed, it may be a good idea to go back and see if the early transformations are still good.