Regression Tactics: Shrinkage Methods and Derived Input Directions



Here I provide comments about some parts of Ch. 3 of HTF that I didn't cover previously, but I still won't attempt to cover everything that is in the chapter.


shrinkage methods (as a substitute for variable selection in regression)

Ridge regression was originally developed as a way of dealing with troublesome degrees of highly correlated predictor variables in multiple regression. With ridge regression, instead of seeking the values of the βj which minimize
Σi ( yi - [β0 + β1xi,1 + ... + βpxi,p] )2,
the βj values are chosen to minimize
Σi ( yi - [β0 + β1xi,1 + ... + βpxi,p] )2 + λ Σj≠0 βj2,
where λ is a nonnegative parameter which controls the amount of shrinkage of the coefficients --- the greater λ is, the greater the degree of shrinkage (of the coefficients relative to the least squares estimates (which correspond to the λ = 0 case)) will be. I think that in most cases, the predictors should be standardized first (by both subtracting off the sample mean and dividing by the sample standard deviation), since otherwise some variables can greatly dominate and overall the desired results may not be obtained.

With OLS regression, when two predictors are highly correlated, with only relatively minor changes in the data, instead of their corresponding coefficients having moderate values of the same sign (both positive, or both negative), the coefficient values can be such that one is a rather large positive value, and the other a rather large negative value. Since small changes in the data to cause such a phenomenon could be due to the error term values, the OLS coefficient estimators can have large variances. But in ridge regression, large magnitudes of the estimated coefficents are discouraged by a positive value of λ, and this helps to reduce the variance of the coefficient estimators (which may result in an appreciable reduction in the variance of the predicted values).

A positive value of λ also tends to pull all estimated coefficients towards 0, and this tends to result in being able to remove some variables from the model altogether without changing the predictons appreciably. If λ is chosen of the basis of creating accurate predictions (for cases in a validation sample, or as judged by cross-validation), then if some coefficients are shrunk to (nearly) 0, the available data is suggesting that those variables just aren't useful predictors (and so one could choose to remove them in order to achieve a simpler model). In any case, by considering a number of different λ values, one arrives at a collection of different fitted models, and one can use cross-validation or a validation sample to select the value of λ which seems to provide the most accurate predictions. Since the collection of models considered have effectively different numbers of predictors (if we ignore variables with really small coefficients), by doing ridge regression in this way (considering multiple values for λ), we are in a sense doing variable selection directly on the basis of estimated prediction performance. (In OLS regression, t tests and F tests are often used to do variable selection. The use of such tests presents a problem in that it is hard to account for the simultaneous inference phenomenon --- if one does a lot of tests, some statistically significant results are likely to be type I errors (and there is also the problem associated with hypothesis tests that some nonsignificant results may be type II errors).)

It should be noted that ridge regression is just an alternative way to arrive at the estimated coefficients in a linear regression model, and that it does not correct for heteroscedasticity and nonlinearities. It may be best to first fit an OLS model using whatever transformations and constructed variables seem appropriate, and then fit the same model using ridge regression.

An equivalent way to describe ridge regression, is that for various positive values of s, one minimizes
Σi ( yi - [β0 + β1xi,1 + ... + βpxi,p] )2,
subject to
Σj≠0 βj2s.
With the lasso method (developed by Tibshirani), for various positive values of t,
Σi ( yi - [β0 + β1xi,1 + ... + βpxi,p] )2,
is minimized subject to
Σj≠0j| ≤ t,
and then one can choose to use the value of t which produces the smallest prediction errors as measured by the average squared prediction error for cases in a validation sample (or cross-validation can be used). While the lasso and ridge regression are quite similar, a difference is that with the lasso instead of the coefficients being shrunk toward 0 but not quite getting there, coefficents become exactly 0, one by one, as t is decreased.

methods using derived input directions (for dimensionality reduction)

When there are a lot of correlated explanatory variables, and not a lot of data, OLS regression may result in high variance (and so high error) estimates and predictions if too many of the explanatory variables are used. While shrinkage methods can be of some help, typically (since ridge regression isn't commonly used) people eliminate some of the explanatory variables. If one suspects that in the future, cases to predict response values for won't necessarily have the predictors so highly correlated, then there is some reason to worry about making perhaps arbitrary choices in keeping some explanatory variables in the model and eliminating others. Methods using derived input directions provide us with another strategy for dealing with the problem of too many variables and not enough data, and better predictions can result (compared to what is gotten by a routine regression analysis fit).

To do principal components regression, one first takes only the explanatory variables and finds perhaps several principal components. Then, one does OLS regression using the principal components as the candidates for the predictors. In this way, all of the explanatory variables can be in the final model (since each principal component involves all of the variables), but only a small number of coefficents have to be estimated by least squares. Furthermore, one has a set of predictors for which correlation should not present a problem. I'll add that principal components can also be used with classification methods --- it's a way to achieve dimensionality reduction.

The first principal component is the linear combination of the (explanatory) variables,
z1 = a1,1x1 a1,2x2 + ... + a1,pxp,
having
a1,12 + a1,22 + ... + a1,p2 = 1,
for which the sample variance of the n z1 values is maximized. (If there was no constraint on the magnitude of the coefficients, then there would be no variance-maximizing linear combination of the explanatory variables --- the variance could always be increased by increasing the magnitudes of the coefficients.) Equivalently, the a1,j identify a direction in the p-dimensional space so that if the n x points are projected onto a line having that direction, the projected points are as spread out as possible (as measured by their sample variance). When obtaining principal components, the variables are typically standardized first (by dividing by the sample standard deviation), since otherwise some variables can greatly dominate and overall the desired results may not be obtained.

The second principal component is the linear combination of the (explanatory) variables,
z2 = a2,1x1 a2,2x2 + ... + a2,pxp,
having
a2,12 + a2,22 + ... + a2,p2 = 1,
and such that
a2 = (a2,1, a2,2, ..., a2,p)
is orthogonal to
a1 = (a1,1, a1,2, ..., a1,p)
for which the sample variance of the n z2 values is maximized. Equivalently, the a2,j identify a direction in the p-dimensional space, orthogonal to the direction indicated by the first principal component, so that if the n x points are projected onto a line having that direction, the projected points are as spread out as possible (as measured by their sample variance). The third principal component is determined similarly, and corrsponds to a direction orthogonal to the directions of the first two principal components, and subsequent principal components are defined in a likewise manner. An alternative viewpoint for principal components is that the first k principal components are orthogonal directions corresponding to a k-dimensional subspace for which the orthogonal distances of the n p-dimensional points to that subspace are collectively as small as possible (as measured by the sum of the squared distances). The suggests that the n projected points --- the n k-dimensional vectors of the principal component values, are the n k-dimensional vectors that best "approximate" the original p-dimensional points. (Note: The k-dimensional vectors of the principal component values can still be viewed as points in p-dimensional space since each principal component involves all p variables.) Due to this, it would seem that perhaps as little as possible would be lost by replacing the original set of p explanatory variables by a smaller set of principal component variables.

It should be noted that the coefficients for the principal components were not determined using the response variable values --- they are not obtained by specifically trying to create a good model for the response variable. Because of this, the coefficients of the principal components do not lead to overfitting in the same way as what occurs if we specifically select a lot of coefficients in a linear regression model to make the fitted values close to the observed values of the response variable. It is also interesting to note that if one uses p principal components in a regression model, and then rewrites the fitted principal component model in terms of the original p explanatory variables, the result is the same as what one gets when a least squares model is fit using all p explanatory variables. So p principal components and p explanatory variables are of equal worth (with regard to finding a good prediction model). But it may well be that fewer than p principal components can be better than the same number of explanatory variables.

There are other regression methods that use derived directions as predictors. One such method is partial least squares regression. Unlike principal components, the directions used in partial least squares regression are determined using the response variable values. Unfortunately, we don't have enough time to thoroughly cover all of the interesting methods that are introduced in HTF.