Regression Tactics: Shrinkage Methods and Derived Input Directions
Here I provide comments about some parts of Ch. 3 of HTF that I didn't cover previously, but I still won't
attempt to cover everything that is in the chapter.
shrinkage methods (as a substitute for variable selection in regression)
Ridge regression was originally developed as a way of dealing with troublesome degrees of highly correlated
predictor variables in
multiple regression.
With ridge regression, instead of seeking the values of the
βj
which minimize
Σi (
yi -
[β0 +
β1xi,1 + ... +
βpxi,p] )2,
the
βj
values are chosen to minimize
Σi (
yi -
[β0 +
β1xi,1 + ... +
βpxi,p] )2
+ λ
Σj≠0
βj2,
where
λ
is a nonnegative parameter which controls the amount of shrinkage of the coefficients --- the greater
λ
is, the greater the degree of shrinkage (of the coefficients relative to the least squares estimates
(which correspond to the
λ = 0 case)) will be.
I think that in most cases, the predictors should be standardized first (by both subtracting off the sample mean and
dividing by the sample
standard deviation), since otherwise some variables can greatly dominate and overall the desired results may not be
obtained.
With OLS regression, when two predictors are highly correlated, with only relatively minor changes in the data, instead of
their corresponding coefficients having
moderate values of the same sign (both positive, or both negative),
the coefficient values can be such that one is a rather large positive value, and the other a rather large negative
value. Since small changes in the data to cause such a phenomenon could be due to the error term values, the
OLS coefficient estimators can have large variances. But in ridge regression, large magnitudes of the estimated
coefficents are discouraged by a positive value of
λ, and this helps to reduce the variance of the coefficient estimators (which may result in an appreciable
reduction in the variance of the predicted values).
A positive value of
λ
also tends to pull all estimated coefficients towards 0, and this tends to result in being able to remove some
variables from the model altogether without changing the predictons appreciably. If
λ
is chosen of the basis of creating accurate predictions (for cases in a validation sample, or as judged by
cross-validation), then if some coefficients are shrunk to (nearly) 0, the available data is suggesting that
those variables just aren't useful predictors (and so one could choose to remove them in order to achieve a simpler
model). In any case, by considering a number of different
λ
values, one arrives at a collection of different fitted models, and one can use cross-validation or a validation
sample to select the value of
λ
which seems to provide the most accurate predictions. Since the collection of models considered have effectively
different numbers of predictors (if we ignore variables with really small coefficients), by doing ridge regression
in this way (considering multiple values for
λ), we are in a sense doing variable selection directly on the basis of estimated prediction performance.
(In OLS regression, t tests and F tests are often used to do variable selection. The use of such tests
presents a problem in that it is hard to account for the simultaneous inference phenomenon --- if one does a lot of
tests, some statistically significant results are likely to be type I errors (and there is also the problem
associated with hypothesis tests that some nonsignificant results may be type II errors).)
It should be noted that ridge regression is just an alternative way to arrive at the estimated coefficients in a
linear regression model, and that it does not correct for heteroscedasticity and nonlinearities. It may be best to
first fit an OLS model using whatever transformations and constructed variables seem appropriate, and then fit the
same model using ridge regression.
An equivalent way to describe ridge regression, is that for various positive values of s, one minimizes
Σi (
yi -
[β0 +
β1xi,1 + ... +
βpxi,p] )2,
subject to
Σj≠0
βj2
≤ s.
With the lasso method (developed by Tibshirani), for various positive values of t,
Σi (
yi -
[β0 +
β1xi,1 + ... +
βpxi,p] )2,
is minimized subject to
Σj≠0
|βj|
≤ t,
and then one can choose to use the value of t which produces the smallest prediction errors as measured by
the average squared prediction error for cases in a validation sample (or cross-validation can be used).
While the lasso and ridge regression are quite similar, a difference is that with the lasso instead of the
coefficients being shrunk toward 0 but not quite getting there, coefficents become exactly 0, one by one,
as t is decreased.
methods using derived input directions (for dimensionality reduction)
When there are a lot of correlated explanatory variables, and not a lot of data, OLS regression may result in high
variance (and so high error) estimates and predictions if too many of the explanatory variables are used.
While
shrinkage methods can be of some help, typically (since ridge regression isn't commonly used) people eliminate some of
the explanatory variables.
If one suspects that in the future, cases to predict response values for won't
necessarily have the predictors so highly correlated, then there is some reason to worry about making perhaps
arbitrary choices in keeping some explanatory variables in the model and eliminating others.
Methods using derived input directions provide us with another strategy for dealing with the problem of too many
variables and not enough data, and better predictions can result (compared to what is gotten by a routine
regression analysis fit).
To do principal components regression, one first takes only the explanatory variables and finds perhaps
several principal components. Then, one does OLS regression using the principal components as the candidates for the
predictors. In this way, all of the explanatory variables can be in the final model (since each principal component
involves all of the variables), but only a small number of
coefficents have to be estimated by least squares. Furthermore, one has a set of predictors for which correlation
should not present a problem.
I'll add that principal components can also be used with classification methods --- it's a way to achieve
dimensionality reduction.
The first principal component is the linear combination of the (explanatory) variables,
z1 =
a1,1x1
a1,2x2
+ ... +
a1,pxp,
having
a1,12
+ a1,22
+ ... +
a1,p2 = 1,
for which the sample variance of the n
z1 values is maximized. (If there was no constraint on the magnitude of the coefficients, then
there would be no variance-maximizing linear combination of the explanatory variables --- the variance could always
be increased by increasing the magnitudes of the coefficients.)
Equivalently, the a1,j identify a direction in the
p-dimensional space so that if the n x points are projected onto a line having that
direction, the projected points are as spread out as possible (as measured by their sample variance).
When obtaining principal components, the variables are typically standardized first (by dividing by the sample
standard deviation), since otherwise some variables can greatly dominate and overall the desired results may not be
obtained.
The second principal component is the linear combination of the (explanatory) variables,
z2 =
a2,1x1
a2,2x2
+ ... +
a2,pxp,
having
a2,12
+ a2,22
+ ... +
a2,p2 = 1,
and such that
a2 =
(a2,1,
a2,2,
..., a2,p)
is orthogonal to
a1 =
(a1,1,
a1,2,
..., a1,p)
for which the sample variance of the n
z2 values is maximized. Equivalently, the a2,j identify a direction in the
p-dimensional space, orthogonal to the direction indicated by the first principal component, so that
if the n x points are projected onto a line having that
direction, the projected points are as spread out as possible (as measured by their sample variance).
The third principal component is determined similarly, and corrsponds to a direction orthogonal to the directions of
the first two principal components, and subsequent principal components are defined in a likewise manner.
An alternative viewpoint for principal components is that the first k principal components are orthogonal
directions corresponding to a k-dimensional subspace for which the orthogonal distances of the n
p-dimensional points to that subspace are collectively as small as possible (as measured by the sum of the
squared distances). The suggests that the n projected points --- the n
k-dimensional vectors of the principal component values, are the n
k-dimensional vectors that best "approximate" the original p-dimensional points.
(Note: The
k-dimensional vectors of the principal component values can still be viewed as points in
p-dimensional space since each principal component involves all p variables.) Due to this, it would
seem that perhaps as little as possible would be lost by replacing the original set of p explanatory variables
by a smaller set of principal component variables.
It should be noted that the coefficients for the principal components were not determined using the response variable
values --- they are not obtained by specifically trying to create a good model for the response variable. Because of
this, the coefficients of the principal components do not lead to overfitting in the same way as what occurs if we
specifically select a lot of coefficients in a linear regression model to make the fitted values close to the
observed values of the response variable. It is also interesting to note that if one uses p principal
components in a regression model, and then rewrites the fitted principal component model in terms of the original
p explanatory variables, the result is the same as what one gets when a least squares model is fit using all p
explanatory variables. So p principal components and p explanatory variables are of equal worth (with
regard to finding a good prediction model). But it may well be that fewer than p principal components can be
better than the same number of explanatory variables.
There are other regression methods that use derived directions as predictors. One such method is partial least
squares regression. Unlike principal components, the directions used in partial least squares regression
are determined using the response variable values. Unfortunately, we don't have enough time to thoroughly cover all
of the interesting methods that are introduced in HTF.