Comparison of Linear Methods for Classification
Here I provide comments about some parts of Ch. 4 of HTF that I didn't cover (enough) previously, but I still won't
attempt to cover everything that is in the chapter.
using OLS regression for classification (also dealt with in Ch. 2) and comparison with LDA
In the binary case, one can code the two classes with 0 and 1, and then do a least squares regression.
If we view the fitted model as an estimate of
E( Y |
x1, ...,
xp ),
and then note that because Y is binary, this is the same as
P( Y = 1 |
x1, ...,
xp ),
then we have a way of estimating the probability that the outcome 1 results from a given value of x.
It may not be so bad that some of the estimated probabilities may be outside of [0, 1], because to obtain a
classification the interest is in whether or not the probability exceeds 0.5, unless the misclassification cost is
not the same for the two types of errors that can be made. (I'll say something about making adjustments to minimize
the expected misclassification cost in such situations.)
Even though this regression scheme may seem to be a bit crude, it can be shown that in the two class case it produces
a linear decision boundary which is parallel (and is often nearly identical to) the linear boundary determined by
LDA. (At most the intercepts differ, and only then if the numbers of observations for the two classes are not the
same.) Since the normality assumption underlying LDA is seldom met in "real" problems, there is no good reason to
strongly favor the LDA classifier over the regression classifier, and so in HTF it is suggested that one keep the
common direction identified by the two methods, and do a search for a good intercept, picking the value that works
best with the training data.
As with LDA, one can use transformed predictors and higher-order terms to create nonlinear boundaries (in the space of
the original predictors.)
If there are more than two classes, then the workload is increased. (E.g., if there are three clases, one needs to
create three regression models.) Also, masking can occur (see Fig, 4.2 and Fig. 4.3 in HTF (pp. 83-84)).
However, the masking problem can often be worked around by fitting higher-order regression models (expanding the set
of predictors). LDA does not suffer from masking in the same way that the regression method can.
Unlike the two class case, with three or more classes LDA can regression can give quite different results.
comparison of using LDA and logistic regression for classification
Expressions (4.29) and (4.30) on pp. 103-104 of HTF show that with both LDA and logistic regression, the logit, and
hence the decision boundary, is modeled as being a linear function of the predictors. A main difference in how the
methods perform is due to the way the coefficients for the logit model are estimated. With LDA they result from the
parameter estimates for the assumed normal distributions, and with logistic regression maximum likelihood estimation
is used to fit the model, using the observed response variable values along with the predictor variable values to
directly estimate
P( Y = 1 |
x).
If the normality and common covariance matrix assumptions
underlying LDA hold, or realistically, if they nearly hold, then LDA may do a bit better
than logistic regression. However, if the assumptions do not hold, and in particular, if there are extreme outliers,
then LDA can do poorly and logistic regression can do much better --- logistic regression is more resistant to
extreme values.
Another advantage of logistic regression is that there are nice ways to get guidance from the data
as to how to do variable transformation and variable selection, and we don't have this with LDA (although one could
try to forge ahead and try some different tactics based on hints found in working with the data).
For example, in the last situation considered in
these examples, for which the predictors have nonnormal distributions and one
of them isn't useful for predicting the response (since the distribution of the variable is exactly the same for both
classes), the use of logistic regression outperformed LDA on the untransformed data, and if one deletes the
nonsignificant variable from the logistic regression fit, further improvement may be possible (and still more improvement
may be possible if variable transformation is employed).