Ch. 4 HTF, STAT 789, Summer 2005

Comparison of Linear Methods for Classification

Here I provide comments about some parts of Ch. 4 of HTF that I didn't cover (enough) previously, but I still won't attempt to cover everything that is in the chapter.

using OLS regression for classification (also dealt with in Ch. 2) and comparison with LDA

In the binary case, one can code the two classes with 0 and 1, and then do a least squares regression. If we view the fitted model as an estimate of

E( Y | x₁, ..., x_p ),

and then note that because Y is binary, this is the same as

P( Y = 1 | x₁, ..., x_p ),

then we have a way of estimating the probability that the outcome 1 results from a given value of x. It may not be so bad that some of the estimated probabilities may be outside of [0, 1], because to obtain a classification the interest is in whether or not the probability exceeds 0.5, unless the misclassification cost is not the same for the two types of errors that can be made. (I'll say something about making adjustments to minimize the expected misclassification cost in such situations.)

Even though this regression scheme may seem to be a bit crude, it can be shown that in the two class case it produces a linear decision boundary which is parallel (and is often nearly identical to) the linear boundary determined by LDA. (At most the intercepts differ, and only then if the numbers of observations for the two classes are not the same.) Since the normality assumption underlying LDA is seldom met in "real" problems, there is no good reason to strongly favor the LDA classifier over the regression classifier, and so in HTF it is suggested that one keep the common direction identified by the two methods, and do a search for a good intercept, picking the value that works best with the training data. As with LDA, one can use transformed predictors and higher-order terms to create nonlinear boundaries (in the space of the original predictors.)

If there are more than two classes, then the workload is increased. (E.g., if there are three clases, one needs to create three regression models.) Also, masking can occur (see Fig, 4.2 and Fig. 4.3 in HTF (pp. 83-84)). However, the masking problem can often be worked around by fitting higher-order regression models (expanding the set of predictors). LDA does not suffer from masking in the same way that the regression method can. Unlike the two class case, with three or more classes LDA can regression can give quite different results.

comparison of using LDA and logistic regression for classification

Expressions (4.29) and (4.30) on pp. 103-104 of HTF show that with both LDA and logistic regression, the logit, and hence the decision boundary, is modeled as being a linear function of the predictors. A main difference in how the methods perform is due to the way the coefficients for the logit model are estimated. With LDA they result from the parameter estimates for the assumed normal distributions, and with logistic regression maximum likelihood estimation is used to fit the model, using the observed response variable values along with the predictor variable values to directly estimate

P( Y = 1 | x).

If the normality and common covariance matrix assumptions underlying LDA hold, or realistically, if they nearly hold, then LDA may do a bit better than logistic regression. However, if the assumptions do not hold, and in particular, if there are extreme outliers, then LDA can do poorly and logistic regression can do much better --- logistic regression is more resistant to extreme values.

Another advantage of logistic regression is that there are nice ways to get guidance from the data as to how to do variable transformation and variable selection, and we don't have this with LDA (although one could try to forge ahead and try some different tactics based on hints found in working with the data).

For example, in the last situation considered in these examples, for which the predictors have nonnormal distributions and one of them isn't useful for predicting the response (since the distribution of the variable is exactly the same for both classes), the use of logistic regression outperformed LDA on the untransformed data, and if one deletes the nonsignificant variable from the logistic regression fit, further improvement may be possible (and still more improvement may be possible if variable transformation is employed).