introduction, STAT 789, Summer 2005

Introduction to Classification and Regression

This course will cover a large number of methods used for classification and regression, with an emphasis on newer computer-intensive methods (since traditional methods such as discriminant analysis for classification (LDA and QDA) and ordinary least squares regression (OLS regression, or sometimes just refered to as LS regression) are covered in other STAT courses at GMU.

In both classification and regression there is a response variable (aka the dependent variable or outcome variable), as well as one or more predictor variables (aka explanatory variables or inputs (or even independent variables)).

In classification problems, G is often used to denote the response variable (although sometimes Y is used). G is a categorical variable (or possibly an ordinal variable), and G is used to denote the set of possibilities for G. I'll begin by using

V₁, V₂, ..., V_m

(for some m ≥ 1) for the available predictor variables, although

X₁, X₂, ..., X_p

are commonly used.

For example, consider the spam example introduced on p. 2 of HTF. To obtain the data, 4601 e-mail messages were used, each of which is known to be either "legit" e-mail or spam. It is desired to use this sample of messages to create a spam filter (or at least learn something about the creation of a spam filter).

All of the "words" (e.g., you, free, hp, edu, remove) and special characters (e.g., !, $, @) were determined, and 11 key variables,

%george, %you, ..., %remove,

were identified (see Table 1.1 on p. 2 of HTF).

We could let

G = {spam, email},

(where email denotes "legit" e-mail) and let

V₁ be %george, the percentage of words and characters in a message which are George (or george or GEORGE),

V₂ be %you, the percentage of words and characters in a message which are you (or You or YOU),

...,

V₁₁ be %remove, the percentage of words and characters in a message which are remove (or Remove or REMOVE).

(Note: One could easily create additional predictor variables from the sample of messages.)

A classifier is a rule which selects (predicts) a value for G, given observed values

v₁, v₂, ..., v_m.

With some classification methods, the final classification rule may only involve a subset of the available predictors, even though all of them were considered and used to arrive at the final rule.

Typically, classifiers are constructed using a training sample (or learning sample) comprised of n complete cases for which the values of both the predictor variables and the response variable are known, although some methods allow for some predictor variable values to be missing. (In the machine learning community, the cases are sometimes called examples, and a classifier is sometimes refered to as a learner.) There are many methods for constructing classifiers from data! The purpose of the classifier is typically to predict the classes of future cases for which only the values of the predictor variables are known, although sometimes the chief goal for creating a classification rule is to gain a better understanding of a certain phenomenon.

Some simple classification rules are given near the bottom of p. 2 of HTF. It can be noted that each of them partitions the measurement space (the set of all possible values of the predictor variables) --- each point in the measurement space either belongs to the set of points that will be classified as spam, or the set of points that will be classified as email.

In regression analysis, the response variable is numerical, and is typically denoted by Y. Often the chief goal is to make a prediction of an unknown value of the response variable associated with a given set of known predictor values, or it is to find a function with approximates the mean or median of Y, conditioned on the predictor variable values. (A more ambitious goal is to develop a model for Y, consisting of an expression for the mean or median, and also the distribution of Y about the mean or median.)

If we consider the task of predicting values of Y, then creating a (regression) prediction rule is somewhat similar to contructing a classifier --- the main difference being that the response variable being predicted is numerical, instead of categorical. For regression, the prediction rule can be some function,

h(v₁, v₂, ..., v_m),

of the inputs

v₁, v₂, ..., v_m.

As with classification, there are many methods which can be used for regression. Even if one chooses to use OLS regression (by far the most commonly used method), there are many strategies for arriving at a specific prediction rule --- that is, there are many ways to determine what function of the predictors to use. (Least squares is a way of estimating the unknown parameters associated with a chosen model, but one is free to fit a wide variety of models, and it is often hard to select the most appropriate one.)

As an example, consider the prostate cancer example on pp. 3-4 of HTF. The response variable is given to be lpsa (the log of the PSA level). Perhaps the original explanatory variables included the cancer volume and the prostate weight, but it may be that using the log of the cancer volume and the log of the prostate weight leads to a simpler, and better performing, prediction function. It is often the case that using transformations of the original variables, e.g.,

x₆ = 1/v₃

and

x₈ = v₇^3/2,

and/or constructed variables like

x₉ = v₅/v₂

can result in a better prediction rule than one can obtain just using the original set of explanatory variables. (Sometimes constructed variables are called features, but some also refer to an untransformed explanatory variable as a feature.) Some prefer to avoid using too many transformations of the form

s(v_j),

and instead add

v_j² and v_j³

to the set of predictors, thinking that the ideal, but unknown, transformation

s(v_j)

can be approximated by a linear function of

v_j, v_j² and v_j³.

However, it needs to be kept in mind that the more variables one uses, the more flexible the model is, and it could be that while the training data is fit very well, the overfit predictor can perform poorly when it comes to predicting the respose variable values for cases not in the training data. (One can say that the generalization error performance is poor.)