Ch. 2 HTF, STAT 789, Summer 2005

Introduction to Local Methods for Regression and Classification

Here I provide comments about some parts of Ch. 2 of HTF that I didn't cover (enough) previously, but I still won't attempt to cover everything that is in the chapter.

nearest neighbors methods

Suppose that there are p explanatory variables and we have a training set consisting of n observations of the form

( x_i, y_i ),

where x_i is the p-dimensional vector of explanatory variable values for the ith observation, and y_i is the response variable value for the ith observation. To estimate the value of E(Y|x) for a given point x, or to predict a value of Y at that point, the method of k nearest neighbors averages the y_i values corresponding to k x_i values which are closest to x. To determine closeness, usually Euclidean distance is used, but other metrics can be used instead.

Typically, one should standardize the variables by dividing all of the values for the jth explanatory variable by the sample standard deviation of the n observed values of the jth explanatory variable (otherwise a variable like number of children would tend to have little influence in determining the nearest neighbors compared to a variables like salary (in dollars), and it would matter whether one used salary (in dollars) or salary (in thousands of dollars). It's possible that appreciably better results could be obtained if one didn't let all of the standardized predictors contribute equally to the distances that are used, since some of the predictors may be "noise variables" which have little or no influence on the response. (But if one uses a distance measure which incorporate weights, I don't know of a good way to efficiently adjust the weights to achieve (near) optimal performance.)

By choosing different values for k, one can obtain different predicted values. So an obvious question is

How should one select the value of k?

If the goal is to minimize the mean squared prediction error, then what won't typically work well is to select the value of k which minimizes the average squared error for the observations in the training sample, because the average squared error is 0 for the case of k = 1. Instead, one should choose the value of k which minimizes the average squared error on an independent validation set (which some refer to as a test set), or one should use cross-validation.

Some say that the effective number of paramters (sometimes called the effective degrees of freedom) for k nearest neighbors regression is n/k because if we divided up the space of the explanatory variables into nonoverlapping regions having k observations in each one, to obtain n/k regions, and we estimated the expected response value for each region using the sample mean of the observations belonging to the region, we would have n/k parameters (the unknown means) to estimate. (Of course, with k nearest neighbors regression the regions aren't fixed.)

k nearest neighbors regression is a local regression method --- that is, one doesn't try to find a single model to fit all of the observations in a data set (and fit the model using all of the data), but instead one just uses observations having x values in a smallish area to predict a response value corresponding to an x value in that area. There are other ways of doing local regression that are typically better. One relatively simple alternative is to predict Y using the Nadaraya-Watson weighted average (a kernel-based method), which uses a weighted average of the y_i values, with the weights decreasing with increasing distance between x_i and x (see p. 35 of HTF). However, other kernel-based methods such as lowess (loess) tend to perform better. (I'll discuss such kernel-based methods, and explain how to use R to perform them, when I cover some of the material in Ch. 6 of HTF.)

k nearest neighbors classification is similar to k nearest neighbors regression, except that instead of averaging numerical response values to get a prediction, a simple majority rule is used to determine the predicted class for a given x --- in a sense, each of the k nearest neighbors in the training sample casts a single vote for its class, and the class with the largest number of votes becomes the predicted class. (If there are only two classes, only odd values are considered for k in order to reduce ties (since with an odd value of k the only problem that can occur is if there are ties when determining the nearest neighbors, and this tends not to happen with "continuous" numerical explanatory variables).) Unlike k nearest neighbors regression, k nearest neighbors classification can be competitive in many classification settings if care is taken to select a good value to use for k (using cross-validation, or a validation sample).

the curse of dimensionality

The so called curse of dimensionality is a problem with local methods in high dimensions --- the nearest neighbors of x may be relatively far away. In regression, this can create a large bias in the predictions, especially if the target point corresponds to a minimum or maximum (because the response values for points in the neighborhoods may all be well above or below a reasonable value for the prediction, since the neighborhoods may have to be very large to contain k neighbors).

An example in HTF (pp. 22-23) deals with points uniformly distributed in (hyper)cubes of various dimensions. As the dimenion, p, increases, the edges of a subset (hyper)cube have to become larger in order to contain (on average) the same percentage of the distributed points. To capture 1% of the points, the edges of the subset cube have to be of length (about) 0.63 --- so nearest neighbors of the center of the (hyper)cube can be closer to some outer edge than they are to the center. Another example (p. 23) deals with a p-dimensional unit ball (meaning the radius is 1), again with uniformly distributed points in it. Expression (2.24) on p. 23 gives the median distance to the point closest to the center of the ball. For p = 10, even with 500 points in the ball, a point at the median distance of the point closest to the center will be closer to the surface of the ball than it is to the center! Sticking with this same example, and letting d(p, n) be the median distance to the point nearest the center when the dimension is p and the sample size is n, we have

d(1, 10) = 0.067,
d(2, 100) = 0.083,
d(3, 1000) = 0.088,
d(4, 10,000) = 0.091;

so even with n increasing exponentially, the median distance becomes larger as p increases. With all such examples, we have that in high dimensions, most points tend to be "close to the edge" --- most points have an extreme value for at least one of the variables. Of course, in many real situations, points aren't uniformly distributed in compact regions. But it's still the case that a lot of points will be extreme in at least one variable, and that nearest neighbors are often not very similar.

To combat the curse, with nearest neighbors methods one could try to eliminate some variables of lesser usefulness in order to reduce the dimension. There isn't a real slick way to do this --- one could do a bit of brute-force trying different things I suppose. Or, one could take information learned from using other methods as to which variables may not be so important and make more educated guesses about what possibilities to explore.