Introduction to Local Methods for Regression and Classification
Here I provide comments about some parts of Ch. 2 of HTF that I didn't cover (enough) previously, but I still won't
attempt to cover everything that is in the chapter.
nearest neighbors methods
Suppose that there are p explanatory variables and we have a training set consisting of
n observations of the form
( xi, yi ),
where
xi is the p-dimensional vector of explanatory variable values
for the ith observation, and yi is the response variable value for the
ith observation. To estimate the value of E(Y|x) for a given point
x,
or to predict a value of Y at that point,
the method of
k nearest neighbors
averages the
yi
values corresponding to k
xi
values which are closest to
x.
To determine closeness, usually Euclidean distance is used, but other metrics can be used instead.
Typically, one should standardize the variables by dividing all of the values for the jth explanatory variable
by the sample standard deviation of the n
observed values of the jth explanatory variable (otherwise a variable like number of children would
tend to have little influence in determining the nearest neighbors compared to a variables like salary (in dollars),
and it would matter whether one used
salary (in dollars) or
salary (in thousands of dollars).
It's possible that appreciably better results could be obtained if one didn't let all of the standardized predictors
contribute equally to the distances that are used, since some of the predictors may be "noise variables" which have
little or no influence on the response. (But if one uses a distance measure which incorporate weights, I don't know
of a good way to efficiently adjust the weights to achieve (near) optimal performance.)
By choosing different values for k, one can obtain different predicted values. So an obvious question is
How should one select the value of k?
If the goal is to minimize the mean squared prediction error, then what won't typically work well is to select the
value of k which minimizes the average squared error for the observations in the training sample, because the
average squared error is 0 for the case of k = 1. Instead, one should choose the value of k which
minimizes the average squared error on an independent validation set (which some refer to as a test
set), or one should use cross-validation.
Some say that the effective number of paramters (sometimes called the
effective degrees of freedom) for
k nearest neighbors regression is n/k
because if we divided up the space of the explanatory
variables into nonoverlapping regions having k observations in each one,
to obtain n/k regions,
and we estimated the expected
response value for each region using the sample mean of the observations belonging to the region, we would have
n/k parameters (the unknown means) to estimate. (Of course, with
k nearest neighbors regression the regions aren't fixed.)
k nearest neighbors regression is a local regression method --- that is, one
doesn't try to find a single model to fit all of the observations in a data set (and fit the model using all of the
data), but instead one just uses observations having
x
values in a
smallish area to predict a response value corresponding to an
x
value in that area. There are other ways of doing local regression that are typically better. One relatively
simple alternative is to predict Y using the Nadaraya-Watson weighted average (a kernel-based
method), which uses a weighted average of the yi values, with the weights decreasing with
increasing distance between
xi
and
x (see p. 35 of HTF).
However, other kernel-based methods such as lowess (loess) tend to perform better.
(I'll discuss such kernel-based methods, and explain how to use R to perform them, when I cover some of the
material in Ch. 6 of HTF.)
k nearest neighbors classification is similar to
k nearest neighbors regression, except that instead of averaging numerical response values to get a
prediction, a simple majority rule is used to determine the predicted class for a given
x --- in a sense, each of the k nearest neighbors in the training sample casts a single vote
for its class, and the class with the largest number of votes becomes the predicted class. (If there are only two
classes, only odd values are considered for k in order to reduce ties (since with an odd value of k the
only problem that can occur is if there are ties when determining the nearest neighbors, and this tends not to happen with
"continuous" numerical explanatory variables).) Unlike
k nearest neighbors regression,
k nearest neighbors classification can be competitive in many classification settings if care is taken to
select a good value to use for k (using cross-validation, or a validation sample).
the curse of dimensionality
The so called curse of dimensionality is a problem with local methods in high dimensions --- the nearest
neighbors of
x may be relatively far away. In regression, this can create a large bias in the predictions,
especially if the
target point corresponds to a minimum or maximum (because the response values for points in the neighborhoods may all
be well above or below a reasonable value for the prediction, since the neighborhoods may have to be very large
to contain k neighbors).
An example in HTF (pp. 22-23) deals with points uniformly distributed in (hyper)cubes of various dimensions.
As the dimenion, p, increases, the edges of a subset (hyper)cube have to become larger in order to contain (on
average) the same percentage of the distributed points. To capture 1% of the points, the edges of the subset cube
have to be of length (about) 0.63 --- so nearest neighbors of the center of the (hyper)cube can be closer to some outer
edge than they are to the center. Another example (p. 23) deals with a p-dimensional unit ball (meaning
the radius is 1), again
with uniformly distributed points in it.
Expression (2.24) on p. 23 gives the median distance to the point closest to the center of the ball. For p
= 10, even with 500 points in the ball, a point at the median distance of the point closest
to the center will be closer to the
surface of the ball than it is to the center! Sticking with this same example, and letting
d(p, n) be the median distance to the point nearest the center when the dimension is p
and the sample size is n, we have
- d(1, 10) = 0.067,
- d(2, 100) = 0.083,
- d(3, 1000) = 0.088,
- d(4, 10,000) = 0.091;
so even with n increasing exponentially, the median distance becomes larger as p increases.
With all such examples, we have that in high dimensions, most points tend to be "close to the edge" --- most points
have an extreme value for at least one of the variables. Of course, in many real situations, points aren't uniformly
distributed in compact regions. But it's still the case that a lot of points will be extreme in at least one
variable, and that nearest neighbors are often not very similar.
To combat the curse, with nearest neighbors methods one could try to eliminate some variables of lesser usefulness in
order to reduce the dimension. There isn't a real slick way to do this --- one could do a bit of brute-force trying
different things I suppose. Or, one could take information learned from using other methods as to which variables
may not be so important and make more educated guesses about what possibilities to explore.