Separating Hyperplanes to Support Vector Machines
Sometimes two classes are easily separable. With one predictor, they can be separated by a point: all observations in
the training sample having the predictor variable value below a certain number are of one class, and all observations with
the predictor variable value above this number are of the other class. With two predictors, it is sometimes possible
to draw a straight line so that all points on one side of the line are of one class and all points on the other side
of the line are of the other class. In three dimensions, the classes may be separated by a single plane, and in
higher dimensions a hyperplane may separate them.
When classes are separable, the data presents a nice way to test classification methods. Generally, we would feel good if
a classification method produces a classifier which correctly classifies each member of the training sample when this
is possible using a very simple boundary. (A 1nn classifier will always correctly classify the training data
(assuming that no two observations have exactly the same predictor values), but we shouldn't be impressed by this
since the class boundaries can be extremely complex and overfitting can make the generalization performance be pretty
bad.)
Fig. 4.14 on p. 129 of HTF2 shows an example where LDA (also 1st-order OLS regression (these are
equivalent, since the sample sizes are equal)) produces a linear boundary which does not separate the classes, even
though many possible linear separators are possible.
Rosenblatt's perceptron learning algorithm can be used to find a such a simple separating boundary, when one is possible.
(Note:
Although a kid with a crayon and a ruler might be able to easily identify a separating line when there are just two
predictors, to visually find a separating hyperplane in high-dimensional space isn't necessarily easy.)
Although it may seem like a natural strategy to attempt to separate the classes with a simple linear boundary,
this figure
may suggest that such a separating boundary may not necessarily be the best classifier.
the perceptron learning algorithm for separating hyperplanes
The perceptron learning algorithm is used with a two-class situation by coding the response variables with 1 and -1,
and a separating boundary (when one exists) is found iteratively. To see how such a boundary can be found, let's
consider the situation where there are two predictors,
x1
and
x2.
An equation for a line in the
x1-x2
plane can be expressed as
f(x) = 0,
where
f(x) =
β0 +
β1x1 +
β2x2.
For a point not on the line
(so f(x) ≠ 0),
f(x) is proportional to the signed distance from the point to the the line, with
f(x) > 0
for point on one side of the line, and
f(x) < 0
for points on the other side of the line.
If we classify points as being of class 1 if
f(x) > 0,
and being of class -1 if
f(x) < 0,
then
yi f(xi)
will be positive if a point is correctly classified and negative if it is incorrectly classified.
Letting
D(β0,
β1,
β2) =
- Σ yi f(xi),
where the sum is only over the cases of the training data which are currently misclassified,
gradients can be used to determine a direction to "move" in the
β0-β1-β2
space to make
D(β0,
β1,
β2)
smaller.
Applying a stochastic gradient descent algorithm (as described on p. 131 of HTF2),
(β0,
β1,
β2)
can be iteratively adjusted until
D(β0,
β1,
β2)
becomes 0, thus determining a separating line. (A fine-tuning step would have to be taken if any ponts were on the
line correspond to the value of
(β0,
β1,
β2)
which produced
D(β0,
β1,
β2)
= 0.)
optimal separating hyperplanes
A problem with the perceptron learning algorithm is that there may be many possible separating hyperplanes, and which
one is arrived at by applying the alogrithm depends upon the initial values used.
(Another problem is that it can be very slow in converging; which is especially a problem since convergence won't
occur if the classes aren't separable by a hyperplane, and it's often hard to determine if the convergence is just slow or
if it's impossible.)
In the absence of information to the contrary, one might think that a separating hyperplane which maximizes the
distance to the closest point might be a good with regard to defining a classifier having a small generalization
error.
Such a hyperplane is what Vapnik refers to as the
optimal separating hyperplane, and it provides a unique solution to the separating hyperplane problem.
This plane can be found by solving the optimization problem:
determine the values of
β0,
β1,
..., βp which
maximize C subject to
yi
(β0 +
β1xi,1 + ... +
βpxi,p)
≥ C
(i = 1, 2, ..., n)
and
β12
+ ... + βp2 = 1.
(Note:
If a constraint is not put on
(β1, ...,
βp)
(note that
β0
is not included)
then the above does not have a solution, and a unique hyperplane is not determined.
By requiring
(β1, ...,
βp)
to be a unit vector, we have that
(β1, ...,
βp)
determines a direction, and
β0
determines a distance.)
In a sense, with two predictors the optimal separating hyperplane is the centerline of the widest possible strip of
"no man's land" which can be placed between the two classes.
(Fig. 4.16 on p. 134 of HTF2 shows such a strip. Two red class points and one green class point which touch the strip
are called support points or support vectors.)
In three (or more) dimensions, the optimal separating hyperplane detrmines the thickest separating slab (hyperslab).
The separating strip/slab is called the margin.
The hyperplane determined by LDA is influenced by all points, even those far away from a separating boundary, and
may not separate separable classes. Logistic regression will find a separating boundary, if one exists, but it may
not be the "optimal" separating boundary. Nevertheless, it's possible that the boundaries determined by LDA and
logistic regression give classifiers with smaller generalization error rates than the classifier corresponding to
the optimal separating hyperplane, which pays more attention to the points which are close to the region separating
two classes. But it's also possible that the optimal separating hyperplane determines a classifier that outperforms
both LDA and logistic regression.
support vector classification
Often there isn't a separating hyperplane if one just uses all of the explanatory variables as the predictors.
But if the predictor set is expanded by adding higher order terms (for example,
x12,
x1x2,
x22,
x13,
x12x2,
x1x22, and
x23,
if there are two explanatory variables,
x1
and x2),
then a separating hyperplane may be possible, since linear/flat boundaries in the expanded feature space can be
curved boundaries in the original space, and with increased flexibility it may be possible to achieve separation.
Another approach is to try to find a hyperplane which separates most of the points for the two classes, but allows
for some points to be on the wrong side, which is what a support vector classifier does.
A support vector classifier uses a margin, but allows points to be in the margin, and even allows a limited number of
them to be on the wrong side of its center line/plane/hyperplane, with it being possible that some points are outside
the margin on the wrong side. Slack variables,
ξ1,
ξ2,
..., ξ2,
give the amounts, relative to the half-thickness of the margin,
by which points are into the margin or on the wrong side of the margin.
- If a point is on the correct side of the margin, it's slack variable value is 0.
- If a point is in the margin, but on the correct side of its center line/plane/hyperplane,
it's slack variable value is between 0 and 1.
- If a point is on the center line/plane/hyperplane of the margin,
it's slack variable value is 1.
- If a point is in the margin, but on the wrong side of its center line/plane/hyperplane,
it's slack variable value is between 1 and 2.
- If a point is on the wrong side of the margin, it's slack variable value is greater than 2.
The decision boundary of a support vector classifier is the center line/plane/hyperplane of the margin which is
determined by
finding the values of
β0,
β1,
..., βp which,
for some fixed value, K,
maximize C subject to
yi
(β0 +
β1xi,1 + ... +
βpxi,p)
≥ C(1 - ξ)
(i = 1, 2, ..., n),
ξi ≥ 0
(i = 1, 2, ..., n),
Σ ξi ≤ K,
and
β12
+ ... + βp2 = 1.
(Note: By making the margin wider (by making C larger), the largest
ξi can be made to be smaller, but there will typically be more
positive ξi (due to more points being in the margin). Even in the two-dimensional case, it's
hard to look at a scatter plot of the points and guess the nature to the solution very precisely.)
Cross-validation can be used to determine a good value to use for K.
(Note:
Subsection 12.2.1 of HTF2 describes how the optimization problem can be solved. A different, but equivalent,
formulation of the problem is considered. γ is the tuning parameter in the alternative formulation
--- it plays a role similar to the one which K plays in the description given above.)
With a support vector classifier, points well on the correct side of the boundary did not play a large role in
shaping the boundary, and this generally seems like a good property. Logistic regression is similar in this regard,
but the situation is different for LDA.
support vector machines
The basic idea behind support vector machine classification is to use an expanded basis, sometimes with a
very large number of basis functions, with the support vector classification method; but actually it's a bit more
complicated. (There isn't time, and perhaps desire, to get into the details!)
A rather hard to understand scheme involving kernels produces an effect similar to a basis function expansion of the
original inputs. (Because the kernels are employed in an alternative formulation of the general strategy as
described above, it's hard to firmly grasp their role, and to understand how they relate to the other kernels that
we've encountered this summer.) Popular kernels are dth degree polynomial kernels, and radial basis
kernels. It's often hard to guess what kernel may work best in a given situation, and so generally it's good to try
several choices.
There can be a lot of flexibility in determining the boundary --- for some values of the tuning/regularization
parameter, the boundary can be quite wiggly, and overfitting can occur. For other values of the parameter,
wiggliness is discouraged, and bias is a problem. But if cross-validation, or a separate validation set, can be used
to determine a value for the parameter which seems just right, and an appropriate kernel is used, then the resulting
classifier can be very good. (Typically, this method is rather good, when comparisons are made with a lot of other
methods.) I encourage you to look at Fig. 12.3 on p. 425 of HTF2 to see classifiers obtained with support vector
machines, one using a 4th-degree polynomial kernel and the other using a radial kernel.
(Note: I think the name support vector machine is silly --- the "machine" part bugs me.)