Separating Hyperplanes to Support Vector Machines

Sometimes two classes are easily separable. With one predictor, they can be separated by a point: all observations in the training sample having the predictor variable value below a certain number are of one class, and all observations with the predictor variable value above this number are of the other class. With two predictors, it is sometimes possible to draw a straight line so that all points on one side of the line are of one class and all points on the other side of the line are of the other class. In three dimensions, the classes may be separated by a single plane, and in higher dimensions a hyperplane may separate them.

When classes are separable, the data presents a nice way to test classification methods. Generally, we would feel good if a classification method produces a classifier which correctly classifies each member of the training sample when this is possible using a very simple boundary. (A 1nn classifier will always correctly classify the training data (assuming that no two observations have exactly the same predictor values), but we shouldn't be impressed by this since the class boundaries can be extremely complex and overfitting can make the generalization performance be pretty bad.)

Fig. 4.14 on p. 129 of HTF2 shows an example where LDA (also 1st-order OLS regression (these are equivalent, since the sample sizes are equal)) produces a linear boundary which does not separate the classes, even though many possible linear separators are possible. Rosenblatt's perceptron learning algorithm can be used to find a such a simple separating boundary, when one is possible. (Note: Although a kid with a crayon and a ruler might be able to easily identify a separating line when there are just two predictors, to visually find a separating hyperplane in high-dimensional space isn't necessarily easy.) Although it may seem like a natural strategy to attempt to separate the classes with a simple linear boundary, this figure may suggest that such a separating boundary may not necessarily be the best classifier.

the perceptron learning algorithm for separating hyperplanes

The perceptron learning algorithm is used with a two-class situation by coding the response variables with 1 and -1, and a separating boundary (when one exists) is found iteratively. To see how such a boundary can be found, let's consider the situation where there are two predictors, x₁ and x₂. An equation for a line in the x₁-x₂ plane can be expressed as f(x) = 0, where

f(x) = β₀ + β₁x₁ + β₂x₂.

For a point not on the line (so f(x) ≠ 0), f(x) is proportional to the signed distance from the point to the the line, with f(x) > 0 for point on one side of the line, and f(x) < 0 for points on the other side of the line. If we classify points as being of class 1 if f(x) > 0, and being of class -1 if f(x) < 0, then y_i f(x_i) will be positive if a point is correctly classified and negative if it is incorrectly classified. Letting

D(β₀, β₁, β₂) = - Σ y_i f(x_i),

where the sum is only over the cases of the training data which are currently misclassified, gradients can be used to determine a direction to "move" in the β₀-β₁-β₂ space to make D(β₀, β₁, β₂) smaller. Applying a stochastic gradient descent algorithm (as described on p. 131 of HTF2), (β₀, β₁, β₂) can be iteratively adjusted until D(β₀, β₁, β₂) becomes 0, thus determining a separating line. (A fine-tuning step would have to be taken if any ponts were on the line correspond to the value of (β₀, β₁, β₂) which produced D(β₀, β₁, β₂) = 0.)

optimal separating hyperplanes

A problem with the perceptron learning algorithm is that there may be many possible separating hyperplanes, and which one is arrived at by applying the alogrithm depends upon the initial values used. (Another problem is that it can be very slow in converging; which is especially a problem since convergence won't occur if the classes aren't separable by a hyperplane, and it's often hard to determine if the convergence is just slow or if it's impossible.) In the absence of information to the contrary, one might think that a separating hyperplane which maximizes the distance to the closest point might be a good with regard to defining a classifier having a small generalization error.

Such a hyperplane is what Vapnik refers to as the optimal separating hyperplane, and it provides a unique solution to the separating hyperplane problem. This plane can be found by solving the optimization problem: determine the values of β₀, β₁, ..., β_p which maximize C subject to

y_i (β₀ + β₁x_i,1 + ... + β_px_i,p) ≥ C

(i = 1, 2, ..., n)

and

β₁² + ... + β_p² = 1.

(Note: If a constraint is not put on (β₁, ..., β_p) (note that β₀ is not included) then the above does not have a solution, and a unique hyperplane is not determined. By requiring (β₁, ..., β_p) to be a unit vector, we have that (β₁, ..., β_p) determines a direction, and β₀ determines a distance.)

In a sense, with two predictors the optimal separating hyperplane is the centerline of the widest possible strip of "no man's land" which can be placed between the two classes. (Fig. 4.16 on p. 134 of HTF2 shows such a strip. Two red class points and one green class point which touch the strip are called support points or support vectors.) In three (or more) dimensions, the optimal separating hyperplane detrmines the thickest separating slab (hyperslab). The separating strip/slab is called the margin.

The hyperplane determined by LDA is influenced by all points, even those far away from a separating boundary, and may not separate separable classes. Logistic regression will find a separating boundary, if one exists, but it may not be the "optimal" separating boundary. Nevertheless, it's possible that the boundaries determined by LDA and logistic regression give classifiers with smaller generalization error rates than the classifier corresponding to the optimal separating hyperplane, which pays more attention to the points which are close to the region separating two classes. But it's also possible that the optimal separating hyperplane determines a classifier that outperforms both LDA and logistic regression.

support vector classification

Often there isn't a separating hyperplane if one just uses all of the explanatory variables as the predictors. But if the predictor set is expanded by adding higher order terms (for example,

x₁², x₁x₂, x₂², x₁³, x₁²x₂, x₁x₂², and x₂³,

if there are two explanatory variables, x₁ and x₂), then a separating hyperplane may be possible, since linear/flat boundaries in the expanded feature space can be curved boundaries in the original space, and with increased flexibility it may be possible to achieve separation. Another approach is to try to find a hyperplane which separates most of the points for the two classes, but allows for some points to be on the wrong side, which is what a support vector classifier does.

A support vector classifier uses a margin, but allows points to be in the margin, and even allows a limited number of them to be on the wrong side of its center line/plane/hyperplane, with it being possible that some points are outside the margin on the wrong side. Slack variables,

ξ₁, ξ₂, ..., ξ₂,

give the amounts, relative to the half-thickness of the margin, by which points are into the margin or on the wrong side of the margin.

If a point is on the correct side of the margin, it's slack variable value is 0.
If a point is in the margin, but on the correct side of its center line/plane/hyperplane, it's slack variable value is between 0 and 1.
If a point is on the center line/plane/hyperplane of the margin, it's slack variable value is 1.
If a point is in the margin, but on the wrong side of its center line/plane/hyperplane, it's slack variable value is between 1 and 2.
If a point is on the wrong side of the margin, it's slack variable value is greater than 2.

The decision boundary of a support vector classifier is the center line/plane/hyperplane of the margin which is determined by finding the values of β₀, β₁, ..., β_p which, for some fixed value, K, maximize C subject to

y_i (β₀ + β₁x_i,1 + ... + β_px_i,p) ≥ C(1 - ξ)

(i = 1, 2, ..., n),

ξ_i ≥ 0

(i = 1, 2, ..., n),

Σ ξ_i ≤ K,

and

β₁² + ... + β_p² = 1.

(Note: By making the margin wider (by making C larger), the largest ξ_i can be made to be smaller, but there will typically be more positive ξ_i (due to more points being in the margin). Even in the two-dimensional case, it's hard to look at a scatter plot of the points and guess the nature to the solution very precisely.) Cross-validation can be used to determine a good value to use for K. (Note: Subsection 12.2.1 of HTF2 describes how the optimization problem can be solved. A different, but equivalent, formulation of the problem is considered. γ is the tuning parameter in the alternative formulation --- it plays a role similar to the one which K plays in the description given above.)

With a support vector classifier, points well on the correct side of the boundary did not play a large role in shaping the boundary, and this generally seems like a good property. Logistic regression is similar in this regard, but the situation is different for LDA.

support vector machines

The basic idea behind support vector machine classification is to use an expanded basis, sometimes with a very large number of basis functions, with the support vector classification method; but actually it's a bit more complicated. (There isn't time, and perhaps desire, to get into the details!) A rather hard to understand scheme involving kernels produces an effect similar to a basis function expansion of the original inputs. (Because the kernels are employed in an alternative formulation of the general strategy as described above, it's hard to firmly grasp their role, and to understand how they relate to the other kernels that we've encountered this summer.) Popular kernels are dth degree polynomial kernels, and radial basis kernels. It's often hard to guess what kernel may work best in a given situation, and so generally it's good to try several choices.

There can be a lot of flexibility in determining the boundary --- for some values of the tuning/regularization parameter, the boundary can be quite wiggly, and overfitting can occur. For other values of the parameter, wiggliness is discouraged, and bias is a problem. But if cross-validation, or a separate validation set, can be used to determine a value for the parameter which seems just right, and an appropriate kernel is used, then the resulting classifier can be very good. (Typically, this method is rather good, when comparisons are made with a lot of other methods.) I encourage you to look at Fig. 12.3 on p. 425 of HTF2 to see classifiers obtained with support vector machines, one using a 4th-degree polynomial kernel and the other using a radial kernel. (Note: I think the name support vector machine is silly --- the "machine" part bugs me.)