Bayes classifiers, STAT 789, Summer 2005

Bayes Classifiers

The Bayes classifier for a specific setting is the optimal classifier for that setting (but it's misclassification rate need not be zero). To construct the Bayes classifier, we need to know

the joint distribution of the predictor variables for each class (its cdf, pdf, or pmf),
the probability of each class producing the item to be classified.

Since we usually do not have the information necessary for the construction of the Bayes classifier, it is typically the case that the Bayes classifier is an unattainable ideal.

In some "constructed" examples for which has the needed information, the Bayes classifier can be determined (although it is not always easy to express the classification rule in a simple way unless the example is a somewhat simple "toy" example). In such cases, it is often of interest to see if any classifiers constructed from the data have error rates nearly as low as the optimal Bayes error rate. In some situations for which the exact Bayes error rate is not easily obtained, it is possible to estimate it rather easily (and rather well) using Monte Carlo methods. (Note: It can be that the Bayes classifier is somewhat easy to obtain, but it's error rate is not so easy to obtain.)

For a simple example, consider a binary classification problem based on a single discrete predictor, X. Letting G = {0, 1}, suppose that for class 0, X has pmf

p_X,0(2) = 2/3 & p_X,0(4) = 1/3,

and for class 1, X has pmf

p_X,1(2) = 1/3 & p_X,1(4) = 2/3.

Further suppose that for the item to be selected to be classified we have

p_G(0) = 1/2 & p_G(1) = 1/2.

To find the Bayes classifier, we can consider all nonrandomized classification rules. (It can be shown that randomized classification rules cannot be superior to the best nonrandomized classification rule.) For the nonrandomized rules we have

rule A: g^*(2) = 0, g^*(4) = 0;
rule B: g^*(2) = 0, g^*(4) = 1;
rule C: g^*(2) = 1, g^*(4) = 0;
rule D: g^*(2) = 1, g^*(4) = 1;

To obtain the error rate for rule A, we can do a straightforward calculation of the probability that rule A classifies a randomly generated/selected case correctly. We have

P( g^*(X) = G ) = P( 0 = G ) = 1/2.

It follows that the probability of an error is also 1/2.

For rule B it's a bit trickier. We have

P( g^*(X) = G )

= P( g^*(X) = G | (X,G) = (2,0) ) × P( (X,G) = (2,0) )

+ P( g^*(X) = G | (X,G) = (4,0) ) × P( (X,G) = (4,0) )

+ P( g^*(X) = G | (X,G) = (2,1) ) × P( (X,G) = (2,1) )

+ P( g^*(X) = G | (X,G) = (4,1) ) × P( (X,G) = (4,1) )

= (1)(1/3) + (0)(1/6) + (0)(1/6) + (1)(1/3) = 2/3,

and so the probability of error for rule B is 1/3. (Note: We have that

P( (X,G) = (2,0) ) = P( X = 2 | G = 0 ) × P( G = 0 ) = (2/3)(1/2) = 1/3.

The probabilities for the other possible values of (X,G) can be obtained in a similar manner.)

Similarly, it can be shown that the probabilities of error for rule C and rule D are 2/3 and 1/2, respectively. So rule B minimizes the probability of misclassification --- it is the Bayes classifier, and the Bayes error rate is 1/3.

In general, for any G and any discrete random variable X, the Bayes rule is the rule which maximizes

P( g^*(X) = G ) = Σ_g Σ_x P( g^*(X) = G | (X,G) = (x,g) ) × P( (X,G) = (x,g) ).

Noting that the right hand side of the above equation is equal to

Σ_x [ Σ_g I(g^*(x) = g) × P( G = g | X = x ) ] P( X = x ),

it is clear that in order to maximize the probability of a correct classification, if suffices to, for each possible value of X, assign g^*(x) the value which maximizes

Σ_g I(g^*(x) = g) × P( G = g | X = x ).

The maximizing choice for g^*(x) is the value of g which maximizes

P( G = g | X = x ) = P( X = x | G = g ) × P( G = g ) / P( X = x ).

So, for each value of X, we should assign g^*(x) the value of g which maximizes

P( X = x | G = g ) × P( G = g ).

If the classes of G are equally likely, we should assign g^*(x) the class which maximizes P( X = x | G = g ) --- which is simply the class for which the probability of obtaining the observed value of X is the greatest.

In the case where X is continuous, for each value of X we should assign g^*(x) the value of g which maximizes

f_g(x) × P( G = g ),

where f_g(x) is the density of X for class g. If the classes of G are equally likely, we should assign g^*(x) the class which maximizes f_g(x) --- which is simply the class for which the density of X is the highest at x (the observed value of X).

Note that all of the above still holds if we view X as a p-dimensional random variable (and x as a p-dimensional vector of observed attribute values).