Bayes Classifiers



The Bayes classifier for a specific setting is the optimal classifier for that setting (but it's misclassification rate need not be zero). To construct the Bayes classifier, we need to know Since we usually do not have the information necessary for the construction of the Bayes classifier, it is typically the case that the Bayes classifier is an unattainable ideal.

In some "constructed" examples for which has the needed information, the Bayes classifier can be determined (although it is not always easy to express the classification rule in a simple way unless the example is a somewhat simple "toy" example). In such cases, it is often of interest to see if any classifiers constructed from the data have error rates nearly as low as the optimal Bayes error rate. In some situations for which the exact Bayes error rate is not easily obtained, it is possible to estimate it rather easily (and rather well) using Monte Carlo methods. (Note: It can be that the Bayes classifier is somewhat easy to obtain, but it's error rate is not so easy to obtain.)

For a simple example, consider a binary classification problem based on a single discrete predictor, X. Letting G = {0, 1}, suppose that for class 0, X has pmf
pX,0(2) = 2/3 & pX,0(4) = 1/3,
and for class 1, X has pmf
pX,1(2) = 1/3 & pX,1(4) = 2/3.
Further suppose that for the item to be selected to be classified we have
pG(0) = 1/2 & pG(1) = 1/2.

To find the Bayes classifier, we can consider all nonrandomized classification rules. (It can be shown that randomized classification rules cannot be superior to the best nonrandomized classification rule.) For the nonrandomized rules we have
To obtain the error rate for rule A, we can do a straightforward calculation of the probability that rule A classifies a randomly generated/selected case correctly. We have
P( g*(X) = G ) = P( 0 = G ) = 1/2.
It follows that the probability of an error is also 1/2.

For rule B it's a bit trickier. We have
P( g*(X) = G )
= P( g*(X) = G | (X,G) = (2,0) ) × P( (X,G) = (2,0) )
+ P( g*(X) = G | (X,G) = (4,0) ) × P( (X,G) = (4,0) )
+ P( g*(X) = G | (X,G) = (2,1) ) × P( (X,G) = (2,1) )
+ P( g*(X) = G | (X,G) = (4,1) ) × P( (X,G) = (4,1) )
= (1)(1/3) + (0)(1/6) + (0)(1/6) + (1)(1/3) = 2/3,
and so the probability of error for rule B is 1/3. (Note: We have that
P( (X,G) = (2,0) ) = P( X = 2 | G = 0 ) × P( G = 0 ) = (2/3)(1/2) = 1/3.
The probabilities for the other possible values of (X,G) can be obtained in a similar manner.)

Similarly, it can be shown that the probabilities of error for rule C and rule D are 2/3 and 1/2, respectively. So rule B minimizes the probability of misclassification --- it is the Bayes classifier, and the Bayes error rate is 1/3.

In general, for any G and any discrete random variable X, the Bayes rule is the rule which maximizes
P( g*(X) = G ) = Σg Σx P( g*(X) = G | (X,G) = (x,g) ) × P( (X,G) = (x,g) ).
Noting that the right hand side of the above equation is equal to
Σx [ Σg I(g*(x) = g) × P( G = g | X = x ) ] P( X = x ),
it is clear that in order to maximize the probability of a correct classification, if suffices to, for each possible value of X, assign g*(x) the value which maximizes
Σg I(g*(x) = g) × P( G = g | X = x ).
The maximizing choice for g*(x) is the value of g which maximizes
P( G = g | X = x ) = P( X = x | G = g ) × P( G = g ) / P( X = x ).
So, for each value of X, we should assign g*(x) the value of g which maximizes
P( X = x | G = g ) × P( G = g ).
If the classes of G are equally likely, we should assign g*(x) the class which maximizes P( X = x | G = g ) --- which is simply the class for which the probability of obtaining the observed value of X is the greatest.

In the case where X is continuous, for each value of X we should assign g*(x) the value of g which maximizes
fg(x) × P( G = g ),
where fg(x) is the density of X for class g. If the classes of G are equally likely, we should assign g*(x) the class which maximizes fg(x) --- which is simply the class for which the density of X is the highest at x (the observed value of X).

Note that all of the above still holds if we view X as a p-dimensional random variable (and x as a p-dimensional vector of observed attribute values).