Bayes Classifiers
The Bayes classifier for a specific setting is the optimal classifier for that setting (but it's misclassification
rate need not be zero). To construct the Bayes classifier, we need to know
- the joint distribution of the predictor variables for each class (its cdf, pdf, or pmf),
- the probability of each class producing the item to be classified.
Since we usually do not have the information necessary for the construction of the Bayes classifier, it is typically
the case that the Bayes classifier is an unattainable ideal.
In some "constructed" examples for which has the needed information, the Bayes classifier can be determined (although
it is not always easy to express the classification rule in a simple way unless the example is a somewhat simple "toy"
example). In such cases, it is often of interest to see if any classifiers constructed from the data have error
rates nearly as low as the optimal Bayes error rate.
In some situations for which the exact Bayes error rate is not easily obtained, it is possible to estimate it rather
easily (and rather well) using Monte Carlo methods. (Note: It can be that the Bayes classifier is somewhat easy to
obtain, but it's error rate is not so easy to obtain.)
For a simple example, consider a binary classification problem based on a single discrete predictor, X.
Letting G = {0, 1}, suppose that for class 0, X has pmf
pX,0(2) = 2/3 &
pX,0(4) = 1/3,
and for class 1, X has pmf
pX,1(2) = 1/3 &
pX,1(4) = 2/3.
Further suppose that for the item to be selected to be classified we have
pG(0) = 1/2 &
pG(1) = 1/2.
To find the Bayes classifier, we can consider all nonrandomized classification rules. (It can be shown that
randomized classification rules cannot be superior to the best nonrandomized classification rule.) For the
nonrandomized rules we have
- rule A: g*(2) = 0,
g*(4) = 0;
- rule B: g*(2) = 0,
g*(4) = 1;
- rule C: g*(2) = 1,
g*(4) = 0;
- rule D: g*(2) = 1,
g*(4) = 1;
To obtain the error rate for rule A, we can do a straightforward calculation of the probability that rule A
classifies a randomly generated/selected case correctly. We have
P( g*(X) = G ) = P( 0 = G ) = 1/2.
It follows that the probability of an error is also 1/2.
For rule B it's a bit trickier. We have
P( g*(X) = G )
= P( g*(X) = G | (X,G) = (2,0) ) × P( (X,G) = (2,0) )
+ P( g*(X) = G | (X,G) = (4,0) ) × P( (X,G) = (4,0) )
+ P( g*(X) = G | (X,G) = (2,1) ) × P( (X,G) = (2,1) )
+ P( g*(X) = G | (X,G) = (4,1) ) × P( (X,G) = (4,1) )
= (1)(1/3) + (0)(1/6) + (0)(1/6) + (1)(1/3) = 2/3,
and so the probability of error for rule B is 1/3.
(Note: We have that
P( (X,G) = (2,0) ) = P( X = 2 | G = 0 ) × P( G = 0 ) = (2/3)(1/2) = 1/3.
The probabilities for the other possible values of (X,G) can be obtained in a similar manner.)
Similarly, it can be shown that the probabilities of error for rule C and rule D are 2/3 and 1/2, respectively.
So rule B minimizes the probability of misclassification --- it is the Bayes classifier, and the Bayes error rate is
1/3.
In general, for any G and any discrete random variable X, the Bayes rule is the rule which maximizes
P( g*(X) = G ) = Σg Σx
P( g*(X) = G | (X,G) = (x,g) )
× P( (X,G) = (x,g) ).
Noting that the right hand side of the above equation is equal to
Σx [ Σg I(g*(x) = g) ×
P( G = g | X = x ) ] P( X = x ),
it is clear that in order to maximize the probability of a correct classification, if suffices to, for each possible
value of X, assign
g*(x) the value which maximizes
Σg I(g*(x) = g) ×
P( G = g | X = x ).
The maximizing choice for
g*(x) is the value of g which maximizes
P( G = g | X = x ) =
P( X = x | G = g ) × P( G = g ) / P( X = x ).
So, for each
value of X, we should assign
g*(x) the value of g which maximizes
P( X = x | G = g ) × P( G = g ).
If the classes of G are equally likely, we should assign
g*(x) the class which maximizes
P( X = x | G = g ) --- which is simply the class for which the probability of obtaining
the observed value of X is the greatest.
In the case where X is continuous,
for each
value of X we should assign
g*(x) the value of g which maximizes
fg(x) × P( G = g ),
where
fg(x) is the density of X for class g.
If the classes of G are equally likely, we should assign
g*(x) the class which maximizes
fg(x)
--- which is simply the class for which the density of X is the highest at x (the
observed value of X).
Note that all of the above still holds if we view X as a p-dimensional random variable (and x as
a p-dimensional vector of observed attribute values).