Bootstrap estimate of bias

Due to the limitations of HTML (as I understand it), for this explanation of the bootstrap estimate of bias, I'll use F_x for the unknown distribution underlying the observations (the x_j), and I'll use F_* for the empirical cdf (based on the observations), which for large n is hopefully a reasonable estimate of F_x. The notation F_* is "okay" actually, since the ecdf is the distribution of the X_j^* (the random variables that we observe by resampling the original set of observations).

F_x governs the "real world" --- and if we knew what F_x is we could use theory or Monte Carlo estimation to obtain either the bias or an estimate of the bias of some estimator, T. That is, if obtaining E(T) is too hard to do analytically, if we knew F_x, we could generate many samples of size n from it, compute the value of T from each sample, and simply average these values to obtain an estimate of E(T). If we generate a large number of samples, the law of large numbers gives us that our estimate of E(T) ought to be good.

But suppose we don't know what F_x is? Then we can't use it to generate the many samples we need for our Monte Carlo estimate.

To get around the fact that we don't know F_x, we can go visit "the bootstrap world." This world is governed by F_*. The beauty of this world is that we know what F_* is, and so we can generate as many samples of x_j^* values as we want! So nothing prevents us from doing a Monte Carlo experiment to obtain a very good estimate of E(T) in the bootstrap world.

What we hope is that the relationship between E(T) and the estimand (what we're using T to estimate) in the bootstrap world is quite similar to the relationship between E(T) and the estimand in the real world. The larger n is, the better the ecdf should approximate the unknown real world cdf, and the closer bootstrap world results ought to be to real world truths.

For example, if we want to know the bias of a trimmed mean as an estimator of the distribution median, we encounter problems in the real world because we don't know the expected value of the estimator and we don't know the distribution median, and so we have neither of the two pieces that make up the bias of the estimator. But in the bootstrap world we do know the distribution median --- given the empirical distribution, based on n equally-likely outcomes, we can darn well determine the distribution median (it'd just be the usual sample median of the original data set). And by resampling, generating as many bootstrap samples as we want, we can very well estimate the expected value of the trimmed mean in the bootstrap world. So in the bootstrap world we can get the two pieces that make up the bias. We can take this value of the bias from the bootstrap world and use it as an estimate of the bias in the real world.

(Hopefully you now better understand the bootstrap estimate of bias. Also, I hope you now better appreciate why I covered Monte Carlo estimation before I presented bootstrapping. Monte Carlo estimation is what we might want to do when we can't push the theory through. But to do Monte Carlo estimation as I describe in the class notes, one has to know what distribution underlies the phenomenon, and often in the real world we don't have that knowledge ... and so we resort to the bootstrap tactic. Note that bootstrapping is Monte Carlo estimation in the bootstrap world. The ideal bootstrap estimates don't make use of resampling --- the ecdf is just used to determine whatever expected values are needed. But because the empirical distribution isn't like a nice parametric distribution where expected values can be obtained using integration or cute summation tricks, rather than determine the ideal bootstrap values, we use Monte Carlo estimates (obtained by bootstrap resampling followed by estimation) instead.)