Bootstrap estimate of bias


Due to the limitations of HTML (as I understand it), for this explanation of the bootstrap estimate of bias, I'll use Fx for the unknown distribution underlying the observations (the xj), and I'll use F* for the empirical cdf (based on the observations), which for large n is hopefully a reasonable estimate of Fx. The notation F* is "okay" actually, since the ecdf is the distribution of the Xj* (the random variables that we observe by resampling the original set of observations).

Fx governs the "real world" --- and if we knew what Fx is we could use theory or Monte Carlo estimation to obtain either the bias or an estimate of the bias of some estimator, T. That is, if obtaining E(T) is too hard to do analytically, if we knew Fx, we could generate many samples of size n from it, compute the value of T from each sample, and simply average these values to obtain an estimate of E(T). If we generate a large number of samples, the law of large numbers gives us that our estimate of E(T) ought to be good.

But suppose we don't know what Fx is? Then we can't use it to generate the many samples we need for our Monte Carlo estimate.

To get around the fact that we don't know Fx, we can go visit "the bootstrap world." This world is governed by F*. The beauty of this world is that we know what F* is, and so we can generate as many samples of xj* values as we want! So nothing prevents us from doing a Monte Carlo experiment to obtain a very good estimate of E(T) in the bootstrap world.

What we hope is that the relationship between E(T) and the estimand (what we're using T to estimate) in the bootstrap world is quite similar to the relationship between E(T) and the estimand in the real world. The larger n is, the better the ecdf should approximate the unknown real world cdf, and the closer bootstrap world results ought to be to real world truths.

For example, if we want to know the bias of a trimmed mean as an estimator of the distribution median, we encounter problems in the real world because we don't know the expected value of the estimator and we don't know the distribution median, and so we have neither of the two pieces that make up the bias of the estimator. But in the bootstrap world we do know the distribution median --- given the empirical distribution, based on n equally-likely outcomes, we can darn well determine the distribution median (it'd just be the usual sample median of the original data set). And by resampling, generating as many bootstrap samples as we want, we can very well estimate the expected value of the trimmed mean in the bootstrap world. So in the bootstrap world we can get the two pieces that make up the bias. We can take this value of the bias from the bootstrap world and use it as an estimate of the bias in the real world.


(Hopefully you now better understand the bootstrap estimate of bias. Also, I hope you now better appreciate why I covered Monte Carlo estimation before I presented bootstrapping. Monte Carlo estimation is what we might want to do when we can't push the theory through. But to do Monte Carlo estimation as I describe in the class notes, one has to know what distribution underlies the phenomenon, and often in the real world we don't have that knowledge ... and so we resort to the bootstrap tactic. Note that bootstrapping is Monte Carlo estimation in the bootstrap world. The ideal bootstrap estimates don't make use of resampling --- the ecdf is just used to determine whatever expected values are needed. But because the empirical distribution isn't like a nice parametric distribution where expected values can be obtained using integration or cute summation tricks, rather than determine the ideal bootstrap values, we use Monte Carlo estimates (obtained by bootstrap resampling followed by estimation) instead.)