Bootstrap estimate of bias
Due to the limitations of HTML (as I understand it), for this
explanation of the bootstrap estimate of bias, I'll use
Fx
for the unknown distribution underlying the observations (the
xj),
and I'll use
F*
for the empirical cdf (based on the observations), which for large
n is hopefully a reasonable estimate of
Fx.
The notation
F*
is "okay" actually, since the ecdf is the distribution of the
Xj* (the random variables that we observe by
resampling the original set of observations).
Fx
governs the "real world" --- and if we knew what
Fx is we could use theory or Monte Carlo estimation to
obtain either the bias or an estimate of the bias of some estimator, T.
That is, if obtaining E(T) is too hard to do analytically,
if we knew
Fx,
we could generate many samples of size n from it, compute the
value of T from each sample, and simply average these values to
obtain an estimate of
E(T). If we generate a large number of samples, the law of large
numbers gives us that our estimate of
E(T)
ought to be good.
But suppose we don't know what
Fx
is? Then we can't use it to generate the many samples we need for our
Monte Carlo estimate.
To get around the fact that we don't know
Fx,
we can go visit "the bootstrap world." This world is governed by
F*. The beauty of this world is that we know what
F*
is, and so we can generate as many samples of
xj* values as we want! So nothing prevents
us from doing a Monte Carlo experiment to obtain a very good estimate of
E(T) in the bootstrap world.
What we hope is that the relationship between E(T) and the
estimand (what we're using T to estimate) in the bootstrap world
is quite similar to the relationship between E(T) and the
estimand in the real world. The larger n is, the better the ecdf
should approximate the unknown real world cdf, and the closer bootstrap
world results ought to be to real world truths.
For example, if we want to know the bias of a trimmed mean as an
estimator of the distribution median, we encounter problems in the real
world because we don't know the expected value of the estimator and we
don't know the distribution median, and so we have neither of the two
pieces that make up the bias of the estimator. But in the bootstrap
world we do know the distribution median --- given the empirical
distribution, based on n equally-likely outcomes, we can darn
well determine the distribution median (it'd just be the usual sample
median of the original data set). And by resampling, generating as many
bootstrap samples as we want, we can very well estimate the
expected value of the trimmed mean in the bootstrap world. So in
the bootstrap world we can get the two pieces that make up the bias.
We can take this value of the bias from the bootstrap world and use
it as an estimate of the bias in the real world.
(Hopefully you now better understand the bootstrap estimate of bias.
Also, I hope you now better appreciate why I covered Monte Carlo
estimation before I presented bootstrapping. Monte Carlo estimation is
what we might want to do when we can't push the theory through. But to
do Monte Carlo estimation as I describe in the class notes, one has to
know what distribution underlies the phenomenon, and often in the real
world we don't have that knowledge ... and so we resort to the bootstrap
tactic. Note that bootstrapping is Monte Carlo estimation in the
bootstrap world. The ideal bootstrap estimates don't make
use of resampling --- the ecdf is just used to determine whatever
expected values are needed. But because the empirical distribution
isn't like a nice parametric distribution where expected values can be
obtained using integration or cute summation tricks, rather than
determine the ideal bootstrap values, we use Monte Carlo estimates
(obtained by bootstrap resampling followed by estimation) instead.)