Some Notes Pertaining to Ch. 4 of E&T



The empirical distribution has probability 1/n associated with each observation in a sample of size n. The empirical cdf is the corresponding cdf --- it's a step function if the data is one dimensional. The empirical distribution plays a huge role in bootstrapping.

If we have observations of iid random variables (or some may choose to put it as independent observations of a random variable), then we lose no information useful for making inferences about the underlying distribution if we are just given the empirical distribution instead of the original sample. (In the language of STAT 652 we can say that the empirical distribution is sufficient. (If we are assuming a particular parametric model, it may not be minimal sufficient. Typically, bootstrapping is used without a parameteric model being assumed. When used with a parametric model we can make use of the minimal sufficient statistics, and bootstrapping is done differently, not making heavy use of the empirical distribution.))

E&T's use of the term parameter makes the distribution mean and median parameters whether or not one is dealing with a parametric model. I tend to refer to parameters only when working with a parametric model --- otherwise I would refer to the mean, median, variance, etc. as distribution measures. (It's not a big deal, but of course I think my way is better since it makes a distinction between two different situations.)

Common distribution measures of interest are functionals. (A functional can be thought of as a function of a function --- it's a function where the input is a function as opposed to a numerical value, a set, are something else.) The functionals which are the distribution measures that we might want to make inferences about have distributions as inputs (and we can think of the distributions as being functions). For example, the distibution mean is a functional --- given a distribution we either integrate or sum using the distribution to obtain a value for the distribution mean (if it exisits).

To obtain the plug-in estimate of a distribution measure one just inputs the empirical distribution into the functional. (As in-class examples I'll consider the mean, variance, and median. The median is perhaps a bit "tricky" due to possible nonuniqueness, but if you force yourself to define the median to eliminate nonuniqueness there is no problem.) Plug-in estimates play a big role in bootstrapping when one does not assume a parametric model. Since the empirical cdf converges to the true (but unknown) cdf, plug-in estimates should be decent if the sample size is not too small. (Notes: (1) As usual, no need to ask what is meant by "too small" because the answer is it depends. (2) With some parametric models, if we assume the model is correct then we might want to avoid using simple plug-in estimates because they won't depend on the data through a minimal sufficient statistic. For example, perhaps we should estimate the mean using something other than the sample mean --- maybe a function of the sum of the log xi. An another example, suppose that we want to estimate the median of a normal distribution. The plug-in estimate, the sample median, is inferior to the sample mean. In the 2nd paragraph on p. 37, E&T indicate that plug-in estimates may not be great when one is dealing with a parametric model. The last paragraph on p. 37 indicates that E&T will address modifications for using plug-in estimates and bootstrapping with parametric models in chapters to come.)

The last sentence of Sec. 4.3 (on p. 37) refers to the "one-sample nonparametric setup" --- it's perhaps easiest to get comfortable with bootstrapping in this simple setting, but E&T soon turn to more complicated situations in order to illustrate how general the bootstrap approach is.