Ch. 5 notes for S&W, STAT 535

Some Comments about Chapter 5 of Samuels & Witmer

Section 5.1

(p. 149) While the first sentence of this section is okay (and can be extended beyond the field of biology), to me it's suggestive of hypothesis testing (which is introduced in Ch. 7) and not so much of the types of things considered in Ch. 5 and Ch. 6, where the focus is on estimation. For example, with the examples of Ch. 1, a key focus of the data analysis can be whether there are differnces due to some treatment, or differences between different populations. What makes this hard to determine in some cases is that we expect some differences in the samples even if there are no differences in the parent distributions associated with the samples. Hypothesis testing is concerned with determining if there is strong evidence of real differences in the presence of sampling variability. With estimation the focus is different: the desire is to estimate the mean, median, or some other summary measure of the parent distribution of the sample, from only the observations belonging to the sample at hand. Because we don't have enough information about a distribution to determine some value associated with a population or distribution exactly, we have to be content with producing a guess about the unknown value that is of interest, and sampling variability leaves uncertainty associated with our best guess. With estimation the focus is to minimize the error in estimation due to the sampling variability, and to quantify the amount of uncertainty associated with an estimate.
(p. 149) One usually refers to the sampling distribution of some specific statistic (e.g., an estimator or a test statistic). Knowledge about the sampling distriibution of a statistic let's us know how sampling variability affects its value, and this allows us to study the distribution of the error in estimation or to determine if the data provides strong evidence of a treatment effect or of population differences (as opposed to the data being consistent with a hypothesis of no difference).
(p. 150, Example 5.1) This is similar to Problem 1 of the homework (a problem that you are suppose to do, but not turn in). It shows that when you produce an estimate (on p. 150, the sample mean can be used as an estimate of the population mean), different samples can produce different values of the estimate, and that an estimate of the mean need not equal the actual population mean. (The example shows the effects of sampling variation.)
(p. 150) The 2nd sentence after Table 5.2 is important. In some cases it may seem that the sample is a subset of a finite population. In such a case, the random variables associated with the observations are not independent. If we know the population size, we can adjust for the lack of independence and obtain a more accurate inference. But often we don't know the population size. However, if we know that the population size, whatever it is, is much larger than the sample size, the adjustment that should be made is negligible (see the long footnote at the bottom of p. 158), and so it is common to ignore the lack of independence and make inferences under the assumption of independent observations. (Although this is nothing to worry about if you don't understand the explanation, some may find it interesting, and at first a bit odd, that when we take a subset of a finite population, which is referred to as sampling without replacement, the random variables associated with the sample are not independent, even though one is guaranteed not to pick the sample population member more than once, but when you sample with replacement by making an observation and then having it so the same subject can be picked again to contribute another observation, the random variables associated with a sample are independent, even though two or more of the observations in the sample can be due to the same subject. As a concrete example, suppose that there are some unknow number of type A objects in a population of size 100, with the rest of the objects being type B objects. When we draw a sample without replacement, we have a lack of independence, since whatever type of object is selected first influences the proportion of type A objects remaining at the time of drawing the second object. However, if we just observe the type of the first object and put it back in the population so that it can possibly be selected again, the distribution at the time of the second drawing of an object doesn't depend on what type of object was drawn first, which means that the first two observations are independent of one another. It should be noted that sampling without replacement is preferable, since there is less uncertainty when we observe a sample based on n different population members, than there is in a sample of size n which can include certain population members more than once. But when n is small compared to the population size, it can make very little difference which type of sampling is used.) There is another viewpoint that makes the lack of knowledge of the population size even more unimportant. As an example, we could say that we're not interested in just the treatment of current cancer patients, but also of those which will occur in the future. By the time the results of some study can lead to something useful, some of the cancer patients available at the time the study was done may have died, and new patients identified. So in one sense the population of interest is always changing --- but another take on it could be that it is infinite, and with such a viewpoint it is legitimate to view the sample at hand as resulting from independent random variables.

Section 5.2

(p. 151) Dichotomous observations are associated with variables that have only two possible outcomes. When the outcomes are coded as 0 and 1, dichotomous observations can be referred to a binary observations. It is common to use dichotomous and binary interchangeably, although I suppose to be technical, the use of the term binary should be restricted to the 0/1 coding (as opposed to coding with A and B, for example). Often binary/dichotomous variables are modeled as iid Bernoulli random variables, but it is important to keep in mind that this should be done only if we have independence and a constant probability of "success" on each trial.
(p. 152, Example 5.4) From Table 5.3 it can be seen that the mean of the sample proportion is 0.3, which is the value of p. It can be shown that whatever value p is, the mean of the sample proportion is equal to p. Because of this, the sample proportion is referred to as an unbiased estimator. (In general, if the expected value of an estimator is equal to the estimand (the value being estimated), whatever value the estimand may be, the estimator is said to be unbiased.) I think that unbiasedness is an overrated property, and that the distribution of the error in estimation (perhaps summarized by some average value associated with the magnitude of the error in estimation) should be the focus --- by itself, unbiasedness just means that in a sense the overestimates would be of the same average magnitude as the underestimates in repeated applications of the estimator (and so unbiasedness doesn't focus on the magnitude of the error in estimation, but only on a balance in the tendency to overestimate and underestimate). Table 5.3 shows that even though the sample proportion is unbiased, it doesn't produce an estimate close to the estimand with high probability. In this case, it isn't that the estimator is defective (the sample proportion is the best estimator to use in this situation), but rather its poor performance is due to the sample size being so small --- with only two observations, one cannot expect to have a good estimate of p. Note that with n = 2 the most likely value for the sample proportion is 0, whereas with n = 20 the most likely value for the sample proportion is 0.3 (see Table 5.4 on p. 154), which is the actual value of p.
(p. 155, Example 5.7) For all values of n, the mean of the sample proportion is equal to the estimand, p. But the variance of the sample proportion is p(1 - p)/n, which is a decreasing function of n. From Fig. 5.5 it can be seen that the probability mass becomes more highly concentrated near p as n increases, and this fact is also shown in Table 5.5. If Fig. 5.5 were to be extended to include a really large sample size, it could be seen that with very high probability the sample proportion will take a value extremely close to p. (Recall that the law of large numbers gives us that a sample mean, which is what the sample proportion is (if we view the number of successes as being a sum of n Bernoulli random variables), converges to the mean of the distribution (which is p for the Bernoulli random variables of the sample mean).)

Section 5.3

(p. 158, 1st paragraph) The facts given here about the sampling distribution of the sample mean assume that the sample mean is based on iid (independent identically distributed) random variables --- this is implied on the bottom of p. 157 with the reference to random samples. The mean and variance of the sampling distribution of the sample mean (or one could just refer to the mean and variance of the sample mean when taking it to be a random variable (so upper-case), as opposed to its value based on a particular sample), can be obtained using the four rules on pp. 100-101, as I showed in class as I led up to the law of large numbers.
(p. 158, Example 5.8) The first part of the 3rd sentence should never have been written.
(p. 159, Theorem 5.1) The first two parts are addressed at the top of p. 158. The third part is new, and will be shown to be very important as we cover the next two chapters. Note that all of the parts together give us that the sampling distribution of the sample mean is either normal or approximately normal, with a mean equal to the mean of the distribution of the random variables making up the sample mean, and a variance which decreases as the sample size increases. All of this is consistent with the law of large numbers --- as the variance gets smaller with increasing sample size, the probability mass associated with the sample mean becomes more highly concentrated about the mean of the parent distribution of the observations, and so for very large n the sample mean will assume a value very close to the mean of the parent distribution (which is what is meant by stating that the sample mean converges to the mean of the parent distribution).
(p. 159, Fig. 5.7) Although the general idea is to show how the probability mass is more highly concentrated about the mean for the sampling distribution of the sample mean compared to the distribution associated with just a single observation, the labeling of the axes is rather screwy (and I advise not spending a lot of time trying to figure it out).
(p. 161, Example 5.10) This example is similar in spirit to Example 5.7 from the preceding section.
(p. 163, Example 5.12) This example shows that histograms aren't necessarily good estimates of the density of the parent distribution of a sample, since 8 different samples from the same distribution produce rather different histograms.

Section 5.4

(p. 168, lines 7 & 8) The phrase "violently skewed" is just a bit too whacky for me --- I think that highly skewed is a more suitable expression to use.
(p. 169, 1st paragraph) This paragraph touches upon some important points. (However, even though one might see statements similar to the 2nd to the last sentence of the paragraph in other books, I think that there are situations in which the distribution mean is still a very relevant summary measure for a highly skewed distribution. Even though the distribution mean isn't a typical value for a single observation from the distribution, if ones considers a sample of values, the mean of the parent distribution can be a typical value for the sample mean.)

Section 5.5

(p. 172, Example 5.17) Although the availability of statistical software makes using the approximation to obtain (approximate) probabilities less important than it used to be, there still can be times when one may want to employ the approximation (e.g., if one doesn't have suitable software handy, or n is so large that it causes a problem for the software). Despite that we may not use the approximation so much for numerical work, the approximation as expressed in Theorem 5.2 on p. 170 is still very important for the justification of certain statistical procedures.

(p. 174) The rule of thumb given may not be the best one to use. An alternate rule of thumb is to require that n be at least as large as 9 times the larger of p/(1 - p) and (1 - p)/p.

Section 5.6

(p. 176) The last paragraph is very important --- we can compare the anticipated performances of various statistics by comparing their sampling distributions (or sometimes, in practice, their estimated or approximated sampling distributions). Also, while it is true that the sample median is better than the sample mean for estimating the mean/median/center of some symmteric distributions, the distribution has to be pretty odd for this to be the case. However, for many heavy-tailed symmetric distributions, some other estimator of the mean/median/center can be superior to both the sample mean and the sample median.