Ch. 3 notes for S&W, STAT 535

Some Comments about Chapter 3 of Samuels & Witmer

Section 3.1

I don't know of any other introductory statistics books that are as good with regard to addressing the issues that are covered on pp. 75-77 of S&W. (Even though S&W is a somewhat low-level book, it's a good one, and you can really learn a lot from it if you read it carefully and spend adequate time trying to understand most of the many points that are covered.) Because we're going to cover the first portion of S&W (a lot of which should be review material) pretty quickly, I won't spend a lot of time on this material in class, but I will hit upon some of the points from time to time.

(p. 72) Note that the paragraph underneath the gray box describes 2 ways of choosing a simple random sample. The first way seem to be the easier way to me, and is the most commonly used scheme.
(p. 72, Choosing a Random Sample) The paragraph that begins at the bottom of the page gives two applications of random sampling. The first is selecting sample members from a larger population. Sometimes, but not always, this is actually implemented. For example, when GMU conducts a survey of its faculty, presumably names are randomly selcted from a long list and those people are requested to supply the desired information. (In this case, while a random sample was the aim, unless the list is accurate, the randomization done properly, and everyone selected responds, the end result is not a true random sample. The analysis may be done assuming that it is, but one should be concerned about the introduction of sampling bias (see p. 75). (The particular type of sampling bias bias in this example is nonresponse bias, and the fear is that the tendency to not participate in the survey is correlated with what is being observed. In studies involving animals or medical pateints, one can still have nonresponse bias --- not necessarily because a subject refuses to participate, but perhaps because some subjects are too weak to be included in the study, and omitting the weak ones biases the results.) Nevertheless, because the population is clearly defined (assuming an appropriate definition of GMU's faculty is determined (e.g., making decisions to include or exclude groups such as part-time faculty, administrative faculty, etc.)) and not too hard to access, it would be possible to actually get a random sample if a big enough effort was made.) But often, it just isn't practical to get a simple random sample. For example, in a study of the intelligence of a certain type of monkey which lives in the wilds of Africa, it would be impossible to assign ID numbers to all of the monkeys, randomly draw a set of ID numbers, and then go collect the monkeys whose numbers were drawn and use these randomly selected monkeys for the intelligence study. Instead, we may have to be content to use whatever collection of such monkeys we can get, and hope to extrapolate the results to the larger population even though we aren't able to use a random sample. (This example is similar to Example 3.6 on p. 76. Also note that the 2nd paragraph on p. 75 addresses some of the points made here, and the last paragraph on p. 77 pertains to the extrapolation of research findings.) The second use of random sampling is to allocate a collection of subjects (perhaps obtained by sampling) to various treatment groups. One should always strive to do this in some way (there are various strategies corresponding to different experimental designs). By randomly assigning subjects to treatment groups, we can sometimes get meaningful conclusions from an experiment even if the collection of subjects assigned were not randomly drawn from a larger population. (For instance, in Example 1.1 on p. 2, it wouldn't be horrible if the sheep were not randomly drawn from a larger population, as long as they were randomly assigned to the two groups. It's the random assignment that gives us a fair way to assess the effectiveness of the vaccine on the 48 sheep used in the study, and it seems reasonable that the results from these 48 sheep can be extrapolated to some larger group of sheep. While it may be vague as to what the appropriate larger group is, the main thing is that the experiment indicates that the vaccine is effective on some population of sheep. But if instead of random assignment of the sheep to the two treatment groups, the 24 healthiest looking sheep were given the vaccine, or just the male sheep were given the vaccine, then the results of the experiment wouldn't mean the same thing, since it wouldn't be clear if the vaccine is effective or if healthy sheep can fight off anthrax better than weaker sheep, or males are much more resistent to anthrax than females.) However, sometimes groups to be compared do not make random assignment of subjects possible (for instance, see Example 1.4 on pp. 3-4 and Example 1.6 on pp. 5-6), in which case having a random sample for each group (from each respective population) becomes much more important.
(p. 74, Example 3.1) *** minor mistake in book *** On the third line of p. 74, column 12 should be column 11.
(p. 74, Example 3.1) This is similar to what you are to do for Exercise 3.2 on p. 77, which I have suggested that you do in a certain way on the homework web page. Specifically, so that everyone who does it correctly will obtain the same result (making it possible for me to indicate what the correct answer is), instead of picking a starting point randomly, you are to start with the set of 5 digits in the upper left corner. Then you are do go down that first column. Rather than use the 2nd and 3rd digits in the column as is done in the book (see highlighted portion of Table 3.1 on p. 74), use the last two digits (the 4th and 5th digits).
(p. 75, line 7) The concept chance error due to sampling is an important one. Even though randomization is properly used, it is still possible to get a misleading result; i.e., the collection of sample values may be rather unusual in some way. Even if the sample isn't highly unusual, it is rare that a random sample exactly reflects the population. In either case, inferences are subject to error even if the random sample is properly collected. Because of this, not only is it important to use statistical procedures which should minimize this sort of error, but we should also try to indicate (quantitatively) something about about accuracy associated with our best inferences ... because we always expect to have some amount of error due to sampling.
(p. 75, Example 3.2) Although this doesn't help with the problem of the net not getting the small fish, to help reduce other sources of bias, the fishing locations should be randomly chosen (perhaps using some sort of stratified sampling plan (which unfortunately may not be covered in S&W)). Since smaller fish and larger fish may tend to be differently distributed throughout the bay, it would be bad to do all of the fishing in the same location if the purpose was to learn something about the fish in the entire bay. *** A similar bias could occur if the intelligence of monkeys was studied using whatever monkeys were available instead of trying to get a collection that would be similar to a random sample, since perhaps it's the less intelligent monkeys that are more often caught and distributed to researchers. (Anytime a sample of convenience is used in animal studies, I guess there is a fear that the animals used may be on the average slower (in thought and/or speed) than those in the larger population from which they were captured.)
(p. 75, Example 3.3) On the other hand (from what is suggested in the book), it could be that smaller nerves are less likely to be selected if the sampling isn't done carefully. For example, if we randomly pick a spot on p. 78, a larger oval is more likely to be hit than is a smaller one. (Important note: Often, as is the case here, randomly is meant to mean uniformly at random. With a discrete set this means having each element being equally likely to be selected. With a continuum of points, like the points of a line interval or the locations on a page, the notion is a bit harder to describe (we'll address it later in Ch. 3), but it still means that no point is favored over any other point when a selection is made. (In picking the nerve to be measured in Example 3.3 or picking an oval from p. 78, if we randomly stab at a spot, instead of first identifying all of the possibilities and then picking one by giving each the same chance to be chosen, we're more likely to poke at a large one than a small one.) However, sometimes randomly does not mean uniformly at random --- something could be random but follow a normal distribution instead of a uniform distribution.)
(pp. 75-76, Example 3.4) This example, and the paragraph that follows it, address a point that a lot of books fail to cover. (There is a lot that can be learned from a careful reading of this book. Despite some mistakes and some things that aren't done really well, on the whole it's a pretty good book.)
(p. 76, Example 3.5) Here the sample used came from crossbred plants created especially for the experiment as opposed to being drawn from a larger population --- it could be that there is no larger population of plants occurring naturally from which a good random sample could have been drawn. Here we hope the sample is representative of what may occur if more such plants are created. It may be stretching things to claim that they are representative of all such plants to be grown in the future since different growing conditions may have some effects. I would guess that in this setting, it's the comparisons of resistances of the progeny plants with the other two types of plants that is the key focus, and the important thing is that all three are grown in similar conditions, exposing them to the same things. (That is, this example is similar to Example 3.7 in that the meaningful population is a bit vague, and the important thing is the comparison of treatments --- the experiment seems like it may yield good information even if it isn't completely clear to what extent the results can be extrapolated.)

Section 3.3

(p. 78) I use P(E) instead of Pr{E}. P is a function, except unlike f(x) = 1/x, where you plug a nonzero number in and you get a number (the inverse of the number plugged in) out, with P you plug an event (which is just a subset of the sample space) in and you get a number (the probability of the event) out.
(p. 80) I've never seen the double arrow used in this situation. (My guess is that Ch. 3 is the weakest chapter in the book.) Most books would use a single arrow (like I did in class) with the arrow pointing from the ratio to the probability. Formally, it is the law of large numbers that gives us that the ratio converges to the true probability as the number of trials increases. It should be noted that we don't always have to adopt the frequency interpretation --- Example 3.10 came up with a probability using the notion of equally-likely outcomes, using the fact that if 30% of the flies are black, and one is chosen at random (giving each fly the same chance of being selected), the probability of picking a black fly is 0.3. But even though here we can determine exactly what the probability is, if we estimate it with a ratio of the type on p. 80 by repeating the experiment over and over, the estimate should be very close to 0.3 with high probability after a sufficiently large number of trials.

Section 3.4

This section should follow Sec. 3.5 since it uses the concepts in independence and conditional probability that aren't introduced until Sec. 3.5. (On p. 91 (near the top, right after Example 3.22) it is stated that material from Sec. 3.5 is used in Sec. 3.4.) (I'm surprised that the middle part of Ch. 3 is so bad, considering that other parts of the book are so good.)

(p. 84) Both trees shown on this page use the concept on independent events (which aren't introduced until the next section of the book). Specifically, it is being assumed that the outcome of the first coin toss does not influence the probabilities associated with the second toss (which seems like a reasonable assumption). The probabilities given in the tree at the bottom of the page follow from the notions of equally-likely outcomes and independent events. I'm not real sure why the book has "the relative frequency interpretation of probability can be a guide to the appropriate combination of the probabilities of subevents" --- the statement makes little or no sense to me, and I don't recall having seen the term "subevent" used in a probability text before.
(p. 85) The paragraph right before Example 3.16 indicates that the population needs to be large. Actually, if two flies are taken one after another (without replacing the first fly before drawing another fly --- so two different flies are taken) from a population with exactly 30% black flies, then the probability of obtaining two black flies is a bit less than (0.3)(0.3) = 0.09. For example, if the population has 3 black flies and 7 gray flies, then the probability that two black flies will be drawn in (3/10)(2/9) = 1/15, which is not equal to 0.09. (The 2/9 is due to the fact that if a black fly is selected on the first draw, only 2 of the 9 flies which remain at the time of the second draw are black. (Note: This type of stuff isn't so important in STAT 535. Of course, it isn't real difficult either, and so there is no good reason why you shouldn't try to understand it.)) If we have 100 flies instead of just 10, the probability of getting two black flies is (30/100)(29/99) = 29/330, which is about 0.0879. If there are 1000 flies, the probability is 0.0898, which is pretty close to 0.09.
(pp. 85-86, Example 3.16) The probability calculation right above Fig. 3.5 is making use of Rule 4 on p. 89 of the next section. Conditional probabilities are also being employed, and Rule 7 on p. 91 is being used to get each of the two terms in the sum.
(pp. 86-87, Example 3.17) The probability calculation right above Fig. 3.6 is making use of Rule 4 on p. 89 of the next section. Conditional probabilities are also being employed, and Rule 7 on p. 91 is being used to get each of the two terms in the sum.

Section 3.5

Although I find this stuff fascinating, I'm not going to have you do a lot with these probability rules. There may be some HW problems pertaining to them, but except for the really basic things that I'll use over and over again, these rules won't be emphasized on the exams. You'll find that as we go through the course, I'll refer to some relatively simple rules a lot in justifying material from other parts of the book, while other rules won't be referred to nearly as often. Specifically, Rule 3 and Rule 4 (two rather simple rules) will be used from time to time, and Rule 6 will be used a lot. Rule 1 is so trivial that we may not notice it being used.

(p. 88) *** mistake in book *** Rule 2 isn't correct. It's possible that the sum of probabilities for various events exceeds 1. For example, consider the random experiment of rolling a fair die and observing the number of spots on the upward face. The sample space is S = {1, 2, 3, 4, 5, 6}. Letting A = {1, 3, 5} (an odd integer is obtained), B = {2, 4, 6} (an even integer is obtained), C = {3, 6} (an integer divisible by 3 is obtained), we have P(A) + P(B) + P(C) = 1/2 + 1/2 + 1/3 = 4/3, which is greater than 1. To partially fix things, one could specify that the sum of the probabilities of all possible outcomes (in the sample space) is equal to 1, but even then we'd need to specify that the sample space is finite or countably infinite (which is a term that you don't have to be concerned about). For example, with the fair die we have P( {1} ) + P( {2} ) + P( {3} ) + P( {4} ) + P( {5} ) + P( {6} ) = 1/6 + 1/6 + 1/6 + 1/6 + 1/6 + 1/6 = 1. It can also be noted that most probability books have P(S) = 1 as one of the basic rules.
(p. 90) The concept of independent events is an important one.

Section 3.6

Once again, I find it odd how S&W organizes the material in Ch. 3. Random variables, which are formally introduced in the next section, should be covered before getting into density curves (which should formally be referred to as probability density functions (pdfs), but are commonly just called densities).

Densities are very important and will be used a lot throughout the course. I'll often sketch densities on the board to provide an indication of the shape associated with the parent distribution of a random sample. (By parent distribution I mean the distribution associated with making a random observation of some variable --- in some cases it will be assumed that making a random observation (that is, randomly selecting a subject or location, and making a measurement) is well approximated with observing a normally distributed random variable (normal distributions are covered in Ch. 4), but in other cases I may need to indicate a skewed, heavy-tailed, or bimodal distibution, and drawing a quick sketch of the assumed density is a good way to convey this information.) I'll also sketch densities of the sampling distributions of certain estimators and test statistics (sampling distributions for estimators are covered in Ch. 5, and sampling distributions of test statistics are covered in Ch. 7 after hypothesis testing is introduced). (Note: A sampling distribution is just the distribution for some statistic (e.g., an estimator or a test statistic). It's just that because a statistic is a function of the random variables assoicated with a random sample (so a function of more than one random variable), its value depends upon the values in the sample, and we refer to its distribution as a sampling distribution.)

(p. 93) If we think of the three parts of Fig. 3.9 as graphics which provide some indication of the relative likelihood of observering the various possible values of something when a random observation is made, it should seem sensible that a smooth curve, like the one labeled (c), is preferable to the choppy histogram shapes of (a) and (b). Looking at (a), it should seem odd if the likelihood remained absolutely constant betweeen 150 and 160, and then jumped to a different value and remained constant between 160 and 170 --- instead, a gradual change would seem to make more sense.
(p. 94) The blue box is very important --- the main idea is that probabilities associated with a continuous variable having a density correspond to areas under the curve.
(pp. 94-95, The Continuum Paradox) Although it may seem very odd at first that the probability of observing a particular value exactly is 0 with a continuous variable, I will try to convince you in class that this is the only value that makes sense for such a probability. An important associated point is that unlike the case with discrete variables (really, discrete random variables --- which are not formally introduced until the next section), with continuous (random) variables, a probability of 0 for a particular outcome doesn't mean that outcome is impossible. (Again, this may seem a bit odd (or very odd) if you're encountering this for the first time.)
(p. 95, Example 3.29) Note that with a continuous variable it doesn't matter whether the endpoints, in this case 100 and 150, are included or not, since the probability associated with each of these points is 0. But with a discrete variable, whether or not the endpoints are to be included can matter, since the associated probabilities may be positive.

Section 3.7

(pp. 96-97, Example 3.31 and Example 3.34) When we observe the number of spots on the upward face of a die, we are observing an object (the upward face of the die) that can take on various values. When we observe the height of a randomly selected man, in this case it's not that a man's height flucuates randomly, but rather it's the act of randomly selecting a man and observing his height that can produce various values according to a probability distribution. The common element is that prior to the act of making an observation, more than one value is possible to occur, and the various values will occur according to a probability distribution.
(p. 97) Here is a way of describing the mean (aka expected value) of a random variable: it is a weighted average of the random variable's possible outcomes --- for a discrete random variable, each possible outcome is weighted by its probability; for a continuous random variable the density function provides the weights (and an integral has to be used instead of a sum).
(p. 100, Rules of Means of Random Variables) The two rules are important for understanding the derivation of certain statistics, but aren't used so much in actual data analysis.
(p. 101, Rules of Variances of Random Variables) The two rules are important for understanding the derivation of certain statistics, but aren't used so much in actual data analysis. Note that Rule 4 pertains to independent random variables only. The book doesn't have a lot about independent random variables --- basically, if two random variables are independent then the value that one assumes doesn't effect the value that the other one assumes. (See bottom portion of p. 100 for more information.)

Section 3.8

(p. 103, 1st paragraph) The two possible outcomes are generically referred to as success and failure. The success is the outcome which has probability p, and in some cases it may be the outcome that seems to be the less desirable one in a practical sense. For example, in a study of death rates, it may be convenient to focus on the deaths (which may be relatively rare) instead of the survivals. If we want to use p for the probability that a randomly chosen subject corresponds to a death, then a death is considered a success --- a bit odd perhaps, but try to get used to it. (Note the Remark on p. 106.)
(pp. 103-104, Example 3.44) The blue box above the example stresses the independence of the trials. The probability calculations in the example get across the concept of what is meant by independent trails. If we let A₁ be the event that the first trail is a success and A₂ be the event that the second trail is a success (noting that in this example, and albino child is considered to be a "success" and a nonalbino a "failure"), the independence of the trails means that these two events are independent (the chance that A₂ occurs does not depend on whether or not A₁ occurs), which gives us that the probability of the intersection of these two events (which corresponds to two albino children) is just the product of the probabilities of the events, which is p*p = p².
(p. 105) When speaking about the binomial coefficient _nC_j, people often say "n choose j" due to the fact that it's value is equal to the number of different subsets of size j that can be created (chosen) from a set of n items. (There is another commonly used notation for the binomial coefficient _nC_j which I'll often use when writing on the board.) In the formula for the binomial distribution probabilities, the n choose j factor is due to there being that number of different sequences of success and failures having j successes among the n trails. So instead of adding two terms having probability p(1 - p) as is done near the middle of p. 104 for the case of n being 2 and j being 1, for the probability of j successes in n trials, there are _nC_j terms to add each having probability p^j(1 - p)^{n - j}.
(p. 110, Example 3.50) This example indicates that not every situation with a collection of binary outcomes should be modeled using a binomial distribution.