Some Comments about Chapter 3 of Samuels & Witmer
Section 3.1
I don't know of any other introductory statistics books that are as good
with regard to addressing the issues that are covered on pp. 75-77 of
S&W. (Even though S&W is a somewhat low-level book, it's a good one,
and you can really learn a lot from it if you read it carefully and
spend adequate time trying to understand most of the many points that
are covered.) Because we're going to cover the first portion of S&W (a
lot of which should be review material) pretty quickly, I won't spend a
lot of time on this material in class, but I will hit upon some of the
points from time to time.
- (p. 72) Note that the paragraph underneath the gray box describes 2
ways of choosing a simple random sample. The first way seem to be the
easier way to me, and is the most commonly used scheme.
- (p. 72, Choosing a Random Sample) The paragraph that begins
at the bottom of the page gives
two applications of random sampling.
The first is selecting
sample members from a larger population. Sometimes, but not always,
this is actually
implemented. For example, when GMU conducts a survey of its faculty,
presumably names are randomly selcted from a long list and those people
are requested to supply the desired information. (In this case, while a
random sample was the aim, unless the list is accurate, the
randomization done properly, and everyone selected responds, the end
result is not a true random sample. The analysis may be done assuming
that it is, but one should be concerned about the introduction of sampling
bias (see p. 75). (The particular type of sampling bias bias in this
example is nonresponse bias, and
the fear is that the tendency to not participate in the survey is
correlated with what is being observed. In studies involving animals or
medical pateints, one can still have nonresponse bias --- not necessarily
because a
subject refuses to participate, but perhaps because some subjects are
too weak to be included in the study, and omitting the weak ones biases
the results.) Nevertheless, because the population is clearly
defined
(assuming an appropriate definition of GMU's faculty is determined
(e.g., making decisions to include or exclude groups such as part-time
faculty, administrative faculty, etc.)) and not too hard to access, it
would be possible to actually get a random sample if a big enough effort was
made.) But often, it just isn't practical to get a simple
random sample. For example, in a study of the intelligence of a certain
type of monkey which lives in the wilds of Africa, it would be
impossible to assign ID numbers to all of the monkeys, randomly draw a
set of ID numbers, and then go collect the monkeys whose numbers were
drawn and use these randomly selected monkeys for the intelligence
study. Instead, we may have to be content to use whatever collection of
such monkeys we can get, and hope to extrapolate the results to the
larger population even though we aren't able to use a random sample.
(This example is similar to Example 3.6 on p. 76. Also note that
the 2nd paragraph on p. 75 addresses some of the points made here, and
the last paragraph on p. 77 pertains to the extrapolation of research
findings.)
The second use of random sampling is to allocate a collection of
subjects (perhaps obtained by sampling) to various treatment groups.
One should always strive to do this in some way (there are various
strategies corresponding to different experimental designs). By
randomly assigning subjects to treatment groups, we can sometimes get
meaningful conclusions from an experiment even if the collection of
subjects assigned were not randomly drawn from a larger population.
(For instance, in Example 1.1 on p. 2, it wouldn't be horrible
if the sheep were not randomly drawn from a larger population, as long
as they were randomly assigned to the two groups. It's the random
assignment that gives us a fair way to assess the effectiveness of the
vaccine on the 48 sheep used in the study, and it seems reasonable that
the results from these 48 sheep can be extrapolated to some larger
group of sheep. While it may be vague as to what the appropriate larger
group is, the main thing is that the experiment indicates that the
vaccine is effective on some population of sheep. But if instead of random assignment of the sheep to the
two treatment groups, the 24 healthiest looking sheep were given the
vaccine, or just the male sheep were given the vaccine, then the
results of the experiment wouldn't mean the same thing, since it wouldn't
be clear if the vaccine is effective or if healthy sheep can fight off
anthrax better than weaker sheep, or males are much more resistent to
anthrax than females.)
However, sometimes groups to be compared do not make random assignment
of subjects possible (for instance, see Example 1.4 on pp. 3-4
and Example 1.6 on pp. 5-6), in which case having a random sample
for each group (from each respective population) becomes much more
important.
- (p. 74, Example 3.1)
*** minor mistake in book *** On the third
line of p. 74, column 12 should be column 11.
- (p. 74, Example 3.1) This is similar to what you are to do
for Exercise 3.2 on p. 77, which I have suggested that you do in a
certain way on the
homework web page. Specifically, so
that everyone who does it correctly will obtain the same result (making
it possible for me to indicate what the correct answer is), instead of
picking a starting point randomly, you are to start with the set of 5
digits in the upper left corner. Then you are do go down that first
column. Rather than use the 2nd and 3rd digits in the column as is done
in the book (see highlighted portion of Table 3.1 on p. 74), use the last
two digits (the 4th and 5th digits).
- (p. 75, line 7) The concept chance error due to sampling is an
important one. Even though randomization is properly used, it is still
possible to get a misleading result; i.e., the collection of sample values
may be
rather unusual in some way. Even if the sample isn't highly unusual, it
is rare that a random sample exactly reflects the population. In either
case, inferences are subject to error even if the random sample is
properly collected. Because of this, not only is it important
to use statistical procedures which should minimize this sort of error,
but we should also try to indicate (quantitatively) something about
about accuracy associated with our best inferences ... because we always
expect to have some amount of error due to sampling.
- (p. 75, Example 3.2)
Although this doesn't help with the problem of the net not getting the
small fish, to help reduce other sources of bias, the fishing locations
should be randomly chosen (perhaps using some sort of stratified
sampling plan (which unfortunately may not be covered in S&W)). Since
smaller fish and larger fish may tend to be differently distributed
throughout the bay, it would be bad to do all of the fishing in the same
location if the purpose was to learn something about the fish in the
entire bay. *** A similar bias could occur if the intelligence of
monkeys was studied using whatever monkeys were available instead of
trying to get a collection that would be similar to a random sample,
since perhaps it's the less intelligent monkeys that are more often
caught and distributed to researchers. (Anytime a sample of convenience
is used in animal studies, I guess there is a fear that the animals used
may be on the average slower (in thought and/or speed) than those
in the larger population from which they were captured.)
- (p. 75, Example 3.3) On the other hand (from what is
suggested in the book), it could be that smaller nerves are less likely
to be selected if the sampling isn't done carefully. For example, if we
randomly pick a spot on p. 78, a larger
oval is more likely to be hit than is a smaller one. (Important
note: Often, as is the case here, randomly is meant to mean
uniformly at random. With a discrete set this means having each
element being equally likely to be selected. With a continuum of
points, like the points of a line interval or the locations on a page,
the notion is a bit harder to describe (we'll address it later in Ch.
3), but it still means that no point is favored over any other point
when a selection is made. (In picking the nerve to be measured in
Example 3.3 or picking an oval from p. 78, if we randomly stab at
a spot, instead of first identifying all of the possibilities and then
picking one by giving each the same chance to be chosen, we're more
likely to poke at a large one than a small one.) However, sometimes randomly does not mean
uniformly at random --- something could be random but follow a normal
distribution instead of a uniform distribution.)
- (pp. 75-76, Example 3.4)
This example, and the paragraph that follows it, address a point that a
lot of books fail to cover. (There is a lot that can be learned from a
careful reading of this book. Despite some mistakes and some things that
aren't done really well, on the whole it's a pretty good book.)
- (p. 76, Example 3.5) Here the sample used came from
crossbred plants created especially for the experiment as opposed to
being drawn from a larger population --- it could be that there is no
larger population of plants occurring naturally from which a good random
sample could have been drawn. Here we hope the
sample is representative of what may occur if more such plants are
created. It may be stretching things to claim that they are
representative of all such plants to be grown in the future since
different growing conditions may have some effects. I would guess that
in this setting, it's the comparisons of resistances of the progeny
plants with the other two types of plants that is the key focus, and the
important thing is that all three are grown in similar conditions,
exposing them to the same things. (That is, this example is similar to
Example 3.7 in that the meaningful population is a bit vague, and
the important thing is the comparison of treatments --- the experiment
seems like it may yield good information even if it isn't completely
clear to what extent the results can be extrapolated.)
Section 3.3
- (p. 78) I use P(E) instead of Pr{E}. P is a
function, except unlike f(x) = 1/x, where you plug a
nonzero number in and you get a number (the inverse of the number
plugged in) out, with P you plug an event (which is just a subset of the
sample space) in and you get a number (the probability of the event)
out.
- (p. 80) I've never seen the double arrow used in this situation.
(My guess is that Ch. 3 is the weakest chapter in the book.) Most books
would use a single arrow (like I did in class) with the arrow pointing
from the ratio to the probability. Formally, it is the law of large
numbers that gives us that the ratio converges to the true
probability as the number of trials increases. It should be noted that
we don't always have to adopt the frequency interpretation ---
Example 3.10 came up with a probability using the notion of
equally-likely outcomes, using the fact that if 30% of the flies are
black, and one is chosen at random (giving each fly the same chance of
being selected), the probability of picking a black fly is 0.3. But
even though here we can determine exactly what the probability is, if we
estimate it with a ratio of the type on p. 80 by repeating the experiment
over and over, the estimate should be very close to 0.3 with high
probability after a sufficiently large number of trials.
Section 3.4
This section should follow Sec. 3.5 since it uses the concepts in
independence and conditional probability that aren't
introduced until Sec. 3.5.
(On p. 91 (near the top, right after Example 3.22) it is stated
that material from Sec. 3.5 is used in Sec. 3.4.)
(I'm surprised that the middle part of Ch. 3
is so bad, considering that other parts of the book are so good.)
- (p. 84) Both trees shown on this page use the concept on
independent events (which aren't introduced until the next section of
the book). Specifically, it is being assumed that the outcome of the
first coin toss does not influence the probabilities associated with the
second toss (which seems like a reasonable assumption). The
probabilities given in the tree at the bottom of the page follow from
the notions of equally-likely outcomes and independent events. I'm
not real sure why the book has "the relative frequency interpretation of
probability can be a guide to the appropriate combination of the
probabilities of subevents" --- the statement makes little or no sense
to me, and I don't recall having seen the term "subevent" used in a probability
text before.
- (p. 85) The paragraph right before Example 3.16 indicates
that the population needs to be large. Actually, if two flies are taken
one after another (without replacing the first fly before drawing
another fly --- so two different flies are taken) from a population with
exactly 30% black flies, then the probability of obtaining two black
flies is a bit less than (0.3)(0.3) = 0.09. For example, if the
population has 3 black flies and 7 gray flies, then the probability that
two black flies will be drawn in (3/10)(2/9) = 1/15, which is not equal
to 0.09. (The 2/9 is due to the fact that if a black fly is selected on
the first draw, only 2 of the 9 flies which remain at the time of the
second draw are black. (Note: This type of stuff isn't so
important in STAT 535. Of course, it isn't real difficult either, and
so there is no good reason why you shouldn't try to understand it.))
If we have 100 flies instead of just 10, the probability of getting two
black flies is (30/100)(29/99) = 29/330, which is about 0.0879. If there
are 1000 flies, the probability is 0.0898, which is pretty close to 0.09.
- (pp. 85-86, Example 3.16)
The probability calculation right above Fig. 3.5 is making use of
Rule 4 on p. 89 of the next section. Conditional probabilities are also
being employed, and Rule 7 on p. 91 is being used to get each of the two
terms in the sum.
- (pp. 86-87, Example 3.17)
The probability calculation right above Fig. 3.6 is making use of
Rule 4 on p. 89 of the next section. Conditional probabilities are also
being employed, and Rule 7 on p. 91 is being used to get each of the two
terms in the sum.
Section 3.5
Although I find this stuff fascinating, I'm not going to have
you do a lot with these probability rules. There may be some HW
problems pertaining to them, but except for the really basic things that
I'll use over and over again, these rules won't be emphasized on the
exams. You'll find that as we go through the course, I'll refer to some
relatively simple rules a lot in justifying material from other parts of
the book, while other rules won't be referred to nearly as often.
Specifically, Rule 3 and Rule 4 (two rather simple rules) will be used from
time to time, and Rule 6 will be used a lot. Rule 1 is so trivial that
we may not notice it being used.
- (p. 88)
*** mistake in book ***
Rule 2 isn't correct. It's possible that the sum of
probabilities for various events exceeds 1. For example, consider
the random experiment of rolling a fair die and observing the number of
spots on the upward face. The sample space is
S = {1, 2, 3, 4, 5, 6}. Letting
A = {1, 3, 5} (an odd integer is obtained),
B = {2, 4, 6} (an even integer is obtained),
C = {3, 6} (an integer divisible by 3 is obtained),
we have
P(A) +
P(B) +
P(C) = 1/2 + 1/2 + 1/3 = 4/3, which is greater than 1.
To partially fix things, one could specify that the sum of the probabilities of
all possible outcomes (in the sample space) is equal to 1, but
even then we'd need to specify that the sample space is finite or
countably infinite (which is a term that you don't have to be concerned
about). For example, with the fair die we have
P( {1} ) +
P( {2} ) +
P( {3} ) +
P( {4} ) +
P( {5} ) +
P( {6} ) = 1/6 + 1/6 + 1/6 + 1/6 + 1/6 + 1/6 = 1. It can also be noted
that most probability books have P(S) = 1 as one of the basic
rules.
- (p. 90) The concept of independent events is an important
one.
Section 3.6
Once again, I find it odd how S&W organizes the material in Ch. 3.
Random variables, which are formally introduced in the next section,
should be covered before getting into density curves (which
should formally be referred to as probability density functions
(pdfs), but are commonly just called densities).
Densities are very important and will be used a lot throughout the course.
I'll often sketch densities on the board to provide an indication of the
shape associated with the parent distribution of a random sample. (By
parent distribution I mean the distribution associated with
making a random observation of some variable --- in some cases it will
be assumed that making a random observation (that is, randomly selecting a
subject or location, and making a measurement) is well approximated with
observing a normally distributed random variable (normal distributions
are covered in Ch. 4), but in other cases I may need to indicate a
skewed, heavy-tailed, or bimodal distibution, and drawing a quick sketch
of the assumed density is a good way to convey this information.)
I'll also sketch densities of the sampling distributions
of certain estimators and test statistics
(sampling distributions for estimators are covered in Ch. 5, and
sampling distributions of test statistics are covered in Ch. 7 after
hypothesis testing is introduced).
(Note: A sampling distribution is just the distribution for
some statistic (e.g., an estimator or a test statistic). It's
just that because a statistic is a function of the random variables
assoicated with a random sample (so a function of more than one
random variable), its value depends upon the values in the sample,
and we refer to its distribution as a sampling distribution.)
- (p. 93) If we think of the three parts of Fig. 3.9 as
graphics which provide some indication of the relative likelihood of
observering the various possible values of something when a random
observation is made,
it should seem sensible that a smooth curve, like the one labeled (c),
is preferable to the choppy histogram shapes of (a) and (b). Looking at
(a), it should seem odd if the likelihood remained absolutely constant
betweeen 150 and 160, and then jumped to a different value and remained
constant between 160 and 170 --- instead, a gradual change would seem to
make more sense.
- (p. 94) The blue box is very important --- the main idea is that
probabilities associated with a continuous variable having a density
correspond to areas under the curve.
- (pp. 94-95, The Continuum Paradox) Although it may seem
very odd at first that the probability of observing a
particular value exactly is 0 with a continuous variable,
I will try to convince you in class that this is the only value that
makes sense for such a probability. An important associated point is
that unlike the case with discrete variables (really, discrete random
variables --- which are not formally introduced until the next section),
with continuous (random) variables, a probability of 0 for a particular
outcome doesn't mean that outcome is impossible. (Again, this may seem
a bit odd (or very odd) if you're encountering this for the first time.)
- (p. 95, Example 3.29)
Note that with a continuous variable it doesn't matter whether the
endpoints, in this case 100 and 150, are included or not, since the
probability associated with each of these points is 0. But with a discrete
variable, whether or not the endpoints are to be included can matter, since the
associated probabilities may be positive.
Section 3.7
- (pp. 96-97, Example 3.31
and Example 3.34)
When we observe the number of spots
on the upward face of a die, we are observing an object (the upward face
of the die) that can take on various values. When we observe the height
of a randomly selected man, in this case it's not that a man's height
flucuates randomly, but rather it's the act of randomly
selecting a man and observing his height that can produce various
values according to a probability distribution. The common element is
that prior to the act of making an observation, more than one value is
possible to occur, and the various values will occur according to a
probability distribution.
- (p. 97) Here is a way of describing the mean (aka expected value)
of a random variable: it is a weighted average of the random variable's
possible outcomes --- for a discrete random variable, each possible
outcome is weighted by its probability; for a continuous random
variable the density function provides the weights (and an integral has
to be used instead of a sum).
- (p. 100, Rules of Means of Random Variables)
The two rules are important for understanding the derivation of certain
statistics, but aren't used so much in actual data analysis.
- (p. 101, Rules of Variances of Random Variables)
The two rules are important for understanding the derivation of certain
statistics, but aren't used so much in actual data analysis.
Note that Rule 4 pertains to independent random variables only.
The book doesn't have a lot about independent random variables ---
basically, if two random variables are independent then the value that
one assumes doesn't effect the value that the other one assumes.
(See bottom portion of p. 100 for more information.)
Section 3.8
- (p. 103, 1st paragraph) The two possible outcomes are generically
referred to as success and failure. The success
is the outcome which has probability p, and in some cases it may
be the outcome that seems to be the less desirable one in a practical
sense. For example, in a study of death rates, it may be convenient to
focus on the deaths (which may be relatively rare) instead of the
survivals. If we want to use p for the probability that a
randomly chosen subject corresponds to a death, then a death is
considered a success --- a bit odd perhaps, but try to get used to it.
(Note the Remark on p. 106.)
- (pp. 103-104, Example 3.44) The blue box above the example
stresses the independence of the trials. The probability calculations
in the example get across the concept of what is meant by independent
trails. If we let
A1 be the event that the first trail is a success
and A2 be the event that the second trail is a success
(noting that in this example, and albino child is considered to be a
"success" and a nonalbino a "failure"), the independence of the trails
means that these two events are independent (the chance that
A2 occurs does not depend on whether or not
A1 occurs), which gives us that the probability of the
intersection of these two events (which corresponds to two albino
children) is just the product of the probabilities of the events, which
is p*p = p2.
- (p. 105) When speaking about the binomial coefficient
nCj, people often say "n choose j" due to
the fact that it's value is equal to the number of different subsets of
size j that can be created (chosen) from a set of n items.
(There is another commonly used notation for the binomial coefficient
nCj which I'll often use when writing on
the board.) In the formula for the binomial distribution probabilities,
the n choose j factor is due to there being that number of
different sequences of success and failures having j successes
among the n trails. So instead of adding two terms having
probability p(1 - p) as is done near the middle of p. 104
for the case of n being 2 and j being 1, for the
probability of j successes in n trials, there are
nCj terms to add each having probability
pj(1 - p)n - j.
- (p. 110, Example 3.50) This example indicates that not
every situation with a collection of binary outcomes should be modeled
using a binomial distribution.