Some Comments about Chapter 8 of Samuels & Witmer
A lot of statistics books, which are otherwise similar to S&W, don't
have a chapter like this one. This chapter deals with a lot of issues
having to do with experimental design --- not so much the different
designs that will be covered in subsequent chapters, but rather with
issues that influence designs, and the interpretation of results. Many
other books would touch on some of these issues in chapters dealing with
the analysis of data from various designs, but a lot of introductory
books completely omit several of the issues that S&W introduce in this chapter.
Section 8.1
- (pp. 310-311, Observational Versus Experimental Studies)
Some have a hard time with the distinction between these two types of
studies. Maybe this is due to the fact that some experiments are more
of the classic lab type of experiments carried out in controlled
enviroment such as a lab or
greenhouse, and include elements such as the randomization of treatment
assignments, while other experiments can actually be carried out in the
field and have a rather simple sampling plan. The main difference is
that an observational study typically involves a sample of convenience,
instead of a random sample from a population/distribution. To deal with
a sample that's not a random sample, we try to develop a model,
and fit the model to the data. The idea is that if the model is
pretty good, the available data can be used to fit the model, and the
model can be used to make inferences about relationships. The danger is
that if the model doesn't include enough features and/or the data
doesn't "span" a board enough part of the population/distribution, the
fitted model and related inferences will suffer due to bias. Multiple
regression (which isn't covered by S&W, but will be covered in STAT
535), is the main statistical technique used for observational studies.
- (p. 311) A lot of books don't make a distinction between
explanatory variables and extraneous variables is the way
S&W do, largely because there isn't a huge difference in the way we
treat these types of variables --- some books would just lump them all
together and refer to them as explanatory variables.
Section 8.2
- (p. 311) The two sources of difficulty can be (possibly) dealt
with using multiple regression (as indicated above). Because S&W
doesn't have a chapter on multiple regression, it may not be given a
lot of specific attention, but rather may be included in the methods
beyond the scope of the book, which are referred to in some places.
- (pp. 313-315, Confounding and Spurious Association)
I think it would have been good to have gone right from the
Association is not
causation maxim to the concept of spurious association. The
ultrasound example is a good one of a spurious association.
You might wonder how the
example about smoking and birthweight differs from the ultrasound
example ... why in one case the term spurious association is used, and
in the other case the issue of confounding is introduced. I
suppose the main difference is that in the ultrasound case, it may well
be that the use of ultrasound contributes nothing to a baby being born
with a low birthweight; i.e., ultrasound has no causal effect
whatsoever; while it may not be so clear that smoking isn't a
contributing factor to low birthweight, although it seems pretty clear
that we should definitely allow for the possibility that alcohol use
may be a contributing factor. It seems to me that the terms
confounder, confounded, and confounding aren't used
consistently by epidemiologists and statisticians (and others), but the
main point to get out of all of this material is that one needs to
adjust for the possible effects of other variables when assessing the
effect of a particular variable on a response variable. (Another
key point to be made at some point is that there should typically be
some doubt when determining causation, because of the possibility that
some lurking variable wasn't adjusted for --- although with a
good randomized experiment, where one has that the only systematic
difference between the treatment group and the control group is in fact
the treatment/variable being studied, it is sometimes possible to have
statistically significant evidence for causation.)
Section 8.3
- (p. 317) Some may say that the description of an experiment
given in the first sentence of the section is too limiting.
- (pp. 317-322) The more you read and hear about unusual and
unanticipated placebo effects, and experimenter biases, the more
you should be convinved that blinding (and hopefully doing a
double blind experiment when appropriate) and using a control
sample is very important.
- (pp. 322-323, Example 8.15)
One could also randomize in such as way as to wind up with equal numbers
in each group, which is desirable for a variety of reasons (e.g., power,
and ease of doing the analysis). One can arrive at equal numbers in the
groups several different ways (e.g., drawing one of 4 colors from a hat
or urn, only doing it without repacement, starting with N/4
chips/slips/bracelets of each color, where N is the total number
of subjects available, or by doing it the way suggested in the 2nd
paragraph following the end of the example on p. 323).
- (p. 323, Fig. 8.3)
Instead of viewing the 4 groups as being unrelated, as might be
suggested by the arrangement of the figure, one could view them as being
"laid out" in a 2 by 2 pattern (i.e., a crossing of peer
support (yes, no) with monetary incentive (yes, no)).
- (p. 324, Why Randomize?)
The sluggish rats example, like other examples used in the chapter, is a
good one.
- (p. 324, Why Randomize?)
Randomization is no way guarantees that the treatments groups have no
important differences, but it does provide us for a way to fairly search
for a "signal" in the presence of experimental "noise." In short,
statistical procedures can reasonably account for differences just due
to the random allocation of nonidentical subjects, and allow us to
attribute any excess differences (differences not reasonably explained
by the random allocation of nonidentical subjects) to real differences
among the treatments. (An erroneous conclusion may be reached, but
consideration of the probability of a type I error addresses with that
possibility.)
- (p. 324, Randomization and the Random Sampling Model)
I've referred to such conceptual popultions before in my web page
comments on previous chapters.
Section 8.4
Viewed simplistically, blocking is a way to
work around the fact that the measurements made on the
available subjects cannot be reasonably
viewed as being the observed outcomes of independent identically distributed
random variables. (Example: measurements made on a large number of
animals from a small number of litters.) But it's more than a workaround
--- one take take advantage of the dependent observations as a way of
reducing the effects of experimental noise, by making the various
treatment groups less different from one another than what completely random
allocation would typically result in. (Sometimes we create blocks when
we don't have to. When blocks don't arise "naturally" we can create
them by making matched sets of similar subjects.) So the bottom line is, by
blocking we can reduce the effect of experimental noise.
- (p. 331, Complementarity of Randomization and Blocking, 3rd
paragraph) I don't quite see the two related purposes --- or else I see
them as very related. The act of making the treatment groups
less different is a means to the end of reducing the noise and being
able to detect differences between treatments better. But in any case,
this paragraph nicely sums up what blocking does and gives us.
- (p. 332, Statistical Adjustment for Extraneous Variables)
Blocking is a way of hopefully cancelling out the effects of extraneous
variables. This is easiest to see when there are just two treatment
groups (could be one treatment and a control): the analysis is done by
first taking the differences of the measurements in each block, with the
hope being that if there are few and small differences between the
subjects in each pair, the contributions to the response due to other
factors cancel out with the subtraction, and what is left can mostly be
attributed to the differences between the treatments. E.g., age may
have a large effect on the response no matter which treatment is
applied, but if each pair has two subjects of the same age, age can no
longer be viewed as contributing to the difference in the responses for
the two members of the matched pair (block). Randomization is a
way of dealing with differences between subjects/units. By the use of
randomization we can fairly assess whether there is evidence of
treatment differences despite other sources that could contribute to
observed differences other than treatments. Still, one might be better
off trying to get rid of some of the experimental noise by blocking, and
then using randomization within blocks to "polish off the differences"
that still exist (although we don't remove the differences by
randomizing within the blocks --- so really it's more like dealing with
the differences that still exist). Alternatively, we may worry that not
enough of the differences can be cancelled out by blocking, and may
instead try to model (e.g., with a regression model)
the effects due to extraneous variables, thereby adjusting for such
differences individually, rather than trying to cancel them out.
Section 8.5
- (pp. 334-336, Example 8.25)
Since all of the petri dishes were prepared from the same culture, and
presumably in the same way, there may not have been a good way to create
blocks, and so a complete randomization was done to assign 3 dishes to
each of the 43 treatments. (Note: If there is no good way to create
blocks, one loses a bit with regard to performance of statistical
procedures if blocks are created and used anyway. But if there are
factors contributing to differences between observational units, then
using such factors to create blocks can result in improved inferences.)
Later, we'll cover the analysis of data using such a nested/hierachical
design. One of the things that you should learn then is that one would
need at least two dishes per treatment in such a study --- if there is
one one dish per treatment, there would be no way to determine if
observered variation was due to differences among dishes, or differences
due to treatments. (See the first two paragraphs on p. 337 for some
additional information about these matters.)
It can also be noted that generally it wouldn't be such a
good idea to have so many different treatments, and so few dishes per
treatment. Finally, please note the correct way to compute the
standard error estmate for the sample mean of the treatment 1
observations. (Some years ago I had to point out that a Ph.D. student
had done things incorrectly in a situation similar to this one. Her
advisor assured me that the incorrect way was standard practice in their
field (wetlands ecology), and rather than fix it and do it the correct
way, they kept the incorrect values and put a note to indicate that the
values weren't proper standard error estimates. I doubt that the
incorrect way is standard practice for good researchers in that field,
but given the stubborness I observed, and the refusal to do it the right
way, it may well be that it's a commonly made mistake, and one that not
only often goes uncorrected, but is passed on to future generations of
researchers.)
- (p. 337, Determination of Sample Size, last sentence)
An example along these lines would be if the object was to estimate the
mean size of particles resulting from a certain process. Perhaps it is
suspected that the average particle size differs from batch to batch.
This leads to needing to use more than one batch! But it is also
suspected that multiple measurements made from the same batch will yield
different values --- that is, in additional to variation due to
differences between batches, there is also variation associated with
each individual batch. This leads to needing to take more than
measurement per batch. Suppose that the expensive part of the
experiment is making the measurements (and that creating different
batches isn't a problem). If one wants to limit the total number of
measurements made to be 100, there is a trade-off to consider: one can
use a small number of batches and make many measurements per batch, or
one can use many batches and make a small number of measurements per
batch. If one can guess as to the relative amounts of variation that
will be observed both between batches and within batches, one can
determine an optimal sampling plan for which a total of 100 measurements
will be made. (I recall a case where there were 3 batches and hundreds
of measurements per batch. Unfornately, the greatest source of
variation was the variation between different batches, and the quality
of the final
estimate was severely hurt by there only being three batches. The
estimate quality would have been a lot better had many more batches been
used, and fewer measurements per batch wouldn't have had such a large
effect on the quality.)
Section 8.6
- (p. 338) The term margin of error isn't used consistently,
and because of this, I don't like to use it at all. (I prefer to make
it clear how I'm expressing information about the uncertainty associated
with an estimate, either by giving an estimated standard error (and
clearly stating that this is what I'm doing), or by giving a confidence
interval.) S&W indicate that the margin of error is the +/- part
of a confidence interval, but some take it to be the estimated standard
error, and it should be realized that one can have 95%, 90%, 99%, or
some other type of confidence interval (and so indicating that it's the
+/- part of a confidence interval isn't very meaningful). (During election time, and
in reporting on results of polls in general, I often see the term margin
of error used, but have never seen it specified just what is meant.)
- (p. 339) The last three paragraphs (about missing data) are
important. A related problem is the one of how to decide if an unusual
observation should be used or omitted. (If a value is unusal, and
obviously the result of some sort of experimental or measurement error,
it should be omitted. If it's unusual but thought to not be an error,
then it shouldn't be omitted. The problem is that sometimes we don't
know if it's an error or not.)
- (pp. 340-341, Randomized Response Sampling)
This section deals with something that isn't commonly done. I'm not
going to cover it in class.
Section 8.7
- (p. 343) The first paragraph (labeled 3) is interesting.
The main point is that if the subjects can't be viewed as being randomly
sampled, or at least representative, of some larger population, but they
are randomly allocated to the treatment groups, then if statistically
significant evidence of a difference is observed, one can think that the
difference isn't just due to the random assignment of different subjects
to equivalent treatments, but rather one can think that the treatment
does something. This allows us to conclude that the treatment affects
some subset of the overall population, but the problem is that
we don't know very much about the size and characteristics of the subet.
But in the initial stages of research on a treatment, I suppose that
evidence that the treatment affects some subset of the population can be
taken as a minor victory. So doing an experiment using a sample of
convenience instead of a randomly chosen sample can sometimes result in
something worth noting. But of course it would be better to have a
statistically significant result from a random sample, so that the
conclusion can be applied to the population from which the sample was
drawn.