Ch. 8 notes for S&W, STAT 535

Some Comments about Chapter 8 of Samuels & Witmer

A lot of statistics books, which are otherwise similar to S&W, don't have a chapter like this one. This chapter deals with a lot of issues having to do with experimental design --- not so much the different designs that will be covered in subsequent chapters, but rather with issues that influence designs, and the interpretation of results. Many other books would touch on some of these issues in chapters dealing with the analysis of data from various designs, but a lot of introductory books completely omit several of the issues that S&W introduce in this chapter.

Section 8.1

(pp. 310-311, Observational Versus Experimental Studies) Some have a hard time with the distinction between these two types of studies. Maybe this is due to the fact that some experiments are more of the classic lab type of experiments carried out in controlled enviroment such as a lab or greenhouse, and include elements such as the randomization of treatment assignments, while other experiments can actually be carried out in the field and have a rather simple sampling plan. The main difference is that an observational study typically involves a sample of convenience, instead of a random sample from a population/distribution. To deal with a sample that's not a random sample, we try to develop a model, and fit the model to the data. The idea is that if the model is pretty good, the available data can be used to fit the model, and the model can be used to make inferences about relationships. The danger is that if the model doesn't include enough features and/or the data doesn't "span" a board enough part of the population/distribution, the fitted model and related inferences will suffer due to bias. Multiple regression (which isn't covered by S&W, but will be covered in STAT 535), is the main statistical technique used for observational studies.
(p. 311) A lot of books don't make a distinction between explanatory variables and extraneous variables is the way S&W do, largely because there isn't a huge difference in the way we treat these types of variables --- some books would just lump them all together and refer to them as explanatory variables.

Section 8.2

(p. 311) The two sources of difficulty can be (possibly) dealt with using multiple regression (as indicated above). Because S&W doesn't have a chapter on multiple regression, it may not be given a lot of specific attention, but rather may be included in the methods beyond the scope of the book, which are referred to in some places.
(pp. 313-315, Confounding and Spurious Association) I think it would have been good to have gone right from the Association is not causation maxim to the concept of spurious association. The ultrasound example is a good one of a spurious association. You might wonder how the example about smoking and birthweight differs from the ultrasound example ... why in one case the term spurious association is used, and in the other case the issue of confounding is introduced. I suppose the main difference is that in the ultrasound case, it may well be that the use of ultrasound contributes nothing to a baby being born with a low birthweight; i.e., ultrasound has no causal effect whatsoever; while it may not be so clear that smoking isn't a contributing factor to low birthweight, although it seems pretty clear that we should definitely allow for the possibility that alcohol use may be a contributing factor. It seems to me that the terms confounder, confounded, and confounding aren't used consistently by epidemiologists and statisticians (and others), but the main point to get out of all of this material is that one needs to adjust for the possible effects of other variables when assessing the effect of a particular variable on a response variable. (Another key point to be made at some point is that there should typically be some doubt when determining causation, because of the possibility that some lurking variable wasn't adjusted for --- although with a good randomized experiment, where one has that the only systematic difference between the treatment group and the control group is in fact the treatment/variable being studied, it is sometimes possible to have statistically significant evidence for causation.)

Section 8.3

(p. 317) Some may say that the description of an experiment given in the first sentence of the section is too limiting.
(pp. 317-322) The more you read and hear about unusual and unanticipated placebo effects, and experimenter biases, the more you should be convinved that blinding (and hopefully doing a double blind experiment when appropriate) and using a control sample is very important.
(pp. 322-323, Example 8.15) One could also randomize in such as way as to wind up with equal numbers in each group, which is desirable for a variety of reasons (e.g., power, and ease of doing the analysis). One can arrive at equal numbers in the groups several different ways (e.g., drawing one of 4 colors from a hat or urn, only doing it without repacement, starting with N/4 chips/slips/bracelets of each color, where N is the total number of subjects available, or by doing it the way suggested in the 2nd paragraph following the end of the example on p. 323).
(p. 323, Fig. 8.3) Instead of viewing the 4 groups as being unrelated, as might be suggested by the arrangement of the figure, one could view them as being "laid out" in a 2 by 2 pattern (i.e., a crossing of peer support (yes, no) with monetary incentive (yes, no)).
(p. 324, Why Randomize?) The sluggish rats example, like other examples used in the chapter, is a good one.
(p. 324, Why Randomize?) Randomization is no way guarantees that the treatments groups have no important differences, but it does provide us for a way to fairly search for a "signal" in the presence of experimental "noise." In short, statistical procedures can reasonably account for differences just due to the random allocation of nonidentical subjects, and allow us to attribute any excess differences (differences not reasonably explained by the random allocation of nonidentical subjects) to real differences among the treatments. (An erroneous conclusion may be reached, but consideration of the probability of a type I error addresses with that possibility.)
(p. 324, Randomization and the Random Sampling Model) I've referred to such conceptual popultions before in my web page comments on previous chapters.

Section 8.4

Viewed simplistically, blocking is a way to work around the fact that the measurements made on the available subjects cannot be reasonably viewed as being the observed outcomes of independent identically distributed random variables. (Example: measurements made on a large number of animals from a small number of litters.) But it's more than a workaround --- one take take advantage of the dependent observations as a way of reducing the effects of experimental noise, by making the various treatment groups less different from one another than what completely random allocation would typically result in. (Sometimes we create blocks when we don't have to. When blocks don't arise "naturally" we can create them by making matched sets of similar subjects.) So the bottom line is, by blocking we can reduce the effect of experimental noise.

(p. 331, Complementarity of Randomization and Blocking, 3rd paragraph) I don't quite see the two related purposes --- or else I see them as very related. The act of making the treatment groups less different is a means to the end of reducing the noise and being able to detect differences between treatments better. But in any case, this paragraph nicely sums up what blocking does and gives us.
(p. 332, Statistical Adjustment for Extraneous Variables) Blocking is a way of hopefully cancelling out the effects of extraneous variables. This is easiest to see when there are just two treatment groups (could be one treatment and a control): the analysis is done by first taking the differences of the measurements in each block, with the hope being that if there are few and small differences between the subjects in each pair, the contributions to the response due to other factors cancel out with the subtraction, and what is left can mostly be attributed to the differences between the treatments. E.g., age may have a large effect on the response no matter which treatment is applied, but if each pair has two subjects of the same age, age can no longer be viewed as contributing to the difference in the responses for the two members of the matched pair (block). Randomization is a way of dealing with differences between subjects/units. By the use of randomization we can fairly assess whether there is evidence of treatment differences despite other sources that could contribute to observed differences other than treatments. Still, one might be better off trying to get rid of some of the experimental noise by blocking, and then using randomization within blocks to "polish off the differences" that still exist (although we don't remove the differences by randomizing within the blocks --- so really it's more like dealing with the differences that still exist). Alternatively, we may worry that not enough of the differences can be cancelled out by blocking, and may instead try to model (e.g., with a regression model) the effects due to extraneous variables, thereby adjusting for such differences individually, rather than trying to cancel them out.

Section 8.5

(pp. 334-336, Example 8.25) Since all of the petri dishes were prepared from the same culture, and presumably in the same way, there may not have been a good way to create blocks, and so a complete randomization was done to assign 3 dishes to each of the 43 treatments. (Note: If there is no good way to create blocks, one loses a bit with regard to performance of statistical procedures if blocks are created and used anyway. But if there are factors contributing to differences between observational units, then using such factors to create blocks can result in improved inferences.) Later, we'll cover the analysis of data using such a nested/hierachical design. One of the things that you should learn then is that one would need at least two dishes per treatment in such a study --- if there is one one dish per treatment, there would be no way to determine if observered variation was due to differences among dishes, or differences due to treatments. (See the first two paragraphs on p. 337 for some additional information about these matters.) It can also be noted that generally it wouldn't be such a good idea to have so many different treatments, and so few dishes per treatment. Finally, please note the correct way to compute the standard error estmate for the sample mean of the treatment 1 observations. (Some years ago I had to point out that a Ph.D. student had done things incorrectly in a situation similar to this one. Her advisor assured me that the incorrect way was standard practice in their field (wetlands ecology), and rather than fix it and do it the correct way, they kept the incorrect values and put a note to indicate that the values weren't proper standard error estimates. I doubt that the incorrect way is standard practice for good researchers in that field, but given the stubborness I observed, and the refusal to do it the right way, it may well be that it's a commonly made mistake, and one that not only often goes uncorrected, but is passed on to future generations of researchers.)
(p. 337, Determination of Sample Size, last sentence) An example along these lines would be if the object was to estimate the mean size of particles resulting from a certain process. Perhaps it is suspected that the average particle size differs from batch to batch. This leads to needing to use more than one batch! But it is also suspected that multiple measurements made from the same batch will yield different values --- that is, in additional to variation due to differences between batches, there is also variation associated with each individual batch. This leads to needing to take more than measurement per batch. Suppose that the expensive part of the experiment is making the measurements (and that creating different batches isn't a problem). If one wants to limit the total number of measurements made to be 100, there is a trade-off to consider: one can use a small number of batches and make many measurements per batch, or one can use many batches and make a small number of measurements per batch. If one can guess as to the relative amounts of variation that will be observed both between batches and within batches, one can determine an optimal sampling plan for which a total of 100 measurements will be made. (I recall a case where there were 3 batches and hundreds of measurements per batch. Unfornately, the greatest source of variation was the variation between different batches, and the quality of the final estimate was severely hurt by there only being three batches. The estimate quality would have been a lot better had many more batches been used, and fewer measurements per batch wouldn't have had such a large effect on the quality.)

Section 8.6

(p. 338) The term margin of error isn't used consistently, and because of this, I don't like to use it at all. (I prefer to make it clear how I'm expressing information about the uncertainty associated with an estimate, either by giving an estimated standard error (and clearly stating that this is what I'm doing), or by giving a confidence interval.) S&W indicate that the margin of error is the +/- part of a confidence interval, but some take it to be the estimated standard error, and it should be realized that one can have 95%, 90%, 99%, or some other type of confidence interval (and so indicating that it's the +/- part of a confidence interval isn't very meaningful). (During election time, and in reporting on results of polls in general, I often see the term margin of error used, but have never seen it specified just what is meant.)
(p. 339) The last three paragraphs (about missing data) are important. A related problem is the one of how to decide if an unusual observation should be used or omitted. (If a value is unusal, and obviously the result of some sort of experimental or measurement error, it should be omitted. If it's unusual but thought to not be an error, then it shouldn't be omitted. The problem is that sometimes we don't know if it's an error or not.)
(pp. 340-341, Randomized Response Sampling) This section deals with something that isn't commonly done. I'm not going to cover it in class.

Section 8.7

(p. 343) The first paragraph (labeled 3) is interesting. The main point is that if the subjects can't be viewed as being randomly sampled, or at least representative, of some larger population, but they are randomly allocated to the treatment groups, then if statistically significant evidence of a difference is observed, one can think that the difference isn't just due to the random assignment of different subjects to equivalent treatments, but rather one can think that the treatment does something. This allows us to conclude that the treatment affects some subset of the overall population, but the problem is that we don't know very much about the size and characteristics of the subet. But in the initial stages of research on a treatment, I suppose that evidence that the treatment affects some subset of the population can be taken as a minor victory. So doing an experiment using a sample of convenience instead of a randomly chosen sample can sometimes result in something worth noting. But of course it would be better to have a statistically significant result from a random sample, so that the conclusion can be applied to the population from which the sample was drawn.