Ch. 2 notes for S&W, STAT 535

Some Comments about Chapter 2 of Samuels & Witmer

Section 2.1

(p. 10) Even if all observations are given in years, some would still treat age as a continuous variable, rather than a discrete one. In principle, any positive value is possible for age, and observations have to be rounded in some way (nearest year, month, or day, for example). It doesn't make a lot of sense to specify that weight is continuous (as is indicated at the top of p. 10) and that age is discrete. Weight also has to be rounded in some way --- whether it be to nearest lb, kg, tenth of a gram, or mg --- and one could view the rounded weights as being discrete. In reality, all variables are in a sense discrete, but if the true quantity that's being measured, as opposed to the rounded measurements, is not discrete, then we tend to refer to the rounded measurements as being for a continuous variable.
(p. 11, Remark) Please get it straight that in statistics we refer to a collection of 20 values as one sample, as opposed to 20 samples. We should say that we have one sample of 20 observations. With regard the blood, the sample is the 20 glucose measurements, not the 20 specimens of blood, or the 20 subjects from which the blood came.

Section 2.2

(pp. 13-14) The histogram of Fig. 2.5 gives the same information as does the dotplot (aka one-dimensional scatterplot) of Fig. 2.4.
(p. 15) One should not overinterpret a histogram based on a small sample. E.g., we should conclude that the distribution underlying the sample from which Fig. 2.7 was created is bimodal, with a mode between 200 and 220 in addition to a mode at about 100. Histograms can be rather unstable, in that if the bin width for the groups is changed, the left endpoint of the first bin is changed, or another sample is drawn from the same distribution and a new histogram is created, the resulting histogram can display different features than the original one. In general, I'm not a big fan of histograms.
(p. 15) Instead of stating skewed to the right it's better to put positively skewed. Similarly, a negatively skewed distribution is one that S&W refers to as being skewed to the left.
(p. 18) I don't like either (a) or (b) of Fig. 2.13. If bins of different widths are going to be used (which is sometimes a good idea), the vertical axis shouldn't be for either frequency or relative frequency --- instead the units of density should be used (although you shouldn't worry about this for STAT 535, since I'm not going to emphasize histograms anyway).
(p. 19) If one turns a stem-and-leaf plot on its side it resembles a histogram or dotplot. An advantage of the stem-and-leaf plot is that one can determine the value of each observation.
(p. 19, 3 lines from bottom) Instead of rounding to "one decimal place" I think it should be the nearest integer.

Section 2.3

The shape of the distribution(s) often influences the choice of which statistical method to use --- some procedures aren't good choices if there is too much skewness or heavy-tailedness.

Even though this section shows a lot of histograms, I'll stress in class that histograms are not the best devices for assessing whether a distribution is heavy or light tailed, or determining if there is approximate symmetry or perhaps mild skewness. Other graphical methods are more useful for such diagonostic determinations.

(p. 21) The caption of Fig. 2.20 suggests that the smooth curve is an approximation of the histogram, but often the opposite point of view is better --- we think that some smooth curve (called the density (see Ch. 3) of the distribution) underlies the observed data, and that a histogram based on a random sample is an approximation of the smooth curve which corresponds to the distribution for the phenomenon; that is, the smooth curve is the truth, and the histogram is an estimate of the truth based on a finite number of observations.
(p. 24) Fig. 2.26 is nice in that it illustrates two sources of observational variability: (a) shows variability due to differences between individuals, perhaps due to exposure to different conditions, and (b) shows variability due lack of precision in the measuring procedure when more than one measurement is made on the same individual.

Section 2.4

The reason that I like to skip Ch. 2 at first and talk about it after having covered Ch. 3 and Ch. 4 is that by doing it my way I can more meaningfully discuss the sample mean and sample median as estimators. In Ch. 2 the sample mean and sample median are mainly examined in their roles as summary measures for a sample --- descriptive statistics. But usually one isn't primarily interested in just the n values which make up a sample, but instead is interested in using the data to make an inference about a larger population of values from which the sample is drawn, or in some cases, it's better to think of it as wanting to make an inference about the distribution underlying the sample --- the distribution that in a sense generated the sample. This notion, statistical inference, is introduced in Sec. 2.8, but since a lot of the concepts needed to address the issue properly aren't presented until later chapters of S&W, the introduction in Ch.2 is a bit awkward.

(p. 26) Note the informal definition of statistic given in the first paragraph. Really a statistic is the formula used to compute the numerical value; i.e., it is a function of the observations expressed in terms of the abstract y_i (see the blue box on p. 27 for an example (although technically, a statistic should be defined using the notation of random variables (that are introduced in Ch. 3), and so to define a statistic one should use Y_i instead of y_i)).
(p. 26 & p. 30 (1st paragraph)) I think it is bad to think of the sample mean or sample median as necessarily being the "center" or "typical value" --- they are what they are, and I can give examples of samples for which neither the sample mean nor sample median is a good measure of the center, nor is either a typical value.
(p. 30, 2nd paragraph) That the sample mean may be highly influenced by a small number of unusual values is not necessarily undesirable, and it isn't necessarily true that the sample median is a superior measure in such cases --- it depends on what the purpose of the summary measure is. For example, suppose that the values is a sample are net profits which results from drilling for oil 100 times. It could be that 97 of them are negative because not enough oil (if any) was found to offset the cost to looking to see if there is oil. But if three of them are due to huge successes where oodles of oil was found, and huge profits were made, the sample mean of the 100 values may be positive (and even very large in an absolute sense), while the sample median would be negative. The sample mean is the average net profit that results from drilling, and the fact that it's positive even though 97% of the values are negative isn't really misleading --- the average result of drilling was good (on the whole, money was made from the 100 attempts to find oil). It's the negative sample median that is misleading if the purpose is to characterize the average net profit. In this case, the sample median is a typical value. Just knowing that the sample median is negative means that at least half of the time money was lost. But in this case that is a poor summary of the full set of data if the purpose is to assess the profitability of drilling for oil. We don't necessarily want a summary value to represent the typical value. In summary, the mean and median can be different --- they aren't always supposed to be measures of the same thing --- and in some cases the interest may be in one of the values more than the other, and in other situations it may be reversed.
(p. 30, 3rd paragraph) To give an example of what is being hinted at here, suppose that we are interested in estimating the true distribution mean of the distribution from the parent distribution of the sample (the distribution responsible for the values in the sample). If the distribution is symmetric (see p. 21), then its mean is equal to its median, and one might think that either the sample mean or the sample median could be used to estimate this value of interest. In most cases the sample mean is the better estimator to use. The sample mean is, in many respects, the ideal estimator if the parent distribution of interest is a normal distribution (see Ch. 4), but also in most other cases it is superior to the sample median. But we shouldn't always use the sample mean to get an estimate of the mean of a symmteric distribution --- the sample median can be better if the distribution has extremely heavy tails, and if the distribution has only moderately heavy tails (compared to a normal distribution) estimators other than the sample mean and sample median (for example, trimmed means and M-estimators) can be better to use.

Section 2.5

I don't particularly like boxplots. (I have a very different attitude about them than does Dr. Gantz, who teaches STAT 510. Of course, my attitude is the correct one.) Boxplots suppress too much information about the shape of a sample or distribution. Dotplots, histograms, and stem-and-leaf plots show much more. With small sample sizes, boxplots can provide a misleading summary, because the ends of the box can be bad estimates of the 25th and 75th percentiles of the distribution which underlies the sample.

(p. 35 & p. 38) Parallel boxplots, such as those shown in Fig. 2.32 and Fig. 2.34, are sometimes nice. When several different samples are being compared, I don't mind so much that a lot of information is lost --- the parallel boxplots provide me with a decent initial summary. (However, if the sample sizes are rather small, I think other types of graphical displays are superior.) But for a single sample of values, I think a boxplot suppresses way too much information --- with a single sample, I don't want such a simple summary.
(p. 35, Outliers) An outlier need not correspond to a mistake. It can just be an extreme value from a heavy-tailed or skewed distribution. Or it can be a rather unusual value in a large sample from a "well-behaved" distribution, such as a normal distribution, and in this case it's a value, that when looked at in conjuction with the entire data set, is not inconsistent with what is to be expected --- that is, in a large data set, one often expects to find a few outliers.
(p. 36, 1st 2 lines) I tend to use the term outlier informally. When I refer to an outlier, I just mean a rather unusual observation in a sample, whether the observation is a mistake of some sort or just an extreme value from one of the tails of the distribution. While there have been various definitions proposed, none of them work well in all situations. While the outlier identification scheme given on p. 36 is a commonly used one, it is not the clear-cut best scheme, and it seems best to not take it as the basis of a definition of an outlier (even though S&W might have us believe that it leads to the definition). I agree with what is on lines 3 and 4 of p. 37 --- an outlier is a value which is unusual relative to the other values in the data set and their variability. But I don't think it's a good idea to believe that there is a fool-proof system that will reasonably identify outliers in all cases.
(p. 37, 1st new paragraph) I agree that one should be very hesitant about removing outliers. Often if it's reasonable to think that an outlier is a mistake I'll remove it. However, sometimes an extreme value may be a mistake, but a mistake that occurred because the value which should have been recorded was rather extreme, and in such a case the outlier provides some information about what should have been recorded, and to completely ignore such a value would result in creating a biased sample.

Section 2.6

(pp. 41-42) I think it should be sample standard deviation and sample variance instead of "sample standard deviation" and "sample variance" since they are often used to estimate the distribution standard deviation and variance --- it needs to be stressed that they are computed using just a sample of values. So when s and s² are being referred to, the word sample should always be used, just like we should use it to distinguish the sample mean and sample median from the distribution/population mean and median.
(p. 31) Although it may be instructive to compute one sample standard deviation by the way explained under the blue box in order to gain a better understanding of the concept, when there are more than 3 or 4 observations in a sample I recoomend using software. (You won't be expected to compute such values for exams and quizzes.)
(pp. 43-44, Why n-1?) Although I've seen other books offer similar "explanations," I find such explanations lacking. The explanation is more for why the term degrees of freedom is used than it is for why n - 1 should be used instead of n. A more meaningful explanation for subtracting 1 from the sample size may be that in some ways a better estimator is obtained by doing so. (Using n-1 makes the sample variance an unbiased estimator (to be covered later) of the distribution variance.)
(p. 47, Estimating the SD from a Histogram) This part isn't so important since most of the time one should be able to compute the sample standard deviation directly.

Section 2.8

(p. 57) The 2nd paragraph is an important one (as is the first sentence of the Statistical Inference section near the bottom of the page).
(p. 58, Defining the Population) I guess there are two ways to think about this: we can decide what the population we want to make an inference about is and select a sample appropriately, or we can select a sample that is convenient to use and decide what population it represents. In either case, it's best to randomly select the sample (although in some cases this isn't done and it is hoped that a sample on convenience is representative of a larger population).
(p. 58, last 3 lines) This is consistent with my comment about p. 11 above --- both the sample and the population is composed of observations and potential observations (since the population includes observation which would have been made if the random selection had been different (to include different observational/experimental/sampling units)). For instance, in Example 2.48 on p. 62 the population is the set of observations and (mostly) potential observations that would have occurred had other cancer patients been selected for the study.
(p. 59, line 1) A population of interest need not be "indefinitely large" --- sometimes it is finite (e.g., all people who will vote in the next U.S. presidential election).
(p. 61, Describing a Population) The first paragraph of this subsection is an important one.
(p. 61 & p. 63) In some cases it is better to refer to the corresponding population characteristic as a population measure, as opposed to a parameter. (This is a picky point: one that most statisticians don't appreciate, and one that you don't have to worry about. But unless the estimand of interest, say the population mean or median, is represented by a parameter in a particular parametric model, I wouldn't refer to it as a parameter. A lot of times we are interested in a mean or median, but don't think that we're dealing with a parametric model --- rather there is just some distribution of interest, which is not necessarily a member of some parametric family.)