Some Comments about Chapter 2 of Samuels & Witmer
Section 2.1
- (p. 10) Even if all observations are given in years, some would
still treat age as a continuous variable, rather than a discrete
one. In principle, any positive value is possible for age, and
observations have to be rounded in some way (nearest year, month, or
day, for example). It doesn't make a lot of sense to specify that
weight is continuous (as is indicated at the top of p. 10) and
that age is discrete. Weight also has to be rounded in some way ---
whether it be to nearest lb, kg, tenth of a gram, or mg --- and one
could view the rounded weights as being discrete. In reality, all
variables are in a sense discrete, but if the true quantity that's being
measured, as opposed to the rounded measurements, is not discrete, then
we tend to refer to the rounded measurements as being for a continuous
variable.
- (p. 11, Remark) Please get it straight that in
statistics we refer to a collection of 20 values as one sample, as
opposed to 20 samples. We should say that we have one sample of 20
observations. With regard the blood, the sample is the 20 glucose
measurements, not the 20 specimens of blood, or the 20 subjects from
which the blood came.
Section 2.2
- (pp. 13-14) The histogram of Fig. 2.5 gives the same information as
does the dotplot (aka one-dimensional scatterplot) of Fig. 2.4.
- (p. 15) One should not overinterpret a histogram based on a small
sample. E.g., we should conclude that the distribution underlying the
sample from which Fig. 2.7 was created is bimodal, with a mode between
200 and 220 in addition to a mode at about 100. Histograms can be
rather unstable, in that if the bin width for the groups is changed, the
left endpoint of the first bin is changed, or another sample is drawn
from the same distribution and a new histogram is created, the resulting
histogram can display different features than the original one. In
general, I'm not a big fan of histograms.
- (p. 15) Instead of stating skewed to the right it's better
to put positively skewed. Similarly, a negatively skewed
distribution is one that S&W refers to as being skewed to the left.
- (p. 18) I don't like either (a) or (b) of Fig. 2.13. If bins of
different widths are going to be used (which is sometimes a good idea),
the vertical axis shouldn't be for either frequency or relative
frequency --- instead the units of density should be used (although you
shouldn't worry about this for STAT 535, since I'm not going to
emphasize histograms anyway).
- (p. 19) If one turns a stem-and-leaf plot on its side it resembles
a histogram or dotplot. An advantage of the stem-and-leaf plot is that
one can determine the value of each observation.
- (p. 19, 3 lines from bottom) Instead of rounding to "one decimal
place" I think it should be the nearest integer.
Section 2.3
The shape of the distribution(s) often influences the choice of which
statistical method to use --- some procedures aren't good choices if
there is too much skewness or heavy-tailedness.
Even though this section shows a lot of histograms, I'll stress in class
that histograms are not the best devices for assessing whether a
distribution is heavy or light tailed, or determining if there is
approximate symmetry or perhaps mild skewness. Other graphical methods
are more useful for such diagonostic determinations.
- (p. 21) The caption of Fig. 2.20 suggests that the smooth curve is
an approximation of the histogram, but often the opposite point of view
is better --- we think that some smooth curve (called the density
(see Ch. 3) of the distribution)
underlies the observed data, and that a histogram
based on a random sample is an approximation of the smooth curve which
corresponds to the distribution for the phenomenon; that is, the smooth
curve is the truth, and the histogram is an estimate of the truth based
on a finite number of observations.
- (p. 24) Fig. 2.26 is nice in that it illustrates two sources of
observational variability: (a) shows variability due to differences
between individuals, perhaps due to exposure to different conditions,
and (b) shows variability due lack of precision in the measuring
procedure when more than one measurement is made
on the same individual.
Section 2.4
The reason that I like to skip Ch. 2 at first and talk about it after
having covered Ch. 3 and Ch. 4 is that by doing it my way I can more
meaningfully discuss the sample mean and sample median as
estimators. In Ch. 2 the sample mean and sample median are
mainly examined in their roles as summary measures for a sample
--- descriptive statistics. But usually one isn't primarily interested
in just the n values which make up a sample, but instead is
interested in using the data to make an inference about a larger
population of values from which the sample is drawn, or in some cases,
it's better to think of it as wanting to make an inference about the
distribution underlying the sample --- the distribution that in a
sense generated the sample. This notion, statistical inference,
is introduced in Sec. 2.8, but since a lot of the concepts needed to
address the issue properly aren't presented until later chapters of
S&W, the introduction in Ch.2 is a bit awkward.
- (p. 26) Note the informal definition of statistic given in
the first paragraph. Really a statistic is the formula used to compute
the numerical value; i.e., it is a function of the observations
expressed in terms of the abstract yi (see the blue
box on p. 27 for an example (although technically, a statistic should be
defined using the notation of random variables (that are introduced in
Ch. 3), and so to define a statistic one should use Yi
instead of yi)).
- (p. 26 & p. 30 (1st paragraph)) I think
it is bad to think of the sample mean or sample
median as necessarily being the "center" or "typical value" --- they are
what they are, and I can give examples of samples for which neither the
sample mean nor sample median is a good measure of the center, nor is
either a typical value.
- (p. 30, 2nd paragraph) That the sample mean may be highly
influenced by a small number of unusual values is not necessarily
undesirable, and it isn't necessarily true that the sample median is a
superior measure in such cases --- it depends on what the purpose of
the summary measure is. For example, suppose that the values is a
sample are net profits which results from drilling for oil 100 times.
It could be that 97 of them are negative because not enough oil (if any)
was found to offset the cost to looking to see if there is oil. But if
three of them are due to huge successes where oodles of oil was found,
and huge profits were made, the sample mean of the 100 values may be
positive (and even very large in an absolute sense), while the sample
median would be negative. The sample mean is the average net profit
that results from drilling, and the fact that it's positive even though
97% of the values are negative isn't really misleading --- the
average result of drilling was good (on the whole, money was made
from the 100 attempts to find oil). It's the negative sample median
that is misleading if the purpose is to characterize the average net
profit. In this case, the sample median is a typical value. Just
knowing that the sample median is negative means that at least half of
the time money was lost. But in this case that is a poor summary of the
full set of data if the purpose is to assess the profitability of
drilling for oil. We don't necessarily want a summary value to
represent the typical value. In summary,
the mean and median can be different --- they aren't always supposed to
be measures of the same thing --- and in some cases the interest may be
in one of the values more than the other, and in other situations it may
be reversed.
- (p. 30, 3rd paragraph) To give an example of what is being hinted
at here, suppose that we are interested in estimating the true
distribution mean of the distribution from the parent
distribution of the sample (the distribution responsible for the
values in the sample). If the distribution is symmetric (see p. 21),
then its mean is equal to its median, and one might think that either
the sample mean or the sample median could be used to estimate this
value of interest. In most cases the sample mean is the better
estimator to use. The sample mean is, in many respects,
the ideal estimator if the parent
distribution of interest is a normal distribution (see Ch. 4), but also in
most other cases it is superior to the sample median. But we shouldn't
always use the sample mean to get an estimate of the mean of a symmteric
distribution --- the sample median can be better if the distribution has
extremely heavy tails, and if the distribution has only moderately heavy
tails (compared to a normal distribution) estimators other than the
sample mean and sample median (for example, trimmed means and
M-estimators) can be better to use.
Section 2.5
I don't particularly like boxplots. (I have a very different attitude
about them than does Dr. Gantz, who teaches STAT 510. Of course, my
attitude is the correct one.) Boxplots suppress too much information
about the shape of a sample or distribution. Dotplots, histograms, and
stem-and-leaf plots show much more. With small sample sizes, boxplots
can provide a misleading summary, because the ends of the box can be
bad estimates of the 25th and 75th percentiles of the distribution which
underlies the sample.
- (p. 35 & p. 38) Parallel boxplots, such as those shown in Fig. 2.32
and Fig. 2.34, are sometimes nice. When several different samples are
being compared, I don't mind so much that a lot of information is lost
--- the parallel boxplots provide me with a decent initial summary.
(However, if the sample sizes are rather small, I think other types of
graphical displays are superior.)
But for a single sample of values, I think a boxplot suppresses way too
much information --- with a single sample, I don't want such a simple
summary.
- (p. 35, Outliers) An outlier need not correspond to a
mistake. It can just be an extreme value from a heavy-tailed or skewed
distribution. Or it can be a rather unusual value in a large sample
from a "well-behaved" distribution, such as a normal distribution, and
in this case it's a value, that when looked at in conjuction with the
entire data set, is not inconsistent with what is to be expected ---
that is, in a large data set, one often expects to find a few outliers.
- (p. 36, 1st 2 lines) I tend to use the term outlier
informally. When I refer to an outlier, I just mean a rather unusual
observation in a sample, whether the observation is a mistake of some
sort or just an extreme value from one of the tails of the distribution.
While there have been various definitions proposed, none of
them work well in all situations. While the outlier identification
scheme given on p. 36 is a commonly used one, it is not the clear-cut
best scheme, and it seems best to not take it as the basis of a
definition of an outlier (even though S&W might have us believe that
it leads to the
definition). I agree with what is on lines 3 and 4 of p. 37 --- an
outlier is a value which is unusual relative to the other values in the
data set and their variability. But I don't think it's a good idea to
believe that there is a fool-proof system that will reasonably identify
outliers in all cases.
- (p. 37, 1st new paragraph) I agree that one should be very hesitant
about removing outliers. Often if it's reasonable to think that an
outlier is a mistake I'll remove it. However, sometimes an extreme value may
be a mistake, but a mistake that occurred because the value which should
have been recorded was rather extreme, and in such a case the outlier
provides some information about what should have been recorded, and to
completely ignore such a value would result in creating a biased sample.
Section 2.6
- (pp. 41-42) I think it should be
sample standard deviation and sample variance instead of
"sample standard deviation" and "sample variance"
since they are often used to estimate the distribution standard
deviation and variance --- it needs to be stressed that they are
computed using just a sample of values. So when s and
s2 are being referred to, the word sample
should always be used, just like we should use it to distinguish the
sample mean and sample median from the distribution/population mean and
median.
- (p. 31) Although it may be instructive to compute one sample
standard deviation by the way explained under the blue box in order to
gain a better understanding of the concept, when there are more than
3 or 4 observations in a sample I recoomend using software.
(You won't be expected to compute such values for exams and quizzes.)
- (pp. 43-44, Why n-1?) Although I've seen other books offer
similar "explanations," I find such explanations lacking. The
explanation is more for why the term degrees of freedom is used
than it is for why n - 1 should be used instead of n.
A more meaningful explanation for subtracting 1 from the sample size may
be that in some ways a better estimator is obtained by doing so.
(Using n-1 makes the sample variance an unbiased estimator (to be
covered later) of the distribution variance.)
- (p. 47, Estimating the SD from a Histogram) This
part isn't so important since most of the time one should be able to
compute the sample standard deviation directly.
Section 2.8
- (p. 57) The 2nd paragraph is an important one (as is the first
sentence of the Statistical Inference section near the bottom of
the page).
- (p. 58, Defining the Population) I guess there are two ways
to think about this: we can decide what the population we want to make
an inference about is and select a sample appropriately, or we can
select a sample that is convenient to use and decide what population it
represents. In either case, it's best to randomly select the sample
(although in some cases this isn't done and it is hoped that a sample on
convenience is representative of a larger population).
- (p. 58, last 3 lines) This is consistent with my comment about p.
11 above --- both the sample and the population is composed of
observations and potential observations (since the population includes
observation which would have been made if the random selection had been
different (to include different observational/experimental/sampling
units)). For instance, in Example 2.48 on p. 62 the population is the
set of observations and (mostly) potential observations that would have
occurred had other cancer patients been selected for the study.
- (p. 59, line 1) A population of interest need not be "indefinitely
large" --- sometimes it is finite (e.g., all people who will vote in the
next U.S. presidential election).
- (p. 61, Describing a Population) The first paragraph of
this subsection is an important one.
- (p. 61 & p. 63) In some cases it is better to refer to the
corresponding population characteristic as a population measure, as
opposed to a parameter. (This is a picky point: one that most
statisticians don't appreciate, and one that you don't have to worry
about. But unless the estimand of interest, say the population mean or
median, is represented by a parameter in a particular parametric model, I
wouldn't refer to it as a parameter. A lot of times we are interested
in a mean or median, but don't think that we're dealing with a
parametric model --- rather there is just some distribution of interest,
which is not necessarily a member of some parametric family.)