Comments About the Book
(and related material)
Note to STAT 554 students: Right now this web
page is mostly material that I developed for my 2001 summer seminar in robust statistics.
I may modify it to make it more appropriate for STAT 554 students later,
but even as is I think it'll help you to better understand the Wilcox
book.
June 22,
June 29,
July 6,
July 13,
July 20,
July 27,
August 3,
August 17,
and August 24 meetings. I'll add other comments later.
(Generally, look for new comments each Sunday, but it may be later some weeks.)
Changes, or additional comments added after the meeting about the material,
are in green, and errors identified at or after
the meeting are shown in
bright red. Some of these modifications are due to comments by J. Gentle or A. Keesee, and others are just things that occurred to me after the meeting.
Since the book is rather elementary in places, I think it'll be best if we
use the book to generate discussion, but in our discussions try to go beyond
the level of the book. So in addition to reading my comments, try to always
come up with some remarks of your own.
Of course some students may want to use our time together to try to finally
gain a good understanding of matters that they are shaky about, and that'll be
okay too.
Even though the book is somewhat elementary in places, if at the end of the
summer we understand everything that's in it, then I think we can call the
seminar a huge success.
Finally, it should be noted that I'm not attempting to outline the chapters.
Rather, I'm making comments, many of the nature of a side remark, about the
material in the book --- but not necessarily about the main points presented in each
chapter.
You'll need to let me know if there are parts of the book that you think need
discussion.
1st meeting (June 22) : Preface, Ch. 1, & Ch. 2
Preface
-
THE GROWING GAP
- (p. viii) Wilcox states that "during the latter half of the twentieth
century, things began to change dramatically."
- If we go back 5 more years to the mid 1940s, then in addition to
robust and computationally-intensive methods, along with better exploratory
and diagnostic graphical techniques, also included would be some key
nonparametric methods.
-
It's important to note that development of new methods is linked to finding
fault with classical methods --- and so the increase in understanding of the
older methods is also important.
- (p. viii) Is there truly an "ever-increasing gap between state-of-the-art
methods versus techniques commonly used" and if so, who is to blame?
- Do the applied statisticans (many having inadequate training to start with)
just not keep up as they should? Are tired
old government workers too resistant to change?
- Are textbook writers too conservative --- not wanting to stray too far from the norm?
- Do members of the faculty fail to introduce the latest and the greatest?
Do they fail to instill the proper attitude in the students? (What is the
proper attitude?)
- Are the researchers failing to be convincing? Have perhaps somewhat
unethical or just plain shabby researchers turned off mainstream applied
statisticians? (Unfortunately, Wilcox's 1997 book, along with his S-Plus functions, contain some errors that could mislead people.) It seems like some statisticians believe that if SAS doesn't
do it, it's not worth doing or it shouldn't be trusted.
- EDUCATION
- (p. viii) Wilcox claims that "various perspectives are not typically
covered in an applied course" and he's right.
- In STAT 554 I try to include a lot of methods that aren't
commonly used because I think they are good methods and because I want to
impress upon the students that things aren't nearly as simple as some lower-level
books and courses make them out to be. I include seldom-used techniques like
Johnson's modified t test, the Steel-Dwass test,
and the Harrell-Davis estimator, along with methods based on trimmed means.
I introduce M-estimators and discuss permutation tests. Along the way I
try to provide reasons for why these less commonly used methods should sometimes
be used. But each semester an important technique like the bootstrap only gets
between 0 and 60 seconds --- there just isn't time to describe bootstrap
methods and properly discuss their strengths and weaknesses.
- Not only does STAT 554 not get to address a lot of good methods, but
many students earning an M.S. degree may not be exposed to everything that they
perhaps should get in an M.S. program. (Should the curriculum be altered?
For the most part I think GMU has a wonderful M.S. program, but should greater
emphasis be placed on more modern methods, or would that come at the expense
of other things equally or perhaps more important?)
- (p. viii) Wilcox claims that "standard training in basic statistics does
not prepare the student for understanding the practical problems with
conventional techniques or why more modern tools might offer a practical
advantage."
- I think I do a good job of this in STAT 554, but then I run short of
time when covering categorical data analysis and regression. (When trying to
include more ways to attack a given general problem in order to provide
improved
performance in more situations (cases of the general problem), while at the
same time properly discussing the pros and cons of various methods and how
they compare, I don't have time to properly cover the topics towards the
end of the course. For M.S. students, who will take many other statistics
courses, maybe my way is a good way. But are others being properly treated?
(I don't want to ever go to a cook-bookish type of course, but sometimes I
do wonder if less would be more in a one semester course.))
- What do other universities do differently? (I suspect many place more
emphasis on theoretical issues, and it seems like that would leave less time
for a broader coverage of applied techniques. But are they spending more time
on newer approaches, and less time on classical methods?)
Ch. 1
- (p. 1) What is meant by "arbitrarily small
departures from normality"?
Maybe the departures that cause trouble aren't so small if measured in a
different way. (E.g., two distributions can seem similar when Kolmogorov
distance is used to measure how much they differ, but they can have large differences
between some quantiles.)
- (p. 2) I wonder what "commonly recommended methods for dealing with
nonnormality have been found to be completely useless." Hopefully we'll
encounter more about this later in the book.
- I don't think he means standard nonparametric methods because they
work well in some situations.
- I think that transformations do horribly in some cases, but we shouldn't
rule out transformation altogether since they can be quite useful in
regression.
- Maybe he's thinking about cases where nonnormality is combined with
a nonnice heteroscedasticity structure (and small sample sizes).
- (p. 2) Wilcox states that "modern techniques stem from three major
developments", but I wonder how much most statisticians know about robust methods
and computationally intensive methods such as bootstrapping, projection
pursuit, CART, and MARS. Just how commonly used are these modern methods?
(Even among university faculty, although some may have been exposed to the
basic ideas, how many have enough experience in using the methods to
recommend when they should be used in place of other methods?)
- (p. 4) It's interesting how belief in the use of the sample mean led
Gauss to support the normal curve. (These days it's more the case that
belief in the normal curve lends support to the use of the sample mean.)
Also interesting that Stigler notes that Gauss's argument is faulty.
- (p. 4) Note that symmetry assumed a role as a key assumption as
a matter of convenience more than because empirical evidence suggested it.
- (p. 5) Reliance on CLT-based robustness arguments has been questioned
in recent years --- certainly the speculation (see p. 2) that 25 observations
should be adequate has been shown to be false. (Also, even if we have
robustness for validity, we don't necessarily have robustness for efficiency.)
- (p. 5) "In 1818, Bessel conducted the first empirical investigation that
focused on whether observations follow a normal curve." I wonder why it took
so long?
- Although (see p. 3) the normal curve was first developed in
1733, it was due to its ability to provide an approximation to binomial
distributions.
- Uses of the normal curve to model nonbinary data seems to be due
(see top of p. 4) to matters of convenience. (The convenience factor
still exists today. Some statisticians are reluctant to admit that a
normal distribution isn't a good model for the error term distribution because
if they did they would no longer have the asymptotic optimality of MLEs and
UMVUEs to support the use of the simple least squares methods that they want
to use.)
- It seems as though histograms would have suggested that some phenomenea
seems to follow at least approximately a normal curve, and so I would have
guessed that early on (before 1818) some comparisons would have been made.
- Note that this book gives information about the origin of some commonly
used terminology. Examples include the central limit theorem (p. 5) and the normal curve (p. 6).
- (p. 6)
Note that at one time Pearson thought that "nonnormal distributions were
actually mixtures of normal distributions and he proposed that efforts be made to
find techniques for separating these nonnormal distributions into what he
presumed were the normal components." But eventually "he abandoned this idea"
--- something that others have not done, since even in the 1990s research was still
being done on using mixtures of normal distributions for nonnormal distributions.
- (p. 7) In 1811, at age 62, Laplace, a supporter of the Bayesian point of
view, created the frequentist approach. In the 1800s the frequentist approach
gained momentum, and in the 1900s it was the dominant point of view, although in
the 1990s there was a surge of renewed interest in Bayesian methods. The renewed
interest was due in large part to improved computational techniques. (I don't
mean to go off on a rant here, but is
this renewed interest justified? Just because we can now easily do
something, does that mean we should do it? Shouldn't the New
Bayesians be asked to show that their methods give us superior inferences?)
- (p. 7) Note that Laplace developed the CLT-based confidence interval
approach in 1814, about 100 years before the work of Fisher and Gosset
("Student"). (Is this evidence to support the conjecture that now
knowledge about statistics is growing faster than before? Why was it that
"Student"'s contribution waited until 1908? (Of course one can't blame
people like Laplace, Gauss, and Cauchy --- those guys weren't idle!))
Ch. 2
- We should review the goals of Part I of the book (see p. 11).
- (p. 11) Wilcox states that "here some important concepts and perspectives
are introduced that are not typically covered in an introductory course."
I think some of the things aren't even in most 400-level and 500-level
courses.
- (p. 14) In many places in the book Wilcox refers to "the typical individual under study." This point of view seems to be common among social scientists.
They are often interested in the typical individual, and will choose between
estimators such as the sample mean and sample median to address the nature of
the typical individual. Whereas I tend to want a more specific focus: should
I be interested in the distribution mean, the median, or perhaps some quantile
other than the median? I think there are plently of cases where the mean is
the proper target, even if it's out in the tail of the distribution and not
among the most likely outcomes. (Think about comparing two production methods
and wanting to choose the one that will yield the most product in a year, and
we
have data that are daily outputs. Should we focus on the mean or median. One
can also think about searching for places to drill for oil. Should we care about the
mean yield or the median yield?)
- (p. 15) Wilcox states that "probability curves are never exactly
symmetric" and yet its interesting that
symmetry assumptions abound in statistics.
- (p. 16) I like Wilcox's simple description of outliers. (The book
returns to outliers in Ch. 3.)
- Be sure that you understand the concept of breakdown point.
Values are given for the sample mean (p. 17), the sample median (p. 19), the
weighted mean (p. 20 --- but he doesn't address the common case of the
trimmed mean (maybe because he doesn't consider trimmed means to be weighted
means (see p. 51))), the sample variance (pp. 21-22), and the least squares estimate
of the slope (p. 29).
- The population median is presented on p. 17. Can you sketch a continuous
pdf for a distribution that has infinitely many values satisfying the
definition given for population median? What about a pmf for a discrete distribution that has no value satisfying the definition?
- (p. 17)
Wilcox has "the most common method for estimating the population median is with the so-called sample median" --- and this is an example of a commonly used
method not being the best method in most cases (since it's usually the case that some other estimator will be better than the sample median).
- (pp. 18-19) A brief discussion of sample mean vs. sample median for
symmetric distributions is given. There is a lot more that could have been added (but he's easing us into things).
- (p. 19) It's pointed out that the breakdown point is just one consideration and that there are others (like average accuracy).
- (p. 22) That "the low breakdown point of the variance turns out to be especially devastating" is something we'll want to look for later in the book.
- (p. 23) The concept that the choice of loss function can make a difference
is introduced. For guessing heights the absolute error seems just as sensible
as the more commonly used (in general) squared error (and for carnival games
(guessing heights, weights, or ages) the proper loss function is often of
the 0-1 nature). Often there is no clear best choice for a loss function.
- (p. 24) It's interesting that Ellis touched upon the general idea behind M-estimators as early as 1844.
- (p. 25) Note that the alternative to least squares, the average pairwise slope, dates back to at least 1750 --- about 50 or 60 years before least squares.
Boscovich's development of the LAD (aka LAV, LAR, L1) method (described on pp. 26-27) occurred in 1757 (so also before least squares).
- (p. 27) ERROR IN BOOK: In figure 2.5 (the figure
itself, not the caption) the point labels should have arc length instead
of latitude.
- (p. 29) ERROR IN BOOK: It should be all
five choose two (or just 10) pairs of bivariate points instead of all ten pairs of slopes.
2nd meeting (June 29) : Ch. 3 & Ch. 4
Ch. 3
- (p. 31) Wilcox states "in recent years it has become clear that this
curve can be a potential source for misleading --- even erroneous ---
conclusions" --- but older papers (see Miller's Beyond ANOVA for
references to some
earlier works) tend not to strongly suggest that
there are huge problems due to nonnormality. (Perhaps statisticians were
more forgiving about weaknesses in standard procedures due to a lack of
convenient and better alternatives.)
- (p. 32) ERROR IN BOOK: He should have The equation for the family of normal curves instead of "The family of equations
for the normal curve."
- (p. 32) ERROR IN BOOK: I don't think e
should be referred to as Euler's constant, since what is commonly
called Euler's constant (commonly
denoted by lower-case gamma) is the limit of
1 + 1/2 + 1/3 + 1/4 + ... + 1/n - log n
as n tends to infinity, and is about 0.577.
- (pp. 32, 33, 34, & 60) ERRORS IN BOOK:
Wilcox should not use exactly when referring to the probabilities (like
0.68 and 0.954) pertaining to normal distributions, since the values he gives
are not the exact probabilities.
- (p. 35) MAD is introduced. It is a relatively widely used alternative
measure of scale. (Some other alternatives are described in the book
Understanding Robust and Exploratory Data Analysis by Hoaglin, Mosteller,
and Tukey.)
- (pp. 34-37) Three methods of identifying outliers are described. I
encourage you to test your understanding by confirming the values given
in the table
below, which are the approximate proportions of outliers one should expect
when applying the various methods to large samples from several different
distributions. (Notes: Below, MADN is MAD/0.6745. Also, the ------ entry in
the Cauchy row is due to the fact that the standard deviation doesn't exist.)
It can be noted that for the boxplot method, the outliers from symmetric
distributions are the values that
are about (for large samples) at least 0.6745*4 (or just about 2.7)
MADNs from the median, and thus, for symmetric distributions at least,
the boxplot method will tend to identify
fewer values as outliers than will the 2nd method (which utilizes 2*MADN).
I generally prefer the boxplot method. (One reason for this preference
is due to the way the boxplot method deals with skewed distributions.)
| 2 sd from mean |
2 MADN from median |
boxplot method |
normal |
0.046 |
0.046 |
0.007 |
uniform |
0.000 |
0.000 |
0.000 |
exponential |
0.050 |
0.120 |
0.048 |
Laplace |
0.059 |
0.128 |
0.063 |
T6 |
0.050 |
0.077 |
0.028 |
Cauchy |
------ |
0.207 |
0.156 |
(Note: For an exponential distribution having a mean of 1, I get that the MAD
should converge to about 0.4812.)
- (p. 36) ERRORS IN BOOK:
At the very top of this page,
in the continuation of the example started on p. 35, there are two 7s in the
1st line that should be 8s (since M equals 8), and also in the 1st line
there is a 5 that should be a 6, a 3 that should be a 4, and a 1 that should
be a 2. In the 2nd line, the value of MAD should be 4.
- (pp. 38-44) I don't have any comments about the section on the CLT, but
let me know if you have any questions about it.
- Wilcox indicates that approximate normality for the sampling distribution
of a statistic can suffer due to outliers if breakdown point of the statistic
is low.
Ch, 4
- Several important terms and facts are given that you should make sure that you
are comfortable with:
- mean squared error (p. 50);
- homoscedasticity and heteroscedasticity (p. 56);
- Gauss-Markov theorem (p. 56);
- squared standard error (equal to the variance of a statistic), and of course standard
error (p. 59);
- expression (4.1) (p. 60).
- (p. 49) Some main goals are to understand the result of the Gauss-Markov
theorem, to understand Laplace's confidence interval method based on
approximate normality, and to develop an appreciation of the role (and
potential weaknesses) of
homoscedasticity assumptions in applied statistics.
(Comment: In Beyond ANOVA, Miller is perhaps too willing to
ignore heteroscedasticity in some cases.)
- (p. 51) Note that with regard to the Gauss-Markov theorem, the class of
weighted means does not include trimmed means and the sample median --- the weights
are not assigned to the order statistics.
- (p. 53) At the bottom of the page Wilcox refers to a property that many robust estimators have: performance almost as good as parametric estimators for
normal distributions when the parent distribution for the data is normal, and
improved performance (sometimes to a large degree) in many settings in which
the parametric estimators perform suboptimally.
A similar point is made on p. 65 --- it would be nice if an estimator works
well in the presence of heteroscedasticity, and yet also works nearly as
well as estimators derived under an assumption of homoscedasticity when in
fact there is no heteroscedasticity.
- (pp. 54-55, paragraph right before the regression section) One might expect the sample median to outperform the sample
mean if f(eta), the value of the pdf at the median, is greater than
1/(2*sigma). This condition is met for the Laplace distribution, and
other distributions that are sufficiently peaked at the median, as well as
for some
fairly extreme contaminated normal distributions, and other distributions
that have a standard deviation that is rather large relative to the dispersion
near the median. But lots and lots of distributions that have heavier tails
than a normal distribution are such that the sample mean is superior to the
sample median, and in some cases for which the sample median is superior to
the sample mean, some other estimator will be superior to the sample median.
- (p. 56) ERROR IN BOOK:
Something is off in the description pertaining to Figure 4.3. One way to fix
it would be to make the 10 (in X + 10) a 2.
- (p. 58) Wilcox refers to a strategy based on estimating how the variance
of Y changes with X. What does J. J. Miller teach about
adjusting for heteroscedasticity in cases for which a transformation of
Y doesn't stabilize the variance?
- (p. 58) Wilcox has "the notion of a confidence interval was not new" but
fails to indicate what methods were available prior to the one developed by
Laplace. I wonder what methods existed prior to 1814.
(I'll guess that some Bayesian
intervals similar to confidence intervals were in existence.)
- (p. 60) Wilcox has "a random sample means that all observations are
independent." But what about the term simple random sample? Can't that
be used to refer to a randomly selected subset of a finite population, and thus be obtained
by making observations that are not independent?
- (p. 62) Wilcox points out that "a practical issue is the accuracy of any confidence interval we compute." (Unfortunately, many times people use methods
in situations in which good accuracy cannot be expected (often when large
sample sizes are called for to justify a certain method, and yet the samples
at hand are small).)
- (p. 63) Wilcox has "to avoid mathematical difficulties, we make a convenient assumption and hope that it yields reasonably accurate results."
I think some people hope for too much! (It's not that I never rely on the
robustness of classical procedures, but I always try to assess if such
reliance is warranted.)
- (p. 64, 1st full paragraph) To test for dependence, we can assume
homoscedasticity (because if the null hypothesis is true, we must have
homoscedasticity). But if we want to make a good estimate of the slope,
we should not necessarily assume homoscedasticity. (If we're giving an
estimate of the slope, then we're clearly not making a firm commitment to
the null hypothesis being true.) But Wilcox seems to suggest that perhaps
a better test can be performed if heteroscedasticity is allowed for. (When
we do a test based on an assumption of homoscedasticity, and heteroscedasticity
causes the test to reject, then we're okay (no Type I error) since if there
is heteroscedasticity then the null hypothesis (in the case under
consideration) is not true. But what if the nature of the heteroscedasticity
results in a test having low power to reject when a rejection is warranted?
(Should our concern be with Type II errors?)) It's also worth noting that if
the null hypothesis is one of a slope of zero, and not independence, then
one should not necessarily assume homoscedasticity under the null
hypothesis.
- (p. 65, 1st and 2nd lines)
If the underlying distribution is close enough to a normal distribution,
and one uses t critical values,
then one doesn't necessarily have to assume that "an accurate estimate of
the variance has been obtained" since a confidence interval based on
t takes the uncertainty of the variance into account.
- (p. 65, 2nd and 3rd lines) ERROR IN BOOK:
I think Wilcox is wrong to refer to a specific interval (an interval resulting from a particular set of data) and claim that "there is a .95 probability that
this interval contains the true slope." (In my opinion, the 95% confidence
results from the success rate of the method, but a particular result of the
method either does or does not (so probability 1 or 0) contain the estimand.
(In some situations I just to not like to deal with ignorance of the truth
using probability.))
- (p. 65, paragraph right before the sumary)
When doing regression, what do you currently do to check the
homoscedasticity assumption?
3rd meeting (July 6) : Ch. 5
- (p. 67, 3rd to last sentence) Wilcox indicates that some of the standard
methods are robust for validity (to use Rupert Miller's terminology)
under certain circumstances (but they are not all that robust for tests
about means, and even in situations where they are robust for validity,
they are not necessarily robust for efficiency).
- (p. 68) For the trivia game at next year's picnic, make a note that
Egon Pearson is the son of Karl Pearson.
- (p. 69) Wilcox uses beta for the probability of type II error,
whereas a lot of books use it for power.
- (pp. 70-71) Make sure you understand the 4 bulleted points about how
power relates to other factors.
- (p. 71, middle of page) ERROR IN BOOK:
Wilcox doesn't define unbiased hypothesis testing method correctly.
- (p. 71, last full paragraph) ERROR IN BOOK:
I get that the values should be .328 (instead of .148) and .775 (instead of
.328).
- (p. 72, paragraph in middle of page) ERROR IN BOOK:
Wilcox seems to be associating type II errors with values of the sample mean
less than 35, which is okay (I suppose), but one should keep in mind that
the test will also fail to reject if the sample mean assumes some values
greater than or equal to 35 (since the sample mean has to be far enough above
35 in order to result in a rejection).
- (p. 72, 6 lines before the new section) ERROR
IN BOOK: Should be larger, not "higher."
- (p. 73)
The figure isn't real good if the desire is to illustrate the difference in
power because the rejection regions aren't shown.
- (p. 73, 1st line after (5.1)) ERROR IN BOOK:
The word theorem should follow central limit.
- (p. 73) It could be mentioned that Slutsky's theorem, as well as the
central limit theorem, plays a role in the asymptotic normality of T.
- (p. 74, last sentence of paragraph at top of page) Wilcox makes a good
point: one shouldn't just think of nonnormality in terms of outliers, since
things like skewness can have an appreciable effect even if there isn't a
large percentage of outliers.
- (p. 74, 1st full paragraph) I find it interesting that Gosset first used
Monte Carlo results to investigate the sampling distribution of T, and
then attempted a mathematical derivation.
- (p. 74, first sentence of paragraph at bottom of page) I wonder how
Fisher's "more formal mathematical derivation" differed from what Gosset did.
- (p. 76) The lognormal distribution referred to in the figure and in the
text is one having parameters 0 and 1 (and so, letting X be the lognormal
random variable, we have X = eZ, where Z is a N(0,1)
random variable). The skewness of this lognormal distribution is about 6.18,
which is a fairly severe degree of skewness (even though Wilcox refers to
this particular skewed distribution as being light-tailed (and so he's
focusing on outliers, and not skewness)).
- (p. 77) Notice how spread out the variance values are in the figure.
It turns out that the true variance can be shown to be about 4.7 for the
lognormal distribution used (and so for one sample, the sample variance is
more than 50 times larger than the true variance --- indicating that outliers
can have a huge effect (even though Wilcox indicates the distribution isn't
highly prone to yielding outliers)).
- (p. 78, 2nd paragraph) I have some similar results that I can present
concerning type I errors when using Student's t test when the parent
distribution is skewed.
- (p. 78, last paragraph) I like Bradley's point of view concerning
type I errors (actual vs. nominal).
- (p. 79, FIGURE 5.5) Can anyone explain why the distribution of the test
statistic is negatively skewed, while the parent distribution for the data
is positively skewed?
- (p. 79) Here are some other facts about the lognormal distribution
being used (that has a skewness of about 6.2): the kurtosis (with the -3
included) is about 111, even though the percentage of outliers in large
samples would typically be small (about 7.8% outliers using the boxplot
method, and about 3.7% outliers using the 2 standard deviation rule). I
think it's not just the percentage of outliers that we have to focus on ---
another important factor is how extreme the outliers are!
- (p. 80, paragraph at top of page) There is an asymptotic formula that relates
the skewness of the parent distribution and the sample size to the expected
value of T under the null hypothesis that the mean equals a specific value.
- (p. 81, FIGURE 5.6) The bootstrap estimate of the sampling distribution
is obtained using Monte Carlo results, much like the way the estimated
sampling distribution indicated in FIGURE 5.5 was obtained --- only here the
empirical distribution is taken to be the true distribution underlying the data,
and so we have another source of inaccuracy --- not only can the Monte Carlo
samples not be truly representative on the distribution from which they are
being drawn, but also the estimate of the true parent distribution (the
empirical distribution) may not be real good.
- (p. 82, middle paragraph) Wilcox seems a bit vague about the role of transformations here (but things get cleared up a bit later on).
- (p. 84, 3rd line from top) ERROR IN BOOK:
Should be .05, not .95.
- (p. 84, 4th line from top) W is sometimes referred to as
Welch's statistic.
- (p. 85, 1st bullet) The recommendation to "avoid Student's T" seems a bit too strong to me. I think situations can arise where I'd use Student's
two sample t test.
- (p. 86, 1st bullet) I find this statement interesting, especially in light
of the explanation supplied in the first full paragraph of p. 87.
- (p. 87, last 4 sentences) Wilcox seems to indicate that Student's t
test should be robust for validity for the general two sample problem, but
one can in fact have serious inflation of the type I error rate if the common
distribution is skewed and one sample size is rather small while the other one
is rather large (it's pretty much the same type of thing as for a one sample
t test).
- (p. 89, paragraph at top of page) I like Wilcox's suggestion that the
results of several procedures can be examined and the consistency of the
resulting conclusions be useful in giving an indication as to the
trustworthiness of the procedures (but it would take some experiance to
get good at doing this sort of thing properly).
- (p. 90, 2nd bullet, 2nd to last sentence) ERROR IN BOOK:
In his statement about the possibility of a biased test, he has the false
and the true reversed.
- (p. 90) That there are general conditions under which Student's T
does not converge to the right answer may have been derived by Cressie and
Whitford, but the result was known by others prior to the publication of their
1986 paper. A 1983 draft of Rupert Miller's Beyond ANOVA: Basics of
Applied Statistics, which was first published in 1986
included results that indicated the problem with the two sample t test.
(Wilcox also gives Cressie and Whitford credit on p. 87 and
p. 93 while not indicating
that they weren't necessarily the first to know of the result.)
- (p. 91, near top) A situation is described in which the use of transformations might be okay (but not necessarily optimal).
4th meeting (July 13) : Ch. 6
- (p. 94) Wilcox indicates that he will cover two bootstrap methods. These
are the percentile method and the percentile t method (often referred
to as the bootstrap t method). We should spend some time discussing
these methods. Also, I can describe how bootstrapping can be used to obtain
estimates of standard error and bias.
- (p. 96, 1st sentence) Wilcox indicates the parent distribution
of the data can be estimated from the data. Of course, of chief concern is how
well it can be estimated with smallish samples.
- (p. 98) Wilcox has that the percentile confidence interval "can be
theoretically justified" in large sample settings. I'm not so sure about
that if the parent distribution is skewed.
The book by Efron and Tibshirani indicates that the percentile interval is
"backwards" and gives no good justification for it (other than it seems to
work okay in some situations). We can talk about this.
Note that the asymptotic accuracy result indicated on p. 104 pertains to the
bootstrap t method, not the simple percentile method.
- (p. 98) Note that large samples are required for the percentile method to
work well for making an inference about the mean. So clearly this bootstrap
method cannot always be relied upon to be accurate.
- (p. 100) In (6.1), do you understand why the sample mean of the original
sample is being subtracted from the sample mean of the bootstrap sample?
- (p. 100) ERROR IN BOOK:
In computing the value if T*, Wilcox has
the values 18.6 and 18.1 reversed (see p, 96, towards bottom of page), and so the value of
T* should be -0.19.
- (p. 101) Note that a sample size of at least 100 is recommended
for the for the percentile t method (and B = 999 is
recommended).
So clearly this bootstrap
method cannot always be relied upon to be accurate.
- (p. 101) A motivation for using 999 for B, instead of 1000, is that
999 T* values divide the real numbers into 1000 intervals.
- (p. 101) Hall's 1986 article indicates that coverage probability doesn't
suffer so badly when a small value is used for B, but with a small
value of B we put ourselves at risk of getting an interval that is
unnecessarily wide.
- (p. 250) ERROR IN BOOK:
The correct page numbers for this article by Hall are 1453-62.
- (p. 102, 1st line) Rather than round .025B and .975B
to nearest integer, it'd be better to pick a value of B so that
0.025(B + 1) and 0.975(B + 1) are integers, and use these
values for L and U.
- (p. 102) ERRORS IN BOOK:
Wilcox has 11.4 as the sample standard deviation for the sample given on
p. 96 (near the top) and repeated on p. 100,
but this value is actually about 11.14. Also,
it isn't clear to me where the values 2.08 and 2.55 are coming from.
Based on p. 100 and the caption of Fig. 6.3 on p. 101, it seems like these
values should be 2.14 and 2.01. Finally, the interval obtained from
assuming normality should be (13.4, 23.8) instead of (13.3, 23.9).
- (p. 102, 3rd line from bottom) ERROR IN BOOK:
The interval obtained from
assuming normality should be (2.2, 12.7) instead of (2.2, 12.8).
- (p. 104, 1st two lines) Wilcox's advice "to always use the percentile
t bootstrap when making inferences about a mean" seems questionable!
Did he consider Johnson's modified t test for tests about the mean
of a skewed distribution? And what if n is small? (Doesn't Wilcox
indicate that n shouldn't be too small when using bootstrapping? Does
he believe that a bootstrap method is best if the sample size is only 10?)
- (p. 105) I think it is bad that Wilcox just compares bootstrapping to
Student's t, when in many situations Student's t isn't the
best nonbootstrap method to use.
- (p. 105 and p. 107) It cracks me up that Wilcox refers to
"quantitative experts" (p. 105) and "authorities" (p. 107).
- (p. 108) To me it would make a lot more sense to use robust estimates
of the slope in the bootstrap procedure. (If you believe normality, it's
not clear that the bootstrap is needed, and if you worry about
nonnormality, I think a robust estimation procedure would be better.)
It would be interesting to study this with a Monte Carlo study. (It may
be that some fine-tuning would be called for, as is described at the top
of p. 109.)
- (p. 109, last paragraph) Do you understand how "the information
conveyed by the correlation coefficient differs from the least squares estimate of the slope"?
- (p. 112, at the top of the page) What do you think makes the actual
size of the test exceed 0.05?
- (p. 113) Note that the breakdown point of the correlation coefficient
is 1/n.
- (p. 115) Note that in some situations the percentile method works better, and
in other situations the bootstrap t works better.
- (p. 115) ERROR IN BOOK:
Wilcox refers to Section 6.5 and Section 6.6, and yet the book has no
numbered sections.
5th meeting (July 20) : Ch. 7
- (p. 118, about 1/2 way down page) Wilcox gives a reference for a "method
for answering the question
exactly", making it perhaps seem like something nontrivial --- whereas in fact
it is rather basic probability stuff (but I guess Wilcox doesn't assume his
typical reader knows much about probability). Does everyone understand how
the probability can be easily determined?
- (p. 119, at top of page) Can you show that if one approximates the
contaminated normal cdf with the standard normal cdf, the maximum error
is about 0.04? (It's a fun little probability exercise.)
- (p. 120) Wilcox has that"the population variance is not robust", and by this he means that slight changes (measured in
certain ways) in the distribution can result in large changes in the
variance. Usually we think of robustness as relating to statistics, but
here he's applying the idea to a distribution summary measure.
- (p. 121, at top of page) The 0.96 comes from a one-sided test. (The way
it is worded, one might think that a two-sided test was under consideration.)
- (p. 121, towards bottom of page) It's hard to confirm the 0.28 value
without doing a Monte Carlo study since the sample sizes are too small to
count on the test statistic having approximately a standard normal or t
distribution with the underlying distribution of the data being so nonnormal.
- (p. 121, last 2 lines) ERROR IN BOOK:
It should be Chapter 5 instead of "Chapter 4" (see pp. 71-72 of Ch. 5).
- (p. 123, at top of page) The first sentence states a key idea!
- (p. 123, top half of page) The first full paragraph reminds us that
although one may have robustness for validity with large enough sample sizes,
one need not have robustness for efficiency in all such situations.
- (p. 123, 8 lines from bottom) ERROR IN BOOK:
Wilcox refers to the normal distribution as being light-tailed, but I think
it's better to think of the normal distribution as having neutral tails.
- (p. 127) The desire to label effect size seems big in psychology (Wilcox
is in a psychology department, and I've encountered this when dealing
with psychologists at GMU). I tend to wonder about the power of
detecting differences of practical concern for the situation under consideration,
but I think the magnitudes involved differ from situation to situation and don't
think in terms of preset definitions.
- (p. 132) At the end of the first paragraph of the new section,
I think it's the case
that while bootstrapping may result in improved accuracy when applied to
normal-theory test statistics used with nonnormal data, the power could still be poor because the test statistic is ill-suited for the task ... it's defective (as Wilcox suggests).
6th meeting (July 27) : Ch. 8
- (p. 139) ERROR IN BOOK: The first sentence
of the second paragraph needs to be reworded ("particularly the population
variance" seems out of place).
- (p. 140, top portion of page (and p. 158)) Unfortunately, Box's paper
has led some to believe that unequal variances are of little concern with
one-way ANOVA. (Both Rupert Miller and John Miller seem to extract this
lesson from the paper.) But Wilcox points out that Box considered only rather
tame cases of heteroscedasticity. Wilcox's 1997 book includes some
numerical results indicating that type I error rate can well exceed nominal
level if variances differ by enough (and unequal sample sizes serve to
aggravate the problem). Also of concern is the fact that power characteristics
can be screwy even if equal sample sizes serve to make actual size of test
close to nominal level. (Those of you who have taken STAT 554 should be
somewhat familiar with this phenomenon.)
- (p. 140, towards bottom of page) Wilcox suggests that problems due to
heteroscedasticity are underappreciated by applied researchers who use
statistical methods. My guess is that they are also underappreciated by most
statisticians holding graduate degrees in statistics. Perhaps lots of
statisticans are aware that there are some problems related to
heteroscedasticity, but many may not be adequately trained in how to deal
with such problems. (Often the semester ends before courses can address
such topics.)
- (p. 141, at top of page) The first full paragraph describes a main goal
that is addressed in Chapters 8 and 9.
- (p. 141, last paragraph before new section) Wilcox has "At some point doubt arises as to whether the population mean provides a reasonable measure of what is typical." While this may be true, it may be that the focus should be
on the mean even if the mean doesn't correspond to a typical value. For more on this, read the 3rd item in my notes on Ch. 2 above.
- (p. 141, last 4 lines) The two classes of estimators are covered in
Ch. 8 (intro material) and Ch. 9 (using such estimators to make inferences).
- (p. 142, first paragraph) This paragraph indicates that the sample mean
and the sample median are extreme examples of trimmed means. I'll also
point out that they also fall into the class of M-estimators.
- (p. 142, first paragraph) The last sentence of this long paragraph
suggests that the relative merits of different degrees of trimming should be
considered, but Wilcox doesn't give of a lot of information which would
allow us to get a good feel for how the degree of trimming affects
performance.
- (p. 143, near top of page) The suggestion to trim 20% follows from a
strategy to trim as much as possible for protection against the ill effects
due to heavy tails while still being competitive in the ballpark of normality.
Studies that I have done suggest that 10% trimming is better than 20% trimming
in a lot of heavy tail situations. While 20% trimming does indeed do better
than 10% trimming is rather extreme heavy tail situations, unless the data
suggested that the case at hand may be such an extreme case, I'd prefer to
trim just 10%. (5% trimming may be slightly better is underlying distribution
is only slightly heavy-tailed, but performance wouldn't drastically suffer
if 10% was trimmed in such situations.)
- (p. 145, Figure 8.2) I think this figure is a poor way to compare the
performances of the estimators since it truncates the tail behavior. He
should also give estimated MSEs and MAEs and/or supply some information
about the proportions of large errors of estimation. (Maybe over 99% of the
time the estimators would supply about the same quality
of estimates --- but if so
then perhaps the focus should be on the less than 1% of the instances in which
at least one of the estimators performs badly.)
- (p. 147) The last sentence of the paragraph that continues at the top of
the page suggests that trimmed means can be used for tests and confidence
intervals even if the underlying distribution is skewed. But I think that we'll
see in Ch. 9 that we must be content with making inferences about the
population/distribution trimmed mean instead of the population/distribution
mean, and of course (as Wilcox points out in several places) these
population measures can differ (and so one should give some thought as to
what it is that you really want to make an inference about).
- (p. 148, bottom portion of page) We are reminded that outliers can of
of interest --- but it is also suggested that outliers can get in the way when
the goal is to learn something about the bulk of the members of a group.
- (p. 149) Wilcox refers to
"three criteria that form the foundation of modern
robust methods." Information about these can be found in Wilcox's 1997 book.
-
Qualitative robustness pertains to the sensitivity of a statistics or
distribution measure to small changes ---
can small changes in data or distribution
result in large changes in value of statistic or distribution measure?
-
Infinitesimal robustness is similar --- it also deals with sensitivity
to small changes. But with this concept the effect of a small change is
described using an influence function (and a bounded influence function
results in good robustness).
- Quantitative robustness pertains to breakdown points..
- (p. 150) I think that it may be easier to get a feel for M-estimators
if a description based on the penalty function (rho) is given as opposed to
a description based on the influence function (psi). I can offer you such
a description when we meet.
- (p. 151, bottom portion of page) Although it isn't entirely bad to
think of M-estimators as ones that down-weight or ignore extreme observations,
with the Huber M-estimator it is perhaps more accurate to say that for
observations far from the bulk of the data, the "excess" distance away is
ignored (or the distance away is down-weighted).
- (p. 152, top half of page) The first two paragraphs on the page describe
the overall strategy. The last several sentances of the 2nd paragraph
describe a key part of the strategy.
- (p. 152, 3rd paragraph) To eliminate the biweight (aka bisquare) brand
from consideration seems a bit too extreme. I've found that if the tailweight
of the underlying distribution is heavy enough, the biweight variety is better
than the Huber variety of M-estimator. Sure one has to worry about lack of
convergence to a sensible value. But one could always compute Huber's
M-estimate as a check, and think of the biweight estimate as a slightly
superior estimate if indeed it is not drastically different from the Huber
estimate (and if they differ by more than a
bit, I'd take a careful look into the situation).
- (p. 152, 3rd paragraph; & p. 158) I think Wilcox puts too much emphasis
on the conclusions of the Freedman and Diaconis article. They show that
M-estimators converge to the correct value for symmetric unimodal (there are
some restrictions) distributions, but
that redescending M-estimators need not be consistent for
multimodal distributions. So if the underlying distribution is not
multimodal, maybe redescending M-estimators can do okay. (Of course,
consistency is an asymptotic result, and so maybe we need to be a bit concerned with smallish sample sizes.)
- (p. 152, last paragraph) Some explanation is given for the desire to
incorporate a measure of scale into the M-estimation procedure. We can
discuss the matter more when we meet.
- (p. 153, top portion of page) Wilcox indicates that "quantitative experts"
suggest setting K to the value 1.28. Other values that have been
suggested by reasonable people are 1.345 and 1.5. (1.28 is about
z0.1. 1.345 is the value that results in an ARE of 95%
when the underlying distribution is normal. 1.5 was suggested in Huber's
original 1964 paper (and is favored by Birkes and Dodge).)
- (p. 153, bottom half of page) I agree that the one-step Huber M-estimate
of location is nearly as good as a fully iterated one. I wonder if there is
a simple one-step estimate to use for an estimate of the slope parameter in
simple regression based on Huber's M-estimator, or if one-step versions exist
for the biweight estimate. (Clearly, one could stop after a single iteration,
but I wonder if this results in a relatively simple closed-form estimate.)
(Question: If no closed form estimate exists (say for an mle or an M-estimate),
is it okay to refer to an estimator? Surely we'd have an estimation
procedure, and one could assess things like unbiasedness and consistency for
the procedure, but I wonder if it's okay to use the term estimator in such
situations.)
- (p. 153, bottom half of page) Back in the 1970s, calculus books referred
to Newton's method, but in the last 20 years I tend to see
Newton-Raphson used. Is there a distinction? Is it okay just to
call it Newton's method?
- (p. 154, top half of page) Note that the one-step (1 iteration may
be a more accurate term to use) estimate can be
described in terms of a simple 2 step procedure (outlier identification, followed by outlier removal (and averaging)) if one ignores the
1.28(MADN)(U - L) part. The one-step estimate is similar to
an adaptive trimmed mean.
- (p. 154, bottom half of page) Wilcox states that "the one-step M-estimator
looks very appealing." Based on studies that I have done (with the help of
various students), I agree. The one-step estimator does just about as good
(or better) as the 10% trimmed mean in situation in which the 10% trimmed
mean works very well, and
the one-step estimator does just about as good
(or better) as the 20% trimmed mean in situation in which the 20% trimmed
mean works very well. It's about 95% as good (using MSE as a measure of
goodness) as the sample mean if the underlying distribution is normal (even
for sample sizes in the ballpark of 10 or 20).
So if using the sample mean is rejected due to apparent heavy tails, then for
estimating the mean/median of a symmetric distribution, the one-step M-estimate
seems like a good choice. Although there are problems with using it to estimate
the mean or median of a skewed distribution if the sample size is sufficiently
large (due to biasedness (that is not even of the asymptotically unbiasedness
nature) resulting in an MSE that does not tend to 0), my work has indicated
that the M-estimate is not necessarily a bad choice for small sample size
situations (although for estimating the mean or
median, if the skewness is large
relative to the kurtosis, one may be better off using the sample mean
or a one-sided trimmed mean).
- (p. 155) Wilcox states that "the one-step M-estimator can have a
substantially smaller standard error" (compared to the 20% trimmed mean).
It's important to keep in mind that for large sample sizes, bias may also be of concern --- because neither estimator is guaranteed to be unbiased, or even
asymptotically unbiased, for the distribution mean or median.
The estimator with the smaller standard error is not necessarily the one having the smaller MSE.
- (p. 156, 4th sentence from top)
ERROR IN BOOK:
Instead of "an outlier" it should be some outliers.
- (p. 157) The paragraph right before the Summary gives somewhat of a
summary of Wilcox's opinions about the relative merits of the Huber
M-estimator and the 20% trimmed mean, and also provides something of a preview
for Ch. 9.
- (p. 157) The 2nd bulletted item of the Summary is rather important.
- (p. 158) The book by Staudte and Sheather may be good to investigate at
some point. (Just today I got a book on bootstrapping by Chernick that I'm
going to evaluate for seminar appropriateness.)
- (p. 158, last sentence) The standard error of the one-step estimator can be estimated with a simple bootstrap estimate of standard error.
- (p. 161, 3rd line from top & 3rd line from bottom)
ERROR IN BOOK:
The word probability should be replaced by distribution (or perhaps
probabilities).
- (p. 162, near middle of page)
The "intuitive explanation" that Wilcox refers to is attempted on the bottom
half of p. 164, but it is not a good explanation.
- (p. 164)
In expression (9.2), I think it may be better to replace gamma by
g/n, since the actual proportion trimmed can differ a bit from the
nominal value. This is consistent with what has been found to be true in
the two sample case. (See p. 170, where Wilcox has "Yuen's method has been
found to perform slightly better when sample sizes are small.")
- (p. 164)
Using expression (9.2), and assuming a large sample size,
determine what an optimal value for gamma is if the distribution is a
Laplace (double exponential) distribution. Please try to do this.
- (p. 164)
ERROR IN BOOK:
3 lines below expression (9.2) a factor of 2 is missing from in
front of the gamma.
- (p. 166)
Note that the confidence interval given in expression (9.2)
is for the population (distribution) trimmed mean. For symmetric
distributions,
the distribution's trimmed mean coincides with the mean/median, but for skewed
distributions the trimmed mean is a nonstandard distribution measure to focus
on.
- (p. 167, roughly 3rd quarter of page)
Results from one of my studies conflict with Wilcox's claims.
I found that increasing the proportion trimmed can cause tests to become anticonservative in some cases.
- (p. 167, bottom portion of page)
Unfortunately, Wilcox's conclusions are rather vague. I wish he had used
the style of his 1997 book where he reported on results for specific
distributions and sample sizes.
- (p. 168, 1st two lines)
Note that n cannot be too small if one wants to insure good accuracy.
- (p. 168, last sentence of 1st full paragraph, and last sentence on page)
I'd like to see someone else confirm this. Also, I wish Wilcox would have
indicated what sample sizes and distributions he considered.
- (p. 168 & p. 171)
I think that using 585 for U makes more sense than 584. Also, the reason for B being 599 instead of 499 is that for a 95% confidence interval,
for which 2.5th and 97.5th percentiles are needed, B+1 should be a multiple of 40.
- (p. 169, 1st paragraph)
Wilcox has "and in the event sampling is from a normal curve, using means
offers only a slight advantage." My guess is that the alternative procedure
can result in about a 10% decrease in power, which is somewhat slight, but
not ultraslight. (Note: The reduced power of the alternative procedure
is also relevant to the last sentence on p. 173 and to 2nd to the last
bullet on p. 178.)
- (p. 171)
In addition to being a confidence interval for the difference between two
population trimmed means, expression (9.9) could also be used to perform a
test for the general two sample problem (testing the null hypothesis that
the two distributions are identical against the general alternative).
- (p. 174, last paragraph)
Wilcox compares trimmed mean procedure to M-estimator procedure. One of my studies suggests that for a variety of symmetric heavy-tailed distributions, the
signed-rank test outperforms (power comparable, but accuracy better (of
course, since signed-rank test is exact)) testing procedures based on trimmed means and M-estimators, although I didn't use the bootstrap methods that Wilcox recommends.
- (p. 174 & p. 175)
Note that in some cases Wilcox has found that the regular (and simple)
percentile bootstrap outperforms the bootstrap t (something that Wilcox refers to as "somewhat surprising").
- (p. 175, top half of page)
Note the way the D* values are formed. While it's clear
to me
that doing it whis way is reasonable, I wonder if more accuracy could be
achieved by doing it another way --- instead of just using B differences,
combine all of the estimates from resampling B times from the first
sample with all of the estimates obtained from resampling B times
from the second sample to form B2 differences.
Can anyone figure out a way to do it this alternative way?
- (p. 175, towards bottom of page)
COMPLAINT:
Wilcox indicates here that with trimmed means the percentile bootstrap is better than the bootstrap t --- and so I wonder why he went into more detail
on the bootstrap t method and is somewhat casual in remarking that the
percentile method is better.
- (p. 178, 1st full sentence)
ERROR IN BOOK:
Wilcox has "In some cases the correct estimate is substantially smaller than
the incorrect estimate." I don't see how this is possible. Am I doing something wrong, or do you guys agree with me? Please take a moment or so to consider
this.
- (p. 179)
COMPLAINT:
Wilcox uses "measures" instead of variables on the next to the last
line, even though he had used variables previously on the page.
Although measures may be used in some fields, I don't think it's good to use
two different terms when one would suffice.
- (p. 183)
ERROR IN BOOK:
Wilcox refers to "Section 6.6" even though the book has no numbered sections.
- (p. 187, 1st full paragraph)
Note that the t test is fairly accurate for testing the null hypothesis
of independence against the general alternative, even if there is
nonnormality, but power can be low in some situations.
Also, in addition to having Spearman's rho and Kendall's tau, StatXact
has an exact permutation test based on Pearson's statistic (that is an exact way
to test for lack of independence) that doesn't require an assumption of
normality --- and if this exact version is employed,
one wouldn't have to worry about even slight inaccuracy due to nonnormality.
- (p. 188)
ERROR IN BOOK:
In the 2nd set if X and Y values a little more than halfway down the page,
the 3rd Y value should be 28 (instead of 47).
- (p. 190, 6th line)
COMPLAINT:
Why recommend B = 600 here when previously 599 has been used?
This seems like needless lack of consistency.
- (p. 190, last 2 lines (and 1st line of p. 191))
I question that Spearman's rho and Kendall's tau are "typically covered"
in introductory statistics courses.
- (p. 191)
Here are some comments about Spearman's rho.
- Spearman's paper appeared in 1904, and so it is a rather old statistic.
- It should be mentioned that it is a measure of the strength of a
monotone relationship, whereas Pearson's statistics is a measure of the
strength of a linear relationship.
- When doing a test with a small sample, tables of the exact null sampling
distribution (or StatXact) should be used instead of a normal or T
approximation.
- (p. 193)
In addition to the behavioral data analysis book by N. Cliff,
other books (e.g., the text I use for Nonparametric Statistics) also
contain information about the treatment of ties. (When there are a lot of ties,
the method used can make an appreciable difference.)
- (p. 193)
The 1997 Wilcox book contains some information about the methods related to M-estimators.
- (p. 197)
Note that the axis of the MVE need not directly correspond to the
correlation computed from the points inside the MVE.
- (p. 197)
What is the IML of SAS/IML?
- (very bottom of p. 200 and very top of p. 201)
Wilcox makes it seem as though Spearman's rho and Kandall's tau are to be
considered to be newish alternative methods, but they really aren't very new.
(Spearman's paper was published in 1904.)
- (pp. 200-201) Wilcox suggests that the nature of the association in
Figure 10.10 changes at about X = 150, but I wonder if he is putting too
much emphasis on the smooth. If the plot was truncated at X = 250 and the
smooth was removed, then to me a visual examination would not suggest that
there is a positive trend up to X = 150 and then no association for larger
values of X. (Note: I'm not a fan of smooths when the data is sparse as it is
in the right half of Figure 10.10. If the parameter(s) of the smoother was
set differently, the picture would change.)
- (p. 202, last bullet) If ones wants to test using a null hypothesis of
independence instead of a null hypothesis of zero correlation, then I think the
tests based on an assumption of homoscedasticity should be preferred. One
could still use resampling to perform tests based on more exotic statistics,
but I think the resampling should be of the permutation variety as opposed
to resampling intact (x, y) ordered pairs.
9th meeting (August 24) : Ch. 11 & Ch. 12
- (p. 206, 2nd to last sentence of 1st full paragraph)
This relates to Ch. 10 (and recall, doing a t test
that the correlation is 0 using
Pearson's sample correlation coefficient is equivalent to doing a t
test of the null hypothesis that the slope is 0).
Wilcox indicates that if you reject that the slope, beta, is 0, you
can safely interpret that the distribution of Y depends on x,
but you should be careful about making the interpretation that
E(Y|x) is an increasing or decreasing function of x if the
assumption of homoscedasticity is in question (even if the rest of the
simple regression model holds).
- (p. 206, 3rd sentence of 2nd paragraph)
I agree with Wilcox's advice.
- (p. 206)
ERROR IN BOOK:
Wilcox refers to Section 11.12, but there are no numbered sections.
- (p. 208, 1st full paragraph)
It's interesting that Wilcox claims that Theil-Sen estimator competes well
with least squares when there are iid normal error terms, but I wish he'd
have given a quantitative result! I'm not going to be happy with 80%
efficiency since M-regression can be very close to 95% efficient while
protecting against the ill effects of very heavy-tailed distributions.
- (p. 212, near middle of page, 1st sentence of paragraph)
It'd be more accurate to put reduces the bad effects of outliers
in place of "protects against outliers" since in some cases other methods
are appreciably more resistant to outliers (and so you don't want to think that
L1 regressions provides complete protection). (Also, in the
3rd sentence of that paragraph, it's kind of silly to have that the breakdown
point is "only zero" --- I think it would be better to put that the breakdown
point is 1/n (or asymptotic breakdown pojnt is 0), and perhaps remind
the reader that this is the lowest possible value.)
- (p. 213)
Some people think L1 regression is more resistant to weird
points than it really is. Figure 11.2 provides an excellent example of how
this method can fail.
Maybe we can find this data and then try other types of regression with it.
My guess is that least squares will do poorly, but some types of M-regression
will do well.
- (p. 213)
ERROR IN BOOK:
Ordering the residuals makes no sense --- need to use order statistics of
the squared residuals instead.
- (p. 214) Figure 11.3 shows that even high breakdown point methods can
fail.
- (pp. 213-215)
If one believes regression model with iid error terms holds, then LTS (and LMS too)
is a very poor choice for estimating the unknown parameters (according to results from studies that I have done). Not only are the MSEs relatively huge, but
also LTS is very slow on S-PLUS.
- (p. 215, last 2 sentences of 2nd to last paragraph)
I agree with Wilcox's advice, but one could add that the fits should be
examined graphically even if all of the methods you try are in near
agreement (since all could result in a screwy fit).
- (p. 216)
ERROR IN BOOK:
One needs to use the ordered absolute residuals, not the absolute values
of the ordered residuals.
- (p. 217)
One might wonder why use the median of the squares instead of the median
of the absolute values. I think it may have to do with having a unique
solution in even sample size cases. For odd sample sizes it would make
no difference whether squares or absolute values were used, but for even sample
sizes, where I think it uses the average of the two middlemost values, one
doesn't necessarily get unique estimates
if absolute values
are used.
(Think of the intercept, and moving
the fitted line up and down while keeping slope constant. Different values
of the intercept can result in the same value of the median of the absolute
residuals if the median is computed from two middlemost values.)
- (p. 217) I've seen in other places where LTS is generally better than LMS
(as Wilcox has). But the word must not have spread to all corners, since
some seem to use LMS instead of LTS when they want a high breakdown method.
- (p. 217)
ERROR IN BOOK:
Near middle of page, Wilcox has "indicated in Figure 11.3" but I don't see how
figure indicates what he claims it does (since it doesn't even show LMS fit).
- (p. 218)
The method described for identifying regression outliers seems superior to
using studentized residuals (since studentized residuals are based on least
squares fits, and are not necessarily well-behaved if there are multiple
outliers (or just general overall heavy-tailedness). I wonder what the
alternative method corresponds to in the iid normal case. (I.e., are
the points labeled as a regression outlier those with a studentized
residual of 2 or greater?)
- (p. 221)
I wish Wilcox would have included a description of the adjusted M-estimator
(but I guess we can refer to his 1997 book, although I think the
description there should to be improved).
- (p. 221)
ERROR IN BOOK:
Wilcox refers to Section 11.10, but there are no numbered sections.
- (p. 221)
Wilcox claims using bootstrapping with adjusted M-estimator gives good
results even with small sample sizes, extreme nonnormality, and extreme
heteroscedasticity. Given what we've found in our studies, where we
don't have heteroscedasticity, I find his claim to be a bit hard to believe,
and so I
wish he would have included some numerical results to back up his claim.
- (p. 222, bottom of page)
I like the strategy of picking the estimator having the smallest estimated
standard error. It could be a lot of work if bootstrapping is needed to get
some of the standard error estimates, but if the data analysis is important,
one might want to go to all of the trouble.
- (p. 223)
I'd be interested in knowing what regression depth is. The scheme is to find
the line having the "highest regression depth" (seems odd to refer to a high depth --- I'd have used the word greatest), but book doesn't indicate what regression depth is.
- (p. 224)
Wilcox seems to like LTS estimator with breakdown of 0.2 or 0.25. Too bad he
doesn't give any solid comparisons with other methods. For instance, how does
it compare with least squares and Huber M-regression in iid normal case (and
how does it compare with other methods in situations
with iid error terms having contaminated normal distributions)?
- (p. 224)
I don't agree that a breakdown point of 0.13 is "dangerously low" since in
order for 13% of the data to cause big trouble, those 13% have to be working
together in a sense (as opposed to being 13% contamination scattered about
in different directions), and if the 13% of the bad points were working
together to result in a bad estimate, one could hope to spot the trouble
using graphical methods. (One idea for a graphical method would be to color
points having large residuals and then look at the p-dimensional
predictor space using a rotating cube (for p = 3) or parallel
coordinates (for p > 3).)
- (p. 224)
ERROR IN BOOK:
Wilcox refers to Section 10.9, but there are no numbered sections.
- (p. 224)
A problem with using the MVE described in Ch. 10 to identify the slope is
that if the x values are tightly clustered in the middle and sparse
at the ends, the MVE could indicate a very misleading result (and it would
be better to use information provided by the points having extreme x
values).
- (pp. 224-226)
Wilcox seems to like the rather exotic methods, like Theil-Sen, adjusted
M-estimator, and LTS with breakdown of 0.2 or 0.25. My guess is that only a
reletively small number (maybe even a small number in an absolute sense) of
people actually use
these methods in practice. I'll guess that more people (but still a
a small proportion of statisticians and users of statistical methods) use
the more well-known alternative methods like Huber or bisquare M-rergression.
(Unfortunately, L1 regression seems to be the alternative
method many consider when they don't use least squares.)
Wilcox favors methods that may do well in the worst of situations
(combining heavy tails with heteroscedasticity), but I wonder if
using one of the
exotic methods favored by Wilcox would be as good as
using one of the more familiar M-regression
methods if
we can
determine that although we have a situation where least squares shouldn't
be trusted, we don't have an ultraextreme situation.
(I'll guess that in lots of iid error term cases, more common
M-regression methods will outperform those that Wilcox favors.)
- Below I'll give some results pertaining to some of the more commonly
used (I think) robust regression methods (that can be done using standard
S-PLUS functions). In a 1999 paper that I presented at a meeting in Chicago,
I first showed (using Monte Carlo results) that conclusions based on
asymptotic results (not requiring Monte Carlo work) seem to apply for the
most part when the sample size is only as large as 50. That being the case,
I developed a table of asymptotic relative efficiency (ARE) values with
which to compare 4 different regression methods (really 5, since I found
that using the Andrews weight function was nearly identical to using the
bisquare weight function (both in asymptotic results and Monte Carlo results
for smallish sample size situations)). The table below gives ARE values of
least squares (OLS), Huber M (Huber), and L1 (LAD) estimators with respect to
the bisquare M-estimator. So ARE values in the table greater than 1 indicate
that another estimator beats the bisquare estimator --- and it can be noted
that there are not a lot of ARE values greater than 1, and only two ARE
values exceed 1.05. Thirty-one different error term distributions are considered.
T15 denotes a T distribution with 15 df, cn(.05) denotes
a contaminated normal distribution with 0.05 being the probability
associated with the larger variance normal distribution. More than one row
is labeled cn(.05) since different scale factors can be used in a 5%
contaminated normal distribution. The scale factors used range from 2 to 10,
inclusive, and
the details can be found in my 1999 paper. The column labeled
twi gives the value of a tailweight index.
dist'n | twi |
OLS |
Huber |
LAD |
normal |
1.00 |
1.05 |
1.00 |
0.67 |
cn(.001) |
1.00 |
1.05 |
1.00 |
0.67 |
cn(.001) |
1.00 |
1.05 |
1.00 |
0.67 |
cn(.001) |
1.01 |
0.96 |
1.00 |
0.67 |
cn(.005) |
1.01 |
1.04 |
1.00 |
0.67 |
cn(.01) |
1.01 |
1.04 |
1.00 |
0.67 |
cn(.005) |
1.01 |
1.02 |
1.00 |
0.67 |
cn(.01) |
1.03 |
0.99 |
1.00 |
0.67 |
cn(.005) |
1.03 |
0.71 |
0.99 |
0.67 |
cn(.01) |
1.07 |
0.54 |
0.98 |
0.67 |
cn(.05) |
1.07 |
0.98 |
1.00 |
0.68 |
T15 |
1.09 |
1.00 |
1.01 |
0.71 |
cn(.05) |
1.12 |
0.92 |
1.00 |
0.69 |
T10 |
1.15 |
0.96 |
1.01 |
0.73 |
cn(.1) |
1.15 |
0.93 |
1.00 |
0.70 |
cn(.05) |
1.15 |
0.88 |
0.99 |
0.69 |
cn(.05) |
1.20 |
0.98 |
0.99 |
0.69 |
logistic |
1.21 |
0.93 |
1.01 |
0.76 |
T7 |
1.22 |
0.90 |
1.01 |
0.75 |
T6 |
1.27 |
0.87 |
1.01 |
0.76 |
cn(.1) |
1.29 |
0.83 |
1.00 |
0.70 |
T5 |
1.34 |
0.81 |
1.01 |
0.78 |
cn(.1) |
1.37 |
0.79 |
0.99 |
0.71 |
T4 |
1.47 |
0.72 |
1.00 |
0.81 |
cn(.1) |
1.53 |
0.71 |
0.98 |
0.71 |
Laplace |
1.64 |
0.72 |
1.00 |
1.44 |
T3 |
1.72 |
0.52 |
0.99 |
0.85 |
T2 |
2.47 |
0.00 |
0.95 |
0.92 |
cn(.05) |
3.43 |
0.19 |
0.92 |
0.65 |
cn(.1) |
4.93 |
0.11 |
0.85 |
0.65 |
Cauchy |
9.23 |
----- |
0.79 |
1.13 |
It can be seen that L1 regression is typically a rather
poor choice (except for 2 rather extreme, and usually unrealistic, error
term distributions). Also, it should be noted that while bisquare M-regression
can be quite a bit better than least squares, it never does a whole lot
worse than least squares.
- (p. 227, top half of page)
Wilcox's strategy seems okay, but as he points out, using smoothers
for diagnositc purposes can be tricky in multiple regression settings.
To address the "criticism" he refers to, one could also employ the
strategy described on the bottom part of p. 222.
- (p. 227, bottom half of page)
I agree that it is often not good to use least squares and assume all is well,
but I think it should be mentioned that least squares is okay to use in many
instances. (If this was not the case, then in its present form STAT 656
should be eliminated.)
- (p. 229)
I like that this chapter will address permutation tests and rank-based tests,
since in the end one wants the best procedure to use in a given setting, not
just the best procedure that is considered to be a robust procedure (although
the book has considered classical normal theory procedures throughout, and compared them with robust procedures).
- (p. 231 & p. 234) How to view
the W-M-W test is dealt with near the bottom of page 231.
I think it's best
to think of the test as being one of the null hypothesis that all of the
random variables are iid from the same distribution against the general
alternative, but that it is sensitive to the value of p (introduced
on p. 230), and if certain assumptions are made, it can be viewed as a test
about means or medians (with a more rigid assumption needed to view it as a test
about medians). It's unfortunate that it is presented as a test about medians
in some places, but it's nice that on p. 234 Wilcox indicates that the W-M-W
test "is unsatisfactory when trying to make inferences about medians" (although
it would be better to state that it can be unsatisfactory unless one feels
that it is reasonable to make certain assumptions (e.g., if we have a shift
model situation, or a scale model for nonnegative random variables, then test
can be used to address medians)).
- (p. 233)
In the first part of the paragraph that starts in the middle of the page,
Wilcox makes it seem that the reader should be prepared to go forth armed
with some really good methods, but really, he has given us scant information
with which to make informed decisions. For the rank-based methods designed
to work well if there is heteroscedasticity, he cites a book by N. Cliff.
I don't have this book yet, but I am aware that such rank-based methods were
introduced into the more mainstream statistical literature many years ago.
A GMU M.S. is Stat. Sci. graduate, Kelly Buchanan (now Kelly Thomas), did a
thesis for me in 1993, and it includes a very good literature review of what
is known as the generalized Behrens-Fisher problem (two sample tests of
means
when there is nonnormality and heteroscedasticity), which includes many
references about modifications of standard nonparametric procedures. To
summarize, in the early 1960s there were papers indicating that the ordinary
W-M-W test was not an accurate test for the generalized Behrens-Fisher
problem, and including suggested modifications of the W-M-W test. Among such
papers were those of P. K. Sen (1962) and R. F. Potthoff (1963). Monte Carlo
studies done for the thesis indicate that a 1979 modification of Sen's
procedure developed by K. Y. Fung represented an improvement of earlier
efforts, but it should be noted that K. K. Yuen's (note: I think K. K. Yuen
became K. Y. Fung when she got married) 1974 trimmed mean modification of
Welch's test was found to be generally better than the rank-based Fung test.
It should also be noted that most of the studies pertaining to the
generalized Behrens-Fisher problem employed symmetric location-scale
families of distributions, and that much less is known about tests about
means in the presence of skewness and heteroscedasticity.
(Note: It's too bad that Wilcox doesn't seem to be as well acquainted with
mainstream statistics literature as he is with the literature from the social sciences dealing with statistical methods.)
- (p. 233)
For the data from the experiment to study the possible effect of ozone on
weight gain in rats, I think the ordinary W-M-W test would be a better starting
point than a version of the test designed to adjust for heteroscedasticity.
It seems to me that it would first be proper to test the null hypothesis that
ozone has no effect against the general alternative that weight gain is somehow
effected by the amount of ozone. If the null hypothesis of no effect is
rejected, then one could go about trying to characterize how the distributions
differ, examining means, medians, quantiles, variances, and other distribution
measures. To just be concerned about the value of p (which is the focus
of the rank-based test that adjusts for heteroscedasticity), seems a bit silly.
Suppose that p equals 0.5, but that the distributions have very
different
dispersions --- I think it'd be nice to make note of this, since it would mean
that one enviroment tends to produce more uniform weight gain, while in the
other enviroment there is a tendency to observe more extreme (both small and
large) values.
- (p. 235)
In the middle portion of the page, Wilcox describes the Monte Carlo version
of the test. The "official" version of the test doesn't use random selections,
but rather considers all possible ways of dividing the n1 +
n2 observations into groups of sizes
n1 and
n2. StatXact does the exact version, and also will do a Monte Carlo version if the sample sizes are too large for the exact version.
Also, it should be mentioned that it's easy to get a p-value (and a confidence interval for the estimated p-value in the Monte Carlo case), and one doesn't
have to do a size 0.05 test the way Wilcox describes.
- (p. 236, 1st full paragraph)
Wilcox correctly implies that the two sample permutation test is really an
exact test for the general two sample problem (of the null hypothesis of
identical distributions against the general alternative), and is not really
a test about the means. (But if one is willing to make some additional assumptions (e.g., assume that either the distributions are identical, or that one is
stochastically larger than the other if they differ), it can be considered
to be a test about the means.) If unequal
variances can cause the probability of rejecting to exceed the nominal level
even if the means are the same, it is still a valid test of the general
two sample problem (since if the variances differ, the distributions are
not the same, and so the correct decision is to reject the null hypothesis).
- (p. 238, 10 lines from bottom, and p. 239 (a bit below halfway down))
I don't like it that Wilcox uses the phrase "bootstrap techniques again appear to
have practical value" (my italics). (He uses similar phrases in other
places in the book.) I want some indication of how accurate a method is, and
more precise information about sample size recommendations.
I can't help but be at least a bit suspicious when he indicates his
endorsement applies to small sample size settings (since in simpler settings
it is known that somewhat largish samples sizes are needed to insure accuracy
with bootstrap procedures). On p. 239, Wilcox has "certain types of bootstrap
methods appear to be best when sample sizes are small." Again, I am a bit
suspicious. Perhaps the bootstrap methods are best, but are they good
enough?
- (p. 239, 2nd to last sentence in paragraph that begins the page)
Wilcox has "very small departures from normality in the fourth group can make
it highly unlikely that the differences among the first three groups will be
detected." It depends on what test procedure is used. Wilcox is correct to
have "can" because one large variance can adversely effect the ability to
identify any differences if a pooled estimate of scale is employed by the test
procedure. But if the test procedure employs pairwise comparison subtests,
then a large variance for the fourth group won't effect the ability to detect
differences among the first three groups.
- (p. 242)
ERROR IN BOOK:
In the indt description, I'll guess that it should be in
press a. instead of just "in press."
- (p. 243, 2nd to last bullet)
Wilcox has "practical methods have been derived and easy-to-use software is
available." Since I have found mistakes in expensive software (StatXact), I've
grown to be suspicious of software (and try to test it before I trust it ---
which can be a lot of work), and I'm even more suspicious of "freeware"
(Dallas and I have identified mistakes in some of Wilcox's S-PLUS functions,
and our Friday group has found fault with S-PLUS function obtained from
Venables and Ripley). I've also grown to be suspicious of authors of papers.
Even if they present honest Monte Carlo results to support their claims, I
wonder what results they haven't put in the paper (e.g., maybe their method
only works good in some settings, and they only reported on cases in which
the method performs well, leaving it to others to identify situations in
which their method can be quite lousy).