comments about book for STAT 789 robust statistics

Comments About the Book

(and related material)

Note to STAT 554 students: Right now this web page is mostly material that I developed for my 2001 summer seminar in robust statistics. I may modify it to make it more appropriate for STAT 554 students later, but even as is I think it'll help you to better understand the Wilcox book.

June 22, June 29, July 6, July 13, July 20, July 27, August 3, August 17, and August 24 meetings. I'll add other comments later. (Generally, look for new comments each Sunday, but it may be later some weeks.)

Changes, or additional comments added after the meeting about the material, are in green, and errors identified at or after the meeting are shown in bright red. Some of these modifications are due to comments by J. Gentle or A. Keesee, and others are just things that occurred to me after the meeting.

Since the book is rather elementary in places, I think it'll be best if we use the book to generate discussion, but in our discussions try to go beyond the level of the book. So in addition to reading my comments, try to always come up with some remarks of your own. Of course some students may want to use our time together to try to finally gain a good understanding of matters that they are shaky about, and that'll be okay too. Even though the book is somewhat elementary in places, if at the end of the summer we understand everything that's in it, then I think we can call the seminar a huge success.

Finally, it should be noted that I'm not attempting to outline the chapters. Rather, I'm making comments, many of the nature of a side remark, about the material in the book --- but not necessarily about the main points presented in each chapter. You'll need to let me know if there are parts of the book that you think need discussion.

1st meeting (June 22) : Preface, Ch. 1, & Ch. 2

Preface

THE GROWING GAP
1. (p. viii) Wilcox states that "during the latter half of the twentieth century, things began to change dramatically."
  - If we go back 5 more years to the mid 1940s, then in addition to robust and computationally-intensive methods, along with better exploratory and diagnostic graphical techniques, also included would be some key nonparametric methods.
  - It's important to note that development of new methods is linked to finding fault with classical methods --- and so the increase in understanding of the older methods is also important.
2. (p. viii) Is there truly an "ever-increasing gap between state-of-the-art methods versus techniques commonly used" and if so, who is to blame?
  - Do the applied statisticans (many having inadequate training to start with) just not keep up as they should? Are tired old government workers too resistant to change?
  - Are textbook writers too conservative --- not wanting to stray too far from the norm?
  - Do members of the faculty fail to introduce the latest and the greatest? Do they fail to instill the proper attitude in the students? (What is the proper attitude?)
  - Are the researchers failing to be convincing? Have perhaps somewhat unethical or just plain shabby researchers turned off mainstream applied statisticians? (Unfortunately, Wilcox's 1997 book, along with his S-Plus functions, contain some errors that could mislead people.) It seems like some statisticians believe that if SAS doesn't do it, it's not worth doing or it shouldn't be trusted.
EDUCATION
1. (p. viii) Wilcox claims that "various perspectives are not typically covered in an applied course" and he's right.
  - In STAT 554 I try to include a lot of methods that aren't commonly used because I think they are good methods and because I want to impress upon the students that things aren't nearly as simple as some lower-level books and courses make them out to be. I include seldom-used techniques like Johnson's modified t test, the Steel-Dwass test, and the Harrell-Davis estimator, along with methods based on trimmed means. I introduce M-estimators and discuss permutation tests. Along the way I try to provide reasons for why these less commonly used methods should sometimes be used. But each semester an important technique like the bootstrap only gets between 0 and 60 seconds --- there just isn't time to describe bootstrap methods and properly discuss their strengths and weaknesses.
  - Not only does STAT 554 not get to address a lot of good methods, but many students earning an M.S. degree may not be exposed to everything that they perhaps should get in an M.S. program. (Should the curriculum be altered? For the most part I think GMU has a wonderful M.S. program, but should greater emphasis be placed on more modern methods, or would that come at the expense of other things equally or perhaps more important?)
2. (p. viii) Wilcox claims that "standard training in basic statistics does not prepare the student for understanding the practical problems with conventional techniques or why more modern tools might offer a practical advantage."
  - I think I do a good job of this in STAT 554, but then I run short of time when covering categorical data analysis and regression. (When trying to include more ways to attack a given general problem in order to provide improved performance in more situations (cases of the general problem), while at the same time properly discussing the pros and cons of various methods and how they compare, I don't have time to properly cover the topics towards the end of the course. For M.S. students, who will take many other statistics courses, maybe my way is a good way. But are others being properly treated? (I don't want to ever go to a cook-bookish type of course, but sometimes I do wonder if less would be more in a one semester course.))
  - What do other universities do differently? (I suspect many place more emphasis on theoretical issues, and it seems like that would leave less time for a broader coverage of applied techniques. But are they spending more time on newer approaches, and less time on classical methods?)

Ch. 1

(p. 1) What is meant by "arbitrarily small departures from normality"? Maybe the departures that cause trouble aren't so small if measured in a different way. (E.g., two distributions can seem similar when Kolmogorov distance is used to measure how much they differ, but they can have large differences between some quantiles.)
(p. 2) I wonder what "commonly recommended methods for dealing with nonnormality have been found to be completely useless." Hopefully we'll encounter more about this later in the book.
- I don't think he means standard nonparametric methods because they work well in some situations.
- I think that transformations do horribly in some cases, but we shouldn't rule out transformation altogether since they can be quite useful in regression.
- Maybe he's thinking about cases where nonnormality is combined with a nonnice heteroscedasticity structure (and small sample sizes).
(p. 2) Wilcox states that "modern techniques stem from three major developments", but I wonder how much most statisticians know about robust methods and computationally intensive methods such as bootstrapping, projection pursuit, CART, and MARS. Just how commonly used are these modern methods? (Even among university faculty, although some may have been exposed to the basic ideas, how many have enough experience in using the methods to recommend when they should be used in place of other methods?)
(p. 4) It's interesting how belief in the use of the sample mean led Gauss to support the normal curve. (These days it's more the case that belief in the normal curve lends support to the use of the sample mean.) Also interesting that Stigler notes that Gauss's argument is faulty.
(p. 4) Note that symmetry assumed a role as a key assumption as a matter of convenience more than because empirical evidence suggested it.
(p. 5) Reliance on CLT-based robustness arguments has been questioned in recent years --- certainly the speculation (see p. 2) that 25 observations should be adequate has been shown to be false. (Also, even if we have robustness for validity, we don't necessarily have robustness for efficiency.)
(p. 5) "In 1818, Bessel conducted the first empirical investigation that focused on whether observations follow a normal curve." I wonder why it took so long?
- Although (see p. 3) the normal curve was first developed in 1733, it was due to its ability to provide an approximation to binomial distributions.
- Uses of the normal curve to model nonbinary data seems to be due (see top of p. 4) to matters of convenience. (The convenience factor still exists today. Some statisticians are reluctant to admit that a normal distribution isn't a good model for the error term distribution because if they did they would no longer have the asymptotic optimality of MLEs and UMVUEs to support the use of the simple least squares methods that they want to use.)
- It seems as though histograms would have suggested that some phenomenea seems to follow at least approximately a normal curve, and so I would have guessed that early on (before 1818) some comparisons would have been made.
Note that this book gives information about the origin of some commonly used terminology. Examples include the central limit theorem (p. 5) and the normal curve (p. 6).
(p. 6) Note that at one time Pearson thought that "nonnormal distributions were actually mixtures of normal distributions and he proposed that efforts be made to find techniques for separating these nonnormal distributions into what he presumed were the normal components." But eventually "he abandoned this idea" --- something that others have not done, since even in the 1990s research was still being done on using mixtures of normal distributions for nonnormal distributions.
(p. 7) In 1811, at age 62, Laplace, a supporter of the Bayesian point of view, created the frequentist approach. In the 1800s the frequentist approach gained momentum, and in the 1900s it was the dominant point of view, although in the 1990s there was a surge of renewed interest in Bayesian methods. The renewed interest was due in large part to improved computational techniques. (I don't mean to go off on a rant here, but is this renewed interest justified? Just because we can now easily do something, does that mean we should do it? Shouldn't the New Bayesians be asked to show that their methods give us superior inferences?)
(p. 7) Note that Laplace developed the CLT-based confidence interval approach in 1814, about 100 years before the work of Fisher and Gosset ("Student"). (Is this evidence to support the conjecture that now knowledge about statistics is growing faster than before? Why was it that "Student"'s contribution waited until 1908? (Of course one can't blame people like Laplace, Gauss, and Cauchy --- those guys weren't idle!))

Ch. 2

We should review the goals of Part I of the book (see p. 11).
(p. 11) Wilcox states that "here some important concepts and perspectives are introduced that are not typically covered in an introductory course." I think some of the things aren't even in most 400-level and 500-level courses.
(p. 14) In many places in the book Wilcox refers to "the typical individual under study." This point of view seems to be common among social scientists. They are often interested in the typical individual, and will choose between estimators such as the sample mean and sample median to address the nature of the typical individual. Whereas I tend to want a more specific focus: should I be interested in the distribution mean, the median, or perhaps some quantile other than the median? I think there are plently of cases where the mean is the proper target, even if it's out in the tail of the distribution and not among the most likely outcomes. (Think about comparing two production methods and wanting to choose the one that will yield the most product in a year, and we have data that are daily outputs. Should we focus on the mean or median. One can also think about searching for places to drill for oil. Should we care about the mean yield or the median yield?)
(p. 15) Wilcox states that "probability curves are never exactly symmetric" and yet its interesting that symmetry assumptions abound in statistics.
(p. 16) I like Wilcox's simple description of outliers. (The book returns to outliers in Ch. 3.)
Be sure that you understand the concept of breakdown point. Values are given for the sample mean (p. 17), the sample median (p. 19), the weighted mean (p. 20 --- but he doesn't address the common case of the trimmed mean (maybe because he doesn't consider trimmed means to be weighted means (see p. 51))), the sample variance (pp. 21-22), and the least squares estimate of the slope (p. 29).
The population median is presented on p. 17. Can you sketch a continuous pdf for a distribution that has infinitely many values satisfying the definition given for population median? What about a pmf for a discrete distribution that has no value satisfying the definition?
(p. 17) Wilcox has "the most common method for estimating the population median is with the so-called sample median" --- and this is an example of a commonly used method not being the best method in most cases (since it's usually the case that some other estimator will be better than the sample median).
(pp. 18-19) A brief discussion of sample mean vs. sample median for symmetric distributions is given. There is a lot more that could have been added (but he's easing us into things).
(p. 19) It's pointed out that the breakdown point is just one consideration and that there are others (like average accuracy).
(p. 22) That "the low breakdown point of the variance turns out to be especially devastating" is something we'll want to look for later in the book.
(p. 23) The concept that the choice of loss function can make a difference is introduced. For guessing heights the absolute error seems just as sensible as the more commonly used (in general) squared error (and for carnival games (guessing heights, weights, or ages) the proper loss function is often of the 0-1 nature). Often there is no clear best choice for a loss function.
(p. 24) It's interesting that Ellis touched upon the general idea behind M-estimators as early as 1844.
(p. 25) Note that the alternative to least squares, the average pairwise slope, dates back to at least 1750 --- about 50 or 60 years before least squares. Boscovich's development of the LAD (aka LAV, LAR, L₁) method (described on pp. 26-27) occurred in 1757 (so also before least squares).
(p. 27) ERROR IN BOOK: In figure 2.5 (the figure itself, not the caption) the point labels should have arc length instead of latitude.
(p. 29) ERROR IN BOOK: It should be all five choose two (or just 10) pairs of bivariate points instead of all ten pairs of slopes.

2nd meeting (June 29) : Ch. 3 & Ch. 4

Ch. 3

(p. 31) Wilcox states "in recent years it has become clear that this curve can be a potential source for misleading --- even erroneous --- conclusions" --- but older papers (see Miller's Beyond ANOVA for references to some earlier works) tend not to strongly suggest that there are huge problems due to nonnormality. (Perhaps statisticians were more forgiving about weaknesses in standard procedures due to a lack of convenient and better alternatives.)
(p. 32) ERROR IN BOOK: He should have The equation for the family of normal curves instead of "The family of equations for the normal curve."
(p. 32) ERROR IN BOOK: I don't think e should be referred to as Euler's constant, since what is commonly called Euler's constant (commonly denoted by lower-case gamma) is the limit of
1 + 1/2 + 1/3 + 1/4 + ... + 1/n - log n
as n tends to infinity, and is about 0.577.
(pp. 32, 33, 34, & 60) ERRORS IN BOOK: Wilcox should not use exactly when referring to the probabilities (like 0.68 and 0.954) pertaining to normal distributions, since the values he gives are not the exact probabilities.
(p. 35) MAD is introduced. It is a relatively widely used alternative measure of scale. (Some other alternatives are described in the book Understanding Robust and Exploratory Data Analysis by Hoaglin, Mosteller, and Tukey.)

(pp. 34-37) Three methods of identifying outliers are described. I encourage you to test your understanding by confirming the values given in the table below, which are the approximate proportions of outliers one should expect when applying the various methods to large samples from several different distributions. (Notes: Below, MADN is MAD/0.6745. Also, the ------ entry in the Cauchy row is due to the fact that the standard deviation doesn't exist.) It can be noted that for the boxplot method, the outliers from symmetric distributions are the values that are about (for large samples) at least 0.6745*4 (or just about 2.7) MADNs from the median, and thus, for symmetric distributions at least, the boxplot method will tend to identify fewer values as outliers than will the 2nd method (which utilizes 2*MADN). I generally prefer the boxplot method. (One reason for this preference is due to the way the boxplot method deals with skewed distributions.)

	2 sd from mean	2 MADN from median	boxplot method
normal	0.046	0.046	0.007
uniform	0.000	0.000	0.000
exponential	0.050	0.120	0.048
Laplace	0.059	0.128	0.063
T₆	0.050	0.077	0.028
Cauchy	------	0.207	0.156

(Note: For an exponential distribution having a mean of 1, I get that the MAD should converge to about 0.4812.)

(p. 36) ERRORS IN BOOK: At the very top of this page, in the continuation of the example started on p. 35, there are two 7s in the 1st line that should be 8s (since M equals 8), and also in the 1st line there is a 5 that should be a 6, a 3 that should be a 4, and a 1 that should be a 2. In the 2nd line, the value of MAD should be 4.
(pp. 38-44) I don't have any comments about the section on the CLT, but let me know if you have any questions about it.
Wilcox indicates that approximate normality for the sampling distribution of a statistic can suffer due to outliers if breakdown point of the statistic is low.

Ch, 4

Several important terms and facts are given that you should make sure that you are comfortable with:
- mean squared error (p. 50);
- homoscedasticity and heteroscedasticity (p. 56);
- Gauss-Markov theorem (p. 56);
- squared standard error (equal to the variance of a statistic), and of course standard error (p. 59);
- expression (4.1) (p. 60).
(p. 49) Some main goals are to understand the result of the Gauss-Markov theorem, to understand Laplace's confidence interval method based on approximate normality, and to develop an appreciation of the role (and potential weaknesses) of homoscedasticity assumptions in applied statistics. (Comment: In Beyond ANOVA, Miller is perhaps too willing to ignore heteroscedasticity in some cases.)
(p. 51) Note that with regard to the Gauss-Markov theorem, the class of weighted means does not include trimmed means and the sample median --- the weights are not assigned to the order statistics.
(p. 53) At the bottom of the page Wilcox refers to a property that many robust estimators have: performance almost as good as parametric estimators for normal distributions when the parent distribution for the data is normal, and improved performance (sometimes to a large degree) in many settings in which the parametric estimators perform suboptimally. A similar point is made on p. 65 --- it would be nice if an estimator works well in the presence of heteroscedasticity, and yet also works nearly as well as estimators derived under an assumption of homoscedasticity when in fact there is no heteroscedasticity.
(pp. 54-55, paragraph right before the regression section) One might expect the sample median to outperform the sample mean if f(eta), the value of the pdf at the median, is greater than 1/(2*sigma). This condition is met for the Laplace distribution, and other distributions that are sufficiently peaked at the median, as well as for some fairly extreme contaminated normal distributions, and other distributions that have a standard deviation that is rather large relative to the dispersion near the median. But lots and lots of distributions that have heavier tails than a normal distribution are such that the sample mean is superior to the sample median, and in some cases for which the sample median is superior to the sample mean, some other estimator will be superior to the sample median.
(p. 56) ERROR IN BOOK: Something is off in the description pertaining to Figure 4.3. One way to fix it would be to make the 10 (in X + 10) a 2.
(p. 58) Wilcox refers to a strategy based on estimating how the variance of Y changes with X. What does J. J. Miller teach about adjusting for heteroscedasticity in cases for which a transformation of Y doesn't stabilize the variance?
(p. 58) Wilcox has "the notion of a confidence interval was not new" but fails to indicate what methods were available prior to the one developed by Laplace. I wonder what methods existed prior to 1814. (I'll guess that some Bayesian intervals similar to confidence intervals were in existence.)
(p. 60) Wilcox has "a random sample means that all observations are independent." But what about the term simple random sample? Can't that be used to refer to a randomly selected subset of a finite population, and thus be obtained by making observations that are not independent?
(p. 62) Wilcox points out that "a practical issue is the accuracy of any confidence interval we compute." (Unfortunately, many times people use methods in situations in which good accuracy cannot be expected (often when large sample sizes are called for to justify a certain method, and yet the samples at hand are small).)
(p. 63) Wilcox has "to avoid mathematical difficulties, we make a convenient assumption and hope that it yields reasonably accurate results." I think some people hope for too much! (It's not that I never rely on the robustness of classical procedures, but I always try to assess if such reliance is warranted.)
(p. 64, 1st full paragraph) To test for dependence, we can assume homoscedasticity (because if the null hypothesis is true, we must have homoscedasticity). But if we want to make a good estimate of the slope, we should not necessarily assume homoscedasticity. (If we're giving an estimate of the slope, then we're clearly not making a firm commitment to the null hypothesis being true.) But Wilcox seems to suggest that perhaps a better test can be performed if heteroscedasticity is allowed for. (When we do a test based on an assumption of homoscedasticity, and heteroscedasticity causes the test to reject, then we're okay (no Type I error) since if there is heteroscedasticity then the null hypothesis (in the case under consideration) is not true. But what if the nature of the heteroscedasticity results in a test having low power to reject when a rejection is warranted? (Should our concern be with Type II errors?)) It's also worth noting that if the null hypothesis is one of a slope of zero, and not independence, then one should not necessarily assume homoscedasticity under the null hypothesis.
(p. 65, 1st and 2nd lines) If the underlying distribution is close enough to a normal distribution, and one uses t critical values, then one doesn't necessarily have to assume that "an accurate estimate of the variance has been obtained" since a confidence interval based on t takes the uncertainty of the variance into account.
(p. 65, 2nd and 3rd lines) ERROR IN BOOK: I think Wilcox is wrong to refer to a specific interval (an interval resulting from a particular set of data) and claim that "there is a .95 probability that this interval contains the true slope." (In my opinion, the 95% confidence results from the success rate of the method, but a particular result of the method either does or does not (so probability 1 or 0) contain the estimand. (In some situations I just to not like to deal with ignorance of the truth using probability.))
(p. 65, paragraph right before the sumary) When doing regression, what do you currently do to check the homoscedasticity assumption?

3rd meeting (July 6) : Ch. 5

(p. 67, 3rd to last sentence) Wilcox indicates that some of the standard methods are robust for validity (to use Rupert Miller's terminology) under certain circumstances (but they are not all that robust for tests about means, and even in situations where they are robust for validity, they are not necessarily robust for efficiency).
(p. 68) For the trivia game at next year's picnic, make a note that Egon Pearson is the son of Karl Pearson.
(p. 69) Wilcox uses beta for the probability of type II error, whereas a lot of books use it for power.
(pp. 70-71) Make sure you understand the 4 bulleted points about how power relates to other factors.
(p. 71, middle of page) ERROR IN BOOK: Wilcox doesn't define unbiased hypothesis testing method correctly.
(p. 71, last full paragraph) ERROR IN BOOK: I get that the values should be .328 (instead of .148) and .775 (instead of .328).
(p. 72, paragraph in middle of page) ERROR IN BOOK: Wilcox seems to be associating type II errors with values of the sample mean less than 35, which is okay (I suppose), but one should keep in mind that the test will also fail to reject if the sample mean assumes some values greater than or equal to 35 (since the sample mean has to be far enough above 35 in order to result in a rejection).
(p. 72, 6 lines before the new section) ERROR IN BOOK: Should be larger, not "higher."
(p. 73) The figure isn't real good if the desire is to illustrate the difference in power because the rejection regions aren't shown.
(p. 73, 1st line after (5.1)) ERROR IN BOOK: The word theorem should follow central limit.
(p. 73) It could be mentioned that Slutsky's theorem, as well as the central limit theorem, plays a role in the asymptotic normality of T.
(p. 74, last sentence of paragraph at top of page) Wilcox makes a good point: one shouldn't just think of nonnormality in terms of outliers, since things like skewness can have an appreciable effect even if there isn't a large percentage of outliers.
(p. 74, 1st full paragraph) I find it interesting that Gosset first used Monte Carlo results to investigate the sampling distribution of T, and then attempted a mathematical derivation.
(p. 74, first sentence of paragraph at bottom of page) I wonder how Fisher's "more formal mathematical derivation" differed from what Gosset did.
(p. 76) The lognormal distribution referred to in the figure and in the text is one having parameters 0 and 1 (and so, letting X be the lognormal random variable, we have X = e^Z, where Z is a N(0,1) random variable). The skewness of this lognormal distribution is about 6.18, which is a fairly severe degree of skewness (even though Wilcox refers to this particular skewed distribution as being light-tailed (and so he's focusing on outliers, and not skewness)).
(p. 77) Notice how spread out the variance values are in the figure. It turns out that the true variance can be shown to be about 4.7 for the lognormal distribution used (and so for one sample, the sample variance is more than 50 times larger than the true variance --- indicating that outliers can have a huge effect (even though Wilcox indicates the distribution isn't highly prone to yielding outliers)).
(p. 78, 2nd paragraph) I have some similar results that I can present concerning type I errors when using Student's t test when the parent distribution is skewed.
(p. 78, last paragraph) I like Bradley's point of view concerning type I errors (actual vs. nominal).
(p. 79, FIGURE 5.5) Can anyone explain why the distribution of the test statistic is negatively skewed, while the parent distribution for the data is positively skewed?
(p. 79) Here are some other facts about the lognormal distribution being used (that has a skewness of about 6.2): the kurtosis (with the -3 included) is about 111, even though the percentage of outliers in large samples would typically be small (about 7.8% outliers using the boxplot method, and about 3.7% outliers using the 2 standard deviation rule). I think it's not just the percentage of outliers that we have to focus on --- another important factor is how extreme the outliers are!
(p. 80, paragraph at top of page) There is an asymptotic formula that relates the skewness of the parent distribution and the sample size to the expected value of T under the null hypothesis that the mean equals a specific value.
(p. 81, FIGURE 5.6) The bootstrap estimate of the sampling distribution is obtained using Monte Carlo results, much like the way the estimated sampling distribution indicated in FIGURE 5.5 was obtained --- only here the empirical distribution is taken to be the true distribution underlying the data, and so we have another source of inaccuracy --- not only can the Monte Carlo samples not be truly representative on the distribution from which they are being drawn, but also the estimate of the true parent distribution (the empirical distribution) may not be real good.
(p. 82, middle paragraph) Wilcox seems a bit vague about the role of transformations here (but things get cleared up a bit later on).
(p. 84, 3rd line from top) ERROR IN BOOK: Should be .05, not .95.
(p. 84, 4th line from top) W is sometimes referred to as Welch's statistic.
(p. 85, 1st bullet) The recommendation to "avoid Student's T" seems a bit too strong to me. I think situations can arise where I'd use Student's two sample t test.
(p. 86, 1st bullet) I find this statement interesting, especially in light of the explanation supplied in the first full paragraph of p. 87.
(p. 87, last 4 sentences) Wilcox seems to indicate that Student's t test should be robust for validity for the general two sample problem, but one can in fact have serious inflation of the type I error rate if the common distribution is skewed and one sample size is rather small while the other one is rather large (it's pretty much the same type of thing as for a one sample t test).
(p. 89, paragraph at top of page) I like Wilcox's suggestion that the results of several procedures can be examined and the consistency of the resulting conclusions be useful in giving an indication as to the trustworthiness of the procedures (but it would take some experiance to get good at doing this sort of thing properly).
(p. 90, 2nd bullet, 2nd to last sentence) ERROR IN BOOK: In his statement about the possibility of a biased test, he has the false and the true reversed.
(p. 90) That there are general conditions under which Student's T does not converge to the right answer may have been derived by Cressie and Whitford, but the result was known by others prior to the publication of their 1986 paper. A 1983 draft of Rupert Miller's Beyond ANOVA: Basics of Applied Statistics, which was first published in 1986 included results that indicated the problem with the two sample t test. (Wilcox also gives Cressie and Whitford credit on p. 87 and p. 93 while not indicating that they weren't necessarily the first to know of the result.)
(p. 91, near top) A situation is described in which the use of transformations might be okay (but not necessarily optimal).

4th meeting (July 13) : Ch. 6

(p. 94) Wilcox indicates that he will cover two bootstrap methods. These are the percentile method and the percentile t method (often referred to as the bootstrap t method). We should spend some time discussing these methods. Also, I can describe how bootstrapping can be used to obtain estimates of standard error and bias.
(p. 96, 1st sentence) Wilcox indicates the parent distribution of the data can be estimated from the data. Of course, of chief concern is how well it can be estimated with smallish samples.
(p. 98) Wilcox has that the percentile confidence interval "can be theoretically justified" in large sample settings. I'm not so sure about that if the parent distribution is skewed. The book by Efron and Tibshirani indicates that the percentile interval is "backwards" and gives no good justification for it (other than it seems to work okay in some situations). We can talk about this. Note that the asymptotic accuracy result indicated on p. 104 pertains to the bootstrap t method, not the simple percentile method.
(p. 98) Note that large samples are required for the percentile method to work well for making an inference about the mean. So clearly this bootstrap method cannot always be relied upon to be accurate.
(p. 100) In (6.1), do you understand why the sample mean of the original sample is being subtracted from the sample mean of the bootstrap sample?
(p. 100) ERROR IN BOOK: In computing the value if T^*, Wilcox has the values 18.6 and 18.1 reversed (see p, 96, towards bottom of page), and so the value of T^* should be -0.19.
(p. 101) Note that a sample size of at least 100 is recommended for the for the percentile t method (and B = 999 is recommended). So clearly this bootstrap method cannot always be relied upon to be accurate.
(p. 101) A motivation for using 999 for B, instead of 1000, is that 999 T^* values divide the real numbers into 1000 intervals.
(p. 101) Hall's 1986 article indicates that coverage probability doesn't suffer so badly when a small value is used for B, but with a small value of B we put ourselves at risk of getting an interval that is unnecessarily wide.
(p. 250) ERROR IN BOOK: The correct page numbers for this article by Hall are 1453-62.
(p. 102, 1st line) Rather than round .025B and .975B to nearest integer, it'd be better to pick a value of B so that 0.025(B + 1) and 0.975(B + 1) are integers, and use these values for L and U.
(p. 102) ERRORS IN BOOK: Wilcox has 11.4 as the sample standard deviation for the sample given on p. 96 (near the top) and repeated on p. 100, but this value is actually about 11.14. Also, it isn't clear to me where the values 2.08 and 2.55 are coming from. Based on p. 100 and the caption of Fig. 6.3 on p. 101, it seems like these values should be 2.14 and 2.01. Finally, the interval obtained from assuming normality should be (13.4, 23.8) instead of (13.3, 23.9).
(p. 102, 3rd line from bottom) ERROR IN BOOK: The interval obtained from assuming normality should be (2.2, 12.7) instead of (2.2, 12.8).
(p. 104, 1st two lines) Wilcox's advice "to always use the percentile t bootstrap when making inferences about a mean" seems questionable! Did he consider Johnson's modified t test for tests about the mean of a skewed distribution? And what if n is small? (Doesn't Wilcox indicate that n shouldn't be too small when using bootstrapping? Does he believe that a bootstrap method is best if the sample size is only 10?)
(p. 105) I think it is bad that Wilcox just compares bootstrapping to Student's t, when in many situations Student's t isn't the best nonbootstrap method to use.
(p. 105 and p. 107) It cracks me up that Wilcox refers to "quantitative experts" (p. 105) and "authorities" (p. 107).
(p. 108) To me it would make a lot more sense to use robust estimates of the slope in the bootstrap procedure. (If you believe normality, it's not clear that the bootstrap is needed, and if you worry about nonnormality, I think a robust estimation procedure would be better.) It would be interesting to study this with a Monte Carlo study. (It may be that some fine-tuning would be called for, as is described at the top of p. 109.)
(p. 109, last paragraph) Do you understand how "the information conveyed by the correlation coefficient differs from the least squares estimate of the slope"?
(p. 112, at the top of the page) What do you think makes the actual size of the test exceed 0.05?
(p. 113) Note that the breakdown point of the correlation coefficient is 1/n.
(p. 115) Note that in some situations the percentile method works better, and in other situations the bootstrap t works better.
(p. 115) ERROR IN BOOK: Wilcox refers to Section 6.5 and Section 6.6, and yet the book has no numbered sections.

5th meeting (July 20) : Ch. 7

(p. 118, about 1/2 way down page) Wilcox gives a reference for a "method for answering the question exactly", making it perhaps seem like something nontrivial --- whereas in fact it is rather basic probability stuff (but I guess Wilcox doesn't assume his typical reader knows much about probability). Does everyone understand how the probability can be easily determined?
(p. 119, at top of page) Can you show that if one approximates the contaminated normal cdf with the standard normal cdf, the maximum error is about 0.04? (It's a fun little probability exercise.)
(p. 120) Wilcox has that"the population variance is not robust", and by this he means that slight changes (measured in certain ways) in the distribution can result in large changes in the variance. Usually we think of robustness as relating to statistics, but here he's applying the idea to a distribution summary measure.
(p. 121, at top of page) The 0.96 comes from a one-sided test. (The way it is worded, one might think that a two-sided test was under consideration.)
(p. 121, towards bottom of page) It's hard to confirm the 0.28 value without doing a Monte Carlo study since the sample sizes are too small to count on the test statistic having approximately a standard normal or t distribution with the underlying distribution of the data being so nonnormal.
(p. 121, last 2 lines) ERROR IN BOOK: It should be Chapter 5 instead of "Chapter 4" (see pp. 71-72 of Ch. 5).
(p. 123, at top of page) The first sentence states a key idea!
(p. 123, top half of page) The first full paragraph reminds us that although one may have robustness for validity with large enough sample sizes, one need not have robustness for efficiency in all such situations.
(p. 123, 8 lines from bottom) ERROR IN BOOK: Wilcox refers to the normal distribution as being light-tailed, but I think it's better to think of the normal distribution as having neutral tails.
(p. 127) The desire to label effect size seems big in psychology (Wilcox is in a psychology department, and I've encountered this when dealing with psychologists at GMU). I tend to wonder about the power of detecting differences of practical concern for the situation under consideration, but I think the magnitudes involved differ from situation to situation and don't think in terms of preset definitions.
(p. 132) At the end of the first paragraph of the new section, I think it's the case that while bootstrapping may result in improved accuracy when applied to normal-theory test statistics used with nonnormal data, the power could still be poor because the test statistic is ill-suited for the task ... it's defective (as Wilcox suggests).

6th meeting (July 27) : Ch. 8

(p. 139) ERROR IN BOOK: The first sentence of the second paragraph needs to be reworded ("particularly the population variance" seems out of place).
(p. 140, top portion of page (and p. 158)) Unfortunately, Box's paper has led some to believe that unequal variances are of little concern with one-way ANOVA. (Both Rupert Miller and John Miller seem to extract this lesson from the paper.) But Wilcox points out that Box considered only rather tame cases of heteroscedasticity. Wilcox's 1997 book includes some numerical results indicating that type I error rate can well exceed nominal level if variances differ by enough (and unequal sample sizes serve to aggravate the problem). Also of concern is the fact that power characteristics can be screwy even if equal sample sizes serve to make actual size of test close to nominal level. (Those of you who have taken STAT 554 should be somewhat familiar with this phenomenon.)
(p. 140, towards bottom of page) Wilcox suggests that problems due to heteroscedasticity are underappreciated by applied researchers who use statistical methods. My guess is that they are also underappreciated by most statisticians holding graduate degrees in statistics. Perhaps lots of statisticans are aware that there are some problems related to heteroscedasticity, but many may not be adequately trained in how to deal with such problems. (Often the semester ends before courses can address such topics.)
(p. 141, at top of page) The first full paragraph describes a main goal that is addressed in Chapters 8 and 9.
(p. 141, last paragraph before new section) Wilcox has "At some point doubt arises as to whether the population mean provides a reasonable measure of what is typical." While this may be true, it may be that the focus should be on the mean even if the mean doesn't correspond to a typical value. For more on this, read the 3rd item in my notes on Ch. 2 above.
(p. 141, last 4 lines) The two classes of estimators are covered in Ch. 8 (intro material) and Ch. 9 (using such estimators to make inferences).
(p. 142, first paragraph) This paragraph indicates that the sample mean and the sample median are extreme examples of trimmed means. I'll also point out that they also fall into the class of M-estimators.
(p. 142, first paragraph) The last sentence of this long paragraph suggests that the relative merits of different degrees of trimming should be considered, but Wilcox doesn't give of a lot of information which would allow us to get a good feel for how the degree of trimming affects performance.
(p. 143, near top of page) The suggestion to trim 20% follows from a strategy to trim as much as possible for protection against the ill effects due to heavy tails while still being competitive in the ballpark of normality. Studies that I have done suggest that 10% trimming is better than 20% trimming in a lot of heavy tail situations. While 20% trimming does indeed do better than 10% trimming is rather extreme heavy tail situations, unless the data suggested that the case at hand may be such an extreme case, I'd prefer to trim just 10%. (5% trimming may be slightly better is underlying distribution is only slightly heavy-tailed, but performance wouldn't drastically suffer if 10% was trimmed in such situations.)
(p. 145, Figure 8.2) I think this figure is a poor way to compare the performances of the estimators since it truncates the tail behavior. He should also give estimated MSEs and MAEs and/or supply some information about the proportions of large errors of estimation. (Maybe over 99% of the time the estimators would supply about the same quality of estimates --- but if so then perhaps the focus should be on the less than 1% of the instances in which at least one of the estimators performs badly.)
(p. 147) The last sentence of the paragraph that continues at the top of the page suggests that trimmed means can be used for tests and confidence intervals even if the underlying distribution is skewed. But I think that we'll see in Ch. 9 that we must be content with making inferences about the population/distribution trimmed mean instead of the population/distribution mean, and of course (as Wilcox points out in several places) these population measures can differ (and so one should give some thought as to what it is that you really want to make an inference about).
(p. 148, bottom portion of page) We are reminded that outliers can of of interest --- but it is also suggested that outliers can get in the way when the goal is to learn something about the bulk of the members of a group.
(p. 149) Wilcox refers to "three criteria that form the foundation of modern robust methods." Information about these can be found in Wilcox's 1997 book.
- Qualitative robustness pertains to the sensitivity of a statistics or distribution measure to small changes --- can small changes in data or distribution result in large changes in value of statistic or distribution measure?
- Infinitesimal robustness is similar --- it also deals with sensitivity to small changes. But with this concept the effect of a small change is described using an influence function (and a bounded influence function results in good robustness).
- Quantitative robustness pertains to breakdown points..
(p. 150) I think that it may be easier to get a feel for M-estimators if a description based on the penalty function (rho) is given as opposed to a description based on the influence function (psi). I can offer you such a description when we meet.
(p. 151, bottom portion of page) Although it isn't entirely bad to think of M-estimators as ones that down-weight or ignore extreme observations, with the Huber M-estimator it is perhaps more accurate to say that for observations far from the bulk of the data, the "excess" distance away is ignored (or the distance away is down-weighted).
(p. 152, top half of page) The first two paragraphs on the page describe the overall strategy. The last several sentances of the 2nd paragraph describe a key part of the strategy.
(p. 152, 3rd paragraph) To eliminate the biweight (aka bisquare) brand from consideration seems a bit too extreme. I've found that if the tailweight of the underlying distribution is heavy enough, the biweight variety is better than the Huber variety of M-estimator. Sure one has to worry about lack of convergence to a sensible value. But one could always compute Huber's M-estimate as a check, and think of the biweight estimate as a slightly superior estimate if indeed it is not drastically different from the Huber estimate (and if they differ by more than a bit, I'd take a careful look into the situation).
(p. 152, 3rd paragraph; & p. 158) I think Wilcox puts too much emphasis on the conclusions of the Freedman and Diaconis article. They show that M-estimators converge to the correct value for symmetric unimodal (there are some restrictions) distributions, but that redescending M-estimators need not be consistent for multimodal distributions. So if the underlying distribution is not multimodal, maybe redescending M-estimators can do okay. (Of course, consistency is an asymptotic result, and so maybe we need to be a bit concerned with smallish sample sizes.)
(p. 152, last paragraph) Some explanation is given for the desire to incorporate a measure of scale into the M-estimation procedure. We can discuss the matter more when we meet.
(p. 153, top portion of page) Wilcox indicates that "quantitative experts" suggest setting K to the value 1.28. Other values that have been suggested by reasonable people are 1.345 and 1.5. (1.28 is about z_0.1. 1.345 is the value that results in an ARE of 95% when the underlying distribution is normal. 1.5 was suggested in Huber's original 1964 paper (and is favored by Birkes and Dodge).)
(p. 153, bottom half of page) I agree that the one-step Huber M-estimate of location is nearly as good as a fully iterated one. I wonder if there is a simple one-step estimate to use for an estimate of the slope parameter in simple regression based on Huber's M-estimator, or if one-step versions exist for the biweight estimate. (Clearly, one could stop after a single iteration, but I wonder if this results in a relatively simple closed-form estimate.) (Question: If no closed form estimate exists (say for an mle or an M-estimate), is it okay to refer to an estimator? Surely we'd have an estimation procedure, and one could assess things like unbiasedness and consistency for the procedure, but I wonder if it's okay to use the term estimator in such situations.)
(p. 153, bottom half of page) Back in the 1970s, calculus books referred to Newton's method, but in the last 20 years I tend to see Newton-Raphson used. Is there a distinction? Is it okay just to call it Newton's method?
(p. 154, top half of page) Note that the one-step (1 iteration may be a more accurate term to use) estimate can be described in terms of a simple 2 step procedure (outlier identification, followed by outlier removal (and averaging)) if one ignores the 1.28(MADN)(U - L) part. The one-step estimate is similar to an adaptive trimmed mean.
(p. 154, bottom half of page) Wilcox states that "the one-step M-estimator looks very appealing." Based on studies that I have done (with the help of various students), I agree. The one-step estimator does just about as good (or better) as the 10% trimmed mean in situation in which the 10% trimmed mean works very well, and the one-step estimator does just about as good (or better) as the 20% trimmed mean in situation in which the 20% trimmed mean works very well. It's about 95% as good (using MSE as a measure of goodness) as the sample mean if the underlying distribution is normal (even for sample sizes in the ballpark of 10 or 20). So if using the sample mean is rejected due to apparent heavy tails, then for estimating the mean/median of a symmetric distribution, the one-step M-estimate seems like a good choice. Although there are problems with using it to estimate the mean or median of a skewed distribution if the sample size is sufficiently large (due to biasedness (that is not even of the asymptotically unbiasedness nature) resulting in an MSE that does not tend to 0), my work has indicated that the M-estimate is not necessarily a bad choice for small sample size situations (although for estimating the mean or median, if the skewness is large relative to the kurtosis, one may be better off using the sample mean or a one-sided trimmed mean).
(p. 155) Wilcox states that "the one-step M-estimator can have a substantially smaller standard error" (compared to the 20% trimmed mean). It's important to keep in mind that for large sample sizes, bias may also be of concern --- because neither estimator is guaranteed to be unbiased, or even asymptotically unbiased, for the distribution mean or median. The estimator with the smaller standard error is not necessarily the one having the smaller MSE.
(p. 156, 4th sentence from top) ERROR IN BOOK: Instead of "an outlier" it should be some outliers.
(p. 157) The paragraph right before the Summary gives somewhat of a summary of Wilcox's opinions about the relative merits of the Huber M-estimator and the 20% trimmed mean, and also provides something of a preview for Ch. 9.
(p. 157) The 2nd bulletted item of the Summary is rather important.
(p. 158) The book by Staudte and Sheather may be good to investigate at some point. (Just today I got a book on bootstrapping by Chernick that I'm going to evaluate for seminar appropriateness.)
(p. 158, last sentence) The standard error of the one-step estimator can be estimated with a simple bootstrap estimate of standard error.

7th meeting (August 3) : Ch. 9

(p. 161, 3rd line from top & 3rd line from bottom) ERROR IN BOOK: The word probability should be replaced by distribution (or perhaps probabilities).
(p. 162, near middle of page) The "intuitive explanation" that Wilcox refers to is attempted on the bottom half of p. 164, but it is not a good explanation.
(p. 164) In expression (9.2), I think it may be better to replace gamma by g/n, since the actual proportion trimmed can differ a bit from the nominal value. This is consistent with what has been found to be true in the two sample case. (See p. 170, where Wilcox has "Yuen's method has been found to perform slightly better when sample sizes are small.")
(p. 164) Using expression (9.2), and assuming a large sample size, determine what an optimal value for gamma is if the distribution is a Laplace (double exponential) distribution. Please try to do this.
(p. 164) ERROR IN BOOK: 3 lines below expression (9.2) a factor of 2 is missing from in front of the gamma.
(p. 166) Note that the confidence interval given in expression (9.2) is for the population (distribution) trimmed mean. For symmetric distributions, the distribution's trimmed mean coincides with the mean/median, but for skewed distributions the trimmed mean is a nonstandard distribution measure to focus on.
(p. 167, roughly 3rd quarter of page) Results from one of my studies conflict with Wilcox's claims. I found that increasing the proportion trimmed can cause tests to become anticonservative in some cases.
(p. 167, bottom portion of page) Unfortunately, Wilcox's conclusions are rather vague. I wish he had used the style of his 1997 book where he reported on results for specific distributions and sample sizes.
(p. 168, 1st two lines) Note that n cannot be too small if one wants to insure good accuracy.
(p. 168, last sentence of 1st full paragraph, and last sentence on page) I'd like to see someone else confirm this. Also, I wish Wilcox would have indicated what sample sizes and distributions he considered.
(p. 168 & p. 171) I think that using 585 for U makes more sense than 584. Also, the reason for B being 599 instead of 499 is that for a 95% confidence interval, for which 2.5th and 97.5th percentiles are needed, B+1 should be a multiple of 40.
(p. 169, 1st paragraph) Wilcox has "and in the event sampling is from a normal curve, using means offers only a slight advantage." My guess is that the alternative procedure can result in about a 10% decrease in power, which is somewhat slight, but not ultraslight. (Note: The reduced power of the alternative procedure is also relevant to the last sentence on p. 173 and to 2nd to the last bullet on p. 178.)
(p. 171) In addition to being a confidence interval for the difference between two population trimmed means, expression (9.9) could also be used to perform a test for the general two sample problem (testing the null hypothesis that the two distributions are identical against the general alternative).
(p. 174, last paragraph) Wilcox compares trimmed mean procedure to M-estimator procedure. One of my studies suggests that for a variety of symmetric heavy-tailed distributions, the signed-rank test outperforms (power comparable, but accuracy better (of course, since signed-rank test is exact)) testing procedures based on trimmed means and M-estimators, although I didn't use the bootstrap methods that Wilcox recommends.
(p. 174 & p. 175) Note that in some cases Wilcox has found that the regular (and simple) percentile bootstrap outperforms the bootstrap t (something that Wilcox refers to as "somewhat surprising").
(p. 175, top half of page) Note the way the D^* values are formed. While it's clear to me that doing it whis way is reasonable, I wonder if more accuracy could be achieved by doing it another way --- instead of just using B differences, combine all of the estimates from resampling B times from the first sample with all of the estimates obtained from resampling B times from the second sample to form B² differences. Can anyone figure out a way to do it this alternative way?
(p. 175, towards bottom of page) COMPLAINT: Wilcox indicates here that with trimmed means the percentile bootstrap is better than the bootstrap t --- and so I wonder why he went into more detail on the bootstrap t method and is somewhat casual in remarking that the percentile method is better.
(p. 178, 1st full sentence) ERROR IN BOOK: Wilcox has "In some cases the correct estimate is substantially smaller than the incorrect estimate." I don't see how this is possible. Am I doing something wrong, or do you guys agree with me? Please take a moment or so to consider this.

8th meeting (August 17) : Ch. 10

(p. 179) COMPLAINT: Wilcox uses "measures" instead of variables on the next to the last line, even though he had used variables previously on the page. Although measures may be used in some fields, I don't think it's good to use two different terms when one would suffice.
(p. 183) ERROR IN BOOK: Wilcox refers to "Section 6.6" even though the book has no numbered sections.
(p. 187, 1st full paragraph) Note that the t test is fairly accurate for testing the null hypothesis of independence against the general alternative, even if there is nonnormality, but power can be low in some situations. Also, in addition to having Spearman's rho and Kendall's tau, StatXact has an exact permutation test based on Pearson's statistic (that is an exact way to test for lack of independence) that doesn't require an assumption of normality --- and if this exact version is employed, one wouldn't have to worry about even slight inaccuracy due to nonnormality.
(p. 188) ERROR IN BOOK: In the 2nd set if X and Y values a little more than halfway down the page, the 3rd Y value should be 28 (instead of 47).
(p. 190, 6th line) COMPLAINT: Why recommend B = 600 here when previously 599 has been used? This seems like needless lack of consistency.
(p. 190, last 2 lines (and 1st line of p. 191)) I question that Spearman's rho and Kendall's tau are "typically covered" in introductory statistics courses.
(p. 191) Here are some comments about Spearman's rho.
- Spearman's paper appeared in 1904, and so it is a rather old statistic.
- It should be mentioned that it is a measure of the strength of a monotone relationship, whereas Pearson's statistics is a measure of the strength of a linear relationship.
- When doing a test with a small sample, tables of the exact null sampling distribution (or StatXact) should be used instead of a normal or T approximation.
(p. 193) In addition to the behavioral data analysis book by N. Cliff, other books (e.g., the text I use for Nonparametric Statistics) also contain information about the treatment of ties. (When there are a lot of ties, the method used can make an appreciable difference.)
(p. 193) The 1997 Wilcox book contains some information about the methods related to M-estimators.
(p. 197) Note that the axis of the MVE need not directly correspond to the correlation computed from the points inside the MVE.
(p. 197) What is the IML of SAS/IML?
(very bottom of p. 200 and very top of p. 201) Wilcox makes it seem as though Spearman's rho and Kandall's tau are to be considered to be newish alternative methods, but they really aren't very new. (Spearman's paper was published in 1904.)
(pp. 200-201) Wilcox suggests that the nature of the association in Figure 10.10 changes at about X = 150, but I wonder if he is putting too much emphasis on the smooth. If the plot was truncated at X = 250 and the smooth was removed, then to me a visual examination would not suggest that there is a positive trend up to X = 150 and then no association for larger values of X. (Note: I'm not a fan of smooths when the data is sparse as it is in the right half of Figure 10.10. If the parameter(s) of the smoother was set differently, the picture would change.)
(p. 202, last bullet) If ones wants to test using a null hypothesis of independence instead of a null hypothesis of zero correlation, then I think the tests based on an assumption of homoscedasticity should be preferred. One could still use resampling to perform tests based on more exotic statistics, but I think the resampling should be of the permutation variety as opposed to resampling intact (x, y) ordered pairs.

9th meeting (August 24) : Ch. 11 & Ch. 12

(p. 206, 2nd to last sentence of 1st full paragraph) This relates to Ch. 10 (and recall, doing a t test that the correlation is 0 using Pearson's sample correlation coefficient is equivalent to doing a t test of the null hypothesis that the slope is 0). Wilcox indicates that if you reject that the slope, beta, is 0, you can safely interpret that the distribution of Y depends on x, but you should be careful about making the interpretation that E(Y|x) is an increasing or decreasing function of x if the assumption of homoscedasticity is in question (even if the rest of the simple regression model holds).
(p. 206, 3rd sentence of 2nd paragraph) I agree with Wilcox's advice.
(p. 206) ERROR IN BOOK: Wilcox refers to Section 11.12, but there are no numbered sections.
(p. 208, 1st full paragraph) It's interesting that Wilcox claims that Theil-Sen estimator competes well with least squares when there are iid normal error terms, but I wish he'd have given a quantitative result! I'm not going to be happy with 80% efficiency since M-regression can be very close to 95% efficient while protecting against the ill effects of very heavy-tailed distributions.
(p. 212, near middle of page, 1st sentence of paragraph) It'd be more accurate to put reduces the bad effects of outliers in place of "protects against outliers" since in some cases other methods are appreciably more resistant to outliers (and so you don't want to think that L₁ regressions provides complete protection). (Also, in the 3rd sentence of that paragraph, it's kind of silly to have that the breakdown point is "only zero" --- I think it would be better to put that the breakdown point is 1/n (or asymptotic breakdown pojnt is 0), and perhaps remind the reader that this is the lowest possible value.)
(p. 213) Some people think L₁ regression is more resistant to weird points than it really is. Figure 11.2 provides an excellent example of how this method can fail. Maybe we can find this data and then try other types of regression with it. My guess is that least squares will do poorly, but some types of M-regression will do well.
(p. 213) ERROR IN BOOK: Ordering the residuals makes no sense --- need to use order statistics of the squared residuals instead.
(p. 214) Figure 11.3 shows that even high breakdown point methods can fail.
(pp. 213-215) If one believes regression model with iid error terms holds, then LTS (and LMS too) is a very poor choice for estimating the unknown parameters (according to results from studies that I have done). Not only are the MSEs relatively huge, but also LTS is very slow on S-PLUS.
(p. 215, last 2 sentences of 2nd to last paragraph) I agree with Wilcox's advice, but one could add that the fits should be examined graphically even if all of the methods you try are in near agreement (since all could result in a screwy fit).
(p. 216) ERROR IN BOOK: One needs to use the ordered absolute residuals, not the absolute values of the ordered residuals.
(p. 217) One might wonder why use the median of the squares instead of the median of the absolute values. I think it may have to do with having a unique solution in even sample size cases. For odd sample sizes it would make no difference whether squares or absolute values were used, but for even sample sizes, where I think it uses the average of the two middlemost values, one doesn't necessarily get unique estimates if absolute values are used. (Think of the intercept, and moving the fitted line up and down while keeping slope constant. Different values of the intercept can result in the same value of the median of the absolute residuals if the median is computed from two middlemost values.)
(p. 217) I've seen in other places where LTS is generally better than LMS (as Wilcox has). But the word must not have spread to all corners, since some seem to use LMS instead of LTS when they want a high breakdown method.
(p. 217) ERROR IN BOOK: Near middle of page, Wilcox has "indicated in Figure 11.3" but I don't see how figure indicates what he claims it does (since it doesn't even show LMS fit).
(p. 218) The method described for identifying regression outliers seems superior to using studentized residuals (since studentized residuals are based on least squares fits, and are not necessarily well-behaved if there are multiple outliers (or just general overall heavy-tailedness). I wonder what the alternative method corresponds to in the iid normal case. (I.e., are the points labeled as a regression outlier those with a studentized residual of 2 or greater?)
(p. 221) I wish Wilcox would have included a description of the adjusted M-estimator (but I guess we can refer to his 1997 book, although I think the description there should to be improved).
(p. 221) ERROR IN BOOK: Wilcox refers to Section 11.10, but there are no numbered sections.
(p. 221) Wilcox claims using bootstrapping with adjusted M-estimator gives good results even with small sample sizes, extreme nonnormality, and extreme heteroscedasticity. Given what we've found in our studies, where we don't have heteroscedasticity, I find his claim to be a bit hard to believe, and so I wish he would have included some numerical results to back up his claim.
(p. 222, bottom of page) I like the strategy of picking the estimator having the smallest estimated standard error. It could be a lot of work if bootstrapping is needed to get some of the standard error estimates, but if the data analysis is important, one might want to go to all of the trouble.
(p. 223) I'd be interested in knowing what regression depth is. The scheme is to find the line having the "highest regression depth" (seems odd to refer to a high depth --- I'd have used the word greatest), but book doesn't indicate what regression depth is.
(p. 224) Wilcox seems to like LTS estimator with breakdown of 0.2 or 0.25. Too bad he doesn't give any solid comparisons with other methods. For instance, how does it compare with least squares and Huber M-regression in iid normal case (and how does it compare with other methods in situations with iid error terms having contaminated normal distributions)?
(p. 224) I don't agree that a breakdown point of 0.13 is "dangerously low" since in order for 13% of the data to cause big trouble, those 13% have to be working together in a sense (as opposed to being 13% contamination scattered about in different directions), and if the 13% of the bad points were working together to result in a bad estimate, one could hope to spot the trouble using graphical methods. (One idea for a graphical method would be to color points having large residuals and then look at the p-dimensional predictor space using a rotating cube (for p = 3) or parallel coordinates (for p > 3).)
(p. 224) ERROR IN BOOK: Wilcox refers to Section 10.9, but there are no numbered sections.
(p. 224) A problem with using the MVE described in Ch. 10 to identify the slope is that if the x values are tightly clustered in the middle and sparse at the ends, the MVE could indicate a very misleading result (and it would be better to use information provided by the points having extreme x values).
(pp. 224-226) Wilcox seems to like the rather exotic methods, like Theil-Sen, adjusted M-estimator, and LTS with breakdown of 0.2 or 0.25. My guess is that only a reletively small number (maybe even a small number in an absolute sense) of people actually use these methods in practice. I'll guess that more people (but still a a small proportion of statisticians and users of statistical methods) use the more well-known alternative methods like Huber or bisquare M-rergression. (Unfortunately, L₁ regression seems to be the alternative method many consider when they don't use least squares.) Wilcox favors methods that may do well in the worst of situations (combining heavy tails with heteroscedasticity), but I wonder if using one of the exotic methods favored by Wilcox would be as good as using one of the more familiar M-regression methods if we can determine that although we have a situation where least squares shouldn't be trusted, we don't have an ultraextreme situation. (I'll guess that in lots of iid error term cases, more common M-regression methods will outperform those that Wilcox favors.)

Below I'll give some results pertaining to some of the more commonly used (I think) robust regression methods (that can be done using standard S-PLUS functions). In a 1999 paper that I presented at a meeting in Chicago, I first showed (using Monte Carlo results) that conclusions based on asymptotic results (not requiring Monte Carlo work) seem to apply for the most part when the sample size is only as large as 50. That being the case, I developed a table of asymptotic relative efficiency (ARE) values with which to compare 4 different regression methods (really 5, since I found that using the Andrews weight function was nearly identical to using the bisquare weight function (both in asymptotic results and Monte Carlo results for smallish sample size situations)). The table below gives ARE values of least squares (OLS), Huber M (Huber), and L₁ (LAD) estimators with respect to the bisquare M-estimator. So ARE values in the table greater than 1 indicate that another estimator beats the bisquare estimator --- and it can be noted that there are not a lot of ARE values greater than 1, and only two ARE values exceed 1.05. Thirty-one different error term distributions are considered. T15 denotes a T distribution with 15 df, cn(.05) denotes a contaminated normal distribution with 0.05 being the probability associated with the larger variance normal distribution. More than one row is labeled cn(.05) since different scale factors can be used in a 5% contaminated normal distribution. The scale factors used range from 2 to 10, inclusive, and the details can be found in my 1999 paper. The column labeled twi gives the value of a tailweight index.

dist'n	twi	OLS	Huber	LAD
normal	1.00	1.05	1.00	0.67
cn(.001)	1.00	1.05	1.00	0.67
cn(.001)	1.00	1.05	1.00	0.67
cn(.001)	1.01	0.96	1.00	0.67
cn(.005)	1.01	1.04	1.00	0.67
cn(.01)	1.01	1.04	1.00	0.67
cn(.005)	1.01	1.02	1.00	0.67
cn(.01)	1.03	0.99	1.00	0.67
cn(.005)	1.03	0.71	0.99	0.67
cn(.01)	1.07	0.54	0.98	0.67
cn(.05)	1.07	0.98	1.00	0.68
T15	1.09	1.00	1.01	0.71
cn(.05)	1.12	0.92	1.00	0.69
T10	1.15	0.96	1.01	0.73
cn(.1)	1.15	0.93	1.00	0.70
cn(.05)	1.15	0.88	0.99	0.69
cn(.05)	1.20	0.98	0.99	0.69
logistic	1.21	0.93	1.01	0.76
T7	1.22	0.90	1.01	0.75
T6	1.27	0.87	1.01	0.76
cn(.1)	1.29	0.83	1.00	0.70
T5	1.34	0.81	1.01	0.78
cn(.1)	1.37	0.79	0.99	0.71
T4	1.47	0.72	1.00	0.81
cn(.1)	1.53	0.71	0.98	0.71
Laplace	1.64	0.72	1.00	1.44
T3	1.72	0.52	0.99	0.85
T2	2.47	0.00	0.95	0.92
cn(.05)	3.43	0.19	0.92	0.65
cn(.1)	4.93	0.11	0.85	0.65
Cauchy	9.23	-----	0.79	1.13

It can be seen that L₁ regression is typically a rather poor choice (except for 2 rather extreme, and usually unrealistic, error term distributions). Also, it should be noted that while bisquare M-regression can be quite a bit better than least squares, it never does a whole lot worse than least squares.

(p. 227, top half of page) Wilcox's strategy seems okay, but as he points out, using smoothers for diagnositc purposes can be tricky in multiple regression settings. To address the "criticism" he refers to, one could also employ the strategy described on the bottom part of p. 222.
(p. 227, bottom half of page) I agree that it is often not good to use least squares and assume all is well, but I think it should be mentioned that least squares is okay to use in many instances. (If this was not the case, then in its present form STAT 656 should be eliminated.)
(p. 229) I like that this chapter will address permutation tests and rank-based tests, since in the end one wants the best procedure to use in a given setting, not just the best procedure that is considered to be a robust procedure (although the book has considered classical normal theory procedures throughout, and compared them with robust procedures).
(p. 231 & p. 234) How to view the W-M-W test is dealt with near the bottom of page 231.
I think it's best to think of the test as being one of the null hypothesis that all of the random variables are iid from the same distribution against the general alternative, but that it is sensitive to the value of p (introduced on p. 230), and if certain assumptions are made, it can be viewed as a test about means or medians (with a more rigid assumption needed to view it as a test about medians). It's unfortunate that it is presented as a test about medians in some places, but it's nice that on p. 234 Wilcox indicates that the W-M-W test "is unsatisfactory when trying to make inferences about medians" (although it would be better to state that it can be unsatisfactory unless one feels that it is reasonable to make certain assumptions (e.g., if we have a shift model situation, or a scale model for nonnegative random variables, then test can be used to address medians)).
(p. 233) In the first part of the paragraph that starts in the middle of the page, Wilcox makes it seem that the reader should be prepared to go forth armed with some really good methods, but really, he has given us scant information with which to make informed decisions. For the rank-based methods designed to work well if there is heteroscedasticity, he cites a book by N. Cliff. I don't have this book yet, but I am aware that such rank-based methods were introduced into the more mainstream statistical literature many years ago. A GMU M.S. is Stat. Sci. graduate, Kelly Buchanan (now Kelly Thomas), did a thesis for me in 1993, and it includes a very good literature review of what is known as the generalized Behrens-Fisher problem (two sample tests of means when there is nonnormality and heteroscedasticity), which includes many references about modifications of standard nonparametric procedures. To summarize, in the early 1960s there were papers indicating that the ordinary W-M-W test was not an accurate test for the generalized Behrens-Fisher problem, and including suggested modifications of the W-M-W test. Among such papers were those of P. K. Sen (1962) and R. F. Potthoff (1963). Monte Carlo studies done for the thesis indicate that a 1979 modification of Sen's procedure developed by K. Y. Fung represented an improvement of earlier efforts, but it should be noted that K. K. Yuen's (note: I think K. K. Yuen became K. Y. Fung when she got married) 1974 trimmed mean modification of Welch's test was found to be generally better than the rank-based Fung test. It should also be noted that most of the studies pertaining to the generalized Behrens-Fisher problem employed symmetric location-scale families of distributions, and that much less is known about tests about means in the presence of skewness and heteroscedasticity. (Note: It's too bad that Wilcox doesn't seem to be as well acquainted with mainstream statistics literature as he is with the literature from the social sciences dealing with statistical methods.)
(p. 233) For the data from the experiment to study the possible effect of ozone on weight gain in rats, I think the ordinary W-M-W test would be a better starting point than a version of the test designed to adjust for heteroscedasticity. It seems to me that it would first be proper to test the null hypothesis that ozone has no effect against the general alternative that weight gain is somehow effected by the amount of ozone. If the null hypothesis of no effect is rejected, then one could go about trying to characterize how the distributions differ, examining means, medians, quantiles, variances, and other distribution measures. To just be concerned about the value of p (which is the focus of the rank-based test that adjusts for heteroscedasticity), seems a bit silly. Suppose that p equals 0.5, but that the distributions have very different dispersions --- I think it'd be nice to make note of this, since it would mean that one enviroment tends to produce more uniform weight gain, while in the other enviroment there is a tendency to observe more extreme (both small and large) values.
(p. 235) In the middle portion of the page, Wilcox describes the Monte Carlo version of the test. The "official" version of the test doesn't use random selections, but rather considers all possible ways of dividing the n₁ + n₂ observations into groups of sizes n₁ and n₂. StatXact does the exact version, and also will do a Monte Carlo version if the sample sizes are too large for the exact version. Also, it should be mentioned that it's easy to get a p-value (and a confidence interval for the estimated p-value in the Monte Carlo case), and one doesn't have to do a size 0.05 test the way Wilcox describes.
(p. 236, 1st full paragraph) Wilcox correctly implies that the two sample permutation test is really an exact test for the general two sample problem (of the null hypothesis of identical distributions against the general alternative), and is not really a test about the means. (But if one is willing to make some additional assumptions (e.g., assume that either the distributions are identical, or that one is stochastically larger than the other if they differ), it can be considered to be a test about the means.) If unequal variances can cause the probability of rejecting to exceed the nominal level even if the means are the same, it is still a valid test of the general two sample problem (since if the variances differ, the distributions are not the same, and so the correct decision is to reject the null hypothesis).
(p. 238, 10 lines from bottom, and p. 239 (a bit below halfway down)) I don't like it that Wilcox uses the phrase "bootstrap techniques again appear to have practical value" (my italics). (He uses similar phrases in other places in the book.) I want some indication of how accurate a method is, and more precise information about sample size recommendations. I can't help but be at least a bit suspicious when he indicates his endorsement applies to small sample size settings (since in simpler settings it is known that somewhat largish samples sizes are needed to insure accuracy with bootstrap procedures). On p. 239, Wilcox has "certain types of bootstrap methods appear to be best when sample sizes are small." Again, I am a bit suspicious. Perhaps the bootstrap methods are best, but are they good enough?
(p. 239, 2nd to last sentence in paragraph that begins the page) Wilcox has "very small departures from normality in the fourth group can make it highly unlikely that the differences among the first three groups will be detected." It depends on what test procedure is used. Wilcox is correct to have "can" because one large variance can adversely effect the ability to identify any differences if a pooled estimate of scale is employed by the test procedure. But if the test procedure employs pairwise comparison subtests, then a large variance for the fourth group won't effect the ability to detect differences among the first three groups.
(p. 242) ERROR IN BOOK: In the indt description, I'll guess that it should be in press a. instead of just "in press."
(p. 243, 2nd to last bullet) Wilcox has "practical methods have been derived and easy-to-use software is available." Since I have found mistakes in expensive software (StatXact), I've grown to be suspicious of software (and try to test it before I trust it --- which can be a lot of work), and I'm even more suspicious of "freeware" (Dallas and I have identified mistakes in some of Wilcox's S-PLUS functions, and our Friday group has found fault with S-PLUS function obtained from Venables and Ripley). I've also grown to be suspicious of authors of papers. Even if they present honest Monte Carlo results to support their claims, I wonder what results they haven't put in the paper (e.g., maybe their method only works good in some settings, and they only reported on cases in which the method performs well, leaving it to others to identify situations in which their method can be quite lousy).