Comments about Statistical Infernce, 2nd Ed. by G. Casella and R. Berger


You can use the links below to jump down to the part of this web page that you want to read.

Preface material

  1. (p. vii) The authors note that there is enough material in the book for two semesters. If GMU had a two-semester sequence based on this book, the first semester would be similar to STAT 544, which is a prerequisite for this course (STAT 652). Because students come into STAT 652 expected to know the material in a course like STAT 544, in STAT 652 we skip a lot of the probability material in the first portion of the book. However, C&B has some important material in its probability chapters that is not covered in STAT 544, and so for the first several weeks of STAT 652 I'll be covering some of that probability material. There is some other probability material in C&B that I just won't have time to cover, even though I doubt it was covered in STAT 544 (e.g., some of the results in Sec. 3.6). Since this material may be useful to you as you work the homework problems this semester, I strongly encourage you to spend some time looking through the first portion of the book so that, if nothing else, you'll be familiar with what kind of material is included, even if you don't take time to master it right away. I may briefly point out some of the more useful things in my notes and lectures

Chapter 2

  1. (p. 56) C&B uses "E X" to denote the expected value of X. Many books use E(X) instead (which is what I'm used to). (Logically, there's nothing wrong with the way C&B does it, since E isn't a function.)
  2. (p. 63) The term kernel (as in kernel of a function) is introduced on this page. (You may not have encountered this in probability class.)
  3. (p. 67) Lemma 2.3.14 is a generalization of a result that you might remember from calculus classes. The simple form of the result is that if a is a constant then the limit of (1 + a/n)^n, as n tends to infinity, is e^a. This generalization gives that the limit is the same if a is replaced by a sequence of constants that converge to a.

Chapter 3

  1. (p. 102) C&B uses some notation and terminology that I don't like, but often I'll try to change my ways to match the book in order to make it easier on you. But I'm so used to using N(μ, σ2) to donote a normal distribution, that I don't think I can get used to using n(μ, σ2). (The line above (3.3.13) indicates their notation is the usual notation, but I don't think that's true. I think more books use an upper-case N than a lower-case one.) I'll also point out that when they refer to "the normal distribution" they mean the family of normal distributions. While I think it's fine to refer to the standard normal distribution because it's a specific distribution, I don't like to say or write the normal distribution because there are infinitely many normal distributions ... there's a family of them. I usually like to state either a normal distribution or the family of normal distributions, but even with me the normal distribution sometimes slips out.
  2. (p. 111) The first sentence of Sec. 3.4 indicates that a "family of pdfs or pmfs" is an exponential family if they are of the given form. But we can also refer to a family of distributions as being an exponential family. E.g., the family of all two-parameter normal distributions is an exponential family.
  3. (p. 113) On the last line the indicator function isn't needed since it will always equal 1. If the support of the distribution isn't the set of all real numbers then it makes sense to use an indicator function, but here it's really not needed.
  4. (p. 137) The first line of subsection 3.8.3 uses the phrase "the exponential family" (which I don't particularly like). Usually we say or write that a family of distributions (or pmfs or pdfs) is an exponential family. E.g., the family of Poisson distributions is an exponentail family, the family of all geometric distributions is an exponential family, etc. But on this page they use "the exponential family" to mean the set of all exponential family distributions. (When they indicate "the lognormal distribution" they mean the family of all lognormal distributions. So in the first paragraph of subsection 3.8.3, they mean that the family of lognormal distributions (which is an exponential family) is a member of the set of all such exponential families.)

Chapter 5

  1. (p. 207) Some authors of probability books state that the word mutually is not needed before independence when referring to three or more random variables.
  2. (p. 209) The notation on the bottom 2/3 of the page concerning sampling from a finite population is bad! If X1 is the random variable corresponding to the first random selection, and x1 its observed value, then {x1, ..., xN} should not be used for the set of population values. Instead, we could use {y1, ..., yN} as the set of population values. Then, if the first random selection was the 7th member of the population, y7, we would have that the observed value of X1 is x1 = y7. Also, the sentence on the 6th to the last line if the page doesn't seem quite right. If two or more population values are the same, then even if we sample without replacement the same value could be chosen more than once.
  3. (p. 219) In the 2nd line of the proof of Lemma 5.3.2, the example referred to should be 2.1.9 instead of 2.1.7.
  4. (p. 219, p. 223, & p. 225) Lemma 5.3.2 indicates that χp2 is used to denote a chi-squared random variable, but in the statement of the lemma it is used in the usual way to denote a chi-squared distribution. On p. 223 (5th line), it's used in a somewhat informal way (that I'll discuss in class) when put under the square root (with p = n - 1). On p. 225 (in Example 5.3.7) it's used to denote a random variable. Similarly, on p. 225, Fp,q is used to denote a random variable in Example 5.3.7, and it's used to denote a distribution in Theorem 5.3.8. I'm not thrilled about the authors' informality here, but I guess the intelligent reader can determine what is meant by the context.
  5. (p. 220) I don't like the 4th line of this page ... the "Defining" part. The fact given in the 5th line can easily be shown to be true, and one doesn't need to have it follow from (5.3.1) (on the 3rd line).
  6. (p. 224) In the 2nd to the last sentence on the page, the authors note that a variance ratio may have an F distribution even if the parent populations are not normal, but I'll point out that with nonnormality the sampling distribution of the variance ratio can be quite different from an F distribution. It depends on the nature of the nonnormality.
  7. (p. 236) The 6th line indicates that "the random variable" converges in distribution, but really it's the sequence of random variables that converges.
  8. (p. 227) The notation given is Definition 5.4.2 is not widely used. (You shouldn't assume that others are familiar with this notation.)
  9. (p. 243) In the 5th line of the proof to Theorem 5.5.24, the = sign isn't appropriate (since the left side isn't equal to the right side). An approximately equals sign would be okay.

Chapter 6

  1. (p. 281) In theorem 6.2.13, θ can be a vector. I've also noticed other places where authors fail to use vector notation (bold face) when the parameter can be multi-dimensional. Some books don't use vecotr notation for multi-dimensional parameters, and use θ whether the parameter is one-dimensional (a single parameter) or of higher dimension (more than one parameter). But the authors use the bold face θ notation in other places (e.g., Sec. 3.4 and p. 241), and I find it odd that they aren't consistent in this regard.
  2. (p. 282) Two lines below Definition 6.2.16, an ancillary statistic is referred to as "an observation" which I find odd since elsewhere the authors seem to stress that a statistic is a random variable. In fact in the definition of ancillary statistic (Def. 6.2.16) it is made clear that it is a random variable since its distribution is referred to. So I'll chalk up the "observation" characterization as sloppy informality.
  3. (p. 288) On the 3rd line of the paragraph following Theorem 6.2.25, C&B notes that the parameter space does not contain an open set. Yet in the theorem, the condition about the open set pertains to the range of the wi and not the parameter space. Something else odd is that really there is just one unknown parameter (θ), and so really the parameter space is just the set of all real numbers (thus one-dimensional) and not a parabola. I address this case on p. 6.2.17 of the course notes in a way that matches the wording of the theorem. (A similar theorem can be stated in terms of an open set in the parameter space ... some books do it this way. But this little example considered in C&B does not correspond well to the statement of their theorem. I find it all rather odd.)
  4. (p. 292) The first paragraph of subsection 6.3.2 indicates that the subsection will pertain only to the discrete case, yet Example 6.3.4 deals with a continuous distribution. It's a nice example in that one piece of the evidence comes from the observed data and the other piece comes from what is known about the experiment. Also, one piece is an estimate of the unknown parameter and the other piece pertains to the precision that should be associated with the estimate. For a similar discrete distribution example, one could have an experiment consisting of observing n iid Bernoulli random variables, where n is known but the parameter is unknown. The evidence could consist of two parts, (i) the observed sample proportion of successes (which is an estimate of the unknown parameter) and (ii) n (which along with the observed sample proportion can be used to provide an estimate of the variance of the estimator).
  5. (pp. 292-300) Some of the material on these pages is a bit hard to thoroughly understand (based on the information given). Other parts of the material just aren't important as far as the focus of STAT 652 is concerned. I'm going to skip a lot of what's on these pages for now, other than summarizing some things about the likelihood principle. (Given our limited class time, I want to spend our time together going over things that I think will do you the most good, especially with regard to preparing you to do well on the homework and the final exam.)
  6. (p. 296) On the 12th and 13th lines by "the proof of the Formal Likelihood Principle" I think the proof of Brinbaum's Theorem on p. 294 is meant. (That is, Kalbfleisch (and others) aren't convinced that the Formal Likelihood Principle follows from the Formal Sufficiency Principle and the Conditionality Principle.)

Chapter 7

  1. (p. 325) In the Bayes estimator for p (about 40% of the way down the page), they incorrectly have y instead of Y. (With the lowercase y, it's in the form of an estimate instead of an estimator.)
  2. (p. 327) In the first line, I think it would be better to use independent observations (or independent random variables) instead of "independent samples."
  3. (p. 340) In the last sentence of the paragraph on the top portion of the page, I think "range of the pdf" should be replaced by either range of the random variable, or support of the distribution. (The pdf is a function. The set of possible inputs for a function is its domain, and the set of possible values it can assume is its range. For example, for a uniform (0, 2] r.v., the range of the pdf is {0, 1/2} (not an interval, just a set copnsisting of the two values 0 and 1/2), since the pdf is 1/2 on (0, 2] and it's 0 otherwise. The domain of the pdf is the entire real number line (as with any other univariate distribution), since the pdf is defined for all values even though it is only nonzero on (0, 2]. Now the random variable is also a function (of some sample space), and it's range is the set of values for which the pdf is positive (which is called the support of the distribution).)
  4. (p. 342) In the first sentence of subsection 7.3.3, I think "unbiased estimates" should be replaced with unbiased estimators. Although many people casually use the term "unbiaed estimate" when referring to an estimate obtained from an unbiased estimator, an estimate is an observed value of an estimator, and is not a r.v., so it doesn't have an expectation, and it's not sensible to indicate that it's unbiased.
  5. (p. 347) Theorem 7.3.23 is similar to what some books refer to as the Lehmann-Scheffe´ theorem. C&B give a Lehmann-Scheffe´ theorem on p. 369. (Altogether, Lehmann and Scheffe´ developed many results pertaining to UMVUEs.)

Chapter 8

  1. (p. 373, right before Def'n 8.1.2) Many would disagree with the statement that the goal of a hypothesis test is to decide which' of two complementary hypotheses is true. More accurately, I think the goal is to determine if the data provide strong evidence to reject the null hypothesis and support the alternative hypothesis.
  2. (p. 373, last 3 sentences) They indicate that θ0 is the maximum acceptable proportion of defects, and then indicate that θ >= θ0 corresponds to the proportion of defects being unacceptably high, which is a bit of a contradiction (would be okay if it was > instead of >=).
  3. (p. 374) Many would disagree with the first sentence, arguing that failing to reject the null hypothesis is not the same as accepting it as true. Also, many would find problems with the middle paragraph on the page.
  4. (p. 374) They introduce the term acceptance region. The authors excluded, I've never known of a good statistician to use this term in this context. (Note: I don't mean to indicate that the authors aren't good statisticians, but I think it's safe to say that among good statisticians they are in the minority with regard to some of their beliefs and conventions concerning hypothesis testing.
  5. (p. 386, first full sentence on page) I don't think the alternative hypothesis has to be the one she expects the data to support, and hopes to prove. (Typically, we don't "prove" anything with an hypothesis test.) The alternative, or research, hypothesis is the one you want to see if there is strong evidence for. (This doesn't mean you expect it to be supported ... you're just checking to see if it is.)
  6. (p. 386, last line) The "cutoff points" are also known as (and perhaps more commonly known as) critical values.
  7. (p. 397, last sentence of first paragraph of subsection 8.3.4 (also Def'n 8.3.26)) In my (roughly) 52.5 years, over half of which have been spent heavily dealing with statistics, I've never seen a p-value referred to as a test statistic. One can certainly do something like they do, but it's not the common viewpoint about p-values. It puzzles me why the authors decided to go strongly against convention in this chapter.

Chapter 10

  1. (p. 467 (pertaining to 3rd paragraph)) I've never heard or seen anyone use the term "asymptopia" before.
  2. (p. 516) For (A3), "densities" should be "pdfs or pmfs".