comments about Ch. 5 of H & L

Comments about Ch. 5 of Applied Logistic Regression, 2nd Ed.

(p. 146) Despite the fact that the square of the denominator in (5.1) looks different from the usual denominator used for Pearson's chi-square goodness-of-fit statistic, (5.2) is indeed equal to the usual Pearson's statistic for the corresponding 2 by J table. (The key is that the sum of the two terms for a column in the table yields just one term in the (5.2) representation. (I can show you this if you want, but it's pretty easy if you want to try it.)) Seeing where the chi-square distribution comes from is easy if you write it as in (5.2) --- you have the sum of the squares of J asymptotically normal random variables. While J - (p+1) df results from following the usual "rules", to me it isn't easy to see why those are the proper df.
(p. 146) I think the deviance residual isn't quite as straightforward as the Pearson residual. It seems like it can be described as a difference in log-likelihoods: the log-likelihood of the saturated model minus the log-likelihood of the fitted model under consideration. Recall that at our first meeting there was some confusion about the term saturated model --- confusion that I blame on the authors of the book. P. 13 indicates that the likelihood of the saturated model is 1, which can only be if each factor in the likelihood product is 1, which can only be if each success has an associated probability of 1 and each failure has an associated probability of 1. Since H&L were referring to the setting in Table 1.1, where there are outcomes of both 0 and 1 for cases having the same covariate pattern, one can only conclude that the saturated model is some sort of an ideal model incorporating variables not in the data set, and that these phantom variables can be used to fully account for outcomes of both 0 and 1 for cases having the same covariate pattern of the observed covariates (and not the phantom ones). However, in other places in the book, H&L use saturated model as Dr. Bolstein was at the first meeting. If there are J < n distinct covariate patterns, and outcomes of both 0 and 1 occur for some cases having the same covariate pattern, then one creates a fit for a saturated model by fitting each distinct covariate pattern using a sample proportion, and one does not have a likelihood of 1. This matches material on p. 166, where the log-likelihood of the saturated model is 0 (so likelihood is 1) only in the case of J = n (although it seems to me that the log-likelihood could be 0 even if J < n).
(p. 146, first sentence after (5.4)) (5.4) seems like (1.8) on p. 13 rather than (1.10) on p. 14.
(p. 148) I find it odd that H&L put a hat on their test statistic --- usually the hat notation is used to denote an estimate.
(p. 150, first paragraph and Table 5.1) It's not clear where the expected value of 12.7 comes from. Maybe it's as stated in the first paragraph, but in that case it's not clear where the probability 0.234 in the 5th row of the table comes from. That is, I don't see the correspondence between the probabilities given in the table and the expected counts given in the tables. I would have guessed the expected count for the 5th row would be 58*0.234, but that doesn't equal 12.7.
(p. 152) The description of the Tsiatis test at the top of the page doesn't seem very clear --- but I'm going to choose not to worry about the details for now.
(p. 153) To understand why the procedure described is sensible, I suppose one has to look elsewhere --- the book seems to give a description w/o sufficient motivation.
(p. 157) PMC can be made close to 0.5 by putting theta₁ equal to 0.5 and beta₁ = mu close to 0.
(p. 162) Is it proper to call the area under the ROC curve the ROC, as is indicated by the general rule given on this page?
(p. 162) I wonder how good the guidelines suggested by the general rule are. I know in some classification settings one wouldn't be happy with less than 90% correct predictions ... but how does this relate to ROC? Recall that Jill brought this up before --- with something like CART, can we get an ROC curve, and if so how? (I think the simple way would be to change the priors --- I'm pretty sure this would work fine.) Why is the area under the ROC curve of great interest? In the end, don't we choose a specific classifier (e.g., logisitc regression with a specific value for the probability cutoff), and once that is done, aren't we interested in it's misclassification rate, or perhaps both its sensitivity and specificity? The area under the curve seems to relate to the performance of a collection of classifiers.
(p. 164) The first paragraph of subsection 5.2.5 indicates that the R² measures aren't really measures of goodness-of-fit, even though they do relate to the degree of improvement over the constant model relative to the improvement of the perfect model over the constant model. The same is true in ordinary regression: one can have a high value of R² with a model that is clearly wrong (as indicated by a residual analysis), and one can have a low value when the model is correct (if the error term variance is relatively large). (An interesting thing about logisitc regression is that the variation is related to the probabilities which are being fit --- whereas in ordinary regression, the variation need not be at all related to what is being fit.)
(p. 165) Recall that Ed Prokop indicated that he was having trouble matching some of the values given on this page.
(p. 173) Given that the approximation of the residual is (1 - h_j) y_j, why isn't the factor (1 - h_j) squared in the expression for its variance (which is the first displayed expression on p. 173)? (It can be noted that (5.14) and (5.15) are consistent with not squaring this factor, and so I hesitate to think that the square was accidentally omitted from the expression for the variance.) Also, since the probability is being estimated, it should be referred to as an estimated variance. Whoa! I initially failed to note that h_j is not a constant. From (5.13) on p. 169 it can be seen that it's a function of all of the responses (through pi hat), and so a random variable.
(p. 181) The first part of the paragraph at the bottom of the page, pertaining to pattern 31, is evidence that the diagnostic statistics can be somewhat misleading. The approximations aren't really good --- note that H&L claim that the approximations are only modestly correlated with the actual observed changes.
(p. 182) The issue of whether or not to delete the 5 cases is interesting. There are 628 cases in all, so if one fit a model to all but 5 of them, it could be thought to be good about 99% of the time. But then to use the model, should one check to see if a subject under consideration has a covariate pattern similar to one of the 5 that were ignored? Since the subject matter experts did not think the 5 covariate patterns were particularly unusual, the decision to not delete them was made. Would you have made the same decision?
(p. 187) It puzzles me greatly why the test described on the top portion of this page should be a two-tailed test! One could reject because the validation sample was in too much agreement with the model, which doesn't seem at all sensible to me. I think H&L have a(nother) mistake! In subsection 1.2 on p. 1146 of the pertinent JASA article, it states "Typically, large values of the test statistic ... indicate a lack of fit" --- so I think one should use an upper-tailed rejection region.
(p. 192) The RACE/SITE results are interesting. Taking the loose relative risk interpretation of odds ratio for the sake of simplicity, nonwhites are twice as likely as whites to remain drug free at Site A, but only half as likely at Site B. This serves to demonstrate the need for the interaction term for these two variables, but at the same time it may seem puzzling --- what is it about the different sites that is responsible for this?
(p. 198) H&L again get screwed up with percent increases. (Were they absent the day this was covered in the 6th grade?) If 11 vs. 10 is a 10% increase, why isn't 2 vs. 1 a 100% increase (as opposed to their claim that it's a 50% increase)?