Ch. 9 notes for H&W, STAT 657

Some Comments about Chapter 9 of Hollander & Wolfe

In the first part of Ch. 9, we'll find connections with the material of Ch. 8 (in particular, Kendall's tau, and Mann's test for trend). For some simple things, we can make use of StatXact, but for other things (e.g., the multiple regression covered in the second part of Ch. 9) Minitab may be the most convenient software to use. (Since the Minitab developers have connections with the Penn State researchers who have contributed a lot to rank-based regression, Minitab includes some rank-based regression, while most other statistical software does not.)

Like H&W (in Sec. 9.7), I'll also give a nondetailed brief description of other types of nonparametric regression. Tree-based regression (which can be done using CART or S-Plus) is a form of nonparametric regression, but there isn't enough time to cover it in STAT 657. (I covered the use CART for nonparametric classification and regression in this past summer's STAT 789 course.) Prior to Sec. 9.7, the material in Ch. 9 is based on the assumption of a particular form for a linear regression model --- the nonparametric aspect is due to the fact that one doesn't have to assume a parametric model for the error term distribution. Tree-based regression, and methods referred to in Sec. 9.7, don't assume that the form of the regression model is known --- they can quite useful in helping one to uncover the nature of the relationships between various variables.

Section 9.1

This section deals with a simple situation. We can make use of Mann's test for trend (see Sec. 8.1, although H&W don't put much emphasis on Mann's test) to do a hypothesis test about the slope in a simple regression model.

I'll offer some specific comments about the text below.

p. 416, Data: Note that, for convenience, H&W assume that
x₁ < x₂ < x₃ < x_n-1 < x_n.
If the data are presented to you with the x_i in some other order, one can simply relabel the points. But it is being assumed that none of the x_i values are the same. (The situation of ties among the x_i isn't treated in this section (although a reference to an article is given on p. 418).)
p. 416, Assumption A1: I refer to the e_i as the error terms. They produce the variation of the y_i about the regression line. The e_i values can be due to measurment error and/or natural variation in what is being observed (i.e., for a given value of x, there can be more than one possible value of what is being observed, even if there is no measurement error). In some settings there can be practically no measurement error, and the variation is just due to population differences, but it's still common to refer to the e_i as the error terms.
p. 416, Assumption A2: Note that the e_i are taken to be iid random variables (H&W are using lower case here to refer to the random variables as well as their observed values) having a distribution with median 0 (as opposed to mean 0, as is often assumed in other descriptions of regression).
p. 416, Procedure: The D_i of (9.3) are the residuals, viewed as random variables.
p. 417, Large-Sample Approximation: When the null hypothesis is true,
D_i = e_i,
and so
D_j - D_i = e_j - e_i.
Whether or not the error term distribution is symmetric, the distribution of
e_j - e_i
is symmetric about 0 (since it's the difference of two iid random variables), and so
D_j - D_i
is equally likely to assume a positive value or a negative value. So (making note of (9.5) on p. 416))
E₀(D_j - D_i) = 0,
and so (making note of (9.4) on p. 416))
E₀(C) = 0.
p. 418, Example 9.1: The description of the double ratio may be hard to understand at first. First of all, it should be realized that, even with no seeding, the rainfall in the target area can differ from the rainfall in the control area. That is, the T/Q ratio can differ from 1 with no seeding done over the target area. So, to study the effect of cloud seeding, one could compare the value of the T/Q ratio for the seeded periods (periods for when the clouds over the target area are seeded (the clouds over the control area are never seeded)) to the value of the T/Q ratio for the unseeded periods. (Note that over the course of the year, sometimes seeding is done, and sometimes it's not. The amounts of rainfall are kept track of in both areas for both seeded periods and unseeded periods, and one double ratio value is computed to reflect the effect of seeding for each year.) Finally, in the expression
[T/Q (seeded)]/[T/Q (unseeded)]
on p. 418, it may appear that the (seeded) and (unseeded) designations go with the Q, contradictory to the book's indication that the control area is never seeded, while the target area is the one that is sometimes seeded and sometimes unseeded. Really, the (seeded) and (unseeded) designations go with the whole T/Q ratio.
p. 419, Example 9.1: Consider the work shown for this example, and note the similarity with the Sec. 8.1 procedures. (This similarity is noted in Comment 2 on p. 420.) C from Sec. 9.1 is the same as K from Sec. 8.1 if we replace the (x_i, y_i) pairs from Sec. (8.1) with the (x_i, d_i) pairs of Sec. (9.1), and note that x_j - x_i is positive whenever j > i.
p. 419, Example 9.1: It is noted that the approximate p-value of 0.071 differs a bit from the exact p-value of about 0.117. But if a continuity correction (in this case of 1, changing the -6 to a -5) is used, the approximate p-value sn't so bad --- it's 0.110 --- even though n is only 5.
p. 419, Comment 1: The distribution of
e_j - e_i
is symmetric about 0 (since it's the difference of two iid random variables), and so the median of
e_j - e_i
equals 0. Since
Y_j - Y_i
is equal to
e_j - e_i
plus a constant, the median of the distribution of
Y_j - Y_i
is just that constant.
p. 420, Comment 2: Note that when the null hypothesis value of the slope is equal to 0, the test about the slope reduces to Kendall's test (or Mann's test for trend if the x_i are considered to be nonrandom). This being the case, one could get a small p-value due to a strong monotone relationship that isn't a linear relationship. So, as is typically the case with regression, one has to examine the data and confirm that a linear model makes sense. If one examines the squirrel monkey data from Table 9.3 on p. 421, while it seems clear that there is a strong relationship between the two variables, it's not clear that it's a linear relationship. (Note: I plan to give you a scatter plot of the data, so you may not want to bother to produce such a plot.) Where Problem 4 on pp. 420-421 instructs one to "test for the presence of a linear relationship between these two measurements" it should be kept in mind that the test of Sec. 9.1 can only be used for this purpose if (9.1) on p. 416 is assumed to hold. I think that it may be better to treat the data as being observations from a bivariate distribution, and use tests from Ch. 8 to reject the null hypothesis of independence and suggest that there is a positive association between the two variables. Then if appropirate graphics (e.g., a residual plot) suggest that a linear relationship may be appropriate, one may decide to assume that the model given by (9.1) holds.
p. 420, Comment 3: H&W indicates that Mann's test for trend can be viewed as a special case of the test of Sec. 9.1 about the slope, provided that the x_i represent the time order and the null value of the slope is 0, but I think the connection with Mann's test need not be considered to be limited in this way, if one notes that one can obviously use something other than time order with Mann's test (since as long as one orders according to the x values, it doesn't matter whether x represents time, distance, or something else), and one notes that the D_i can play the role of the Y_i in Mann's test, and that the D_i are iid if the true slope is equal to the null hypothesis value whether the null value is 0 or something else.

Section 9.2

This is a very short section.

I'll offer some specific comments about the text below.

p. 421, Procedure: Note that N is equal to n choose 2.
pp. 421-422, Procedure: It's disappointing that H&W doesn't point out a connection between the slope estimate of Sec. 9.2 and the test about the slope presented in Sec. 9.1. Basically, the point estimate of the slope is a value which is as compatible with the data as possible, when the test of Sec. 9.1 is used to judge compatibility. Specifically, if we use the point estimate as the null hypthesis value of the slope, and do a two-sided test, the p-value is 1. (Because of this, the point estimate of Sec. 9.2 has similarities to the Hodges-Lehmann estimates of previous chapters.) It can be determined that this is true by examination of the equation 3 lines from the bottom on p. 419; noting that the number of positive d_j - d_i will be equal to the number of negative d_j - d_i, resulting in a value of 0 for the test statistic. (I'll go over this in class.)
p. 422, Comment 4: It is noted that Dietz claims that there are nice apsects of the the Theil estimator. Rand Wilcox also reports good things about the estimator, which is covered in his 2001 book Fundamentals of Modern Statistical Methods: Substantially Improving Power and Accuracy.
p. 422, Comment 5: While it is well known that the least squares estimator is sensitive to outliers, it is less well known that some robust alternatives can also produce a bad estimate given that an extreme outlier occurs at a position where it has great influence. The Theil estimator, being the median of the pairwise slope estimates, is very resistant to outliers (it has a high breakdown point). One really wild value, no matter where it occurs, cannot ruin the Theil estimate.
p. 422, Comment 6: It takes a bit of work to show that the least squares estimator is a weighted average of the S_ijs.

Section 9.3

This is another short section --- perhaps a bit too short, since it is unfortunate that the book doesn't provide much insight into why the confidence interval presented has the correct coverage probability. It should be noted that sometimes the confidence interval produced by the method of this section is shorter than the one based on least squares, and in other instances, the opposite is true.

I'll offer some specific comments about the text below.

p. 424, Procedure: Note that in order for M and Q to be integers, the coverage probability has to be chosen so that the k critical value is an even integer.
p. 424, Large-Sample Approximation: (error in book) Since a confidence interval procedure that produces intervals which are wider than they need to be is conservative, and a confidence interval procedure that produces intervals which are narrower than they need to be is anticonservative, I think that one needs to use the smallest integer that is greater than or equal to the right-hand side of (9.25) instead of "the largest integer that is less than or equal to the right-hand side of (9.25)." (The same mistake is made on the 2nd line of p. 426.)
p. 425, Comment 9: (error in book) H&W states that the lower and upper confidence bounds are given by (9.27) and (9.29), but these expressions are actually one-sided confidence intervals. The confidence bounds are just the endpoints of the one-sided confidence intervals.

Section 9.4

Yet another short section.

I'll offer some specific comments about the text below.

p. 426, Procedure: Note that the estimator is based on a really simple idea: once the slope estimate is determined, each of the (x, y) pairs can be used to produce an estimate of the intercept using (9.32), and "overall" estimate is just the median of n estimates obtained from the n points.
p. 427, Procedure: On the line after (9.36), I don't like that the median of the distribution of Y given a particular value for x is referred to as "the typical value" since a distribution median need not be a value which is close to values which are highly likely to occur.

Section 9.5

I may not spend much time on this section in class. Since I don't plan to assign any HW exercises based on this section, I won't take the time to carefully go through the procedure step by step. If you want to use the test at some point in the future, I think studying Example 9.5 on pp. 430-433 should help you perform the procedure correctly.

I'll offer some specific comments about the text below.

p. 429, Procedure: The pooled estimator given by (9.40) is a weighted sum of the k slope estimators associated with the k samples. To see that this is the case, note that in the slope estimator for a sample (an estimator having the form of the one given in Comment 5 on p. 422), the sample mean of the Y_i can be omitted, since it can be pulled out in front of a sum, with the sum having the value 0.
p. 430, Procedure: Note that (9.42) gives residuals, and that the residuals are ranked from 1 to n_i in each sample (see the top half of p. 433 for an example).

Section 9.6

My guess is that the most convenient way for most of you to do the HW exercise related to this section will be to use Minitab. (Note: The student version of Minitab that I have installed doesn't do the rank regression with the rreg command, but it's on the mainframe version that we all have access to. If you've never gotten your account established on the mason/osf1 system, you might want to do that rather soon. I'll make the HW exercise related to this section such that you don't have to print out anything --- you can just copy the answers from the screen. That way, you don't have to do a lot on the computer that you may not be real familiar with. Some information about getting onto the mainframe can be found on this web page that I have on my STAT 554 web site.) Although it isn't clear that the main method covered by this section is better than robust regression methods like those based on M-estimators, I think that everyone ought to have a way to do linear regression in addition to ordinary least squares (OLS), and so if you don't know how to do any other alternative method, then the material in this section may be quite useful to you in the future. (If nothing else, it can serve as a way to check the reasonableness of an OLS analysis.) Still, it should be noted that rank regression is seldom used.

Although H&W emphasizes hypothesis testing, it should also be pointed out that the estimation procedure used to obtain the coefficients in the fitted model is superior to OLS estimation is a lot of cases (e.g., many cases for which the error term distribution has heavy tails).

Since H&W is somewhat skimpy on the details in places, I'll point out that Ch. 6 of Alternative Methods of Regression by Birkes and Dodge (Wiley, 1993) provides some additional details (but still doesn't explain everything fully --- which may be just fine for most of us ... since life is too short to worry about all of the grubby details of every statistical procedure that we want to use).

I'll offer some specific comments about the text below.

p. 439, Assumptions: Note that this section deals with the usual multiple linear regression model. Rather than the usual assumption of a normal error term distribution, which allows one to form confidence intervals and predictions intervals, and do tests of hypotheses using OLS regression fits, here that assumption is relaxed, and it is only assumed that the error term distribution is symmetric, with a mean/median of 0. (H&W stipulates that the median is 0, but since symmetry is assumed, unless the error term distribution is such that the mean is not defined, then the mean is also 0.)
p. 439, Hypothesis: (error in book) On the two lines below (9.53), H&W has "do not play significant roles" whereas it really should be do not play any role.
p. 440, Procedure: An inspection of (9.54) reveals the main difference in rank regression and OLS regression. If the difference in the brackets was the same as the difference in the parentheses, then minimizing (9.54) would lead to the OLS estimates. By incorporating the ranks, the large differences are given less influence than they are in OLS regression, and so rank regression is less sensitive to gross outliers (and generally performs better for heavy-tailed error term distributions).
p. 440, Procedure: In class, I'll explain the similarities of the testing procedure described here with the corresponding normal theory testing procedure. (Both use a test statistic which has a difference in a quality of fit measure (between the full and reduced models) in the numerator, and a measure of scale in the numerator (with the measure of scale being the estimated error term distribution variance in the normal theory F statistic, and being some other measure of spread related to the error term distribution in the rank regression statistic).)
pp. 441-446, Example 9.6: You can do the HW exercise related to this section using Minitab by closely following the steps used in this example.
p. 443, Example 9.6: (error in book) The 3rd row of M₁ should be 0 0 1 0 instead of 0 0 0 0.
p. 447, Comment 23: H&W doesn't explain why the parameter tau (not the same as Kendall's tau) comes into play when testing hypotheses about about the coefficients. The explanation isn't a simple one, and so it would take a lot of time in class to develop it. Being that the end of the semester is so near, we'll have to skip the explanation. (Really, a different type of nonparametric statistics course would be needed to cover things like this, and the development asymptotic results that H&W present with little explanation. This semester's version on STAT 657 has focused on covering a large number of nonparametric methods for a wide variety of situation, and having students use the methods to complete the HW exercies.)
p. 448, Problem 33: (error in book) On the 2nd line from the bottom of the page, independent should be dependent.

Section 9.7

An important thing to make note of is that on p. 453 H&W indicates that the phrase nonparametric regression typically refers to regression done using method such as those introduced in this section, and not the type of regression covered in the first six sections of Ch. 9. (To avoid confusion, I tend to use the phrase rank regression to refer to the method of Sec. 9.6.)

During class, I plan to present a bit more information about kernel regression smoothers, local regression smoothers, and spline regression smoothers than is included in H&W, and hope to also give brief descriptions of CART and MARS, but there isn't time to present too much material on any of these methods. (Some of them are rather complex, and I would need a least a whole lecture period to adequately explain them.)

I'll offer some specific comments about the text below.

p. 454 (relates to 2nd paragraph): Although H&W focuses on regression based on a single predictor, the methods of this section can be extended to more than one predictor variable. A whole semester course would be needed to do justice to nonparametric multiple regression, and such a course would require a solid foundation in traditional multiple regression, and perhaps some prior knowledge of some common computational statistics techniques. But the difficult course would take students to the frontiers of modern statistics. Methods which are considered by some to be methods of data mining, knowledge discovery, and machine learning, are little different (or no different, in many cases) than established statistical methods used for this type of nonparametric regression, nonparametric classification, and clustering. A good book that covers a lot of such methods, and is written by three leading Stanford statisticians, is The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Hastie, Tibshirani, and Friedman.
p. 454, Assumptions: In the 3rd paragraph of the subsection, H&W refers to the mu term in (9.64) as the median, but for some methods, the resulting estimated function may be closer to the mean than the median. In the 1st paragraph of the subsection, H&W indicates that the e_i are iid random variables, but in a lot of situations the methods are applied when this is not thought to be the case, since often the variance (and even the general shape) of the error term distribution is thought to be nonconstant (i.e., depend on x).
p. 456, General Discussion: Note that Minitab and S-Plus include some of the more modern regression methods referred to in this section. I guess the fact that SAS is not referred to is an indication that it doesn't include any of the methods referred to in this section. (If this is indeed the case, then, as far as I know, SAS doesn't do rank regression, robust regression, or nonparametric regression. If anyone knows differently, please let me know.)
p. 456, General Discussion: H&W indicates that the variance-bias trade-off issue comes up when choosing a particular (so really, choosing a general method, and then also deciding how to "tune" it) nonparametric regression method. But the variance versus bias issue is also present with typical applications of OLS regression. When one decides to omit nonsignificant variables in the variable selection process, often the hope is that variance can be decreased without adding too much bias. That is, if a variable has minor (but perhaps some) influence on the value of the response, then it can often be dropped from the model with little consequence. But it should be kept in mind that unless the model statement is exactly correct (not just which variables should be included, but also how they should be included (e.g., maybe quadratic, cubic, and interaction terms be added, or the variable needs to be transformed), there will typically be bias associated with the estimates. (Another reason for omitting variables from a model is to develop a simpler model. Taking the view that most models are approximate anyway, there is some extra beauty in a simple model, as opposed to a more complex one, particularly when the simple model is almost as accurate as the more complex one.)

Section 9.8

Note that 3 lines from the bottom on p. 456, one can see that in the case of an evenly spaced design, the AREs of the Theil estimator with respect to the least squares estimator are the same as those for the AREs of the signed-rank test with respect to the t test. So for large enough sample sizes, the Theil estimator is better than 95% efficient if the error term distribution is normal, and for heavy-tailed error term distributions, the Theil estimator can be superior.