HW #4 solutions, Spring '08, STAT 554

Answers for HW #4

Spring 2008

Note: The format used below is not what I expected you to use --- you should have given some plots, and need not have given the results for procedures which shouldn't be considered. I'm giving results obtained from many different methods in order to make it easier for me to grade the papers (since I suspect that not everyone will have consistently used the best methods).

Problem 1

A probit plot shows a clear pattern indicating fairly strong positive skewness. The skewness can also be seen from a symmetry plot, and a sample skewness of about 1.8 also provides evidence of positive skewness.

part (a)

Due to the skewness, which is large in magnitude relative to the kurtosis, the sample mean, 60, should be chosen as the estimate of the distribution mean. (Recall, the sample kurtosis is not as easy to interpret when there is appreciable skewness. The skewness being large can cause the kurtosis to be large.)

Some other (bad) estimates are given below (for grading purposes --- they should not be considered to be competitive here):

22 (sample median),
49.1 (trimmed mean (g = 2)),
40.4 (M-estimate (with bend parameter 1.345)).

part (b)

Since the desired test is about the mean of an apparently skewed distribution, Johnson's modifed t test is the clear choice, and the desired p-value is about 0.028. The other tests are not valid.

For grading purposes, some other p-values are:

0.0023 (Student's t test),
0.003 (signed-rank test (Minitab's normal approx. using midranks)),
0.00016 (sign test).

part (c)

part (d)

Due to the skewness, the interval associated with the sign test can be considered, as can an interval resulting from the transformation to symmetry ploy.

With the positive skewness and only positive data values, power transformations using powers less than 1 can be investigated. While a power of 0.1 results in a sample skewness near zero for the transformed data, probit and symmetry plots suggest that the transformed distribution is not symmetric. (It's possible for a distribution to have a skewness of zero and at the same time not be symmetric.) Therefore the transformation ploy should not be used.

I'll go with the interval based on the sign test. The resulting confidence interval is (14, 62).

Some other intervals are:

(37, 82) (Student's t procedure),
(28, 66) (signed-rank procedure).

part (e)

Due to the strong positive skewness, the smallish sample size, and the fact that the estimand is a rather extreme quantile in the stretched-out right tail of the distribution, E1 is the best choice, and the estimate is 120. (Note: The skewness is almost as strong as that of an exponential distribution, and my studies indicate that for a sample of size 25, E1 is appreciably better than E9 for estimating the 90th percentile of an exponential distribution (and E1 is also a bit better for a sample size as large as 50). So E1 seems to be the clear choice here.)

For purpose of comparison (and for grading purposes), some other estimates are:

120 (E6),
120 (E2),
183 (E9),
173 (E8),
187 (estimator on p. 115 of class notes),
183 (estimator on pp. 3-4 of handout from 7th lecture),
215 (E4).

Problem 2

A probit plot and a symmetry plot, along with the sample skewness, suggest the possibility of somewhat mild negative skewness, but the evidence for skewness is not conclusive. Still, given the possibility of skewness, I think it may be best to choose Johnson's modified t procedure, although in the end we can see that the same interval results from using the ordinary t interval. The resulting confidence interval is (-2.3×10³, 3.2×10³). (Since the estimated standard error of the sample mean is 925, a rule of thumb suggests that the confidence bounds should be expressed using one additional significant digit. However, one can note that the data values seem to have been rounded to the nearest hundred, and so in this case it doesn't seem appropriate to express additional accuracy in the interval estimate.)

Some other intervals are:

(-2.3×10³, 3.2×10³) (Student's t procedure),
(-2.5×10³, 3.2×10³) (signed-rank procedure),
(-1.6×10³, 3.9×10³) (sign procedure (approximate --- based on nonlinear interpolation)).

Problem 3

A symmetry plot and probit plot suggests that the distribution is symmetric, or nearly symmetric, and a sample skewness of about 0.06 supports this conclusion. The probit plot also suggests that the underlying distribution is slightly light-tailed.

part (a)

Given the appearance of a slightly light-tailed symmetric distribution, Student's t interval should work fine. The resulting confidence interval is (3.82, 4.30). (Since the estimated standard error of the sample mean is 0.0889, a rule of thumb suggests that the confidence bounds should be expressed using one additional significant digit. However, one can note that the data values seem to have been rounded to the nearest hundredth, and so in this case it doesn't seem appropriate to express additional accuracy in the interval estimate.)

Some other intervals are:

(3.82, 4.30) (Johnson's modified t procedure),
(3.80, 4.32) (signed-rank procedure),
(3.69, 4.46) (sign procedure (approximate --- based on nonlinear interpolation)).

part (b)

Since the sample size is not too small, and there is very little or no skewness, E9 is the best choice, and the estimate is 4.98. (Note: My studies indicate that even for a sample of size as small as 25, E9 is appreciably better than E1 for estimating the 90th percentile of a normal distribution.)

For purpose of comparison (and for grading purposes), some other estimates are:

4.98 (E8),
4.99 (estimator on pp. 3-4 of handout from 7th lecture),
4.99 (estimator on p. 115 of class notes),
4.93 (E1),
4.90 (E2),
4.90 (E6),
5.02 (E4).

Problem 4

There are signs of heavy tails, and possibly skewness. (Recall, it's harder to judge the skewness/symmetry of a heavy-tailed distribution --- the wildness due to the heavy tails of a symmetric distribution can create an appearance of mild skewness.)

Since we can't be sure about the symmetry/skewness issue, the safe thing to do is to use a Huber M-estimate since it should be decent whether the underlying distribution is mildly skewed or symmetric. (When heavy tails are the dominant feature, it really doesn't matter much if the distribution is perfectly symmetric or a little bit skewed.) Since the tail weight doesn't seem to be only slightly greater than that of a normal distribution (based on an inspection of several Q-Q plots, as well as the sample kurtosis), we should avoid using 1.5 as a bend (since that choice would be appropriate for something only slightly heavy-tailed, like a logistic distribution), and instead use a bend of 1.345, or even perhaps 1.2. A 20% trimmed mean could also be a decent choice. Among the trimmed means, the ones having 15%, 20%, and 25% trimming have the lowest estimated standard errors (although with such a small sample size we shouldn't take the estimated standard errors too seriously). Also, these give the same value, 0.43, as the two M-estimates when rounded to two significant digits. (It's nice when several of the top candidates all give the same estimate.) It should be noted that trimming just 10% is too little since various Q-Q plots suggest more than just a slightly heavy-tailed distribution (and it should be recalled that the sample kurtosis based on only 20 observations may not be too accurate, and in this case seems a bit low (based on the tail wieght indicated by the Q-Q plots)).

Various estimates are:

0.41 (10% upper trimmed mean (trimming 2 from upper end)),
0.43 (5% upper trimmed mean (trimming 1 from upper end)),
0.45 (sample mean),
0.45 (5% trimmed mean (trimming 1 from each end)),
0.44 (10% trimmed mean (trimming 2 from each end)),
0.43 (15% trimmed mean (trimming 3 from each end)),
0.43 (20% trimmed mean (trimming 4 from each end)),
0.43 (one-step Huber M-estimate using bend of 1.345),
0.43 (one-step Huber M-estimate using bend of 1.2),
0.44 (one-step Huber M-estimate using bend of 1.5),
0.43 (25% trimmed mean (trimming 5 from each end)),
0.44 (30% trimmed mean (trimming 6 from each end)),
0.44 (35% trimmed mean (trimming 7 from each end)),
0.44 (sample median),
0.44 (Harrell-Davis estimate),
0.44 (Hodges-Lehmann estimate).

(It should be noted that the results of the studies that I've done (and presented to you) make it pretty clear that when the sample size is smallish, the sample median is typically not a good choice, and the Harrell-Davis estimator is also generally inferior to several other choices.) (FYI, the value of MADN is about 0.1038.)

Problem 5

A probit plot suggests negative skewness. The skewness is also suggested by a symmetry plot, and a sample skewness of about -0.6 also provides evidence of mild negative skewness.

part (a)

Since the desired test is about the mean of an apparently skewed distribution, Johnson's modifed t test is the clear choice, and the desired p-value is about 0.01 (rounded from 0.012). The other tests are not valid.

For grading purposes, some other p-values are:

0.020 (Student's t test),
0.052 (signed-rank test (Minitab's normal approx. using midranks)),
0.25 (sign test).

part (b)

Due to the skewness, the possibilities are the sign test, and the transformation ploy, for which a transformation to symmetry (or very near symmetry) is done, and the t test (or signed-rank test) is applied to the transformed data.

With the negative skewness and only positive data values, power transformations using powers greater than 1 can be investigated, but no simple power transformation will do much to correct the skewness. However, the transformation y = (x - 9.55)**2.45 does pretty good, as do the transformations y = (x - 9.5)**2.7 and y = (x - 9.4)**3.25. All of these transformations result in a p-value of about 0.06 when a t test is done. Even though none of these transformations may be perfect, the fact that they are all pretty good and result in the same p-value (when rounded appropriately), suggests that some similar transformation which does achieve perfect symmetry will result in about the same p-value. (That is, the tansformation technique appears to be pretty robust in this setting.) So it seems best to use a transformation followed by a t test. (Note: Since applying the transformation method is pretty tricky is this case, I didn't really expect anyone to go this route and feel confident about doing so. Therefore, if you simply reported the p-value of 0.25 from the sign test on the original data, and at least gave the transformation ploy adequate consideration, I'll give you almost full credit (even though the p-value is 4 times larger than the one resulting from the transformation ploy).