Homework


Homework exercises will be assigned throughout the summer session. Typically you'll have at least a week to do each exercise, but towards the end of the course you may have less than a week for some of the exercises.

The various problems (and parts of problems) will be given different point values. I'm not sure how many points there will be in all. (I haven't taught a course based on this material before.)

Some of the problems will be designated as ones which you are suppose to work entirely on your own. Do not discuss these problems with anyone, and do not show your work to anyone or look at someone else's work. It'll be better for you to get no credit for a problem than it will be to get involved in an Honor Code violation case.


due June 15

1) (4 points --- work this one entirely on your own) Consider a sample of size 9 for which no two of the observations in the sample are equal (e.g., the Control sample on p. 11 of E&T --- it consists of 9 different values), and consider using the nine values to create a bootstrap sample. Consider the sample median of the bootstrap sample. Letting p(i) be the probability that the sample median of a bootstrap sample is equal to the observed value of the ith order statistic (from the original sample if size 9), develop an expression for p(i) and evaluate it for i = 1 through i = 9. Provide adequate justification for why your expression is correct (i.e., explain it to me). If you choose to modify expression (2.5) from Problem 2.4 (b) on p. 16 of E&T, you should still provide adequate justification for your expression. That is, it won't be adequate to state that you simply modified (2.5) in an obvious way --- I want a clear explanation for why your expression gives the desired probabilities. (In class I'll explain how these probabilities can be used to obtain the limiting value for the bootstrap estimate of the standard error of the sample median. Ask me to do this on Tuesday if I forget to do so --- I meant to cover this when I was talking about Ch. 2 on Thursday.)
Answer: Letting S denote the sample median, the probability that S is equal to x(i) is equal to P( S >= x(i) ) - P( S >= x(i+1) ), which is equal to P( 4 or less of the observations in the bootstrap sample are less than or equal to x(i-1) ) - P( 4 or less of the observations in the bootstrap sample are less than or equal to x(i) ). To get the first probability we can sum the binomial (9, (i-1)/9) pmf values for the outcomes 0, 1, 2, 3, and 4, and to get the second probability we can sum the binomial (9, i/9) pmf values for the outcomes 0, 1, 2, 3, and 4. This results in a very slight modification of expression (2.5) on p. 16 of E&T.
2) (2 points --- you can discuss this one with others, but do not copy the answer of another student) Give an answer for Probelm 3.2 on p. 28 of E&T using no more than 3 sentences. Your answer should contain some sort of numerical result. (You may want to make use of the result given in Problem 3.1 on p. 28 of E&T.)
Answer: The probability of getting no repetitions when sampling with replacement is about 0.256, while the probability of getting no repetitions when sampling without replacement is 1. Since we should anticipate getting one or more repetitions when sampling 15 items with replacement from a population of 82 items, when no repetitions are observed it is not unreasonable to suspect that the sampling was done without replacement.

due June 22

3) (3 extra credit points --- work this one entirely on your own) Do Problem 6.5 on p. 57 of E&T. (Note: Extra credit points count towards the numerator, but not the denominator. That is, I don't expect many to earn the extra credit points and so when I determine your HW average, if you've gotten all of the problems correct except the the extra credit problems, I'll give you 100% credit for the HW. (Extra credit points can be useful to impress me or to make up for other HW points you've missed earning.))
Answer: Imagine a line of 2n - 1 locations, labeled 1, 2, 3, ..., 2n - 2, 2n - 1. For each location, either a divider will be placed into it or it will be left empty, with n - 1 dividers being placed in all. Additionally, a divider will be placed before the first location and a divider will be placed after the last location --- so the two additional dividers are at locations 0 and 2n. The n + 1 dividers will be referred to as the 0th, 1st, 2nd, 3rd, 4th, 5th, ..., (n - 1)th, and nth dividers. There will be a total of n empty locations amongst the dividers. Each possible way of placing the dividers corresponds to a distinct bootstrap sample --- with the number of empty locations between the ith and the (i - 1)th dividers being the number of occurrences of xi in the bootstrap sample. So the total number of distinct bootstrap samples is just the number of ways of choosing the n - 1 locations for dividers in the line of location 1 through location 2n - 1. So the desired value is 2n-1Cn-1. (Note: A number of students made use of a result given in Ch. 1 of A First Course in Probability, by S. Ross.)
4) (9 points in all (distributed as indicated below) --- you can discuss this one with others, but do not copy the work of another student) Use R's rnorm function to generate 75 observations from a standard normal distribution, using 321 as the random number seed. Give me the R code that will produce answers for parts (a), (b), and (c). But give me your final answers separately --- that is, I don't want to have to execute your R code to obtain your answers. (Note: The R code to do parts (a) through (c) can be rather short --- if you think it's necessary to do something complicated, requiring a lot of code, then you're making things way harder than necessary.)
Answer: This R code can be used to obtain the desired answers. (Execute it to see the answers.)

due June 29

5) (2 points --- work this one entirely on your own) Consider the cell survival data of Table 9.4 on p. 116 of E&T. If one bootstraps complete cases, what is the probability that the questionable case appears exactly k times in the first bootstrap sample, for k being 0, 1, 2, and 3? Give numerical values for the 4 probabilities, rounding each to the nearest thousandth, rather than giving a general expression involving k.
Answer: Letting Y be a binomial (14, 1/14) random variable, the desired probabilities are P(Y = k) for k being 0, 1, 2, and 3. R's dbinom function can be easily used to obtain the desired probabilities. They are 0.354, 0.382, 0.191, and 0.059.
6) (2 points --- you can discuss this one with others, but do not copy the answer of another student) Give a simple (one short sentence/statement) answer to Problem 9.7 on p. 122 of E&T. (Note: (9.28) follows from (9.10), but you don't have to understand how (9.10) is obtained. Rather, you can think of it like this: given that (9.10) expresses the least squares estimates in terms of C and y, where C is the design matrix based on the real world data and y is the vector of observed response values (from the real world experiment), and (9.28) gives the bootstrap replicates of the estimates for the case of bootstrapping residuals, with y* being the vector of y* values given by (9.26), and C being the same design matrix used in (9.10); when bootstrapping complete cases as opposed to bootstrapping residuals, what needs to be changed about (9.28) in order to make it give the bootstrap replicates for complete case bootstrapping? It can be noted that when bootstrapping complete cases, (9.26) is no longer used to obtain the response values for the bootstrap data sets --- rather one just uses the y values from the resampled cases. So, for bootstrapping cases, y* is interpretted differently. That difference in not what I'm looking for here. I want to know what else has to be changed in order to modify (9.28) to have it give bootstrap replicates of the estimates for complete case bootstrapping. Despite all of these words of explanation, the answer I'm looking for is really simple. So if you come up with a simple answer and question whether I wanted something that seems so obvious, it may be the case that you have arrivied at what I'm looking for.)
Answer: One needs to change C to C*, where the values in the ith row of of C* are the predictor values for the ith randomly selected case. (Note: Unlike the case of bootstrapping residuals, for which the design matrix is the same for all B bootstrap samples, when bootstrapping cases a different design matrix may be used with each bootstrap sample.)

due July 6

7) (2 points --- you can discuss this one with others, but do not copy the answer of another student) Do Problem 10.5 on p. 140 of E&T.
Answer: 44,100 --- or 42,354 if 1.96 is used instead of 2. (Since I think that everyone got this one correct, I won't go to the trouble of typing an explanation for the answer here.)
8) (2 points --- you can discuss this one with others, but do not copy the answer of another student) In Problem 11.9 on p. 151 of E&T, I think the results given for α(xi) and β(xi,xj) are incorrect. Give the correct results. (Notes: (1) You don't have to show any work --- you can just supply the answers. (2) The sample variance referred to is not the typically used unbiased sample variance. (3) Note the 1/n and 1/n2 coefficients in (11.18) and be sure to take them into account when you give your expressions for α and β.)
Answer: α(xi) = ((n-1)/n)xi2 and β(xi,xj) = -2xixj.
9) (4 points --- you can discuss this one with others, but do not copy the answer of another student) Do Problem 11.12 on p. 152 of E&T, except that you don't have to make comparisons to the bootstrap estimates --- just give the jackknife estimates (and show how you used software to obtain the jackknife estimates).
Answer: This R code can be used to obtain the desired answers. (Execute it to see the answers.)
10) (10 points (including 2 extra credit points for the summary/conclusions and explanation) --- work this one entirely on your own) Do Problem 11.13 on p. 152 of E&T, only change the word "variance" to standard error, and use ten thousand samples of size 20 instead of only 100. Use B = 800 and set the random number seed to 321. So for part (a), get a jackknife estimate of the standard error of the sample mean from each of the 10000 samples of size 20, and get a bootstrap estimate of the standard error of the sample mean from each of the 10000 samples of size 20. Then report the sample mean and sample standard deviation of your sample of 10000 jackknife estimates and size 20 and report the sample mean and sample standard deviation of your sample of 10000 bootstrap estimates. (Note: I recommend that you loop through a routine 10000 times. At the start of the looped routine, generate a new sample of 20 N(1,1) observations. Also, within the looped routine, get the four estimates that you need. That is, don't do the computations for part (b) completely seperate from the computations for part (a), since that would mean repeating a lot of stuff unnecessarily. Once you're gone through the looped routine 10000 times, exit the loop and then get the sample mean and sample standard deviation of each of the four sets of 10000 estimates that you produced.) Don't forget to supply a short statement to summarize and explain your findings. Turn in the R code you used to obtain the results, and also supply the desired means and standard deviations, along with your short summary/explanation. (Note: I'm not sure what sort of conclusions one could arrive at if one followed the book and used only 100 trials in a little Monte Carlo study to compare the performance of the bootrap and the jackknife estimators --- the noise in such a small experiment is just too large to make it likely that one can discover what is true about the small differences in performance ... the results are way too dependent on the random number seed used. In order to effectively judge the slight difference in performance of the bootstrap and jackknife estimators in each of the two situations, I compared the estimated mean squared errors, which requires that you know the true value of the estimand. If you decide to go this route, and you obtain the true value of the estimand for part (b), be sure to highlight that result on your submitted homework paper.) Note: It may take a rather long time to get the results using R, but still it should only be a smallish number of minutes (as opposed to hours).
Answer: This R code can be used to obtain the desired answers. (Execute it to see the answers.)

due July 13

11) (9 points --- you can discuss this one with others, but do not copy the answer of another student) Before doing parts (a) and (b), set the random number seed to 321 and generate a sample of 40 standard normal observations using z <- rnorm(40). Use this sample for parts (a) and (b). Before doing parts (c) through (f), set the random number seed to 321 and generate a sample of 40 exponential observations using x <- rexp(40). Use this sample for parts (c) through (f).
Answer: This R code can be used to obtain the desired answers. (Execute it to see the answers.)

due July 20

12) (10 points --- you can discuss this one with others, but do not copy the answer of another student) Do a Monte Carlo study to compare the performances of Student's t interval, the bootstrap percentile interval, the BCA interval, the ABC interval, and the bootstrap t interval for estimating a distribution mean. Using confidence intervals having a nominal coverage probability of 0.90, for each of the five interval estimation methods, obtain estimates of the probability that the interval misses to the left, the probability that the interval misses to the right, and the coverage probability. Base the estimates on 2500 samples of size 25 generated using x <- exp( rnorm( 25 ) ). The x values will be observations from a lognormal distribution. Feel free to copy large parts of my R code, but of course some important changes will have to be made. For the bootstrap t intervals, don't use variance stabilization or nested bootstrapping. Instead, use the usual estimate of the standard error of a sample mean. Don't make use of the fact that the underlying distribution is a lognormal distribution to tweak the interval estimation methods in any way. Give me a simple table containing your results, and also turn in the R code that you used to obtain your results. (Important: Don't use a lognormal distribution that differs from the one indicated above.)


due July 27

13) (3 points --- you can discuss this one with others, but do not copy the answer of another student) Set the random number seed to 321 and generate a sample of 25 exponential observations using x <- rexp(25). Using this data, do a bootstrap test of the null hypothesis that the mean equals 0.75 against the alternative that the mean does not equal 0.75, using the sample mean (as opposed to Student's t statistic) as the test statistic. Rather than shift the empirical distribution to create a bootstrap world distribution in agreement with the null hypothesis, do a rescaling by multiplying the observations in the observed sample by an appropriate factor. Report the approximate p-value and also turn in the R code that you used.

14) (3 points --- you can discuss this one with others, but do not copy the answer of another student) Set the random number seed to 321 and generate a sample of 8 standard normal observations using w <- rnorm(8). Immediately following this (without resetting the random number seed), generate another sample of 8 normal observations using x <- 0.5 + rnorm(8). Immediately following this (without resetting the random number seed), generate another sample of 8 normal observations using y <- 1 + rnorm(8). Using these three samples, do a bootstrap test of the null hypothesis that all three underlying distributions are identical against the general alternative, using the difference between the largest of the three sample means and the smallest of the three sample means as the test statistic. Report the approximate p-value and also turn in the R code that you used.