MTB > # I'll use the mainframe version of Minitab to perform the rank von Neumann test MTB > # on the batch chemical process yields data from the book by Box and Jenkins. MTB > # First I'll read in the data and examine it. MTB > read 'chemical.dat' c1 Entering data from file: chemical.dat 70 rows read. MTB > name c1 'yield' MTB > print c1 yield 47 64 23 71 38 64 55 41 59 48 71 35 57 40 58 44 80 55 37 74 51 57 50 60 45 57 50 45 25 59 50 71 56 74 50 58 45 54 36 54 48 55 45 57 50 62 44 64 43 52 38 59 55 41 53 49 34 35 54 45 68 38 50 60 39 59 40 57 54 23 MTB > describe c1 N MEAN MEDIAN TRMEAN STDEV SEMEAN yield 70 51.13 51.50 51.21 11.91 1.42 MIN MAX Q1 Q3 yield 23.00 80.00 43.75 58.25 MTB > let k90 = 1 MTB > execute 'skku' Executing from file: skku.MTB skewness -0.0754499 kurtosis 0.117856 MTB > execute 'qqnorm' Executing from file: qqnorm.MTB C92 - * - * - * 1.5+ 3 - 3 * - 32 * - 6* - 24** 0.0+ 32*2 - * *2*3 - *6 - 222 - **2 -1.5+ *2 - * - * - * ------+---------+---------+---------+---------+---------+C90 24 36 48 60 72 84 MTB > # The data is consistent with observations of iid normal random variables MTB > # based on what can be seen above. But with time-ordered data, it's good MTB > # to check for dependence. A good way to check for serial correlation of MTB > # lag 1 is to plot the pairs ( x_i, x_{i+1} ) and to compute the value of MTB > # the serial correlation coefficient (see p. 35 of Miller's Beyond ANOVA). MTB > let c2 = c1 MTB > delete 70 c2 MTB > let c3 = c1 MTB > delete 1 c3 MTB > name c2 'x_i' c3 'x_{i+1}' k1 'r_1' MTB > plot c3 c2 80+ * - * * x_{i+1} - * * * - * - * * * * 60+ * 2** 3 - 2* * 3 * * ** * * * - * * * 3* 2 - ** * - * ** 2 * ** 40+ * 2 *** * * - * * ** * - - - * * * 20+ - ------+---------+---------+---------+---------+---------+x_i 24 36 48 60 72 84 MTB > let c11 = c1 - mean(c1) MTB > let c12 = c2 - mean(c1) MTB > let c13 = c3 - mean(c1) MTB > let k1 = ( sum( c12*c13 )/69 )/( sum( c11*c11 )/70 ) MTB > print k1 r_1 -0.395529 MTB > # A serial correlation coefficient of about -0.40 is a fairly strong MTB > # indication of a lack of independence, and the plot above also suggests MTB > # this. Noting that the sample mean is about 51, it can be seen that a MTB > # smaller than average yield is usually followed by a larger then average MTB > # yield, and a larger than average yield is usually followed by a smaller MTB > # than average yield. (If we had independence, we'd expect to see more MTB > # instances of two smaller than average values in a row and two larger MTB > # than average values in a row. MTB > MTB > # I'll now see if the rank von Neumann test provides statistically significant MTB > # evidence for lack of independence. I'll do a two-tailed test since there MTB > # may have been no good reason to suspect a particular type of dependence MTB > # before looking at the data. It can be noted that there are tied values MTB > # in the data set. I'll use the method of midranks to deal with the ties. MTB > # This seems to be suggested by the 1982 JASA article by Bartels, but he warns MTB > # that the distribution theory is affected by ties. My guess is that since the MTB > # midranks are used in both the numerator and the denominator of the RVN statistic, MTB > # there will be some cancellation effects and that the usual approximations may be MTB > # okay. MTB > MTB > let c21 = rank(c1) MTB > # (Note: The rank function generates midranks when there are ties.) MTB > print c21 C21 25.0 62.0 1.5 66.0 10.0 62.0 43.5 15.5 55.5 26.5 66.0 5.5 49.0 13.5 52.5 18.5 70.0 43.5 8.0 68.5 35.0 49.0 31.5 58.5 22.0 49.0 31.5 22.0 3.0 55.5 31.5 66.0 46.0 68.5 31.5 52.5 22.0 39.5 7.0 39.5 26.5 43.5 22.0 49.0 31.5 60.0 18.5 62.0 17.0 36.0 10.0 55.5 43.5 15.5 37.0 28.0 4.0 5.5 39.5 22.0 64.0 10.0 31.5 58.5 12.0 55.5 13.5 49.0 39.5 1.5 MTB > let c22 = c21 MTB > delete 70 c22 MTB > let c23 = c21 MTB > delete 1 c23 MTB > let c24 = c22 - c23 MTB > let k21 = sum( c24**2 ) MTB > let c25 = c21 - mean(c21) MTB > let k22 = sum( c25**2 ) MTB > let k23 = k21/k22 MTB > name k21 'nm' k22 'dn' k23 'rvn' MTB > print k21-k23 nm 83192.3 dn 28514.5 rvn 2.91754 MTB > # Now I'll use the beta dist'n to obtain an approximate p-value. Since rvn > 2, MTB > # we'll have b > 1/2, and I can double the upper-tail probability, P( B >= b), MTB > # or I can exploit the symmetry of the approximating beta dist'n and double MTB > # P( B <= 1 - b ), which is what I'll do below. MTB > let k31 = k23/4 MTB > let k32 = ( 5*70*71*69*69/( 68*( 5*70*70 - 2*70 - 9 ) ) - 1 )/2 MTB > let k33 = 1 - k31 MTB > cdf k33 k34; SUBC> beta k32 k32. MTB > let k35 = 2*k34 MTB > name k31 'b' k32 'alpha' k33 '1 - b' k34 'cdf 1-b' k35 'p-value' MTB > print k31-k35 b 0.729385 alpha 35.2247 1 - b 0.270615 cdf 1-b 0.000023904 p-value 0.000047808 MTB > # So we get an approximate p-value of 4.8 * 10^{-5}. Even though approximation MTB > # formulas sometimes perform poorly when approximating extremely small probabilities, MTB > # in this case I think it's safe to assume that whatever the exact p-value is, it's MTB > # very small and we have strong evidence against independence. MTB > # While I have the data read in, I'll go ahead and do a standard runs test on the total MTB > # number of runs when the data is dichotomized using the sample median (which is 51.5). MTB > runs 51.5 c1 yield K = 51.5000 THE OBSERVED NO. OF RUNS = 59 THE EXPECTED NO. OF RUNS = 36.0000 35 OBSERVATIONS ABOVE K 35 BELOW THE TEST IS SIGNIFICANT AT 0.0000 MTB > # Even though Minitab didn't supply a p-value, it did some of the hard work in MTB > # determining that the number of runs is 59. (Minitab indicates that the approximate MTB > # p-value rounds to 0.0000, and so I guess we can claim p-value < 0.00005. (The last MTB > # time I checked, no matter what the sample size is, Minitab uses a normal approximation MTB > # without a continuity correction to obtain an approximate p-value when doing a runs test.)) MTB > # I'll now use a normal approximation with a continuity correction to obtain a p-value. MTB > # Since the number of runs observed, 59, exceeds the expected value under the null hypothesis, MTB > # 36, to get an approximate p-value we can double an upper-tail probability. The approximate MTB > # p-value is MTB > # 2 * [ 1 - Phi( z ) ], MTB > # where MTB > # z = ( r - 1/2 - E(R) )/sqrt( Var(R) ), MTB > # where E(R) and Var(R) are the mean and variance under the null hypothesis. MTB > # Equivalently, the approximate p-value is 2*Phi(-z), which is how I'll get it below. MTB > # The null mean is 36. I'll store the null variance in k41. MTB > let k41 = 2*35*35*( 2*35*35 - 35 - 35 )/( 70*70*69 ) MTB > let k42 = ( 36 + 1/2 - 59 )/sqrt(k41) MTB > cdf k42 k43; SUBC> norm 0 1. MTB > let k44 = 2*k43 MTB > name k41 'null var' k42 '-z' k44 'approx p' MTB > print k41 k42 k44 null var 17.2464 -z -5.41793 approx p 0.000000119 MTB > # Using the continuity correction, the approx. p-value is 1.2 * 10^{-7}. Without the MTB > # continuity correction, the approx. p-value comes out as 3.1 * 10^{-8}. MTB > save 'chem' Saving worksheet in file: chem.MTW To do the runs test using StatXact, start up StatXact and (1) copy and paste the data set (I supply a link to the data file) into the first column of the Case Data spread sheet. (If a new Case Data spread sheet doesn't appear when you start StatXact, click File > New > Case Data to bring up a new Case Data spread sheet.) (2) Next click Nonparametrics > One-Sample Goodness-of-Fit > Runs to bring up the dialog box for the Runs Test. Var1 should be highlighted in the Variables box (if not, click it to highlight it). (3) Click the uppermost Right Arrow to put Var1 in the Response box. (4) Click the Median button in the Cut-off box. (5) Click the Exact button in the Compute box. (6) Finally, click OK. You should get an exact p-value of about 1.7 * 10^{-8} for a two-sided test (see 2-Sided column (2nd to the last column) under P-Value in the output, looking at the Exact (bottom) row). The two-tailed test p-value is based on | R - E(R) |, the absolute difference between R and E(R). (Note: Although the StatXact manual (viewable using Help) indicates that a continuity correction is used for the Asymptotic p-value, this isn't always the case ... a normal approximation without a continuity correction is used sometimes. (When I did a runs test with a sample of size 12 the continuity correction was used for StatXact's asymptotic p-value computation, but when I used data sets having 60 and 70 observations, the continuity correction was not used.)) Software isn't of much help with tests based on runs up and down, except to compute standard normal cdf values for approximate p-values. With the data under consideration there are 60 runs up and down. (Note: Even though there are tied values in the data, they didn't cause a problem when determining the runs up and down since no consecutive values were tied.) The sample size (70) is too large to make use of Table E of G&C, but we can use the normal approximation. Using the formula on the very bottom on p. 3-27 of the class notes, I obtained an approximate p-value of 1.6 * 10^{-4}. The longest run up or down is of length 3. Although the table on p. 3-28 of the class notes doesn't give results for n = 70, using the results given for n = 60 and n = 80 it can be concluded that, for n = 70, P( L >= 3) is between 0.9890 and 0.9977, and P( L <= 3 ) is between 0.3316 and 0.4432. It can be concluded that the length of the longest run up or down observed, 3, is neither unusually long nor unusually short. (I'll guess that 4 is the most likely outcome under the null hypothesis, and that 3 is the 2nd most likely outcome.) I don't know of a good way to do a test based on the length of the longest run of the dichotomized data. The tables don't cover the n = 70 case, and I don't know of an asymptotic approximation to use.