MTB > # I'll use the mainframe version of Minitab to perform the rank von Neumann test
MTB > # on the batch chemical process yields data from the book by Box and Jenkins.
MTB > # First I'll read in the data and examine it.
MTB > read 'chemical.dat' c1
Entering data from file: chemical.dat
70 rows read.
MTB > name c1 'yield'
MTB > print c1
yield
47 64 23 71 38 64 55 41 59 48 71 35 57
40 58 44 80 55 37 74 51 57 50 60 45 57
50 45 25 59 50 71 56 74 50 58 45 54 36
54 48 55 45 57 50 62 44 64 43 52 38 59
55 41 53 49 34 35 54 45 68 38 50 60 39
59 40 57 54 23
MTB > describe c1
N MEAN MEDIAN TRMEAN STDEV SEMEAN
yield 70 51.13 51.50 51.21 11.91 1.42
MIN MAX Q1 Q3
yield 23.00 80.00 43.75 58.25
MTB > let k90 = 1
MTB > execute 'skku'
Executing from file: skku.MTB
skewness -0.0754499
kurtosis 0.117856
MTB > execute 'qqnorm'
Executing from file: qqnorm.MTB
C92 - *
- *
- *
1.5+ 3
- 3 *
- 32 *
- 6*
- 24**
0.0+ 32*2
- * *2*3
- *6
- 222
- **2
-1.5+ *2
- *
- *
- *
------+---------+---------+---------+---------+---------+C90
24 36 48 60 72 84
MTB > # The data is consistent with observations of iid normal random variables
MTB > # based on what can be seen above. But with time-ordered data, it's good
MTB > # to check for dependence. A good way to check for serial correlation of
MTB > # lag 1 is to plot the pairs ( x_i, x_{i+1} ) and to compute the value of
MTB > # the serial correlation coefficient (see p. 35 of Miller's Beyond ANOVA).
MTB > let c2 = c1
MTB > delete 70 c2
MTB > let c3 = c1
MTB > delete 1 c3
MTB > name c2 'x_i' c3 'x_{i+1}' k1 'r_1'
MTB > plot c3 c2
80+ *
- * *
x_{i+1} - * * *
- *
- * * * *
60+ * 2** 3
- 2* * 3 * * ** * * *
- * * * 3* 2
- ** *
- * ** 2 * **
40+ * 2 *** * *
- * * ** *
-
-
- * * *
20+
-
------+---------+---------+---------+---------+---------+x_i
24 36 48 60 72 84
MTB > let c11 = c1 - mean(c1)
MTB > let c12 = c2 - mean(c1)
MTB > let c13 = c3 - mean(c1)
MTB > let k1 = ( sum( c12*c13 )/69 )/( sum( c11*c11 )/70 )
MTB > print k1
r_1 -0.395529
MTB > # A serial correlation coefficient of about -0.40 is a fairly strong
MTB > # indication of a lack of independence, and the plot above also suggests
MTB > # this. Noting that the sample mean is about 51, it can be seen that a
MTB > # smaller than average yield is usually followed by a larger then average
MTB > # yield, and a larger than average yield is usually followed by a smaller
MTB > # than average yield. (If we had independence, we'd expect to see more
MTB > # instances of two smaller than average values in a row and two larger
MTB > # than average values in a row.
MTB >
MTB > # I'll now see if the rank von Neumann test provides statistically significant
MTB > # evidence for lack of independence. I'll do a two-tailed test since there
MTB > # may have been no good reason to suspect a particular type of dependence
MTB > # before looking at the data. It can be noted that there are tied values
MTB > # in the data set. I'll use the method of midranks to deal with the ties.
MTB > # This seems to be suggested by the 1982 JASA article by Bartels, but he warns
MTB > # that the distribution theory is affected by ties. My guess is that since the
MTB > # midranks are used in both the numerator and the denominator of the RVN statistic,
MTB > # there will be some cancellation effects and that the usual approximations may be
MTB > # okay.
MTB >
MTB > let c21 = rank(c1)
MTB > # (Note: The rank function generates midranks when there are ties.)
MTB > print c21
C21
25.0 62.0 1.5 66.0 10.0 62.0 43.5 15.5 55.5 26.5 66.0
5.5 49.0 13.5 52.5 18.5 70.0 43.5 8.0 68.5 35.0 49.0
31.5 58.5 22.0 49.0 31.5 22.0 3.0 55.5 31.5 66.0 46.0
68.5 31.5 52.5 22.0 39.5 7.0 39.5 26.5 43.5 22.0 49.0
31.5 60.0 18.5 62.0 17.0 36.0 10.0 55.5 43.5 15.5 37.0
28.0 4.0 5.5 39.5 22.0 64.0 10.0 31.5 58.5 12.0 55.5
13.5 49.0 39.5 1.5
MTB > let c22 = c21
MTB > delete 70 c22
MTB > let c23 = c21
MTB > delete 1 c23
MTB > let c24 = c22 - c23
MTB > let k21 = sum( c24**2 )
MTB > let c25 = c21 - mean(c21)
MTB > let k22 = sum( c25**2 )
MTB > let k23 = k21/k22
MTB > name k21 'nm' k22 'dn' k23 'rvn'
MTB > print k21-k23
nm 83192.3
dn 28514.5
rvn 2.91754
MTB > # Now I'll use the beta dist'n to obtain an approximate p-value. Since rvn > 2,
MTB > # we'll have b > 1/2, and I can double the upper-tail probability, P( B >= b),
MTB > # or I can exploit the symmetry of the approximating beta dist'n and double
MTB > # P( B <= 1 - b ), which is what I'll do below.
MTB > let k31 = k23/4
MTB > let k32 = ( 5*70*71*69*69/( 68*( 5*70*70 - 2*70 - 9 ) ) - 1 )/2
MTB > let k33 = 1 - k31
MTB > cdf k33 k34;
SUBC> beta k32 k32.
MTB > let k35 = 2*k34
MTB > name k31 'b' k32 'alpha' k33 '1 - b' k34 'cdf 1-b' k35 'p-value'
MTB > print k31-k35
b 0.729385
alpha 35.2247
1 - b 0.270615
cdf 1-b 0.000023904
p-value 0.000047808
MTB > # So we get an approximate p-value of 4.8 * 10^{-5}. Even though approximation
MTB > # formulas sometimes perform poorly when approximating extremely small probabilities,
MTB > # in this case I think it's safe to assume that whatever the exact p-value is, it's
MTB > # very small and we have strong evidence against independence.
MTB > # While I have the data read in, I'll go ahead and do a standard runs test on the total
MTB > # number of runs when the data is dichotomized using the sample median (which is 51.5).
MTB > runs 51.5 c1
yield
K = 51.5000
THE OBSERVED NO. OF RUNS = 59
THE EXPECTED NO. OF RUNS = 36.0000
35 OBSERVATIONS ABOVE K 35 BELOW
THE TEST IS SIGNIFICANT AT 0.0000
MTB > # Even though Minitab didn't supply a p-value, it did some of the hard work in
MTB > # determining that the number of runs is 59. (Minitab indicates that the approximate
MTB > # p-value rounds to 0.0000, and so I guess we can claim p-value < 0.00005. (The last
MTB > # time I checked, no matter what the sample size is, Minitab uses a normal approximation
MTB > # without a continuity correction to obtain an approximate p-value when doing a runs test.))
MTB > # I'll now use a normal approximation with a continuity correction to obtain a p-value.
MTB > # Since the number of runs observed, 59, exceeds the expected value under the null hypothesis,
MTB > # 36, to get an approximate p-value we can double an upper-tail probability. The approximate
MTB > # p-value is
MTB > # 2 * [ 1 - Phi( z ) ],
MTB > # where
MTB > # z = ( r - 1/2 - E(R) )/sqrt( Var(R) ),
MTB > # where E(R) and Var(R) are the mean and variance under the null hypothesis.
MTB > # Equivalently, the approximate p-value is 2*Phi(-z), which is how I'll get it below.
MTB > # The null mean is 36. I'll store the null variance in k41.
MTB > let k41 = 2*35*35*( 2*35*35 - 35 - 35 )/( 70*70*69 )
MTB > let k42 = ( 36 + 1/2 - 59 )/sqrt(k41)
MTB > cdf k42 k43;
SUBC> norm 0 1.
MTB > let k44 = 2*k43
MTB > name k41 'null var' k42 '-z' k44 'approx p'
MTB > print k41 k42 k44
null var 17.2464
-z -5.41793
approx p 0.000000119
MTB > # Using the continuity correction, the approx. p-value is 1.2 * 10^{-7}. Without the
MTB > # continuity correction, the approx. p-value comes out as 3.1 * 10^{-8}.
MTB > save 'chem'
Saving worksheet in file: chem.MTW
To do the runs test using StatXact, start up StatXact and
(1) copy and paste the data set (I supply a link to the data file) into the first column of
the Case Data spread sheet. (If a new Case Data spread sheet doesn't appear when you start StatXact,
click
File > New > Case Data
to bring up a new Case Data spread sheet.)
(2) Next click
Nonparametrics > One-Sample Goodness-of-Fit > Runs
to bring up the dialog box for the Runs Test. Var1 should be highlighted in the Variables
box (if not, click it to highlight it).
(3) Click the uppermost Right Arrow to put Var1 in the Response box.
(4) Click the Median button in the Cut-off box.
(5) Click the Exact button in the Compute box.
(6) Finally, click OK.
You should get an exact p-value of about 1.7 * 10^{-8} for a two-sided test (see 2-Sided column
(2nd to the last column) under P-Value in the output, looking at the Exact (bottom) row).
The two-tailed test p-value is based on | R - E(R) |, the absolute difference between R and E(R).
(Note: Although the StatXact manual (viewable using Help) indicates that a continuity correction
is used for the Asymptotic p-value, this isn't always the case ... a normal approximation without
a continuity correction is used sometimes. (When I did a runs test with a sample of size 12 the
continuity correction was used for StatXact's asymptotic p-value computation, but when I used data
sets having 60 and 70 observations, the continuity correction was not used.))
Software isn't of much help with tests based on runs up and down, except to compute standard
normal cdf values for approximate p-values. With the data under consideration there are 60
runs up and down. (Note: Even though there are tied values in the data, they didn't cause a
problem when determining the runs up and down since no consecutive values were tied.)
The sample size (70) is too large to make use of Table E of G&C, but we can use the normal
approximation. Using the formula on the very bottom on p. 3-27 of the class notes, I obtained
an approximate p-value of 1.6 * 10^{-4}.
The longest run up or down is of length 3. Although the table on p. 3-28 of the class notes
doesn't give results for n = 70, using the results given for n = 60 and n = 80 it can be concluded
that, for n = 70, P( L >= 3) is between 0.9890 and 0.9977, and P( L <= 3 ) is between 0.3316 and
0.4432. It can be concluded that the length of the longest run up or down observed, 3, is neither
unusually long nor unusually short. (I'll guess that 4 is the most likely outcome under the null
hypothesis, and that 3 is the 2nd most likely outcome.)
I don't know of a good way to do a test based on the length of the longest run of the dichotomized
data. The tables don't cover the n = 70 case, and I don't know of an asymptotic approximation to
use.