Abbreviated R Log of Illustrations of Permutational
and Bootstrap Data Analyses, 5/5/08 ff
=========================================================
(I) SHOE DATA EXAMPLE: PERMUTATIONAL DISTRIBUTION
> shoes = data.frame(boy=rep(1:10,2), trt=rep(c("A","B"), rep(10,2)),
side=c("L","L","R","L","R","L","L","L","R","L","R","R","L","R",
"L","R","R","R","L","R"), wear=c(13.2,8.2,10.9,14.3,10.7,6.6,9.5,
10.8,8.8,13.3,14.0,8.8,11.2,14.2,11.8,6.4,9.8,11.3,9.3,13.6))
> shoes
boy trt side wear ### data from Box, Hunter and Hunter 1978
1 1 A L 13.2 ### discussed on pp.116-118 and p.138
2 2 A L 8.2 ### of Venables and Ripley book.
3 3 A R 10.9
4 4 A L 14.3
5 5 A R 10.7
6 6 A L 6.6
7 7 A L 9.5
8 8 A L 10.8
9 9 A R 8.8
10 10 A L 13.3
11 1 B R 14.0
12 2 B R 8.8
13 3 B L 11.2
14 4 B R 14.2
15 5 B L 11.8
16 6 B R 6.4
17 7 B R 9.8
18 8 B R 11.3
19 9 B L 9.3
20 10 B R 13.6
> aggregate(shoes$wear, by=shoes$trt, mean)
Group x
1 A 10.63
2 B 11.04
Similarly ...
> aggregate(shoes$wear, by=shoes[2:3], mean)
trt side x
1 A L 10.84286
2 B L 10.76667
3 A R 10.13333
4 B R 11.15714
> var.test(shoes$wear[1:10], shoes$wear[11:20])
### F test for variance equality ### as on pp.117-8
...
F = 0.9474, num df = 9, denom df = 9, p-value = 0.9372
...
> t.test(shoes$wear[1:10], shoes$wear[11:20], var=T)
...
t = -0.3689, df = 18, p-value = 0.7165
95 percent confidence interval:
-2.744924 1.924924
### The observed mean difference is 10.63-11.04 = -0.41
### The t-distribution can be used to get the CI (-2.74, 1.92)
## for the difference, but since we don't believe the
## normality, we generate a CI based on the quantiles of
## 1000 randomly sampled orderings of A,B labels among the
## 20 shoe-wear numbers: the sample space is 20 choose 10
## = exp(lgamma(21)-2*lgamma(11)) = 184756, so it would be
## feasible but not necessary to enumerate all of them
## exhaustively .
### PERMUTATIONAL APPROACH: WE WANT TO RECREATE THE DISTRIBUTION
### OF THE T-STAT WITH RESPECT TO PERMUTED RESIDUALS:
> means = c(mean(shoes$wear[1:10]), mean(shoes$wear[11:20]))
resids = c(shoes$wear[1:10]-means[1], shoes$wear[11:20]-means[2])
> newt0 = numeric(1000)
for(i in 1:1000) {
permut = sample(20,20)
newt0[i] = t.test(resids[permut[1:10]],
resids[permut[11:20]], var=T)$stat
}
> hist(newt0, nclass=30, prob=T, xlab="t-stat's", main=
paste("Histogram of t-statistics for Shoe-Data, Permutational",
"\n In terms of Residuals"))
lines(seq(-4,4,length=120), dt(seq(-4,4,length=120),18))
### We can see from this that the perrmutational distribution
### is not bad, but not exactly like t_{18}: its hump is a little
### too small near 0, and there is a sharp bump a little
### to the right of zero, but the behavior farther out it the
### tails is extremely like the t distribution !
THE CONFIDENCE INTERVAL FOR MU_A-MU_B FROM THE ORIGINAL DATA
IS BASED ON THE QUANTILES OF THIS PERMUTATIONAL DISTRIBUTON, AS
FOLLOWS:
> quantile(newt0, c(.025, .05, .95, .975))
2.5% 5.0% 95.0% 97.5%
-2.023708 -1.721438 1.746166 2.171158
### If we wanted them to be symmetric, then we could use:
> quantile(abs(newt0), c(.90, .95))
90% 95%
1.738890 2.093144
### Compare this with t_{18} quantiles:
> qt(c(.025, .05, .95, .975),18)
[1] -2.100922 -1.734064 1.734064 2.100922
## symmetric version was perfectly in tune with t !!
## So the permutationally based 95% CI becomes:
> means[1]-means[2] + c(-1,1)*2.0931*
(means[1]-means[2])/ t.test(shoes$wear[1:10],
shoes$wear[11:20], var=T)$stat
[1] -2.736231 1.916231 ## contrast with previous: (-2.745, 1.925)
### Of course, the whole thing could have been done with a
## larger permutational sample, to approximate more
## closely the true permutational distribution.
## THE PERMUTATIONAL RESULT IS USUALLY PRESENTED AS A
## NULL-HYPOTHESIS CALCULATION OF AN EXTREME-TAIL
## PROBABILITY, IE A P-VALUE:
> 2*(1-pt(abs(t.test(shoes$wear[1:10], shoes$wear[11:20],
var=T)[[1]]),18))
t
0.716498 ### This was the t-dist p-value:
> sum(newt0 <= -0.3689 | newt0 >= 0.3689)/1000
[1] [1] 0.712 ### This is the empirical p-value
### RECALL that the t.test value was -0.3689
----------------------------------------------------
(II) GALAXIES DATA EXAMPLE: PARAMETRIC BOOTSTRAP
Now let's consider how to sample from distributions which
are more complicated, like the Galaxies data, to simulate
more complicated quantities like the median:
> summary(galaxies/1000)
Min. 1st Qu. Median Mean 3rd Qu. Max.
9.172 19.53 20.83 20.83 23.13 34.28
## Our next objective will be to simulate in order to find
## the sampling variablility for data 'like' the Galaxies
## of the median-statistic.
## First idea is: PARAMETRIC BOOTSTRAP
>>> Attempt number 1: let's simulate from the parametric
distribution consisting of mixture of 4 logistic components
>>> For this, we will create 1000 samples of 82 observations from the
previously fitted mixture of logistics, as follows:
### First the component-labels:
> cmpvec = sample(1:4, 82000, replace=T, prob=c(.08531, .26916,
.60933, 0.0362))
> nobs = table(cmpvec)
1 2 3 4
6940 21986 50123 2951
### Next the actual observations:
> mn = c(9.671603, 19.732924, 22.247312,32.974545)
scal = c(0.2531770, 0.3116578, 1.2410406, 0.5674815)
> gobs = numeric(82000)
for(i in 1:4) gobs[cmpvec==i] = rlogis(nobs[i],mn[i],scal[i])
> gobs = matrix(gobs, ncol=1000)
gmed = apply(gobs,2,median)
summary(gmed)
Min. 1st Qu. Median Mean 3rd Qu. Max.
19.86 20.56 20.89 20.91 21.23 22.17
> var(gmed)
[1] 0.2079557
> quantile(gmed,c(.025,.975))
2.5% 97.5%
20.15464 21.83569
### This is a data-generated interval which makes
### a lot of sense as a prediction for where the median
## estimator will fall with 95% Confidence when the
## true median is near median(galxies). But this is not exactly
## interpretable as a confidence interval! It differs in particular
### from the following interval which is unavoidably based on the
### asymptotic large-sample normal-distribution for the median.
> median(galaxies/1000) + c(-1.96,1.96)*sqrt(0.2080)
[1] 19.93960 21.72740
### This is the normal interval: we can see that it is
### off-center, and that is because the histogram of the
### empirically generated sample medians is skewed !!
> hist(gmed, xlab="Galaxy speed / 1000", nclass=30, prob=T,
ylab="Scaled Rel Freq", main=
"Scaled Histogram for Median of Galaxy Speed, n=82")
### It is worthwhile at this point to find the exact median
### of the fitted mixture of logistic distributions:
> gpts = seq(8,35, length=301)
mixlogis = function(x) .08531*plogis(x,mn[1],scal[1]) +
.26916*plogis(x,mn[2],scal[2]) + .60933*
plogis(x,mn[3],scal[3]) +0.0362*plogis(x,mn[4],scal[4])
dfInv = function(z) predict(smooth.spline(mixlogis(gpts),gpts,
spar=1.e-6), z)$y
> dfInv(.5)
[1] 20.88156 ### compare with empirical median 20.83
### Here was another way to figure this median.
> prb <- c(.08531, .26916, .60933, 0.0362)
uniroot(function(x) {
tmat <- matrix(plogis(outer(mn,x,"-")/scal), nrow=4)
c(prb %*% tmat)-.5 }, c(17,23))$root
[1] 20.88154
------------------------------------------------------------
### We pursue the CI discussion a little further:
### as explained in class, T(Fhat_n) - med_g is approximately the
### same distributionally as T(Fboot_n) - T(Fapprox),
### where the Fapprox is taken to be the estimated mixture
### distribution in the parametric-bootstrap case (and simply the
### empirical distribution function in the nonparametric bootstrap
### case below) which leads in the present case to the CI, as
### follows.
## We found T(Fapprox) = 20.88155 in two ways above.
> median(galaxies/1000) + 20.88155 - c(21.83569, 20.15464)
[1] 19.879 21.560
### This one is a lot closer to the normal
### interval !! This would be our recommended CI based on Parametric
### bootstrap !
>>> Attempt number 2: let's simulate from a spline-smoothed
distribution estimate. This is actually easy to do when we
recall that the vector rnkx of 50 empirical distribution
function values corresponded to x-values pts. We directly
approximate the inverse function by a smoothing spline:
> galxinv <- smooth.spline(rnkx, pts, spar=1e-6)
## Then generate 82000 pseudo-random galaxy-speed observations
## from this smoothed empirical distribution:
> gobs2 <- matrix(predict.smooth.spline(galxinv,runif(82000))$y,
ncol=82)
gmed2 <- apply(gobs2,1,median)
> summary(gmed2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
20 20.36 20.62 20.82 21.21 22.33
### Almost identical to previous value:
> quantile(gmed2,c(.025,.975))
2.5% 97.5%
20.13287 22.00272
> hist(gmed2, nclass=30, prob=T)
### Although the quantiles of this median distribution
### are somewhat similar to the previous ones, the shape
### of the histogram is quite different !!!
### Now the median of the resulting approximates distribution is
> predict.smooth.spline(galxinv,0.5)$y
[1] 20.84478
### The resulting CI is : 20.8335 + 20.84478 - c(22.00272, 20.13287)
### = (19.67556, 21.54541)
>>> Attempt number 3: let's simulate from the kernel-density
estimator (logistic kernel, bandwidth 0.2). The idea is first
to simulate a set of 82000 indices 1:82 to correspond with the
individual logistics being superposed, then to simulate the logistics
with the corresponding galaxies values as means, and 0.2 as scales:
> glxinds <- sample(1:82, 82000, replace=T)
gobs3 <- matrix(galaxies[glxinds]/1000 + rlogis(82000)*0.2, ncol=82)
gmed3 <- apply(gobs3,1,median)
> hist(gmed3, nclass=30, prob=T)
### This again looks very much like the logistic-mixture
### distribution from the parametric bootstrap.
> summary(gmed3)
Min. 1st Qu. Median Mean 3rd Qu. Max.
19.69 20.57 20.85 20.91 21.2 22.44
> quantile(gmed3,c(.025,.975))
2.5% 97.5%
20.15532 21.9298 ### Again much as before !!!
### Now get median for kernel-approximated distribution as:
> uniroot( function(x) {
kerdf <- outer(x,galaxies, function(x,y) plogis(x,galaxies,0.2))
c(kerdf %*% rep(1/82,82)) - 0.5}, c(17,23))$root
[1] 20.84609
### The resulting CI is : 20.8335 + 20.84609 - c(21.9298, 20.15532)
### = (19.74979, 21.52427)
============================================================
(III) ONE IDEA OF THE NONPARAMETRIC BOOTSTRAP IS AS THE
LIMIT OF THIS LAST SIMULATION FROM THE KERNEL-DENSITY ESTIMATOR,
IN THE LIMIT AS BANDWIDTH GOES TO 0 !!!
Let's implement this idea for the galaxies data. It is actually more
difficult to imagine a proper bootstrap for the shoes, as we
discussed in class.
Now the galaxies, 1000 iterations with bootstrap sample of size 82:
> bootgalx <- matrix(sample(galaxies,82000, replace=T), ncol=82)
mdgalx <- apply(bootgalx,1,median)
> motif()
> hist(mdgalx-median(galaxies), nclass=30)
### Plotted histogram shows very spread-out and non-normal
distribution!
### Recall that the idea of bootstrap is to view this distribution
as essentially the same as that of the sample median minus the
true median. Therefore the 95% CI for the median is:
> median(galaxies) - quantile(mdgalx-median(galaxies), c(.975,.025))
97.5% 2.5%
19.72601 21.492 ### pretty close to the confidence intervals
### generated via "Parametric Bootstraps" above.
Recall that the parametric mixed-logistic bootstrap gave:
(19.81710, 21.47235)
and the `parametric' spline bootstrap gave:
(19.664, 21.534)
and the `parametric' kernel-density (logistic, bandwidth .2)
bootstrap gave:
(19.737, 21.512)
Another run of the nonparametric bootstrap gives:
> median(galaxies) - quantile(apply(matrix(sample(galaxies,82000,
replace=T), ncol=82),1,median)-median(galaxies), c(.975,.025))
97.5% 2.5%
19.707 21.49661
### This is quite similar to the previous intervals, and may be easier
to recommend because it depends on fewer assumptions !
------------------------------------------------------
The shoe-dataset example is slightly more difficult because a fair
re-sampling can be done not by permuting randomly as we did in the
(null-hypothesis) permutational setting, but rather by re-sampling
within each group ! As in regression problems more generally, it is
not the actual measured observations which can (in non-null settings)
be treated as permutable or `exchangeable', but rather the residuals!!
Here is a bootstrap for the shoes:
> bootshoe <- rbind(matrix(shoes$wear[sample(1:10,10000, replace=T)],
ncol=1000), matrix(shoes$wear[sample(11:20,10000, replace=T)],
ncol=1000))
> difsamp <- c(c(rep(.1,10),rep(-.1,10)) %*% bootshoe)
> summary(difsamp)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.66000 -1.11250 -0.42000 -0.43613 0.21000 2.69000
> c(2*mean(shoes$wear[1:10]-shoes$wear[11:20])-quantile(
difsamp,c(.975,.025)))
[1] -2.46025 1.60025
### This should be contrasted with the previous intervals we found:
the permutational interval based on studentized differences
( -2.662879 1.842879)
and the t-distribution CI
( -2.7449, 1.9249)
### In this setting, with such a small sample-size, it is not clear
### that the nonparametric bootstrap is reliable: of these
### intervals, probably the permutational one makes most sense.