Abbreviated R Log of Illustrations of Permutational and Bootstrap Data Analyses, 5/5/08 ff ========================================================= (I) SHOE DATA EXAMPLE: PERMUTATIONAL DISTRIBUTION > shoes = data.frame(boy=rep(1:10,2), trt=rep(c("A","B"), rep(10,2)), side=c("L","L","R","L","R","L","L","L","R","L","R","R","L","R", "L","R","R","R","L","R"), wear=c(13.2,8.2,10.9,14.3,10.7,6.6,9.5, 10.8,8.8,13.3,14.0,8.8,11.2,14.2,11.8,6.4,9.8,11.3,9.3,13.6)) > shoes boy trt side wear ### data from Box, Hunter and Hunter 1978 1 1 A L 13.2 ### discussed on pp.116-118 and p.138 2 2 A L 8.2 ### of Venables and Ripley book. 3 3 A R 10.9 4 4 A L 14.3 5 5 A R 10.7 6 6 A L 6.6 7 7 A L 9.5 8 8 A L 10.8 9 9 A R 8.8 10 10 A L 13.3 11 1 B R 14.0 12 2 B R 8.8 13 3 B L 11.2 14 4 B R 14.2 15 5 B L 11.8 16 6 B R 6.4 17 7 B R 9.8 18 8 B R 11.3 19 9 B L 9.3 20 10 B R 13.6 > aggregate(shoes$wear, by=shoes$trt, mean) Group x 1 A 10.63 2 B 11.04 Similarly ... > aggregate(shoes$wear, by=shoes[2:3], mean) trt side x 1 A L 10.84286 2 B L 10.76667 3 A R 10.13333 4 B R 11.15714 > var.test(shoes$wear[1:10], shoes$wear[11:20]) ### F test for variance equality ### as on pp.117-8 ... F = 0.9474, num df = 9, denom df = 9, p-value = 0.9372 ... > t.test(shoes$wear[1:10], shoes$wear[11:20], var=T) ... t = -0.3689, df = 18, p-value = 0.7165 95 percent confidence interval: -2.744924 1.924924 ### The observed mean difference is 10.63-11.04 = -0.41 ### The t-distribution can be used to get the CI (-2.74, 1.92) ## for the difference, but since we don't believe the ## normality, we generate a CI based on the quantiles of ## 1000 randomly sampled orderings of A,B labels among the ## 20 shoe-wear numbers: the sample space is 20 choose 10 ## = exp(lgamma(21)-2*lgamma(11)) = 184756, so it would be ## feasible but not necessary to enumerate all of them ## exhaustively . ### PERMUTATIONAL APPROACH: WE WANT TO RECREATE THE DISTRIBUTION ### OF THE T-STAT WITH RESPECT TO PERMUTED RESIDUALS: > means = c(mean(shoes$wear[1:10]), mean(shoes$wear[11:20])) resids = c(shoes$wear[1:10]-means[1], shoes$wear[11:20]-means[2]) > newt0 = numeric(1000) for(i in 1:1000) { permut = sample(20,20) newt0[i] = t.test(resids[permut[1:10]], resids[permut[11:20]], var=T)$stat } > hist(newt0, nclass=30, prob=T, xlab="t-stat's", main= paste("Histogram of t-statistics for Shoe-Data, Permutational", "\n In terms of Residuals")) lines(seq(-4,4,length=120), dt(seq(-4,4,length=120),18)) ### We can see from this that the perrmutational distribution ### is not bad, but not exactly like t_{18}: its hump is a little ### too small near 0, and there is a sharp bump a little ### to the right of zero, but the behavior farther out it the ### tails is extremely like the t distribution ! THE CONFIDENCE INTERVAL FOR MU_A-MU_B FROM THE ORIGINAL DATA IS BASED ON THE QUANTILES OF THIS PERMUTATIONAL DISTRIBUTON, AS FOLLOWS: > quantile(newt0, c(.025, .05, .95, .975)) 2.5% 5.0% 95.0% 97.5% -2.023708 -1.721438 1.746166 2.171158 ### If we wanted them to be symmetric, then we could use: > quantile(abs(newt0), c(.90, .95)) 90% 95% 1.738890 2.093144 ### Compare this with t_{18} quantiles: > qt(c(.025, .05, .95, .975),18) [1] -2.100922 -1.734064 1.734064 2.100922 ## symmetric version was perfectly in tune with t !! ## So the permutationally based 95% CI becomes: > means[1]-means[2] + c(-1,1)*2.0931* (means[1]-means[2])/ t.test(shoes$wear[1:10], shoes$wear[11:20], var=T)$stat [1] -2.736231 1.916231 ## contrast with previous: (-2.745, 1.925) ### Of course, the whole thing could have been done with a ## larger permutational sample, to approximate more ## closely the true permutational distribution. ## THE PERMUTATIONAL RESULT IS USUALLY PRESENTED AS A ## NULL-HYPOTHESIS CALCULATION OF AN EXTREME-TAIL ## PROBABILITY, IE A P-VALUE: > 2*(1-pt(abs(t.test(shoes$wear[1:10], shoes$wear[11:20], var=T)[[1]]),18)) t 0.716498 ### This was the t-dist p-value: > sum(newt0 <= -0.3689 | newt0 >= 0.3689)/1000 [1] [1] 0.712 ### This is the empirical p-value ### RECALL that the t.test value was -0.3689 ---------------------------------------------------- (II) GALAXIES DATA EXAMPLE: PARAMETRIC BOOTSTRAP Now let's consider how to sample from distributions which are more complicated, like the Galaxies data, to simulate more complicated quantities like the median: > summary(galaxies/1000) Min. 1st Qu. Median Mean 3rd Qu. Max. 9.172 19.53 20.83 20.83 23.13 34.28 ## Our next objective will be to simulate in order to find ## the sampling variablility for data 'like' the Galaxies ## of the median-statistic. ## First idea is: PARAMETRIC BOOTSTRAP >>> Attempt number 1: let's simulate from the parametric distribution consisting of mixture of 4 logistic components >>> For this, we will create 1000 samples of 82 observations from the previously fitted mixture of logistics, as follows: ### First the component-labels: > cmpvec = sample(1:4, 82000, replace=T, prob=c(.08531, .26916, .60933, 0.0362)) > nobs = table(cmpvec) 1 2 3 4 6940 21986 50123 2951 ### Next the actual observations: > mn = c(9.671603, 19.732924, 22.247312,32.974545) scal = c(0.2531770, 0.3116578, 1.2410406, 0.5674815) > gobs = numeric(82000) for(i in 1:4) gobs[cmpvec==i] = rlogis(nobs[i],mn[i],scal[i]) > gobs = matrix(gobs, ncol=1000) gmed = apply(gobs,2,median) summary(gmed) Min. 1st Qu. Median Mean 3rd Qu. Max. 19.86 20.56 20.89 20.91 21.23 22.17 > var(gmed) [1] 0.2079557 > quantile(gmed,c(.025,.975)) 2.5% 97.5% 20.15464 21.83569 ### This is a data-generated interval which makes ### a lot of sense as a prediction for where the median ## estimator will fall with 95% Confidence when the ## true median is near median(galxies). But this is not exactly ## interpretable as a confidence interval! It differs in particular ### from the following interval which is unavoidably based on the ### asymptotic large-sample normal-distribution for the median. > median(galaxies/1000) + c(-1.96,1.96)*sqrt(0.2080) [1] 19.93960 21.72740 ### This is the normal interval: we can see that it is ### off-center, and that is because the histogram of the ### empirically generated sample medians is skewed !! > hist(gmed, xlab="Galaxy speed / 1000", nclass=30, prob=T, ylab="Scaled Rel Freq", main= "Scaled Histogram for Median of Galaxy Speed, n=82") ### It is worthwhile at this point to find the exact median ### of the fitted mixture of logistic distributions: > gpts = seq(8,35, length=301) mixlogis = function(x) .08531*plogis(x,mn[1],scal[1]) + .26916*plogis(x,mn[2],scal[2]) + .60933* plogis(x,mn[3],scal[3]) +0.0362*plogis(x,mn[4],scal[4]) dfInv = function(z) predict(smooth.spline(mixlogis(gpts),gpts, spar=1.e-6), z)$y > dfInv(.5) [1] 20.88156 ### compare with empirical median 20.83 ### Here was another way to figure this median. > prb <- c(.08531, .26916, .60933, 0.0362) uniroot(function(x) { tmat <- matrix(plogis(outer(mn,x,"-")/scal), nrow=4) c(prb %*% tmat)-.5 }, c(17,23))$root [1] 20.88154 ------------------------------------------------------------ ### We pursue the CI discussion a little further: ### as explained in class, T(Fhat_n) - med_g is approximately the ### same distributionally as T(Fboot_n) - T(Fapprox), ### where the Fapprox is taken to be the estimated mixture ### distribution in the parametric-bootstrap case (and simply the ### empirical distribution function in the nonparametric bootstrap ### case below) which leads in the present case to the CI, as ### follows. ## We found T(Fapprox) = 20.88155 in two ways above. > median(galaxies/1000) + 20.88155 - c(21.83569, 20.15464) [1] 19.879 21.560 ### This one is a lot closer to the normal ### interval !! This would be our recommended CI based on Parametric ### bootstrap ! >>> Attempt number 2: let's simulate from a spline-smoothed distribution estimate. This is actually easy to do when we recall that the vector rnkx of 50 empirical distribution function values corresponded to x-values pts. We directly approximate the inverse function by a smoothing spline: > galxinv <- smooth.spline(rnkx, pts, spar=1e-6) ## Then generate 82000 pseudo-random galaxy-speed observations ## from this smoothed empirical distribution: > gobs2 <- matrix(predict.smooth.spline(galxinv,runif(82000))$y, ncol=82) gmed2 <- apply(gobs2,1,median) > summary(gmed2) Min. 1st Qu. Median Mean 3rd Qu. Max. 20 20.36 20.62 20.82 21.21 22.33 ### Almost identical to previous value: > quantile(gmed2,c(.025,.975)) 2.5% 97.5% 20.13287 22.00272 > hist(gmed2, nclass=30, prob=T) ### Although the quantiles of this median distribution ### are somewhat similar to the previous ones, the shape ### of the histogram is quite different !!! ### Now the median of the resulting approximates distribution is > predict.smooth.spline(galxinv,0.5)$y [1] 20.84478 ### The resulting CI is : 20.8335 + 20.84478 - c(22.00272, 20.13287) ### = (19.67556, 21.54541) >>> Attempt number 3: let's simulate from the kernel-density estimator (logistic kernel, bandwidth 0.2). The idea is first to simulate a set of 82000 indices 1:82 to correspond with the individual logistics being superposed, then to simulate the logistics with the corresponding galaxies values as means, and 0.2 as scales: > glxinds <- sample(1:82, 82000, replace=T) gobs3 <- matrix(galaxies[glxinds]/1000 + rlogis(82000)*0.2, ncol=82) gmed3 <- apply(gobs3,1,median) > hist(gmed3, nclass=30, prob=T) ### This again looks very much like the logistic-mixture ### distribution from the parametric bootstrap. > summary(gmed3) Min. 1st Qu. Median Mean 3rd Qu. Max. 19.69 20.57 20.85 20.91 21.2 22.44 > quantile(gmed3,c(.025,.975)) 2.5% 97.5% 20.15532 21.9298 ### Again much as before !!! ### Now get median for kernel-approximated distribution as: > uniroot( function(x) { kerdf <- outer(x,galaxies, function(x,y) plogis(x,galaxies,0.2)) c(kerdf %*% rep(1/82,82)) - 0.5}, c(17,23))$root [1] 20.84609 ### The resulting CI is : 20.8335 + 20.84609 - c(21.9298, 20.15532) ### = (19.74979, 21.52427) ============================================================ (III) ONE IDEA OF THE NONPARAMETRIC BOOTSTRAP IS AS THE LIMIT OF THIS LAST SIMULATION FROM THE KERNEL-DENSITY ESTIMATOR, IN THE LIMIT AS BANDWIDTH GOES TO 0 !!! Let's implement this idea for the galaxies data. It is actually more difficult to imagine a proper bootstrap for the shoes, as we discussed in class. Now the galaxies, 1000 iterations with bootstrap sample of size 82: > bootgalx <- matrix(sample(galaxies,82000, replace=T), ncol=82) mdgalx <- apply(bootgalx,1,median) > motif() > hist(mdgalx-median(galaxies), nclass=30) ### Plotted histogram shows very spread-out and non-normal distribution! ### Recall that the idea of bootstrap is to view this distribution as essentially the same as that of the sample median minus the true median. Therefore the 95% CI for the median is: > median(galaxies) - quantile(mdgalx-median(galaxies), c(.975,.025)) 97.5% 2.5% 19.72601 21.492 ### pretty close to the confidence intervals ### generated via "Parametric Bootstraps" above. Recall that the parametric mixed-logistic bootstrap gave: (19.81710, 21.47235) and the `parametric' spline bootstrap gave: (19.664, 21.534) and the `parametric' kernel-density (logistic, bandwidth .2) bootstrap gave: (19.737, 21.512) Another run of the nonparametric bootstrap gives: > median(galaxies) - quantile(apply(matrix(sample(galaxies,82000, replace=T), ncol=82),1,median)-median(galaxies), c(.975,.025)) 97.5% 2.5% 19.707 21.49661 ### This is quite similar to the previous intervals, and may be easier to recommend because it depends on fewer assumptions ! ------------------------------------------------------ The shoe-dataset example is slightly more difficult because a fair re-sampling can be done not by permuting randomly as we did in the (null-hypothesis) permutational setting, but rather by re-sampling within each group ! As in regression problems more generally, it is not the actual measured observations which can (in non-null settings) be treated as permutable or `exchangeable', but rather the residuals!! Here is a bootstrap for the shoes: > bootshoe <- rbind(matrix(shoes$wear[sample(1:10,10000, replace=T)], ncol=1000), matrix(shoes$wear[sample(11:20,10000, replace=T)], ncol=1000)) > difsamp <- c(c(rep(.1,10),rep(-.1,10)) %*% bootshoe) > summary(difsamp) Min. 1st Qu. Median Mean 3rd Qu. Max. -3.66000 -1.11250 -0.42000 -0.43613 0.21000 2.69000 > c(2*mean(shoes$wear[1:10]-shoes$wear[11:20])-quantile( difsamp,c(.975,.025))) [1] -2.46025 1.60025 ### This should be contrasted with the previous intervals we found: the permutational interval based on studentized differences ( -2.662879 1.842879) and the t-distribution CI ( -2.7449, 1.9249) ### In this setting, with such a small sample-size, it is not clear ### that the nonparametric bootstrap is reliable: of these ### intervals, probably the permutational one makes most sense.