Abbreviated R Log of Illustrations of Permutational
 and Bootstrap Data Analyses,	               5/5/08 ff
=========================================================

(I) SHOE DATA EXAMPLE: PERMUTATIONAL DISTRIBUTION

> shoes = data.frame(boy=rep(1:10,2), trt=rep(c("A","B"), rep(10,2)),
    side=c("L","L","R","L","R","L","L","L","R","L","R","R","L","R",
    "L","R","R","R","L","R"), wear=c(13.2,8.2,10.9,14.3,10.7,6.6,9.5,
    10.8,8.8,13.3,14.0,8.8,11.2,14.2,11.8,6.4,9.8,11.3,9.3,13.6))
> shoes 
   boy trt side wear	### data from Box, Hunter and Hunter 1978
 1   1   A    L 13.2    ### discussed on pp.116-118 and p.138
 2   2   A    L  8.2    ### of Venables and Ripley book.
 3   3   A    R 10.9
 4   4   A    L 14.3
 5   5   A    R 10.7
 6   6   A    L  6.6
 7   7   A    L  9.5
 8   8   A    L 10.8
 9   9   A    R  8.8
10  10   A    L 13.3
11   1   B    R 14.0
12   2   B    R  8.8
13   3   B    L 11.2
14   4   B    R 14.2
15   5   B    L 11.8
16   6   B    R  6.4
17   7   B    R  9.8
18   8   B    R 11.3
19   9   B    L  9.3
20  10   B    R 13.6

> aggregate(shoes$wear, by=shoes$trt, mean)
  Group     x 
1     A 10.63
2     B 11.04

Similarly ...

> aggregate(shoes$wear, by=shoes[2:3], mean)
  trt side        x 
1   A    L 10.84286
2   B    L 10.76667
3   A    R 10.13333
4   B    R 11.15714

> var.test(shoes$wear[1:10], shoes$wear[11:20])

###  F test for variance equality ### as on pp.117-8
...
F = 0.9474, num df = 9, denom df = 9, p-value = 0.9372 
...
> t.test(shoes$wear[1:10], shoes$wear[11:20], var=T)
...
t = -0.3689, df = 18, p-value = 0.7165 
95 percent confidence interval:
 -2.744924  1.924924 
### The observed mean difference is 10.63-11.04 = -0.41

### The t-distribution can be used to get the CI (-2.74, 1.92)
##     for the difference, but since we don't believe the 
##     normality, we generate a CI based on the quantiles of 
##     1000 randomly sampled orderings of A,B labels among the
##     20 shoe-wear numbers: the sample space is 20 choose 10
##     =  exp(lgamma(21)-2*lgamma(11)) = 184756, so it would be
##     feasible but not necessary to enumerate all of them 
##     exhaustively .

### PERMUTATIONAL APPROACH: WE WANT TO RECREATE THE DISTRIBUTION
### OF THE T-STAT WITH RESPECT TO PERMUTED RESIDUALS:

> means = c(mean(shoes$wear[1:10]), mean(shoes$wear[11:20]))
  resids = c(shoes$wear[1:10]-means[1], shoes$wear[11:20]-means[2])

> newt0 =  numeric(1000)
  for(i in 1:1000) {
      permut = sample(20,20)
      newt0[i] = t.test(resids[permut[1:10]], 
           resids[permut[11:20]], var=T)$stat
   }

> hist(newt0, nclass=30, prob=T, xlab="t-stat's", main=
   paste("Histogram of t-statistics for Shoe-Data, Permutational",
   "\n In terms of Residuals"))
  lines(seq(-4,4,length=120), dt(seq(-4,4,length=120),18))

### We can see from this that the perrmutational distribution
###  is not bad, but not exactly like t_{18}: its hump is a little 
###  too small near 0, and there is a sharp bump a little 
###  to the right of zero, but the behavior farther out it the 
###  tails is extremely like the t distribution !

THE CONFIDENCE INTERVAL FOR MU_A-MU_B FROM THE ORIGINAL DATA
IS BASED ON THE QUANTILES OF THIS PERMUTATIONAL DISTRIBUTON, AS
FOLLOWS:

> quantile(newt0, c(.025, .05, .95, .975))
      2.5%      5.0%    95.0%    97.5% 
-2.023708 -1.721438  1.746166  2.171158

### If we wanted them to be symmetric, then we could use:
> quantile(abs(newt0), c(.90, .95))
     90%      95%
1.738890 2.093144

### Compare this with t_{18} quantiles:
> qt(c(.025, .05, .95, .975),18)
[1] -2.100922 -1.734064  1.734064  2.100922  
      ## symmetric version was perfectly in tune with t !!

## So the permutationally based 95% CI becomes:
> means[1]-means[2] + c(-1,1)*2.0931* 
      (means[1]-means[2])/ t.test(shoes$wear[1:10], 
      shoes$wear[11:20], var=T)$stat
[1] -2.736231  1.916231 ## contrast with previous: (-2.745, 1.925)

### Of course, the whole thing could have been done with a 
##    larger permutational sample, to approximate more 
##    closely the true permutational distribution.

## THE PERMUTATIONAL RESULT IS USUALLY PRESENTED AS A 
## NULL-HYPOTHESIS CALCULATION OF AN EXTREME-TAIL 
## PROBABILITY, IE A P-VALUE:

> 2*(1-pt(abs(t.test(shoes$wear[1:10], shoes$wear[11:20], 
      var=T)[[1]]),18))
        t 
 0.716498	### This was the t-dist p-value:
> sum(newt0 <= -0.3689 | newt0 >= 0.3689)/1000
[1] [1] 0.712   ### This is the empirical p-value
### RECALL that the t.test value was -0.3689

----------------------------------------------------

(II) GALAXIES DATA EXAMPLE: PARAMETRIC BOOTSTRAP

Now let's consider how to sample from distributions which 
are more complicated, like the Galaxies data, to simulate 
more complicated quantities like the median:

> summary(galaxies/1000)
  Min. 1st Qu. Median  Mean 3rd Qu.  Max. 
 9.172   19.53  20.83 20.83   23.13 34.28

## Our next objective will be to simulate in order to find 
## the  sampling variablility for data 'like' the Galaxies 
## of the median-statistic.

## First idea is: PARAMETRIC BOOTSTRAP

>>> Attempt number 1: let's simulate from the parametric 
      distribution consisting of mixture of 4 logistic components

>>> For this, we will create 1000 samples of 82 observations from the
  previously fitted mixture of logistics, as follows:

### First the component-labels:
> cmpvec = sample(1:4, 82000, replace=T, prob=c(.08531, .26916, 
      .60933, 0.0362))
> nobs = table(cmpvec)
    1     2     3    4 
 6940 21986 50123  2951

### Next the actual observations:
> mn = c(9.671603, 19.732924, 22.247312,32.974545)
  scal = c(0.2531770, 0.3116578, 1.2410406, 0.5674815)
> gobs = numeric(82000)
  for(i in 1:4) gobs[cmpvec==i] = rlogis(nobs[i],mn[i],scal[i])
> gobs = matrix(gobs, ncol=1000)
  gmed = apply(gobs,2,median)
  summary(gmed)
  Min. 1st Qu. Median  Mean 3rd Qu.  Max. 
  19.86   20.56   20.89   20.91   21.23   22.17
> var(gmed)
[1] 0.2079557
> quantile(gmed,c(.025,.975))
     2.5%   97.5% 
 20.15464 21.83569
   ### This is a data-generated interval which makes
### a lot of sense as a prediction for where the median 
##  estimator will fall with 95% Confidence when the 
##  true median is near median(galxies). But this is not exactly
##  interpretable as a confidence interval! It differs in particular
### from the following interval which is unavoidably based on the 
### asymptotic large-sample normal-distribution for the median.

> median(galaxies/1000) + c(-1.96,1.96)*sqrt(0.2080)
[1] 19.93960 21.72740
    ### This is the normal interval: we can see that it is
  ### off-center, and that is because the histogram of the
  ### empirically generated sample medians is skewed !!

> hist(gmed, xlab="Galaxy speed / 1000", nclass=30, prob=T, 
       ylab="Scaled Rel Freq", main=
     "Scaled Histogram for Median of Galaxy Speed, n=82")

### It is worthwhile at this point to find the exact median
###  of the fitted mixture of logistic distributions:

> gpts = seq(8,35, length=301)
  mixlogis = function(x) .08531*plogis(x,mn[1],scal[1]) + 
    .26916*plogis(x,mn[2],scal[2]) + .60933*
    plogis(x,mn[3],scal[3]) +0.0362*plogis(x,mn[4],scal[4])
  dfInv = function(z) predict(smooth.spline(mixlogis(gpts),gpts, 
        spar=1.e-6), z)$y
 > dfInv(.5)
[1] 20.88156     ### compare with empirical median 20.83
### Here was another way to figure this median.

> prb <- c(.08531, .26916, .60933, 0.0362) 
  uniroot(function(x) {  
       tmat <- matrix(plogis(outer(mn,x,"-")/scal), nrow=4)
       c(prb %*% tmat)-.5 }, c(17,23))$root
[1] 20.88154

------------------------------------------------------------

### We pursue the CI discussion a little further:
###  as explained in class, T(Fhat_n) - med_g is approximately the
### same distributionally as T(Fboot_n) - T(Fapprox), 
### where the Fapprox is taken to be the estimated mixture
### distribution in the parametric-bootstrap case (and simply the
### empirical distribution function in the nonparametric bootstrap
### case below) which leads in the present case  to the CI, as
### follows.

## We found T(Fapprox) = 20.88155  in two ways above.

> median(galaxies/1000) + 20.88155  - c(21.83569, 20.15464)
[1] 19.879   21.560
 
### This one is a lot closer to the normal
### interval !! This would be our recommended CI based on Parametric
### bootstrap !


>>> Attempt number 2: let's simulate from a spline-smoothed
  distribution estimate. This is actually easy to do when we
  recall that the vector  rnkx  of 50 empirical distribution
  function values corresponded to x-values  pts. We directly
  approximate the inverse function by a smoothing spline:

> galxinv <- smooth.spline(rnkx, pts, spar=1e-6)
  
## Then generate 82000 pseudo-random galaxy-speed observations
##   from this smoothed empirical distribution:

> gobs2 <- matrix(predict.smooth.spline(galxinv,runif(82000))$y,
     ncol=82)
  gmed2 <- apply(gobs2,1,median)
> summary(gmed2)
 Min. 1st Qu. Median  Mean 3rd Qu.  Max. 
   20   20.36  20.62 20.82   21.21 22.33
   ### Almost identical to previous value:
> quantile(gmed2,c(.025,.975))
     2.5%    97.5% 
 20.13287 22.00272
> hist(gmed2, nclass=30, prob=T)
  ### Although the quantiles of this median distribution 
 ### are somewhat similar to the previous ones, the shape 
 ### of the histogram is quite different !!! 

### Now the median of the resulting approximates distribution is
> predict.smooth.spline(galxinv,0.5)$y
[1] 20.84478

### The resulting CI is : 20.8335 + 20.84478 - c(22.00272, 20.13287)
###  = (19.67556, 21.54541)

>>> Attempt number 3: let's simulate from the kernel-density 
estimator (logistic kernel, bandwidth 0.2). The idea is first 
to simulate a set of 82000 indices 1:82 to correspond with the
individual logistics being superposed, then to simulate the logistics
with the corresponding  galaxies  values as means, and 0.2 as scales:

> glxinds <- sample(1:82, 82000, replace=T)
  gobs3 <- matrix(galaxies[glxinds]/1000 + rlogis(82000)*0.2, ncol=82)
  gmed3 <- apply(gobs3,1,median)
> hist(gmed3, nclass=30, prob=T)
  ### This again looks very much like the logistic-mixture 
### distribution from the parametric bootstrap.
> summary(gmed3)
  Min. 1st Qu. Median  Mean 3rd Qu.  Max. 
 19.69   20.57  20.85 20.91    21.2 22.44
> quantile(gmed3,c(.025,.975))
     2.5%   97.5% 
 20.15532 21.9298  ### Again much as before !!!

### Now get median for kernel-approximated distribution as:
> uniroot( function(x) { 
     kerdf <- outer(x,galaxies, function(x,y) plogis(x,galaxies,0.2))
     c(kerdf %*% rep(1/82,82)) - 0.5}, c(17,23))$root
[1] 20.84609

### The resulting CI is : 20.8335 + 20.84609 - c(21.9298, 20.15532)
###  =  (19.74979, 21.52427)

============================================================

(III) ONE IDEA OF THE NONPARAMETRIC BOOTSTRAP IS AS THE 
LIMIT OF THIS LAST SIMULATION FROM THE KERNEL-DENSITY ESTIMATOR,
IN THE LIMIT AS BANDWIDTH GOES TO 0 !!!

Let's implement this idea for the galaxies data. It is actually more
difficult to imagine a proper bootstrap for the shoes, as we 
discussed in class.

Now the galaxies, 1000 iterations with bootstrap sample of size 82:

> bootgalx <- matrix(sample(galaxies,82000, replace=T), ncol=82)
  mdgalx <- apply(bootgalx,1,median)
> motif()
> hist(mdgalx-median(galaxies), nclass=30)

### Plotted histogram shows very spread-out and non-normal
    distribution! 

### Recall that the idea of bootstrap is to view this distribution
    as essentially the same as that of the sample median minus the
    true median. Therefore the 95% CI for the median is:

> median(galaxies) - quantile(mdgalx-median(galaxies), c(.975,.025))
    97.5%   2.5% 
 19.72601 21.492  ### pretty close to the confidence intervals
        ### generated via "Parametric Bootstraps" above.

Recall that the parametric mixed-logistic bootstrap gave:
(19.81710, 21.47235)
and the `parametric' spline bootstrap gave:
(19.664, 21.534)
and the `parametric' kernel-density (logistic, bandwidth .2) 
bootstrap gave:
(19.737, 21.512)

Another run of the nonparametric bootstrap gives:
> median(galaxies) - quantile(apply(matrix(sample(galaxies,82000, 
       replace=T), ncol=82),1,median)-median(galaxies), c(.975,.025))
  97.5%     2.5% 
 19.707 21.49661

### This is quite similar to the previous intervals, and may be easier
    to recommend because it depends on fewer assumptions !

------------------------------------------------------

The shoe-dataset example is slightly more difficult because a fair
re-sampling can be done not by permuting randomly as we did in the
(null-hypothesis) permutational setting, but rather by re-sampling
within each group ! As in regression problems more generally, it is
not the actual measured observations which can (in non-null settings)
be treated as permutable or `exchangeable', but rather the residuals!!

Here is a bootstrap for the shoes:

> bootshoe <- rbind(matrix(shoes$wear[sample(1:10,10000, replace=T)], 
     ncol=1000), matrix(shoes$wear[sample(11:20,10000, replace=T)], 
     ncol=1000))
> difsamp <- c(c(rep(.1,10),rep(-.1,10)) %*% bootshoe)
> summary(difsamp)
     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
 -3.66000 -1.11250 -0.42000 -0.43613  0.21000  2.69000
> c(2*mean(shoes$wear[1:10]-shoes$wear[11:20])-quantile(
    difsamp,c(.975,.025)))
[1] -2.46025 1.60025

### This should be contrasted with the previous intervals we found:
the permutational interval based on studentized differences
( -2.662879  1.842879)
and the t-distribution CI
( -2.7449, 1.9249)

### In this setting, with such a small sample-size, it is not clear
###   that the nonparametric bootstrap is reliable: of these
### intervals, probably the permutational one makes most sense.