R Script for Demonstration of an Estimator Using Regression within Strata. ====================================== 10/8/07 This example is a simulation, which we will build with N = 1e5, n=200, H=4, nh=50 for each h Attributes yi will be related to covariates x_i within strata as follows. In stratum h , yi = a[h] + b[h]*xi with N(0,sgsq[h]) errors, xi = N(m[h], 1) distributed > mh = 1:4 bh = c(-.5, 1.2, 0.1, -1) ah = c(.5, 2, 3, 5.5) sgsqh = rep(.25, 4) uv = rnorm(1.e5)*.25 wv = rnorm(1.e5)*.5 hv = sample(1:4,1e5,replace=T) xv = mh[hv] + uv yv = ah[hv] + bh[hv]*xv + wv ## Actual mean: > c(mean(yv), mean(xv)) [1] 2.301839 2.500015 ### ybar is target !! ### This worked really quickly! This is the simulated frame. ### Now draw the samples, first SRSWOR, then stratified > samp1ind = sample(1:1e5, 200) samp2ind = NULL for(i in 1:4) samp2ind = union(samp2ind, sample((1:1e5)[hv==i],50)) > Nh = table(hv) ### numbers to use in stratifying hv hv 1 2 3 4 25055 24835 25179 24931 ### Now we progressively try several methods of analysis: (1) Plain SRS estimation without using x > c(ybhat = mean(yv[samp1ind]), SE = sqrt((1-.002)*var(yv[samp1ind])/200)) ybhat SE 2.4523263 0.1317693 (2) Regression SRS estimation using x, treating mean(x) as known > tmplm= lm(yv[samp1ind] ~ xv[samp1ind]) c(ybreg = as.numeric(tmplm$coef[1]+tmplm$coef[2]*mean(xv)), SE = sqrt((1-.002)*var(summary(tmplm)$resid)/200)) ybreg SE 2.4639112 0.1295492 ### So there was a slight decrease in variance due to > cor(xv,yv) [1] 0.2126113 ### although as it happened the regression estimator was ### actually less accurate than the plain one. (3) Stratified estimator ignoring x within groups > yhbar = numeric(4) xhbar = numeric(4) yhSE = numeric(4) for(i in 1:4) { yhbar[i] = mean(yv[samp2ind][hv[samp2ind]==i]) xhbar[i] = mean(xv[hv==i]) yhSE[i] = sqrt(var(yv[samp2ind][hv[samp2ind]==i])) } > yhbar [1] 0.02054844 4.55372672 3.11581761 1.49220159 > yhSE [1] 0.4889455 0.6001247 0.4902415 0.5399412 > c(ybstrat = sum((Nh/1e5)*yhbar), SE = sqrt((1-.002)*sum((yhSE*Nh/1e5)^2/50))) ybstrat SE 2.29261894 0.03753321 ### Estimator and SE much better !! (4) Stratified estimator with regression within strata ! > chmat = array(0, c(4,4), dimnames=list(NULL, c("ahat","bhat", "sgsqhat", "cor"))) for(i in 1:4) { inds = samp2ind[hv[samp2ind]==i] tmplm = lm(yv[inds] ~ xv[inds]) chmat[i,] = c(tmplm$coef, var(summary(tmplm)$resid), cor(xv[inds],yv[inds])) } c(yregstrat= sum((Nh/1e5)*(chmat[,1] + chmat[,2]*xhbar)), SE = sqrt((1-.002)*sum((Nh/1.e5)^2*chmat[,3]/50))) yregstrat SE 2.27668738 0.03376308 ### This last estimator is slightly lower-variance, but ### actually happens to be less accurate than the stratified ### estimator. The example could be made a LOT more extreme ### by making the within-group correlations strong and consistently ### negative! > chmat[,4] -0.3863068 0.4704235 0.2800386 -0.5313739 ## within-stratum corr's