DEMO related to 10/5/05 Stat 440 Class ====================================== Consider regression vs stratification in dummy-variable setting with binary attribute Y (like opinion-poll response, with 1=Yes). In this illustration, N=12000, n=80, and there are five equal-sized strata (N_h=2400) with respective numbers of Y-attributes equal to 1 in these strata being 240 (=.1*2400), 720, 1200, 1680, 2160. ----------------------------------------------------------- Step 1. Simulate 12,000 Y values, using 5-valued label variable ZL (with equal probabilities of all 5 labels, and respective conditional prob's .1, .3, .5, .7, .9 of Y=1 given ZL = 1:5) > ZL <- sample(1:5, 12000, replace=T) Yv <- rbinom(12000, 1, seq(.1,.9,.2)[ZL]) > table(Yv, ZL) ZL Yv 1 2 3 4 5 0 2114 1712 1172 758 235 1 232 718 1196 1723 2140 ### Looks OK > round(.Last.value[2,]/table(ZL), 3) ZL 1 2 3 4 5 0.099 0.295 0.505 0.694 0.901 ### Looks OK ---------------------------------------------------------- Step 2. Now draw 1000 samples of size 80, SRS, but analyze using regression in each case. ### First see how this looks with one sample > smpind <- sample(1:12000,80) tmpmod <- lm(Yv[smpind] ~ factor(ZL[smpind]) - 1) ## coeff's are 0.20 , 0.19 , 0.45 , 0.61 , 0.88 ## estimating .1, .3, .5, .7, .9 This is a regression model (fitted without intercept) using dummy variables for the different levels of ZL. Another way to get the same coefficient vector is to form the ratios (for j=1,..,5) sum(Yv[smpind]*(ZL[smpind]==j))/sum(ZL[smpind]==j) These are simply the observed groupwise proportions of "yes" answers [Yv==1] in the sample. ## Estimate Yv total [which happens to be: sum(Yv)=6009] by > tyreg <- c(tmpmod$coef %*% rep(2400,5)) > round(tyreg,1) [1] 5576.7 ## Also estimate RMSE of this estimator: > round(12000*sqrt((1-80/12000)*var(tmpmod$resid)/80),1) [1] 583 ## Theoretical value: > Yvhat <- seq(.1,.9,.2)[ZL] round(12000*sqrt((1-80/12000)*var(Yv-Yvhat)/80),1) [1] 552 ## Now do the loop with 1000 such samples, and save the ## estimators and calculate empirical MSE > tyrg.vec <- numeric(1000) for (i in 1:1000) { smpind <- sample(1:12000,80) tyrg.vec[i] <- 2400*sum(tmpmod <- lm(Yv[smpind] ~ factor(ZL[smpind]) - 1)$coef) } ## The "factor" inside the lm command tells R to treat ## the ZL labels like dummy regressor variables (5 such ## predictor variables, one for each label 1:5, a column ## of 0's and 1's indicating for each index whether that ## index has the specified label. > mean(tyrg.vec) [1] 5992.887 ### The avg estimate is near 6009 > sqrt(var(tyrg.vec)) [1] 569.5409 > sqrt(mean((tyrg.vec-6009)^2)) [1] 569.4841 ### Both the s.dev and RMSE of the ### regression est's are near the theoretical value 552 --------------------------------------------------------- Step 3. Now look at stratified estimator based on balanced numbers of ZL's (16 in each group) > Gps <- NULL for(j in 1:5) Gps <- c(Gps, list(1:12000)[ZL==j])) ## Gps is a list in which the j'th list-component ## (j=1:5) is the set of indices in 1:12000 for ## which ZL=1, etc. > unlist(lapply(Gps,length)) [1] 2346 2430 2368 2481 2375 ## respective numbers of indices with ZL values 1:5 > tstr.vec <- numeric(1000) for (i in 1:1000) for (j in 1:5) tstr.vec[i] <- tstr.vec[i] + 2400*mean(Yv[sample(Gps[[j]],16)]) > mean(tstr.vec) [1] 6007.95 ## the avg over all 1000 estimators > round(sqrt(var(tstr.vec)),1) [1] 542.5 ### Very similar to theoretical reg-est SE ### Now calculate the stratified estimator's theoretical RMSE > 2400*sqrt((1-16/2400)*sum(seq(.1,.9,.2)*(1-seq(.1,.9,.2)))/16) [1] 551.3257