SMALL LOG IN R FOR THREE METHODS OF SAMPING PPS WITH REPLACEMENT USING AS ILLUSTRATIVE POPULATION THE 300 COUNTIES IN "agsrs.dat" and using ACRES92 as size variable. ===================================================== 11/7/07 > dim(agsrs.fr) [1] 300 14 > names(agsrs.fr) [1] "COUNTY" "STATE" "ACRES92" "ACRES87" "ACRES82" "FARMS92" [7] "FARMS87" "FARMS82" "LARGEF92" "LARGEF87" "LARGEF82" "SMALLF92" [13] "SMALLF87" "SMALLF82" > summary(agsrs.fr[,3]) Min. 1st Qu. Median Mean 3rd Qu. Max. 0 87510 196700 297900 376100 2234000 ### only 1 exactly equal to 0 > xsiz = agsrs[,3] cumsiz = cumsum(xsiz) ## vecs of length 300 ## METHOD 1: fcn in R newsamp = sample(1:300, 10, prob=xsiz/sum(xsiz), replace=T) > newsamp [1] 266 12 234 255 185 29 258 131 111 2 ## METHOD 2: cum size cumprob = cumsiz/sum(xsiz) usamp = runif(10) newsamp2 = numeric(10) for(i in 1:10) newsamp2[i] = min((1:300)[cumprob > usamp[i]]) > newsamp2 [1] 252 27 161 72 287 91 62 293 101 249 ## METHOD 3: Lahiri (1951) or `rejection' method newsamp3 = numeric(10) maxsiz = max(xsiz) for(i in 1:10) { evt = T while(evt) { MM = sample(1:300,1) uu = runif(1) evt = (maxsiz*uu > xsiz[MM]) } newsamp3[i] = MM } > newsamp3 [1] 235 162 235 210 252 257 265 88 163 267 ### How can we check that all three algorithms give the same ### type of samples ? ## We can at least generate a lot and check that the histograms ## and variances are the same newsamp2 = newsamp3 = numeric(1e4) newsamp = sample(1:300, 1e4, prob=xsiz/sum(xsiz), replace=T) usamp = runif(1e4) for(i in 1:1e4) newsamp2[i] = min((1:300)[cumprob > usamp[i]]) for(i in 1:1e4) { evt = T while(evt) { MM = sample(1:300,1) uu = runif(1) evt = (maxsiz*uu > xsiz[MM]) } newsamp3[i] = MM } v1 = hist(newsamp, breaks = seq(.5, 300.5, 5), plot=F)$counts v2 = hist(newsamp2, breaks = seq(.5, 300.5, 5), plot=F)$counts v3 = hist(newsamp3, breaks = seq(.5, 300.5, 5), plot=F)$counts psiz = xsiz/sum(xsiz) ### break into chi-square for 20 groups gpsz = c(t(array(psiz, c(5,60))) %*% rep(1,5)) > sum((v1-1e4*gpsz)^2/(1e4*gpsz)) [1] 67.10581 > sum((v2-1e4*gpsz)^2/(1e4*gpsz)) [1] 81.5297 > sum((v3-1e4*gpsz)^2/(1e4*gpsz)) [1] 48.8767 ### these are all supposed to be chi-square 59df with mean 59 and ### SD 10.39, so OK > c(var(v1), var(v2), var(v3)) [1] 20801.04 18666.97 19496.06 #### OK, not too bad