Small Log to Illustrate Factors within Linear/GL Models in R ========================================================= 11/2/15 We will use data on Bass data again as an example. > Basstmp0 = read.table("http://www.math.umd.edu/~slud/s430.old/Data/bass", header=T, skip=35) > dim(Basstmp0) [1] 53 12 > names(Basstmp0) [1] "ID" "Lake" "Alkalinity" [4] "pH" "Calcium" "Chlorophyll" [7] "AvgMercury" "No.samples" "min" [10] "max" ## let's recode the alkalinity variable to three categories A,B,C ## based on intervals of values > summary(Basstmp0$Alk) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.20 6.60 19.60 37.53 66.50 128.00 > plot(Basstmp0$Alk,log(Basstmp0$AvgM)) > summary(Basstmp0$pH) Min. 1st Qu. Median Mean 3rd Qu. Max. 3.600 5.800 6.800 6.591 7.400 9.100 > Basstmp0$Alkalinity = factor(cut(Basstmp$Alk, breaks=c(0,15,40,130), right=F, labels=c("A","B","C"))) names(Basstmp0)[3] = "Alk" Basstmp0$pH <- factor(round(Basstmp$pH)) contrasts(Basstmp0$pH) = contr.helmert(6) > levels(Basstmp0$Alk) [1] "A" "B" "C" ### NB. "contrasts" assume an intercept term. > .Options$contrasts ### this is the default unordered ordered "contr.treatment" "contr.poly" > sapply(Basstmp0[,3:4],contrasts) $Alkalinity B C ### in this (default) coding, the intercept b0 coef is the A 0 0 ### group mean for A, but b0+b1 is the group mean for B, B 1 0 ### and b0+b2 is the group mean for C. C 0 1 $pH [,1] [,2] [,3] [,4] [,5] 4 -1 -1 -1 -1 -1 5 1 -1 -1 -1 -1 6 0 2 -1 -1 -1 7 0 0 3 -1 -1 8 0 0 0 4 -1 9 0 0 0 0 5 ### Note that the columns within the block defined by a factor are orthogonal ### TO SEE HOW A LITTLE DATASET RESULTS IN A LINEAR-MODEL DESIGN MATRIX, CONSIDER > cbind(Basstmp0[1:15,c(3,4,7)], model.matrix(lm(AvgMercury ~ ., data=Basstmp0[,c(3,4,7)]))[1:15,]) Alkalinity pH AvgMercury (Intercept) AlkalinityB AlkalinityC pH1 pH2 pH3 pH4 pH5 1 A 6 1.23 1 0 0 0 2 -1 -1 -1 2 A 5 1.33 1 0 0 1 -1 -1 -1 -1 3 C 9 0.04 1 0 1 0 0 0 0 5 4 B 7 0.44 1 1 0 0 0 3 -1 -1 5 A 5 1.20 1 0 0 1 -1 -1 -1 -1 6 B 7 0.27 1 1 0 0 0 3 -1 -1 7 A 5 0.48 1 0 0 1 -1 -1 -1 -1 8 C 8 0.19 1 0 1 0 0 0 4 -1 9 B 6 0.83 1 1 0 0 2 -1 -1 -1 10 A 6 0.81 1 0 0 0 2 -1 -1 -1 11 A 5 0.71 1 0 0 1 -1 -1 -1 -1 12 B 7 0.50 1 1 0 0 0 3 -1 -1 13 B 7 0.49 1 1 0 0 0 3 -1 -1 14 A 6 1.16 1 0 0 0 2 -1 -1 -1 15 C 8 0.05 1 0 1 0 0 0 4 -1 ### You can see that the rows in the little "contrasts" tables show just how indicated columns would be coded for a observation with factor level indicated as the row name. The rule is: if c1,...,ck are the values for levels L1,...,Lk in the jth column of the contrasts matrix, then the corresponding coefficient of that column multiplies a regressor c1*I[L=L1] + c2*I[L=L2] + ... + ck*I[L=Lk] so in a model involving pH above, the beta_pH1 coeff multiplies I[pH=5]-I[pH=4], the beta_pH2 coef multiplies 2*I[pH=6]-I[pH=4]-I[pH=4] , etc. ### Now let's go back to the standard "treatment" coding where each coef multiplies the indicator of a single level of pH. > contrasts(Basstmp$pH) = contr.treatment(6) > Basstmp = cbind.data.frame(logAvM = log(Basstmp[,7]), Basstmp[,-c(1,2,7:11)]) lmBass.tmp = lm(logAvM ~ . - Calcium , data=Basstmp) lmBass.tmp Call: lm(formula = logAvM ~ . - Calcium, data = Basstmp) Coefficients: (Intercept) AlkB AlkC pH5 pH6 pH7 -0.39735 -0.29219 -1.34262 -0.12499 -0.16217 -0.11645 pH8 pH9 Chlorophyll agedata 0.61369 0.60361 -0.01426 0.32958 > dim(model.matrix(lmBass.tmp)) [1] 53 10 > round(t(summary(lmBass.tmp)$coef[,1:3]), 4) (Intercept) AlkB AlkC pH5 pH6 pH7 pH8 pH9 Estimate -0.3973 -0.2922 -1.3426 -0.1250 -0.1622 -0.1164 0.6137 0.6036 Std. Error 0.2661 0.2540 0.3666 0.3167 0.2903 0.3545 0.4906 0.5718 t value -1.4933 -1.1506 -3.6627 -0.3947 -0.5587 -0.3285 1.2508 1.0556 Chlorophyll agedata Estimate -0.0143 0.3296 Std. Error 0.0032 0.2083 t value -4.5047 1.5822 > lmBass.tmp2 = step(lmBass.tmp, .~ (Alk + pH + Chlorophyll + agedata )^2, k=2, direction="both") Start: AIC=-60.81 logAvM ~ (Alk + pH + Calcium + Chlorophyll + agedata) - Calcium Df Sum of Sq RSS AIC - pH 5 1.4474 12.985 -64.545 + Alk:pH 2 1.3442 10.193 -63.375 + Chlorophyll:agedata 1 0.6338 10.903 -61.804 + Alk:agedata 2 0.9269 10.610 -61.248 11.537 -60.809 - agedata 1 0.6717 12.209 -59.810 + Alk:Chlorophyll 2 0.1786 11.359 -57.636 + pH:agedata 5 1.2927 10.245 -57.108 + pH:Chlorophyll 5 0.5863 10.951 -53.574 - Alk 2 4.0407 15.578 -48.895 - Chlorophyll 1 5.4446 16.982 -42.322 Step: AIC=-64.55 logAvM ~ Alk + Chlorophyll + agedata Df Sum of Sq RSS AIC + Chlorophyll:agedata 1 0.9958 11.989 -66.775 12.985 -64.545 - agedata 1 0.5290 13.514 -64.429 + Alk:agedata 2 0.6758 12.309 -63.378 + Alk:Chlorophyll 2 0.0781 12.907 -60.865 + pH 5 1.4474 11.537 -60.809 - Alk 2 5.1948 18.180 -50.710 - Chlorophyll 1 4.5812 17.566 -50.529 Step: AIC=-66.77 logAvM ~ Alk + Chlorophyll + agedata + Chlorophyll:agedata Df Sum of Sq RSS AIC 11.989 -66.775 - Chlorophyll:agedata 1 0.9958 12.985 -64.545 + Alk:Chlorophyll 2 0.2157 11.773 -63.737 + Alk:agedata 2 0.0766 11.912 -63.114 + pH 5 1.0853 10.903 -61.804 - Alk 2 5.7723 17.761 -49.944 > lmBass.tmp3 = step(lmBass.tmp, .~ (Alk + pH + Chlorophyll + agedata )^2, k=4, direction="both") Start: AIC=-40.81 logAvM ~ (Alk + pH + Calcium + Chlorophyll + agedata) - Calcium Df Sum of Sq RSS AIC - pH 5 1.4474 12.985 -54.545 - agedata 1 0.6717 12.209 -41.810 11.537 -40.809 + Chlorophyll:agedata 1 0.6338 10.903 -39.804 + Alk:pH 2 1.3442 10.193 -39.375 + Alk:agedata 2 0.9269 10.610 -37.248 + Alk:Chlorophyll 2 0.1786 11.359 -33.636 - Alk 2 4.0407 15.578 -32.895 + pH:agedata 5 1.2927 10.245 -27.108 - Chlorophyll 1 5.4446 16.982 -24.322 + pH:Chlorophyll 5 0.5863 10.951 -23.574 Step: AIC=-54.55 logAvM ~ Alk + Chlorophyll + agedata Df Sum of Sq RSS AIC - agedata 1 0.5290 13.514 -56.429 + Chlorophyll:agedata 1 0.9958 11.989 -54.775 12.985 -54.545 + Alk:agedata 2 0.6758 12.309 -49.378 + Alk:Chlorophyll 2 0.0781 12.907 -46.865 - Alk 2 5.1948 18.180 -44.710 - Chlorophyll 1 4.5812 17.566 -42.529 + pH 5 1.4474 11.537 -40.809 Step: AIC=-56.43 logAvM ~ Alk + Chlorophyll Df Sum of Sq RSS AIC 13.514 -56.429 + agedata 1 0.5290 12.985 -54.545 + Alk:Chlorophyll 2 0.0898 13.424 -48.782 - Alk 2 4.8073 18.321 -48.299 + pH 5 1.3047 12.209 -41.810 - Chlorophyll 1 6.2326 19.746 -40.328 > extractAIC(lmBass.tmp, k=2) [1] 10.0000 -60.8094 > extractAIC(lmBass.tmp, k=4) [1] 10.0000 -40.8094 > rbind(tmp2=extractAIC(lmBass.tmp2,k=2), tmp3=extractAIC(lmBass.tmp3,k=2)) [,1] [,2] tmp2 6 -66.77455 tmp3 4 -64.42917 > rbind(tmp2=extractAIC(lmBass.tmp2,k=4), tmp3=extractAIC(lmBass.tmp3,k=4)) [,1] [,2] tmp2 6 -54.77455 tmp3 4 -56.42917 > round(summary(lmBass.tmp3)$coef,5) Estimate Std. Error t value Pr(>|t|) (Intercept) -0.26574 0.11706 -2.27021 0.02763 AlkB -0.25820 0.17934 -1.43968 0.15632 AlkC -0.83085 0.20006 -4.15308 0.00013 Chlorophyll -0.01320 0.00278 -4.75384 0.00002 > anova(update(lmBass.tmp3, formula = .~.+Alk:Chlorophyll)) Analysis of Variance Table Response: logAvM Df Sum Sq Mean Sq F value Pr(>F) Alk 2 16.8994 8.4497 29.5843 4.824e-09 *** Chlorophyll 1 6.2326 6.2326 21.8216 2.531e-05 *** Alk:Chlorophyll 2 0.0898 0.0449 0.1572 0.855 Residuals 47 13.4239 0.2856 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > model.matrix(update(lmBass.tmp3, formula = .~.+Alk:Chlorophyll))[1:15,] (Intercept) AlkB AlkC Chlorophyll AlkB:Chlorophyll AlkC:Chlorophyll 1 1 0 0 0.7 0.0 0.0 2 1 0 0 3.2 0.0 0.0 3 1 0 1 128.3 0.0 128.3 4 1 1 0 3.5 3.5 0.0 5 1 0 0 1.8 0.0 0.0 6 1 1 0 44.1 44.1 0.0 7 1 0 0 3.4 0.0 0.0 8 1 0 1 33.7 0.0 33.7 9 1 1 0 1.6 1.6 0.0 10 1 0 0 22.5 0.0 0.0 11 1 0 0 14.9 0.0 0.0 12 1 1 0 4.0 4.0 0.0 13 1 1 0 11.6 11.6 0.0 14 1 0 0 5.8 0.0 0.0 15 1 0 1 71.1 0.0 71.1 > newfac = interaction(Basstmp$Alk, factor(Basstmp$agedat)) > levels(newfac) [1] "A.0" "B.0" "C.0" "A.1" "B.1" "C.1" > contrasts(newfac) B.0 C.0 A.1 B.1 C.1 A.0 0 0 0 0 0 B.0 1 0 0 0 0 C.0 0 1 0 0 0 A.1 0 0 1 0 0 B.1 0 0 0 1 0 C.1 0 0 0 0 1