Log of Clustering Steps in R for HW7 ==================================== > uscomp = read.table("uscomp2.txt") > dim(uscomp) [1] 79 8 > unlist(lapply(uscomp,class)) V1 V2 V3 V4 V5 V6 V7 V8 "factor" "integer" "integer" "integer" "numeric" "numeric" "numeric" "factor" ## Omit the names columns 1,8; then scale the others by SD ## and apply cluster algorithm to the resulting vectors, ## using Euclidean norm distance. > uscompS = data.matrix(uscomp[,2:7]) uscompS = uscompS - outer(rep(1,79), apply(uscompS,2,mean), "*") uscompS = uscompS %*% diag(1/sqrt(diag(t(uscompS) %*% uscompS))) ### load package "cluster", then try out "diana" and "agnes" > clus1 = diana(uscompS) plot(clus1) clus2 = agnes(uscompS) plot(clus2) ### It is hard to read these plots. We need to code clusters up ### to a given height, from the output cluster objects. This ### is done below. ### Consider an example. We create 20 observations in 4 groups of ### 3-vectors: Multivariate Normal all with variance 2*diag(4) and ### means rep(-1,3), rep(1,3), rep(3,3), rep(5,3) > xmat = array(sqrt(2)*rnorm(60), c(20,3)) + outer(rep(seq(-1,5,2), rep(5,4)),rep(1,3), "*") clusx1 = diana(xmat) plot(clusx1) ### See the pictures ClusxAG.pdf clusx2 = agnes(xmat) ### and ClusxDI.pdf with links on the plot(clusx2) ### web-page. > clusx2$height [1] 2.0429589 1.5700492 3.1436726 1.8496718 2.6483806 4.2790983 5.4098963 [8] 1.0150814 1.6290795 1.3861984 2.3627324 0.8115837 3.2783917 2.6410553 [15] 4.5544819 8.8511412 1.6352478 2.8538983 3.6537399 > clusx2$merge [,1] [,2] [1,] -16 -18 [2,] -11 -13 [3,] -15 -17 [4,] -6 -8 [5,] 2 3 [6,] -2 -5 [7,] -9 -12 [8,] -1 4 [9,] 5 1 [10,] -14 -19 [11,] 7 -10 [12,] 6 -4 [13,] 8 11 [14,] 9 10 [15,] 12 -3 [16,] 13 -7 [17,] 14 -20 [18,] 16 17 [19,] 18 15 > sort(clusx2$height) [1] 0.8115837 1.0150814 1.3861984 1.5700492 1.6290795 1.6352478 1.8496718 [8] 2.0429589 2.3627324 2.6410553 2.6483806 2.8538983 3.1436726 3.2783917 [15] 3.6537399 4.2790983 4.5544819 5.4098963 8.8511412 ### The successive merges happen at the sorted heights, from smallest ## to largest. The unsorted heights show the merges pictured from ## left to right. ### Further routines to be coded and illustrated below will show how to generate and interpret the clusters visually. > ClusGp function(ht, clusobj) { ### The idea here is to accumulate list of distinct groups ### using "merge" component for all indices whose heights ### are <= ht sortht = sort(clusobj$height) num = sum(sortht <= ht) Active = NULL Grps = NULL Merge = clusobj$merge for (i in 1:num) { Active = c(Active,i) if(Merge[i,1]<0 & Merge[i,2] <0) { Grps = c(Grps,list(-Merge[i,])) } else if (Merge[i,1]*Merge[i,2]<0) { agp = max(Merge[i,]) bnode = -min(Merge[i,]) Grps = c(Grps, list(c(Grps[[agp]],bnode))) Active = setdiff(Active,agp) } else { Grps = c(Grps, list(c(Grps[[Merge[i,1]]], Grps[[Merge[i,2]]]))) Active = setdiff(Active,Merge[i,]) } } Clust = Grps[Active] ### At this point, all of the Active elements ### in Grps represent non-singleton clusters. ### The next 5 lines of code would be included only ### if we wanted to list all of the singleton ### clusters as part of Clust, but we don't: # nnod = nrow(clusobj$data) # Sngltn = NULL # for(i in Active) Sngltn = union(Sngltn,Grps[[i]]) # Sngltn = setdiff(1:nnod,Sngltn) # for(i in Sngltn) Clust = c(Clust, list(i)) list(Grps=Grps,Active=Active, Singletons=Sngltn, Clusters=Clust) } ### NOTE: the output Active indexes only those Groups ## of nodes that have merged with at least one other ## node at a "height" <= ht. The nodes not included ## in the Grps can all be regarded as singleton ## clusters, and they are listed that way in the ## output Clusters. > ClusGp(2,clusx2)[3:4] $Singletons [1] 1 3 4 7 10 14 19 20 $Clusters $Clusters[[1]] [1] 16 18 $Clusters[[2]] [1] 6 8 $Clusters[[3]] [1] 11 13 15 17 $Clusters[[4]] [1] 2 5 $Clusters[[5]] [1] 9 12 > ClusGp(4, clusx2)[3:4] ### 5 clusters plus 2 singletons. $Singletons [1] 7 20 $Clusters $Clusters[[1]] [1] 6 8 1 9 12 10 $Clusters[[2]] [1] 11 13 15 17 16 18 14 19 $Clusters[[3]] [1] 2 5 4 3 ### These clusters are about as close as we can come to ### the "true" clustering into 1:5, 6:10, 11:15, 16:20