LOG for Splus Illustration of Cox Model Calculations. ==================================================== 12/4/03 HERE is an illustrative comparison using the data in Example 8.4 (Renal insufficiency, previously downloaded from the Section 1.4 dataset as "Dialys.fram"). (I) Basic statements & preliminary exhibits > dialcox <- coxph(Surv(time,Infec) ~ Gp, data=Dialys.fram) > dialcox$loglik [1] -104.2319 -103.0278 ### Note that the 1st component is the initial value for Efron's ### likelihood, 2nd is the final value. (Compare p. 260, bottom.) ## Since the default initial value for beta is 0, the difference of ## initial and final logLik's times 2 gives the Likelihood Ratio ## Chi-sq statistics value below: 2*(-103.0278+104.2319) = 2.4082. ### If we had wanted Breslow's likelihood, and corresponding ### estimates, we should specify method="breslow" within coxph. > summary(dialcox) Call: coxph(formula = Surv(time, Infec) ~ Gp, data = Dialys.fram) n= 119 coef exp(coef) se(coef) z p Gp -0.613 0.542 0.398 -1.54 0.12 exp(coef) exp(-coef) lower .95 upper .95 Gp 0.542 1.85 0.248 1.18 Rsquare= 0.02 (max possible= 0.827 ) Likelihood ratio test= 2.41 on 1 df, p=0.121 Wald test = 2.37 on 1 df, p=0.124 Efficient score test = 2.44 on 1 df, p=0.118 > dialcox$coef/sqrt(dialcox$var) [,1] [1,] -1.539453 ### squared value is 2.3699 = Wald Stat attr(, "names"): [1] "Gp" ### Now we try to re-compute score statistic test from scratch. ## In this special case, the logrank statistic numerator is the ## same as the score statistic, exactly because we are ## replacing beta by 0 in the comensated number of group-1 deaths. > tmp <- survdiff(Surv(time,Infec) ~ Gp, data=Dialys.fram > tmp$chisq [1] 2.529506 ## not quite the same as `efficient score test' above > c(tmp$obs[1],tmp$exp[1], tmp$var[1,1]) [1] 15.000000 11.036448 6.210596 ### and (15-11.036)^2/6.2106=2.530. ### Variance in "efficient score" also does not agree with logrank ### if method="breslow": > coxph(Surv(time,Infec) ~ Gp, data=Dialys.fram, method="breslow")$score [1] 2.487286 ### NOTE that although the Survival pictures show that Group 1 ### has higher risks (because its survival function drops faster), ### the beta coefficient is negative. That is because the second ### group is coded as 2 rather than 0. If the second group wee coded ### 0, then the coefficient would be +0.613. > srvdial <- survfit(Surv(time, Infec) ~ Gp, data=Dialys.fram) > srvdial$strata Gp=1 Gp=2 23 24 > srvdial$time [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5 [16] 16.5 18.5 21.5 22.5 23.5 25.5 26.5 27.5 0.5 1.5 2.5 3.5 4.5 5.5 6.5 [31] 7.5 8.5 9.5 10.5 11.5 12.5 14.5 15.5 16.5 18.5 19.5 20.5 22.5 24.5 25.5 [46] 26.5 28.5 > srvdial$n.risk [1] 43 42 40 36 33 31 29 25 22 20 18 16 14 13 11 10 9 8 6 4 3 2 1 76 60 [26] 56 49 43 40 35 33 30 27 25 22 20 16 14 13 11 10 7 6 5 4 3 1 > plot.survfit(srvdial, lty=c(1,3), main= "Kaplan-Meier Plots for Two Dialysis-patient Groups (1=solid)") printgraph(file="TwoKMDial.ps") ### This can give two survival curves plotted separately: ### we can plot -log(S) to see whether the curves look linear ### and/or proportional > plot(srvdial$time[1:23], -log(srvdial$surv[1:23]), type="l", lty=1, xlim=c(0,30), xlab="Time", ylab="Cum Haz", main="Cumulative Hazard Functions from 2 Separate Dialysis Groups") lines(srvdial$time[24:47], -log(srvdial$surv[24:47]), lty=3) printgraph(file="TwHazDial.ps") > plot(survfit(dialcox), main= ### Summary survival curve with confidence "Summary Survival Curve estimate from Dialysis Data") ### band based on Cox model printgraph(file="SumCoxDial.ps") ### From these exhibits we can roughly guess that the two groups' ### survival curves should fit adequately to a Cox model ### (rough proportionality of cum Hazards), AND that the curves ### are also not far from exponential, so that maybe the survival ### regression model should also fit adequately. (II) Nuisance survival function. ## We noted that the summary survival function under the Cox model could be displayed, along with confidence intervals for each time-point , using the survfit function with Cox-model fitted object as input. ## You might think that there would be an output-list component ## from which we could immediately estimate the nuisance (baseline) ## cumulative-hazard Lambda0 or survival function. But there does not ## seem to be, and according to the Venables and Ripley chapter on ## Survival Analysis (which is a good source for Splus survival ## analysis commands and some possibilities of datasets to analyze), ## this nuisance cumulative hazard is a derived quantity which we must ## estimate ourselves from the Cox-model output, as follows: > srvdialA <- survfit(Surv(time, Infec), data=Dialys.fram) > rsksum <- aggregate.data.frame(exp(dialcox$coef*Dialys.fram[3]), Dialys.fram[1], sum)$Gp > srvnuis <- cumsum(srvdialA$n.ev/rev(cumsum(rev(rsksum)))) > mean(exp(Dialys.fram$Gp*dialcox$coef)) [1] 0.3834194 > round(rbind(nuisrsk=rev(cumsum(rev(rsksum))), atrisk=srvdialA$n.risk),2) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] nuisrsk 45.63 40.93 39.21 36.07 32.14 29.63 27.08 25.41 22.36 19.85 18.18 atrisk 119.00 103.00 98.00 89.00 79.00 73.00 66.00 62.00 55.00 49.00 45.00 [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] srvnuis 16.22 14.55 12.29 11.74 10.07 9.24 8.11 7.27 6.39 6.1 5.01 atrisk 40.00 36.00 30.00 29.00 25.00 23.00 20.00 18.00 15.00 14.0 12.00 [,23] [,24] [,25] [,26] [,27] [,28] srvnuis 3.64 3.09 2.8 1.97 0.84 0.29 atrisk 9.00 8.00 7.0 5.00 2.00 1.00 > plot(srvdialA$time, srvnuis, xlab="time", ylab="Cum Haz", main= "Cox-Model Nuisance Survival Curve Estimator, Dialysis Data" ) printgraph(file="NuisDial.ps") ### You can see that the graph looks very linear !! at least out to t= ### time[25] = 25.5. > srvnuis[25]/srvdialA$time[25] [1] 0.05321062 ### Suggests the slope parameter = constant ### Exponential nuisance-distribution hazard = 0.0532 ### Return to calculation of `efficient score' test. ### Numerator: > dth1sum <- aggregate.data.frame(Dialys.fram$Infec*(2-Dialys.fram$Gp), list(Dialys.fram$time), sum)$x rsk1sum <- rev(cumsum(rev(aggregate.data.frame(2-Dialys.fram$Gp, list(Dialys.fram$time), sum)$x))) > dth1sum [1] 0 1 0 1 2 1 0 0 2 1 1 1 0 0 0 1 1 1 0 0 0 0 1 0 0 1 0 0 > rsk1sum [1] 43 43 42 40 36 33 31 29 25 22 20 18 16 14 13 11 10 9 8 8 8 6 4 3 3 [26] 2 1 0 ### Statistic numerator > sum(dth1sum - rsk1sum/srvdialA$n.risk) [1] 2.818619 ### Variance for Denominator >sum(srvdialA$n.risk*rsk1sum/srvdialA$n.risk*(1-rsk1sum/srvdialA$n.risk)) [1] 277.2631 ------------------------------------------------------------------- (III). Exponential proportional-hazard models can be fitted using Splus 3.4 function "survreg" with the dist="extreme", and link="log" option, e.g. (from online help): survreg(Surv(futime, fustat) ~ ecog.ps + rx, data = ovarian, dist = 'extreme', link = 'log', fixed = list(scale = 1)) This is fitted as a Generalized Linear Model. We know something about parametrizing the exponential models directly, as discussed in class, but this gives another way to do the exponential model for the data in #8.1. In the Dialysis example: > dialreg <- survreg(Surv(time, Infec) ~ Gp, data = Dialys.fram, dist = 'extreme', link = 'log', fixed = list(scale = 1)) > dialreg$coef (Intercept) Gp 2.943625 0.5335044 ## The model being fitted is: T ~ Expon ( e^{b'z} lam0 ), or in terms of a unit exponential V = exp(E), E = extreme-value variate log(T) = -log(lam0) - b'z + E ## But the parameterization in survreg (cf Venables and Ripley p.360) ## is log T ~ beta' z + sig * log(V) ## Thus sig = scale = 1 and -log(lam0) = Intercept ### So here the fitted parameter b=.533 (compare with Cox coef -.613), ### again with sign indicating log(T) is larger within group 2. ### Intercept of 2.943 = -log(lam0) which puts lam0 = ### exp(-2.943) = 0.05271 just like the value .0532 found above !! --------------------------------------------------------------------- (IV) Fitting of model with auxiliary time-dependent covariate log(t). ### The key point to understand is that we must establish a new data- ### frame with times in (start,stop) [left-truncated, right-censored] ### format, non-time-dependent covariates repeated for all records ### involving the same person, and time-dependent covariates entered ### with value at final time-point. Time-intervals in this case would ### be all intervals between distinct times in the study !! ## Recall that srvdialA$time contains 28 ordered distinct times in the ## study and contains successive integer times k+0.5 , k=0..16, 18..28 > Dialys.fram[1,] time Infec Gp 44 0.5 1 2 > newdial <- data.frame(start=0, stop=0.5, Infec=1, Gp=2, ZXlntim=0) > for (i in 2:119) { timind <- match(Dialys.fram$time[i],srvdialA$time) tmpmat <- array(0, dim=c(timind,5)) tmpmat[,4] <- rep(Dialys.fram[i,3], timind) tmpmat[timind,3] <- Dialys.fram$Infec[i] tmpmat[,1] <- if(timind>1) c(0,srvdialA$time[1:(timind-1)]) else 0 tmpmat[,2] <- srvdialA$time[1:timind] newdial <- rbind.data.frame(newdial,tmpmat) } > newdial$ZXlntim <- log(newdial$stop) * as.numeric(newdial$Gp==1) > dim(newdial) [1] 1132 5 > names(newdial) [1] "start" "stop" "Infec" "Gp" "ZXlntim" > dialcoxT <- coxph(Surv(start,stop,Infec) ~ Gp + ZXlntim, data=newdial) ... coef exp(coef) se(coef) z p Gp 1.44 4.21 1.029 1.40 0.160 ZXlntim 1.47 4.36 0.587 2.51 0.012 Likelihood ratio test=14.8 on 2 df, p=0.000595 n= 1132 ### This seems to indicate lack of proportionality !!!?? ### It is surprising because of our visual `finding' that an ### exponential model looked adequate. To avod problems near the ### end of the survival axis, let's re-code so that the time variable ### levels off for all times above the upper quartile, 22. > newdial$ZXlntim <- log(pmin(newdial$stop,22)) * as.numeric(newdial$Gp==1) > dialcoxT <- coxph(Surv(start,stop,Infec) ~ Gp + ZXlntim, data=newdial) ... coef exp(coef) se(coef) z p Gp 1.44 4.22 1.032 1.39 0.160 ZXlntim 1.48 4.38 0.589 2.51 0.012 ### Essentially same result !! #### Note that Gp still does not look significant by ### itself, but we could see from the original survival curves ### that there is a large difference ! ! =================================================================== FURTHER ILLUSTRATION OF MULTIPLE-COVARIATE / TIME-DEP-COV MODELLING > heartcox <- coxph(Surv(start,stop,event) ~ age + year + surgery, data=heart) > summary(heartcox) Call: coxph(formula = Surv(start, stop, event) ~ age + year + surgery, data = heart) n= 172 coef exp(coef) se(coef) z p age 0.0271 1.027 0.0134 2.02 0.043 year -0.1463 0.864 0.0704 -2.08 0.038 surgery -0.6376 0.529 0.3670 -1.74 0.082 exp(coef) exp(-coef) lower .95 upper .95 age 1.027 0.973 1.001 1.055 year 0.864 1.158 0.753 0.992 surgery 0.529 1.892 0.257 1.085 Rsquare= 0.084 (max possible= 0.969 ) Likelihood ratio test= 15.1 on 3 df, p=0.00172 Wald test = 14.5 on 3 df, p=0.0023 Efficient score test = 15 on 3 df, p=0.00179 > round(cbind(coef=heartcox$coef, se=sqrt(diag(heartcox$var))),3) coef se age 0.027 0.013 year -0.146 0.070 surgery -0.638 0.367 ## AIC for initial model (no cov's) andfinal (3 cov's) > -2*heartcox$loglik + 2*c(0,3) [1] 596.2427 587.1323 ### What about other models, eg AIC/LRT test for significance of ### surgery covariate ? > tmp <- coxph(Surv(start,stop,event) ~ age + year, data=heart) > 2*heartcox$loglik[2]- 2*tmp$loglik[2] ### LRT= 3.45 not significant. ### AIC in 2-coeff cases: > -2*coxph(Surv(start,stop,event) ~ age + year, data=heart)$loglik[2]+4 [1] 588.585 > -2*coxph(Surv(start,stop,event) ~ age+surgery, data=heart)$loglik[2]+4 [1] 589.5268 > -2*coxph(Surv(start,stop,event) ~ year+surgery,data=heart)$loglik[2]+4 [1] 589.684 ### Now time-dependent covariate case: > heartcoxT <- coxph(Surv(start,stop,event) ~ age + year + surgery + transplant, data=heart) > summary(heartcoxT) ... coef exp(coef) se(coef) z p age 0.02717 1.028 0.0137 1.9809 0.048 year -0.14635 0.864 0.0705 -2.0768 0.038 surgery -0.63721 0.529 0.3672 -1.7352 0.083 transplant -0.00513 0.995 0.1569 -0.0327 0.970 ... > c(heartcoxT$loglik[2], -2*heartcox$loglik[2] + 8) [1] -290.5656 589.1323 ### AIC not even needed here: clearly no better than 3-covariate model. But what if we take into account both time (year) AND transplant ?! > heartcoxT <- coxph(Surv(start,stop,event) ~ age + year*transplant + surgery, data=heart) coef exp(coef) se(coef) z p age 0.0299 1.030 0.0137 2.18 0.029 year -0.1539 0.857 0.0715 -2.15 0.031 transplant -0.3106 0.733 0.2656 -1.17 0.240 surgery -0.6641 0.515 0.3681 -1.80 0.071 year:transplant 0.0987 1.104 0.0697 1.42 0.160 ### Very interesting !? old coefficients seem even better now > -2*heartcoxT$loglik[2]+10 #### 589.1035 ### See these results also in Venables and Ripley p. 377: > heartcoxT <- coxph(Surv(start,stop,event) ~ age + strata(transplant)+ year + year:transplant + surgery, data=heart) coef exp(coef) se(coef) z p age 0.0331 1.034 0.0140 2.36 0.018 year -0.1622 0.850 0.0723 -2.24 0.025 surgery -0.6457 0.524 0.3719 -1.74 0.083 year:transplant 0.1140 1.121 0.0718 1.59 0.110 > idtrans <- heart$id[heart$transplant==1] ### 69 unique cases (34 others) > indtrans <- match(heart$id,idtrans) > sum(is.na(indtrans)) [1] 34 > sum(!is.na(indtrans)) [1] 138 ### Positions with NA are the records for NONtransplanted patients ### Positions without NA are records for transplant patients ### Idea would be to display survival curves separately: > plot(survfit(heartcoxT)) ### But could do many other data-displays as well for separate group. ==================================================================== (V) Further data display and analysis concerning lack of Cox-model fits in the Dialysis dataset treated above. =================================================================== ### Let us proceed by plotting ratios of (cumulative) hazards for the 2 treatment groups in Dailys.fram ## We make use here of a utility program created to allow piecewise-contant interpolation between given values of survival or cumulative hazard functions > Jfunc function(newx, oldx, oldy, initx = 0, inity = 1) { ### evaluates right-continuous step-function ### with values c(inity,oldy) at c(initx,oldx) at the ### points newx which may be between values of oldx ### Assume oldx sorted, all > initx . pts <- c(initx, oldx) vals <- c(inity, oldy) ## want to know for each element of newx how many of the ## values pts are <= . rankvec <- c(outer(newx, pts, function(x, y) x >= y) %*% rep(1, length(pts))) ## now get new set of y-values by simple evaluation! vals[rankvec] } ## Recall that "srvdial" found from stratified survfit in (I) above ## gave estimated survival-function values separately for each of ## the two groups (23 values for Gp1, then 24 for Gp 2). > plot(srvdial$time[1:23],srvdial$surv[1:23]) lines((1:130)/5, Jfunc((1:130)/5, srvdial$time[1:23],srvdial$surv[1:23])) ### This just tests the function and shows that the group-1 KM function previously plotted in "TwoKMDial.ps" ### Now we do the same for the two groups' separate cumulative hazard ### functions, analogous to "TwHazDial.ps" > plot(srvdial$time[1:23], -log(srvdial$surv[1:23]), pch=1) points(srvdial$time[24:47],-log(srvdial$surv[24:47]), pch=2) lines((1:150)/5, Jfunc((1:150)/5, srvdial$time[1:23],-log(srvdial $surv[1:23]),inity=0), lty=1) lines((1:150)/5, Jfunc((1:150)/5, srvdial$time[24:47],-log(srvdial $surv[24:47]),inity=0), lty=3) ### This all worked very nicely, now plot the hazard ratio: > tmph1 <- Jfunc((1:150)/5, srvdial$time[1:23],-log(srvdial $surv[1:23]),inity=0) tmph2 <- Jfunc((1:150)/5, srvdial$time[24:47],-log(srvdial $surv[24:47]),inity=0) plot((1:150)/5, tmph1/tmph2, xlab="Time", ylab="Ratio of Cum Hazards", main= "(Cum-) Hazard ratios for Gp1 over Gp2, Dialysis data") printgraph(file="DialHazrat.ps") =============================================================== (VI) Since we can see that the Cox-model did not in fact fit in the Dialysis-Data Example, let's look at sopm,e lack-of-fit diagnostics. ## Recall that srvdialA$time gives time-sequence ## and srvnuis the corresponding nuisance (cum) hazard from Cox model > plot(srvdialA$time,srvnuis) lines((1:150)/5, Jfunc((1:150)/5, srvdialA$time, srvnuis, inity=0), lty=3) > mtgrsd.dial <- residuals.coxph(dialcox) ### martingale residuals ### given as a long vector in the same order as orginal data-frame > plot(cumsum(mtgrsd.dial)) printgraph(file="DialMtgRsd.ps") ### what we would use as variance is: ### cumsum(Dialys.fram$Infec-mtgrsd.dial) > plot(cumsum(mtgrsd.dial)/sqrt(cumsum(Dialys.fram$Infec-mtgrsd.dial )), xlab="Time", ylab="Standardized Cum residual", main= "Dialysis Data, Cox-Model Diag Plot Based on Stand Cum Mtg Resids") printgraph(file="Dial.CumRsd.ps")