LOG TO ILLUSTRATE REGRESSION MODELLING AND GRAPHICAL TECHNIQUES IN THE CONTEXT OF SIMULATED DATA ============================================= 11/3/07 (SO THAT WE CAN KNOW THE 'TRUE' MODELS) We begin by creating a small simulated SAS data-set which has a strong quadratic term. libname home "."; data home.SIMSET0 (drop = seed I ); if _N_ = 1 then do; SEED = 55; /* use non-zero seed to make the generated data-set reproducible --- helps in debugging */ do i=1 to 500; X=ranuni(seed); end; /* this is just to initialize or "burn in" the simulation */ end; /* initialization */ do I=1 to 75; x = ranuni(seed); y = 0.7 + 1.3*X + 0.8*X**2 + 0.8*rannor(seed); /* a = 0.7, b1 = 1.3, b2 = .8, sigsq = (.8)^2 */ output; end; /* Need output before "end" in order to have more than 1 OBS !! */ data simset0; set home.simset0; Xsq = X**2; Ymn = 0.7 + 1.3*X + 0.8*Xsq ; run; /* Of course in a real regression problem we would not know the true mean */ options linesize=70 nocenter; goptions csymbol=black; symbol1 V="+" ; symbol2 V=NONE I=RLCLI95 ; symbol3 V=NONE I=JOIN Line=3; symbol4 V=diamond; symbol5 V=NONE I=RLCLI99 ; proc sort data=simset0; by X; /* sort first to graph with I=JOIN */ proc gplot data=simset0; plot Y * X = 1 Y * X = 2 /* no "outliers" with 5 in place of 2 !! */ YMN * X = 3 /overlay ; title "Simple Regression Model Plots vs True Curve"; run; NOTE: Regression equation : y = 0.645668 + 2.011291*X. ### We can see that the coeff of X got inflated because ### of the X^2 term (which was ignored in model so far). ### There are three points which land slightly outside the ### confidence band CLI95, which is not at all surprising ### because the EXPECTED number of points that will is ### 75 * (.05) = 3.75 . ### But if we had wanted to identify the points falling ### outside, we can: proc reg data=simset0; model y = X ; output out=LinRsd0 p=PredL r = RsdL L95=Low U95=Hi stdr =StdR Student = StuResW Rstudent = StuResWO; data BigRsd; set LinRsd0 (drop=Xsq); if Y < Low or Y > Hi or abs(StuResWO) > 1.96; proc print; run; Simple Regression Model Plots vs true Curve 10:20 Friday, November 3, 2006 Obs X y Ymn PredL Low 1 0.43740 -0.65496 1.42168 1.52541 -0.26815 2 0.46659 -0.30014 1.48074 1.58413 -0.20891 3 0.48727 -0.62646 1.52339 1.62570 -0.16710 Obs Hi RsdL StdR StuResW StuResWO 1 3.31897 -2.18038 0.88713 -2.45779 -2.54862 2 3.37716 -1.88427 0.88740 -2.12337 -2.17708 3 3.41851 -2.25216 0.88751 -2.53761 -2.63926 ## One thing we can do to find out if there is any use in ## including a model term based on Xsq is to get a partial ## correlation: proc corr data=simset0; var Y; with Xsq; partial X; run; Pearson Partial Correlation Coefficients, N = 75 Prob > |r| under H0: Partial Rho=0 y Xsq 0.21307 0.0683 /* So we do not quite have significance, but this suggests Xsq term will help the fit. */ proc reg data=simset0; model y = x xsq ; output out= Qrsd0 p=PredQ r = RsdQ RStudent= StuRQ; The REG Procedure Model: MODEL1 Dependent Variable: y Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 1.08035 0.31299 3.45 0.0009 X 1 -0.48110 1.39129 -0.35 0.7305 Xsq 1 2.42558 1.31078 1.85 0.0683 ### NOTE last p-value .0683 same as for partial correlation ! ### This is not a coincidence !!! data Combin (drop=Ymn); merge Qrsd0 LinRsd0 (keep = RsdL StuResWO rename=(StuResWO=StuRL)); Tru = Ymn - PredQ; proc Gplot; plot Tru * X = 3 RsdQ * X = 1 RsdL * X = 4 / overlay; title " Quadratic vs Linear Model Residuals"; run; ### This graph shows that the residuals from Quadratic model (pluses in circles) generally but not always lie closer to the horizontal 0-line than do the Linear model residuals (diamonds). Dashed line shows the true mean-curve minus predicted. NOTE that the quadratic-model fitted coeff of X is negative, and it should not be !! This is a small-sample phenomenon which can occur because X and X-squared are highly correlated: proc corr data=simset0; var xsq; with x; run; /* corr .9681 */ =========================================================== NOW WE CONTINUE WITH A LARGER SIMULATED DATASET TO ILLUSTRATE THE USE OF RESIDUALS FALLING OUTSIDE OF EXPECTED RANGES ------------------------------------------------------- (continuation 11/8/06, using same OPTIONS and GOPTIONS as before) libname home "."; data home.SIMSET1 (drop = seed I ); if _N_ = 1 then do; SEED = 77777; do i=1 to 500; X=ranuni(seed); end; end; do I=1 to 500; x = ranuni(seed); y = 0.7 + 1.3*X + 0.8*X**2 + 0.8*rannor(seed); /* a = 0.7, b1 = 1.3, b2 = .8, sigsq = (.8)^2 */ output; end; data simset1; set home.simset1; Xsq = X**2; Ymn = 0.7 + 1.3*X + 0.8*Xsq ; proc sort data=simset1; by X; proc reg data=simset1; model y = X ; output out=LinRsd1 p=PredL r = RsdL L95=Low U95=Hi stdr =StdR Student = StuResW Rstudent = StuResWO CookD = Cook; data tmp ; set LinRsd1; LowRsd = Low - PredL; HiRsd = Hi-PredL; Zer = 0; YmnRsd = Ymn - PredL; symbol6 V=NONE I = JOIN LINE = 2; proc gplot data=tmp; plot StuResWO * X = 4 /* same as StuResW in this case because n is large */ Zer * X = 3 LowRsd * X = 6 HiRsd * X = 6 YmnRsd * X = 6 / overlay; title "Studentized Residual Plot, SimSet1"; run; data tmp2 ; set tmp; if abs(StuResW0) > 1.96; /* Log window shows this happens 17 times, while expected # = 25 */ proc corr data=tmp; var Y; with Xsq; partial X; run; /* Value .14834 is highly significant, p-value .0009, even though studentized residuals don't show it ! */