Script for Elementary Descriptive Statistics ============================================ Data for illustration purposes is "pima" dataset downloaded from http://www.ics.uci.edu/~mlearn/MLRepository.html which is the web-page for a really nice repository of interesting datasets. (These might be useful as the source for your data project at the end of the term.) You can also find the dataset (in compressed form) in the Data directory you can link to from the course web-page. Note that you must first copy and paste the text file (which you can display on the web-page from the Data directory) to a file "pima.dat" in your workspace. The gzipped file cannot be read directly using INFILE and INPUT. 52 data pima ; 53 infile "pima.dat"; 54 input Obs pregnant glucose diastolic triceps 55 insulin bmi diabetes age Diab ; 56 if _N_ > 1; 57 run; First line was all column-headers, so is "bad data", which we remove bey keeping only data from the 2nd line on. The data now consists of 768 records from National Institute of Diabetes and Digestive Kidney diseases on adult female Pima Indians. Variables relate to diabetes. Begin by looking at the dataset using various summary statistics calculated in SAS. 67 proc freq data=pima ; 68 table pregnant / nocum; 69 run; Freq Table for Pregnancies The FREQ Procedure pregnant Frequency Percent --------------------------------- 0 111 14.45 1 135 17.58 2 103 13.41 3 75 9.77 4 68 8.85 5 57 7.42 6 50 6.51 7 45 5.86 8 38 4.95 9 28 3.65 10 24 3.13 11 11 1.43 12 9 1.17 13 10 1.30 14 2 0.26 15 1 0.13 17 1 0.13 Issuing the next declaration statement saves a lot of editing of outputs from now on: lines are of same order as word-processor window, and dates don't print on every page. 79 options linesize = 70 nodate; Now we get selected summary info about selected variables in the dataset. PROC UNIVARIATE with same DATA and VAR options would give just tons of output. 80 proc means data=pima mean median clm 81 std q1 q3 min max; 82 title "Descriptive stats"; 83 var glucose bmi ; 84 run; 74 proc means data=pima mean median clm 75 std q1 q3 min max; 76 title "Descriptive stats"; 77 var glucose bmi ; 78 run; Descriptive stats 7 The MEANS Procedure Lower 95% Upper 95% Variable Mean Median CL for Mean CL for Mean -------------------------------------------------------------------- glucose 120.8945313 117.0000000 118.6297225 123.1593400 bmi 31.9925781 32.0000000 31.4340966 32.5510596 -------------------------------------------------------------------- Lower Upper Variable Std Dev Quartile Quartile Minimum -------------------------------------------------------------------- glucose 31.9726182 99.0000000 140.5000000 0 bmi 7.8841603 27.3000000 36.6000000 0 -------------------------------------------------------------------- Variable Maximum ----------------------- glucose 199.0000000 bmi 67.1000000 ----------------------- Now let's cross-tabulate variables, AGE by decade and DIASTOLIC by intervals of 20 data pimatmp; set pima (keep = diastolic age); diastolic = floor(diastolic/20)*20; age = 5 + 10* floor(age/10); proc freq; table diastolic * age/ nocum nopercent ; title "Cross-tabulation"; run; Cross-tabulation 22 The FREQ Procedure Table of diastolic by age diastolic age Frequency| 25| 35| 45| 55| Total ---------+--------+--------+--------+--------+ 0 | 20 | 9 | 5 | 0 | 35 ---------+--------+--------+--------+--------+ 20 | 3 | 1 | 0 | 0 | 4 ---------+--------+--------+--------+--------+ 40 | 66 | 11 | 3 | 0 | 82 ---------+--------+--------+--------+--------+ 60 | 230 | 101 | 67 | 30 | 442 ---------+--------+--------+--------+--------+ 80 | 74 | 40 | 37 | 24 | 189 ---------+--------+--------+--------+--------+ 100 | 2 | 3 | 6 | 3 | 15 ---------+--------+--------+--------+--------+ 120 | 1 | 0 | 0 | 0 | 1 ---------+--------+--------+--------+--------+ Total 396 165 118 57 768 (Continued) Cross-tabulation 23 The FREQ Procedure Table of diastolic by age diastolic age Frequency| 65| 75| 85| Total ---------+--------+--------+--------+ 0 | 0 | 1 | 0 | 35 ---------+--------+--------+--------+ 20 | 0 | 0 | 0 | 4 ---------+--------+--------+--------+ 40 | 2 | 0 | 0 | 82 ---------+--------+--------+--------+ 60 | 13 | 0 | 1 | 442 ---------+--------+--------+--------+ 80 | 13 | 1 | 0 | 189 ---------+--------+--------+--------+ 100 | 1 | 0 | 0 | 15 ---------+--------+--------+--------+ 120 | 0 | 0 | 0 | 1 ---------+--------+--------+--------+ Total 29 2 1 768 We discuss in class PROC UNIVARIATE, PROC SORT, and next week a few graphical PROC's (PLOT, GPLOT, etc.) 114 proc univariate data=pimatmp plot ; 115 title "Crude Plots"; 116 run; This gives lots of different statistics plus a few key descriptive plots for the two variables in the dataset, reproduced below. Crude Plots 33 The UNIVARIATE Procedure Variable: diastolic Histogram # Boxplot 122.5+* 1 0 . . . .** 15 | . | . | . | .******************* 189 +-----+ . | | . | | . | | 62.5+********************************************* 442 *--+--* . | . | . | .********* 82 | . . . .* 4 0 . . . 2.5+**** 35 0 ----+----+----+----+----+----+----+----+----+ Crude Plots 34 The UNIVARIATE Procedure Variable: diastolic * may represent up to 10 counts Crude Plots 35 The UNIVARIATE Procedure Variable: diastolic Normal Probability Plot 122.5+ * | | | ++ | ****** | +++ | +++ | +++ | *************** | +++ | +++ | ++ 62.5+ ***************** | +++ | ++ | +++ | *******+ | ++ | +++ | +++ | ++ ** | +++ |+ | 2.5+********* +----+----+----+----+----+----+----+----+----+----+ Crude Plots 36 The UNIVARIATE Procedure Variable: diastolic -2 -1 0 +1 +2 Crude Plots 39 The UNIVARIATE Procedure Variable: age Histogram # Boxplot 87.5+* 1 0 . .* 2 | . | .**** 29 | . | 57.5+******* 57 | . | .************** 118 +-----+ . | | .******************* 165 | | . | + | 27.5+******************************************** 396 *-----* ----+----+----+----+----+----+----+----+---- * may represent up to 9 counts Crude Plots 40 The UNIVARIATE Procedure Variable: age Normal Probability Plot 87.5+ * | | * | | ********* | +++ 57.5+ ******+++++ | ++++ | *******+ | +++++ | ******* | ++++ 27.5+************************** +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2