Stat 430 Homework Assignments, Spring 2009

First Homework Assignment, counting 10 points, due Wednesday, Feb. 4, 2009.

Get SAS running on a PC, WAM, or Unix platform. Using the example programs on pages 3 and 7 of the text as templates, as well as copy-and-paste operations from your favorite word-processor, write a SAS program to input the first 12 data-lines of the dataset "nature02801-s2.dat", columns 2-11 ("Archip" through "LatitS"). [Find the dataset in the data-directory http://www.math.umd.edu/~evs/s430.old/Data".] Use Proc Sort to sort the data in increasing order of "LatitS", and print the results using Proc Print with an appopriate title. Using Proc Means, calculate and print means, Min, and Standard Deviations for "Area" and "Elev" column entries for the data you have entered, and (using Proc Freq) Frequency Tables for "Isol_25" and "Deforstn". Edit the program(s) and output together into a single document, showing the lines of code and relevant output produced by SAS. One good way to do it would be to create a page of code, a page of SAS-generated output (condensed from multiple pages of output) and some lines of explanation, either interspersed or on a separate page.

Second Homework Assignment, counting 15 points, due Monday, Feb. 16, 2009. Hand in 10 pages maximum.

(I) Consider the data set pima.dat of personal characteristics, body measurements, and indicators of diabetes for 768 Pima Indian women, which can be found in the Data directory.
(a) Make several histograms of the diastolic (diastolic blood pressure) variable, with number of categories ("levels") varying from 10 to 80, using GCHART. Which histogram seems the best at describing the distribution of the data? Explain briefly in words what criteria you used in choosing one number of levels as best, and hand in only your best histogram.
(b) Make a histogram (just one, the best you can) of the diabetes variable in the pima dataset using the same procedure as in (a). What differences are there between the best histogram from (a) and the one in (b)? Describe these briefly, using no information other than what you see in the pictures.

(II) Make a histogram, with the same number of cells as in (I)(b), of the logarithms base 10 of the diabetes variable values.
(c) What can you say about the differences in the detail & pattern of the data that are displayed in this histogram by comparison with the one in (I)(b) ?
(d) Which summary statistics, if any, that you can compute from PROC MEANS with these data give different useful information about the PIMA diastolic and log10(diabetes)data ?

(III) Group the PIMA data into 5 Age groups of roughly equal size, and create and compare boxplots (in a single picture, side-by-side, using either PROC UNIVARIATE or PROC BOXPLOT) for the log10(diabetes) values for members of these groups. What does the comparison of the boxplots tell you ? Is the information more or less interpretable than a simple scatterplot of log10(diabetes) vs Age which you can create using PROC GPLOT ?

(IV). Write a single Table using SAS that will contain the MEAN, MEDIAN, and Standard Deviation of log10(diabetes) for all of the 5 Age-Groups you created from the PIMA data in problem (III).

(V). Construct a normal probability plot for diastolic and log10(diabetes) in the PIMA dataset. Comment on any departures from normality observed, and comment on whether or not these departures correspond to features seen in the best histograms in parts (I)-(II).

Third Homework Assignment, counting 15 points, due Monday, Mar. 2, 2009. Hand in 10 pages maximum.

(I). The dataset students contains data from survey described in
Chase, M. A., and Dummer, G. M. (1992), "The Role of Sports as a Social Determinant for Children,"
     Research Quarterly for Exercise and Sport, 63, pp. 418-424.
The survey set out to investigate the concept of `popularity' among public school students.
     (a) Prepare a frequency table for the variable SCHOOL. Indicate whether any particular schools seem
to have been sampled from more or less than most.
     (b) Prepare two vertical bar charts showing how MONEY is related to LOCALE, using the SUBGROUP
option with variable LOCALE. Repeat, reversing the roles of MONEY and LOCALE in the vertical bar chart,
and indicate how the two variables appear to be dependent. Which plot seems to indicate the relationship more clearly?
     (c) Repeat (b) using the option GROUP instead of SUBGROUP. How do these plots compare with those in (b)
in terms of illustrating the dependence?

(II). In studies of the placebo effect, it has been established that nausea can arise after a medicine is taken by mouth
even though there is no physical cause for the distress. As a result, a placebo (a tablet consisting of inert material
and free of the drug) is given to half of the patients in a drug trial. The patients have no idea if they are getting
a placebo or not, and their response (nausea or no nausea) is recorded after the dose. The results are given in
the following table:

                          Nauseated       Not Nauseated
Drug Given                15             35
Placebo Given            4              46

(a) Is there evidence of an association between nausea and the taking of the drug? Explain which statistic(s) you
used, and give the associated p-value(s). Remember to edit the rows of the table together as indicated above.
(b) What does the odds ratio (and confidence interval on it) indicate about the relationship between the two
variables in the table that you prepared? Note that it may be negative or positive, depending on the order in
which SAS orders the levels of your categorical variable in the table.

(III) The data set home contains data on Albuquerque housing prices based on a random sample of over 100 homes
sold Feb 15 to Apr 30, 1993. The data were obtained from the Albuquerque Board of Realtors. Note that you will
need to replace the asterisks with periods in order for SAS to process the datalines properly.
(a) Create a scatterplot of TAXES versus FEATS, with separate plotting characters (eg circle, square, etc.) for the
four different classes of houses defined by combinations (0,0), (1,0), 0,1), (1,1) of (COR,NE). Does this plot tell
you anything about whether the relationship between TAXES and FEATS is different in the four different classes
of houses defined by (COR,NE) ?
(b) Create side-by-side boxplots of TAXES for the 4 (COR,NE) groups, and of FEATS for the 4 (COR,NE) groups,
to see whether these groups differ from each other in their Taxes and their numbers of Features. What do you
conclude from these pictures ?
(c) Break the set of houses into three groups according to whether they have LOW, MEDIUM, or HIGH taxes.
(Use quantiles of the TAXES variable to do this.) Then use a Chi-square test of Row-Column independence,
calculated through PROC FREQ, to determine whether there is any relationship between the tax group and the
COR status of the houses. Explain your conclusions.
(d) Plot the Empirical Distribution Function of the TAXES variable for these data, and use the graph you
produce to estimate the 0.6 quantile of TAXES.

Fourth Homework Assignment, counting 15 points, due Wednesday, Mar. 25, 2009. Hand in 10 pages maximum.

(I). Input the data set "nature02801-s2.dat" from the web-page data directory (the Polynesian islands dataset
frequently mentioned in class).
   (a) Exactly 12 of the island observations occur in consecutive "pairs", with Islnd variable ending in L for
the first observation, and in W for the second. These 12 pairs of observations actually each come from the SAME
island, respectively on the L=Leeward and W=Windward side. Use SAS to perform a t-test on the (natural) logarithms
of Rainfall for the L observations versus the W to see whether there is a difference in average log-rainfall between
Lee and Windward. Which kind of t-test do you think is more appropriate, two-sample pooled-variance or
matched-pairs ? Why ? Interpret your results.
   (b) Note that the pairs of island-observations found in (a), which really correspond to the same island,
have identical values for many island-attribute variables, such as Elev and LatitS. Remove the duplicate
observations from the dataset (leaving only 56 distinct islands), and break them into two groups according to
whether the LatitS variable is positive (which actually means the island is south of the equator) or negative.
Do a t-test to check whether the average logarithim of the Elev variable for he resulting dataset is different for
islands North vs. South of the equator.
   (c) In both parts of the problem, say whether you think the formal t-tests you did look reasonable based
on the histograms of the log(Rain) and log(Elev) variables in the separate datasets used for (a) and (b).

(II). Simple linear regression analysis makes best sense when the response and the predictor are linearly related
(i.e. the observations, when plotted, seem to lie bunched along a straight line). The following datasets are to be analyzed:
(i) The 3 small datsets in the file "Anscombe". These were developed by the statistician Frank Anscombe to
illustrate why it is a really, really bad idea to fit regression models without using plots to help interpret them.
(ii) The 4 small datasets in the file "transform", indicating some naturally occurring linear and non-linear relationships.
a) Make scatterplots for each of the Anscombe datasets. Use PROC CORR to find the correlation between the
response and the predictor. Comment upon what the results suggest about the reliability of looking at summary
statistics like correlations alone to establish the existence of a linear trend.
b) For the datasets transform, make scatterplots and decide if the plot indicates a linear pattern or not. If you think
that the relationship between the two variables is non-linear, then apply an appropriate transformation. You can try
various combinations of log(z), 1/z, sqrt(z), and the untransformed variables. Here z can denote either the response
or the predictor. Comment on any unusual patterns observed, and present only plots of the original data and the one
in which you use the best transformation (if one is needed at all). The best transformation should produce the most
linear pattern over the entire range of the predictor.

(III) Using the ASCII dataset cigcancer.dat in the Data directory,
(i) Find the correlations among all of the cancer rates and the partial correlations of the same cancer rates after
removing the effect of Cigarette smoking.
(ii) Not all cancers seem to have much to do with cigarette smoking. Based on what you found in (i), which
cancers would you say have rates most related to smoking ?
(iii) After removing the effect of cigarette smoking by creating an output file of residuals, assess the remaining
(linear) dependence, if any, among the rates of the four types of cancer. Would you say that significant dependence
remains ? If so, can you guess what it might be due to ? (This is a non-statistical question.)
(iv) Try plotting the LUNG versus BLADDER cancer-rate residuals after first removing the effect of cigarette
smoking. I could not make much sense of this scatterplot directly. But now try plotting (either with separate plotting
characters or in separate pictures, or by marking the points by hand within your scatterplot, to distinguish the
"urban" states from the others. (My "urban" list consisted of: "CA" "CT" "DE" "DC" "FL" "IL" "MD" "MA" "NJ"
"NY" "OH" "PE" "RI" "WI".) NOW what does the scatterplot suggest ?

Fifth Homework Assignment, counting 15 points, due Monday, April 13, 2009. Hand in 10 pages maximum.

(I) Do problems 2 and 3 on the Partial Correlation worksheet (within the Partial Correlation Handout in the Handouts
web-page directory and referenced in the "current reading" page.)

(II) The data set Forbes was obtained by a 19th century physicist who wanted to provide tables allowing altitude to be
measured based on the boiling point of water. This is not as silly as it sounds, as in those days altimeters were large and
sensitive barometers, while thermometers, though breakable, were small and robust, and much more easily hauled up
remote mountains. The dataset consists of measurements of the boiling point of water in degrees Fahrenheit, and
measurements of air pressure in inches of mercury.
(a) Construct a scatterplot for the air pressure as a function of temperature. Run a regression on the observations
and comment on evidence of outliers and non-linearity.
(b) Knowing that, to a first approximation, pressure is proportional to exp(β T) where T is temperature,
make an appropriate transformation on pressure so that a linear model can be properly fit. Make a new scatterplot
to confirm the linear pattern.
(c) Remove the outlier, run a new regression, and make a scatterplot with regression line and a residual plot.
(d) Use the fitted line to provide a 99% prediction interval for the transformed pressure at temperatures of
200.5 degrees and 150 degrees. [Hint: most of what you need to calculate the prediction interval can be found by running
PROC MEANS on the predictor]. Why is the second estimate not to be trusted? What further observations would you
need to make in order to have any confidence in it?

Sixth Homework Assignment, counting 15 points, due Friday, April 24, 2009. Hand in 12 pages maximum.
Do the three problems on HW6 pdf file.

Seventh and Last Homework Assignment, counting 15 points, due Friday, May 8, 2009. Hand in 12 pages maximum.
Do the three problems on HW7 pdf file.

Stat 430 Homework Assignments, Spring 2009

Return to main course page.

© Eric V Slud, Apr. 25, 2009.