Stat 430 Sample Test for in-class Midterm F06 INSTRUCTIONS: The test will be closed-book, but you may use as memory-aid a one- or two-sided 8.5" by 11" sheet of notes. You may also bring and use a calculator, although you need not simplify numerical expressions involving only numbers. (That is, if you get your answer in the form of a numerical expression without symbols which could be evaluated by keying into a simple calculator, then you need not do the arithmetic evaluation.) NOTE: there are more problems here than I would put on the test. I will probably put 3 like these on the test. (I) Start with "data" x 1 1 3 2 1 2 1 3 y 1 2 5 3 2 4 1 6 You are given the calculated results ybar= (1/8) sum y_i = 3, xbar= (1/8) sum x_i = 1.75, ssq.x = sum (x_i-xbar)^2 = 5.5, ssq.y = sum (y_i-ybar)^2 = 24, cov.xy = sum (y_i-ybar)*(x_i-xbar) = 11. (a) What is the (sample) correlation between the x and y data vectors ? (b) What are the least-squares estimated intercept and slope parameters for the model y=a+bx based on these data ? (c) Find the projection of the vector y - ybar * 1_n on the vector x - xbar * 1_n , where 1 denotes the vector (of dimension n=8) with all entries 1. What is its interpretation in the setting of simple linear regression ? (d) Assume that a SAS dataset XYdat, with the two given vectors x and y as columns resides in your workspace in a SAS session. Write a little SAS program to create and print the projection vector defined in (c). (II) Consider the 2x2 contingency table Sick Not Row-tot Exposed 50 150 200 Not 200 1600 1800 Col-Tot 250 1750 2000 (a) Calculate the McNemar test statistic and explain what is the null hypothesis is tests and what distribution its value is compared to in deciding acceptance or rejection. (b) Calculate the (Pearson) Chi-squared test for row-column independence for these data, and explain clearly what null hypothesis it tests and what statistical distribution its value is compared to in deciding acceptance or rejection. (II') In the setting of (II), give a short SAS program to input the data into an appropriate SAS dataset and perform (a) a McNemar test, and (b) a chi-square test for independence. Explain in a few sentences what you would be assuming about the data for each of these two tests, and exactly what the null hypothesis in each case says about some statistical parameter. Is there any 2x2 table setting -- not necessary an exposure-illness study like this one -- where it would make sense to test (separately) BOTH of the hypotheses for the tests (a), (b) ? (III) I have two SAS datasets, Cust1 and Cust2, listing some information about customers of my retail business. Cust1 has columns: ID Sales.M1 Age , and Cust2 has columns: ID Sales.M6 Sales.Yr Here ID is a 6-character unique identifier, the Sales.xxx numeric variables give dollar amounts in Month 1, Month 6, and for the Year, and Age is numeric. The two datasets "Cust1" and "Cust2" have some but not all of the same customers in them. For each of the following three tasks, either write a little SAS program, or tell the SAS datastep and PROC steps you would use to accomplish them using the Cust1, Cust2 data. (a) Find the number of ID's common to both of the Cust1 and Cust2 datasets. (b) Find the total number of ID's (without duplicates) appearing in the (union of) the two datasets. (c) Find the correlation between Sales.M6 and Sales.Yr within the subset of Customers whose age is known to be at least 50. (IV) I have three 100-dimensional vectors V, W, Z with entries contained in a SAS dataset VecData. For each of the following three tasks, either write a little SAS program, or tell the SAS datastep and PROC steps you would use to accomplish them using the dataset VecData. (a) Find the correlation between V and W and also the partial correlation between V and W after removing the linear effect of the variable Z. (b) Create a histogram of Z observations on the subset of records for which the V observations are > 30 . (c) Create a scatterplot of the residuals from the simple linear regression of V on W versus the predictors of V from the same linear regression. ===================================================== Additional topics I might have asked about: --- how to create a 2x2 table from data input from an ASCII file using categories defined in PROC FORMAT. --- sorting and use of FIRST.XX and LAST.XX variables. --- confidence interval and t-test (paired and unpaired) from PROC MEANS and PROC TTEST. --- definition of quantiles and QQplot, use of QQplots to check normality, and how to extract quantile information from PROC UNIVARIATE. --- creating and saving SAS files to your home directory; concatenating and merging files.