Stat 430 Sample Test     for in-class Midterm F06

 INSTRUCTIONS: The test will be closed-book, but 
you may use as memory-aid a one- or two-sided 8.5" by 11" 
sheet of notes. You may also bring and use a calculator, 
although you need not simplify numerical expressions 
involving only numbers. (That is, if you get your answer in 
the form of a numerical expression without symbols which 
could be evaluated by keying into a simple calculator, then 
you need not do the arithmetic evaluation.)

NOTE: there are more problems here than I would put on the test.
I will probably put 3 like these on the test.

(I) Start with "data"

x    1    1    3    2    1    2    1    3
y    1    2    5    3    2    4    1    6

You are given the calculated results ybar= (1/8) sum y_i = 3, 
xbar= (1/8) sum x_i = 1.75, ssq.x = sum (x_i-xbar)^2 = 5.5,  
ssq.y = sum (y_i-ybar)^2 = 24, 
cov.xy = sum (y_i-ybar)*(x_i-xbar) = 11.

(a) What is the (sample) correlation between the x and 
y data vectors ? 

(b) What are the least-squares estimated intercept and 
slope parameters for the model  y=a+bx  based on these 
data ?

(c) Find the projection of the vector  y - ybar * 1_n  
on the vector  x - xbar * 1_n , where  1  denotes the 
vector (of dimension  n=8) with all entries 1.  What 
is its interpretation in the setting of simple linear 
regression ?

(d) Assume that a SAS dataset XYdat, with the two given 
vectors x and y as columns resides in your workspace 
in a SAS session. Write a little SAS program to create 
and print the projection vector defined in (c).


(II) Consider the  2x2  contingency table
               Sick   Not   Row-tot
      Exposed   50    150    200
      Not      200   1600   1800
      Col-Tot  250   1750   2000

(a) Calculate the McNemar test statistic and 
explain what is the null hypothesis is tests and 
what distribution its value is compared to in 
deciding acceptance or rejection.
(b) Calculate the (Pearson) Chi-squared test for 
row-column independence for these data, and 
explain clearly what null hypothesis it tests 
and what statistical distribution its value is 
compared to in deciding acceptance or rejection.


(II') In the setting of (II), give a short SAS program
to input the data into an appropriate SAS dataset and 
perform (a) a McNemar test, and (b) a chi-square test 
for independence. Explain in a few sentences what you 
would be assuming about the data for each of these two
tests, and exactly what the null hypothesis in each 
case says about some statistical parameter.

Is there any 2x2 table setting -- not necessary an 
exposure-illness study like this one --  where it 
would make sense to test (separately) BOTH of the 
hypotheses for the tests (a), (b) ?


(III) I have two SAS datasets, Cust1 and Cust2, listing 
some information about customers of my retail business.
Cust1 has columns: ID  Sales.M1  Age      ,  and 
Cust2 has columns: ID  Sales.M6  Sales.Yr
Here ID is a 6-character unique identifier, the Sales.xxx 
numeric variables give dollar amounts in Month 1, Month 6, 
and for the Year, and Age is numeric. The two datasets
"Cust1" and "Cust2" have some but not all of the same 
customers in them.

For each of the following three tasks, either write a 
little SAS program, or tell the SAS datastep and PROC 
steps you would use to accomplish them using the Cust1, 
Cust2 data.

(a) Find the number of ID's common to both of the Cust1 
and Cust2 datasets.

(b) Find the total number of ID's (without duplicates) 
appearing in the (union of) the two datasets.

(c) Find the correlation between Sales.M6 and Sales.Yr 
within the subset of Customers whose age is known to be 
at least 50.

(IV) I have three 100-dimensional vectors  V, W, Z with 
entries contained in a SAS dataset  VecData. For each 
of the following three tasks, either write a little SAS 
program, or tell the SAS datastep and PROC steps you 
would use to accomplish them using the dataset  VecData.

(a) Find the correlation between V and W and also the 
partial correlation between V and W  after removing the 
linear effect of the variable Z. 

(b) Create a histogram of Z observations on the subset 
of records for which the V observations are > 30 .

(c) Create a scatterplot of the residuals from the 
simple linear regression of V on W versus the 
predictors of V from the same linear regression.

=====================================================
Additional topics I might have asked about:

--- how to create a 2x2 table from data input from 
an ASCII file using categories defined in PROC FORMAT.

--- sorting and use of FIRST.XX and LAST.XX variables.

--- confidence interval and t-test (paired and 
unpaired) from PROC MEANS and PROC TTEST.

--- definition of quantiles and QQplot, use of QQplots 
to check normality, and how to extract quantile 
information from PROC UNIVARIATE.

--- creating and saving SAS files to your home 
directory; concatenating and merging files.