First Homework Assignment, counting 10 points, due Wednesday, Feb. 4, 2009.
Get SAS running on a PC, WAM, or Unix platform. Using the example
programs on pages 3 and 7 of the text as templates, as well as copy-and-paste
operations from your favorite word-processor, write a SAS program to
input the first 12 data-lines of the dataset "nature02801-s2.dat", columns 2-11
("Archip" through "LatitS"). [Find the dataset in the data-directory
http://www.math.umd.edu/~evs/s430.old/Data".] Use Proc Sort to
sort the data in increasing order of "LatitS", and print the results
using Proc Print with an appopriate title. Using Proc Means, calculate
and print means, Min, and Standard Deviations for "Area"
and "Elev" column entries for the data you have entered, and (using Proc Freq)
Frequency Tables for "Isol_25" and "Deforstn". Edit the program(s) and output
together into a single document, showing the lines of code and relevant output
produced by SAS. One good way to do it would be to create a page of code, a
page of SAS-generated output (condensed from multiple pages of output) and
some lines of explanation, either interspersed or on a separate
page.
Second Homework Assignment, counting 15 points, due Monday,
Feb. 16, 2009. Hand in 10 pages maximum.
(I) Consider the data set pima.dat of personal
characteristics, body measurements, and indicators of diabetes for
768 Pima Indian women, which can be found in the Data directory.
(a) Make several histograms of the
diastolic (diastolic blood pressure) variable, with number of
categories ("levels") varying from 10 to 80, using
GCHART. Which histogram seems the best at describing the distribution
of the data? Explain briefly in words what criteria you used in
choosing one number of levels as best, and hand in only your best
histogram.
(b) Make a histogram (just one, the best you
can) of the diabetes variable in the pima dataset using the
same procedure as in (a). What differences are there between the best
histogram from (a) and the one in (b)? Describe these briefly, using
no information other than what you see in the pictures.
(II) Make a histogram, with the same number of cells as in (I)(b), of
the logarithms base 10 of the diabetes variable values.
(c) What
can you say about the differences in the detail & pattern of the data
that are displayed in this histogram by comparison with the one in (I)(b) ?
(d) Which summary statistics,
if any, that you can compute from PROC MEANS with these data give
different useful information about the PIMA diastolic and
log10(diabetes)data ?
(III) Group the PIMA data into 5 Age groups of roughly equal size, and create and compare boxplots (in a single picture, side-by-side, using either PROC UNIVARIATE or PROC BOXPLOT) for the log10(diabetes) values for members of these groups. What does the comparison of the boxplots tell you ? Is the information more or less interpretable than a simple scatterplot of log10(diabetes) vs Age which you can create using PROC GPLOT ?
(IV). Write a single Table using SAS that will contain the MEAN, MEDIAN, and Standard Deviation of log10(diabetes) for all of the 5 Age-Groups you created from the PIMA data in problem (III).
(V). Construct a normal probability plot for diastolic and
log10(diabetes) in the PIMA dataset. Comment on any departures from
normality observed, and comment on whether or not these departures
correspond to features seen in the best histograms in parts (I)-(II).
Third Homework Assignment, counting 15 points, due Monday,
Mar. 2, 2009. Hand in 10 pages maximum.
(I). The dataset students contains data from survey
described in
Chase, M. A., and Dummer, G. M. (1992), "The Role of Sports as a
Social Determinant for Children,"
Research Quarterly for
Exercise and Sport, 63, pp. 418-424.
The survey set out to investigate the concept of `popularity' among
public school students.
(a) Prepare a frequency table for the
variable SCHOOL. Indicate whether any particular schools seem
to
have been sampled from more or less than most.
(b) Prepare two vertical bar charts showing how MONEY
is related to LOCALE, using the SUBGROUP
option with variable LOCALE.
Repeat, reversing the roles of MONEY and LOCALE in the vertical bar chart,
and indicate how the two variables appear to be dependent. Which plot
seems to indicate the relationship more clearly?
(c) Repeat (b) using the option GROUP
instead of SUBGROUP. How do these plots compare with those in (b)
in
terms of illustrating the dependence?
(II). In studies of the placebo effect, it has been
established that nausea can arise after a medicine is taken by mouth
even though there is no physical cause for the distress. As a result,
a placebo (a tablet consisting of inert material
and free of the drug)
is given to half of the patients in a drug trial. The patients have no
idea if they are getting
a placebo or not, and their response (nausea
or no nausea) is recorded after the dose. The results are given in
the
following table:
Nauseated
Not Nauseated
Drug Given
15
35
Placebo Given
4
46
(a) Is there evidence of an association between nausea and the
taking of the drug? Explain which statistic(s) you
used, and give the
associated p-value(s). Remember to edit the rows of the table together
as indicated above.
(b) What does the odds ratio (and confidence interval on it)
indicate about the relationship between the two
variables in the
table that you prepared? Note that it may be negative or positive,
depending on the order in
which SAS orders the levels of your
categorical variable in the table.
(III) The data set home contains data on Albuquerque housing
prices based on a random sample of over 100 homes
sold Feb 15 to Apr
30, 1993. The data were obtained from the Albuquerque Board of
Realtors. Note that you will
need to replace the asterisks with
periods in order for SAS to process the datalines properly.
(a) Create a scatterplot of TAXES versus FEATS, with separate plotting
characters (eg circle, square, etc.) for the
four different classes of
houses defined by combinations (0,0), (1,0), 0,1), (1,1) of (COR,NE).
Does this plot tell
you anything about whether the relationship
between TAXES and FEATS is different in the four different classes
of houses defined by (COR,NE) ?
(b) Create side-by-side boxplots of TAXES for the 4 (COR,NE) groups, and
of FEATS for the 4 (COR,NE) groups,
to see whether these groups differ
from each other in their Taxes and their numbers of Features. What do
you
conclude from these pictures ?
(c) Break the set of houses into three groups according to whether
they have LOW, MEDIUM, or HIGH taxes.
(Use quantiles of the TAXES
variable to do this.) Then use a Chi-square test of Row-Column
independence,
calculated through PROC FREQ, to determine whether there is
any relationship between the tax group and the
COR status of the houses.
Explain your conclusions.
(d) Plot the Empirical Distribution Function of the TAXES variable for
these data, and use the graph you
produce to estimate the 0.6 quantile
of TAXES.
Fourth Homework Assignment, counting 15 points, due Wednesday,
Mar. 25, 2009. Hand in 10 pages maximum.
(I). Input the data set "nature02801-s2.dat" from the web-page
data directory (the Polynesian islands dataset
frequently mentioned in class).
(a) Exactly 12
of the island observations occur in consecutive "pairs", with Islnd
variable ending in L for
the first observation, and in W for the
second. These 12 pairs of observations actually each come from the SAME
island, respectively on the L=Leeward and W=Windward side. Use SAS
to perform a t-test on the (natural) logarithms
of Rainfall for the L
observations versus the W to see whether there is a difference in
average log-rainfall between
Lee and Windward. Which kind of t-test do
you think is more appropriate, two-sample pooled-variance or
matched-pairs ? Why ? Interpret your results.
(b) Note that
the pairs of island-observations found in (a), which really correspond
to the same island,
have identical values for many island-attribute
variables, such as Elev and LatitS. Remove the duplicate
observations
from the dataset (leaving only 56 distinct islands), and break them
into two groups according to
whether the LatitS variable is positive
(which actually means the island is south of the equator) or
negative.
Do a t-test to check whether the average logarithim of
the Elev variable for he resulting dataset is different for
islands North vs.
South of the equator.
(c) In both
parts of the problem, say whether you think the formal t-tests you did
look reasonable based
on the histograms of the log(Rain) and log(Elev)
variables in the separate datasets used for (a) and (b).
(II). Simple linear regression analysis makes best sense
when the response and the predictor are linearly related
(i.e. the
observations, when plotted, seem to lie bunched along a straight
line). The following datasets are to be analyzed:
(i) The 3 small datsets in the file "Anscombe". These were developed by
the statistician Frank Anscombe to
illustrate why it is a really,
really bad idea to fit regression models without using plots to
help interpret them.
(ii) The 4 small datasets in the file "transform", indicating some
naturally occurring linear and non-linear relationships.
a) Make scatterplots for each of the Anscombe
datasets. Use PROC CORR to find the correlation between the
response and the
predictor. Comment upon what the results suggest about the reliability
of looking at summary
statistics like correlations alone to
establish the existence of a linear trend.
b) For the datasets transform, make scatterplots and decide if the
plot indicates a linear pattern or not. If you think
that the
relationship between the two variables is non-linear, then apply an
appropriate transformation. You can try
various combinations of log(z),
1/z, sqrt(z), and the untransformed variables. Here z can denote
either the response
or the predictor. Comment on any unusual patterns
observed, and present only plots of the original data and the one
in
which you use the best transformation (if one is needed at all). The
best transformation should produce the most
linear pattern over the
entire range of the predictor.
(III) Using the ASCII dataset cigcancer.dat in the Data directory,
(i) Find the correlations among all of the cancer rates and the
partial correlations of the same cancer rates after
removing the
effect of Cigarette smoking.
(ii) Not all cancers seem to have much to do with cigarette smoking.
Based on what you found in (i), which
cancers would you say have rates
most related to smoking ?
(iii) After removing the effect of cigarette smoking by creating an
output file of residuals, assess the remaining
(linear) dependence,
if any, among the rates of the four types of cancer. Would you say
that significant dependence
remains ? If so, can you guess what it
might be due to ? (This is a non-statistical question.)
(iv) Try plotting the LUNG versus BLADDER cancer-rate residuals after
first removing the effect of cigarette
smoking. I could not make much
sense of this scatterplot directly. But now try plotting (either with
separate plotting
characters or in separate pictures, or by marking
the points by hand within your scatterplot, to distinguish the
"urban"
states from the others. (My "urban" list consisted of:
"CA" "CT" "DE" "DC" "FL" "IL" "MD" "MA" "NJ"
"NY" "OH" "PE" "RI" "WI".)
NOW what does the scatterplot suggest ?
Fifth Homework Assignment, counting 15 points, due Monday,
April 13, 2009. Hand in 10 pages maximum.
(I) Do problems 2 and 3 on the Partial Correlation worksheet (within
the Partial Correlation Handout in the Handouts
web-page directory
and referenced in the "current reading" page.)
(II) The data set Forbes was obtained by a 19th
century physicist who wanted to provide tables allowing altitude to
be
measured based on the boiling point of water. This is
not as silly as it sounds, as in those days altimeters were large and
sensitive barometers, while thermometers, though breakable, were small
and robust, and much more easily hauled up
remote mountains. The
dataset consists of measurements of the boiling point of water in
degrees Fahrenheit, and
measurements of air pressure in inches of
mercury.
(a) Construct a scatterplot
for the air pressure as a function of temperature. Run a regression on
the observations
and comment on evidence of outliers and non-linearity.
(b) Knowing that, to a first
approximation, pressure is proportional to
exp(β T) where T is temperature,
make an
appropriate transformation on pressure so that a linear model can be
properly fit. Make a new scatterplot
to confirm the linear
pattern.
(c) Remove the outlier, run
a new regression, and make a scatterplot with regression line and a
residual plot.
(d) Use the fitted line to
provide a 99% prediction interval for the transformed pressure at
temperatures of
200.5 degrees and 150 degrees. [Hint: most of what you
need to calculate the prediction interval can be found by running
PROC MEANS on the predictor]. Why is the second estimate not to
be trusted? What further observations would you
need to make in order
to have any confidence in it?
Sixth Homework Assignment, counting 15 points, due Friday,
April 24, 2009. Hand in 12 pages maximum.
Do the three problems on HW6 pdf
file.
Seventh and Last Homework Assignment, counting 15 points, due Friday,
May 8, 2009. Hand in 12 pages maximum.
Do the three problems on HW7 pdf
file.