LOG Covering Multiple Regression Steps in
Data Analysis
========================== Stat 430, 11/20/06
libname home ".";
data home.pima;
infile "ASCdata/pima.dat";
input Obs preg gluc diast triceps
insulin bmi diabscor age Diab;
if Obs ne . ; /* to skip header */
data pima2;
set home.pima;
if diast ne 0 and gluc ne 0 and triceps ne 0
and insulin ne 0 and bmi ne 0 ;
/* remove missing values: keep only 392 out of 768 obs */
proc reg data=pima2;
model diast = bmi age gluc Diab;
plot residual. * predicted.;
output out = home.pim2rsd p= Yprd r = Yrsd;
run;
### Info from output window:
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 41.43186 3.81766 10.85 <.0001
bmi 1 0.48751 0.08571 5.69 <.0001
age 1 0.31908 0.06177 5.17 <.0001
gluc 1 0.02704 0.02239 1.21 0.2279
Diab 1 -0.19096 1.49342 -0.13 0.8983
Root MSE 11.41383 R-Square 0.1743
## Here the "parameter estimate" values are the beta coefficients
## and their standard errors and studentized ratios
## (beta entry divided by its SE) are given along with t_{n-p}
## p-values. Here the number p of beta coeff's is 5, and degrees
## of freedom n-p = 387.
## We can see that the gluc and Diab coef's are very insignificant
## and those terms could be dropped from the model.
## Overall R^2 = .174, and corrected total SSQ = 387*S_Y^2 = 61056
## So the RSS = (1-.1743)*61056 = 50414 (see below), and the model
## SSQ is .1743*61056 = 10642. (Discrepancies due to rounding errors.)
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 4 10639 2659.73795 20.42 <.0001
Error 387 50417 130.27545
Corrected Total 391 61056
## The residual plot produced in PROC REG shows very little pattern,
## maybe none at all. I tried a lot of alternative fits and
## residuals plots with nothing other than the bmi and age showing
## convincing behavior in the data.
Useful plots are: resid vs fitted, studentized residuals vs fitted
Also QQplot for residuals
## Diast looks linear in scatterplots vs each of bmi and age,
## and quadratic and interaction terms do not help. (The
## interaction bmi*age actually makes each of the bmi and age
## coeffs along with the interaction coeff look non-significant !
## So we conclude with the observation of correlations and partial
correlation and their relation to the anova table produced using
proc glm (using same model statement syntax as proc reg)..
proc glm data=pima2;
model diast = bmi age ;
run;
Parameter Estimate Error t Value Pr > |t|
Intercept 43.31286275 3.18487011 13.60 <.0001
bmi 0.50649114 0.08229787 6.15 <.0001
age 0.34319109 0.05669778 6.05
## What we wanted was the "Type I SS":
Source DF Type I SS Mean Square F Value Pr > F
bmi 1 5657.493492 5657.493492 43.47 <.0001
age 1 4768.626652 4768.626652 36.64 <.0001
Error 389 50629.43088 130.15278
Corrected Total 391 61055.55102
proc corr data=pima2;
var diast; with bmi; /* corr 0.30440 p-val <.0001 */
proc corr data=pima2;
var diast; with age; partial bmi; run;
/* partial corr = 0.29339 p-val <.0001 */
Analysis of Variance Table
Response: diastolic
Df Sum Sq Mean Sq F value Pr(>F)
bmi 1 5657 5657 43.468 1.407e-10 ***
age 1 4769 4769 36.639 3.349e-09 ***
Residuals 389 50629 130
cor(diastolic,bmi) ### = 0.3044034
391*var(diastolic) = 61055.55 = 5657 + 4769 + 50629
61055*.3044^2 ### = 5657 OK!!!
Partial corr Diast vs Age partial wrt BMI found above = 0.2933928
> (61055 - 5657)*(.29339^2) ### 4768.532 OK !!!
Also found five largest studentized residuals as follows:
the two or three largest may be too extreme:
663 5 598 19 126
2.544109 3.213193 3.574369 4.103650 4.460879
### The 3 most extreme observations, those with Obs #'s 598, 19, 126
### were flagged and identified by the R software package.