LOG Covering Multiple Regression Steps in Data Analysis ========================== Stat 430, 11/20/06 libname home "."; data home.pima; infile "ASCdata/pima.dat"; input Obs preg gluc diast triceps insulin bmi diabscor age Diab; if Obs ne . ; /* to skip header */ data pima2; set home.pima; if diast ne 0 and gluc ne 0 and triceps ne 0 and insulin ne 0 and bmi ne 0 ; /* remove missing values: keep only 392 out of 768 obs */ proc reg data=pima2; model diast = bmi age gluc Diab; plot residual. * predicted.; output out = home.pim2rsd p= Yprd r = Yrsd; run; ### Info from output window: Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 41.43186 3.81766 10.85 <.0001 bmi 1 0.48751 0.08571 5.69 <.0001 age 1 0.31908 0.06177 5.17 <.0001 gluc 1 0.02704 0.02239 1.21 0.2279 Diab 1 -0.19096 1.49342 -0.13 0.8983 Root MSE 11.41383 R-Square 0.1743 ## Here the "parameter estimate" values are the beta coefficients ## and their standard errors and studentized ratios ## (beta entry divided by its SE) are given along with t_{n-p} ## p-values. Here the number p of beta coeff's is 5, and degrees ## of freedom n-p = 387. ## We can see that the gluc and Diab coef's are very insignificant ## and those terms could be dropped from the model. ## Overall R^2 = .174, and corrected total SSQ = 387*S_Y^2 = 61056 ## So the RSS = (1-.1743)*61056 = 50414 (see below), and the model ## SSQ is .1743*61056 = 10642. (Discrepancies due to rounding errors.) Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 4 10639 2659.73795 20.42 <.0001 Error 387 50417 130.27545 Corrected Total 391 61056 ## The residual plot produced in PROC REG shows very little pattern, ## maybe none at all. I tried a lot of alternative fits and ## residuals plots with nothing other than the bmi and age showing ## convincing behavior in the data. Useful plots are: resid vs fitted, studentized residuals vs fitted Also QQplot for residuals ## Diast looks linear in scatterplots vs each of bmi and age, ## and quadratic and interaction terms do not help. (The ## interaction bmi*age actually makes each of the bmi and age ## coeffs along with the interaction coeff look non-significant ! ## So we conclude with the observation of correlations and partial correlation and their relation to the anova table produced using proc glm (using same model statement syntax as proc reg).. proc glm data=pima2; model diast = bmi age ; run; Parameter Estimate Error t Value Pr > |t| Intercept 43.31286275 3.18487011 13.60 <.0001 bmi 0.50649114 0.08229787 6.15 <.0001 age 0.34319109 0.05669778 6.05 ## What we wanted was the "Type I SS": Source DF Type I SS Mean Square F Value Pr > F bmi 1 5657.493492 5657.493492 43.47 <.0001 age 1 4768.626652 4768.626652 36.64 <.0001 Error 389 50629.43088 130.15278 Corrected Total 391 61055.55102 proc corr data=pima2; var diast; with bmi; /* corr 0.30440 p-val <.0001 */ proc corr data=pima2; var diast; with age; partial bmi; run; /* partial corr = 0.29339 p-val <.0001 */ Analysis of Variance Table Response: diastolic Df Sum Sq Mean Sq F value Pr(>F) bmi 1 5657 5657 43.468 1.407e-10 *** age 1 4769 4769 36.639 3.349e-09 *** Residuals 389 50629 130 cor(diastolic,bmi) ### = 0.3044034 391*var(diastolic) = 61055.55 = 5657 + 4769 + 50629 61055*.3044^2 ### = 5657 OK!!! Partial corr Diast vs Age partial wrt BMI found above = 0.2933928 > (61055 - 5657)*(.29339^2) ### 4768.532 OK !!! Also found five largest studentized residuals as follows: the two or three largest may be too extreme: 663 5 598 19 126 2.544109 3.213193 3.574369 4.103650 4.460879 ### The 3 most extreme observations, those with Obs #'s 598, 19, 126 ### were flagged and identified by the R software package.