LOG Covering Multiple Regression Steps in 
Data Analysis
========================== Stat 430, 11/20/06

libname home ".";
data home.pima;
  infile "ASCdata/pima.dat";
  input  Obs preg gluc diast triceps 
         insulin bmi diabscor age Diab;
  if Obs ne . ;                /* to skip header */

data pima2;
  set home.pima;
  if diast ne 0 and gluc ne 0 and triceps ne 0 
       and insulin ne 0   and bmi ne 0 ;
  /* remove missing values: keep only 392 out of 768 obs */

proc reg data=pima2;
  model diast = bmi  age  gluc  Diab;
  plot residual. * predicted.;
  output out = home.pim2rsd p= Yprd r = Yrsd;
  run;

### Info from output window:

                      Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1       41.43186        3.81766      10.85      <.0001
bmi           1        0.48751        0.08571       5.69      <.0001
age           1        0.31908        0.06177       5.17      <.0001
gluc          1        0.02704        0.02239       1.21      0.2279
Diab          1       -0.19096        1.49342      -0.13      0.8983

      Root MSE             11.41383    R-Square     0.1743


## Here the "parameter estimate" values are the beta coefficients
##   and their standard errors and studentized ratios
##   (beta entry divided by its SE) are given along with t_{n-p}
##   p-values. Here the number p  of  beta coeff's is 5, and degrees
##   of freedom  n-p = 387.

## We can see that the gluc and Diab coef's are very insignificant
##   and those terms could be dropped from the model.

## Overall R^2 = .174, and corrected total SSQ = 387*S_Y^2 = 61056
## So the RSS = (1-.1743)*61056 = 50414 (see below), and the model 
## SSQ is .1743*61056 = 10642. (Discrepancies due to rounding errors.)

                      Analysis of Variance

                          Sum of           Mean
Source          DF        Squares         Square    F Value    Pr > F

Model           4          10639     2659.73795      20.42    <.0001
Error         387          50417      130.27545
                        Corrected Total         391          61056

## The residual plot produced in PROC REG shows very little pattern,
##  maybe none at all. I tried a lot of alternative fits and 
##  residuals plots with nothing other than the bmi and age showing
##  convincing behavior in the data.

Useful plots are: resid vs fitted, studentized residuals vs fitted
Also QQplot for residuals

## Diast looks linear in scatterplots vs each of bmi and age,
##   and quadratic and interaction terms do not help. (The
##   interaction  bmi*age actually makes each of the bmi and age
##   coeffs along with the interaction coeff look non-significant !

## So we conclude with the observation of correlations and partial
   correlation and their relation to the anova table produced using
   proc glm (using same model statement syntax as proc reg)..

 proc glm data=pima2;
   model diast = bmi age ;
   run;

Parameter      Estimate      Error      t Value    Pr > |t|

Intercept    43.31286275    3.18487011    13.60      <.0001
bmi           0.50649114    0.08229787     6.15      <.0001
age           0.34319109    0.05669778     6.05


## What we wanted was the "Type I SS":

 Source     DF       Type I SS     Mean Square    F Value    Pr > F

 bmi        1     5657.493492     5657.493492      43.47    <.0001
 age        1     4768.626652     4768.626652      36.64    <.0001
 
 Error             389     50629.43088       130.15278
Corrected Total    391     61055.55102

proc corr data=pima2;
   var diast; with bmi; /* corr    0.30440  p-val  <.0001  */

proc corr data=pima2;
   var diast; with age; partial bmi; run;
           /* partial corr =    0.29339   p-val  <.0001  */


Analysis of Variance Table

Response: diastolic
           Df Sum Sq Mean Sq F value    Pr(>F)
bmi         1   5657    5657  43.468 1.407e-10 ***
age         1   4769    4769  36.639 3.349e-09 ***
Residuals 389  50629     130

 cor(diastolic,bmi) ### = 0.3044034

391*var(diastolic) =  61055.55 =  5657 + 4769 + 50629

 61055*.3044^2  ### = 5657  OK!!!

Partial corr Diast vs Age partial wrt BMI found above =  0.2933928

> (61055 - 5657)*(.29339^2)  ###   4768.532   OK !!!

Also found five largest studentized residuals as follows: 
the two or three largest may be too extreme:

     663        5      598       19      126
2.544109 3.213193 3.574369 4.103650 4.460879

### The 3 most extreme observations, those with Obs #'s 598, 19, 126 
###    were flagged and identified by the R software package.