SECOND LOG Covering Multiple Regression Model 
Building Steps, including ANOVA from PROC GLM
================================= Stat 430, 11/22/06

## Note first that all of the output features
##  we have been using (predictors residuals LCL 
##  and UCLI, etc. can also be produced using PROC GLM.
##  The extrra advatnage of GLM is that it produces 
### as "Type I SSQ"  Analysis of Variance (ANOVA)
###    Table. (See below for discussion.) The only
###    disadvantage to PROC GLM is that it does not 
##     have automatic model selection options, which 
##     we will cover in PROC REG after the break.

libname home ".";

data simfil (drop=seed i);
    seed=537;
    do i=1 to 600;
       x = -0.5*log(1- ranuni(seed));
       if ranuni(seed) < 0.4 then w = 1;
       else w=0;
       y = 3+ 2*X - X**2 + 1.5*(X**2)*W + 1.2*rannor(seed);
       xsq = X**2; XW = X*W; XSQW = (X**2)*W;
    output; end;      /* produces 600 obs, 6 variables */

proc corr data=simfil;
    var y; with x W;

## Cor y with x =  0.2284,  p-val  <.0001
## Cor y with w =  0.19425, p-val  <.0001

proc corr data=simfil;
    var y; with xsq xW; partial x W;

## Partial (after removing x W )
##   Cor y with   xsq = -0.26696 ,  p-val <.0001
##   Cor y with   XW  =  0.48074 ,  p-val <.0001

Note that the partial correlations of Y with XW are
way too large to be due to chance, even though there 
is no X*W term in the model we simulated !! The reason 
for the large correlation is that the residuals after 
removing X,W from Y are indeed correlated with XW, 
because X^2*W is. That tells usto interpret the high
partial correlation as saying not exactly that X*W is 
present, but that some interaction term like it is.

---------------------------------------------------

proc glm data=simfil;
    model y = x W  xsq  XW  XSQW ;
    output out= rsdetc  p = Yprd  r = Yrsd  
           Rstudent = StudYRes  L95 = LowYCI U95 = HiYCI;
   run;

## Here p = number of regression coefficients is 6, one of 
which is for the intercept Ybar. The DF for Model terms below
is the number p-1 = 5 of NONCONSTANT explanatory variables.

                        Sum of
Source           DF     Squares   Mean Square   F Value  Pr > F

Model             5   502.3705    100.4741      71.20    <.0001
Error           594   838.1891      1.4111

Corrected Total 599  1340.5596

## This crude Analysis of Variance (ANOVA) table is produced
both by PROC REG and PROC GLM. Recall that the Error Mean Square
= Residual Sum of Squares divided by n-p is the same as the 
estimated residual variance sigma_hat^2 = 1.4111 in this case.
Also, CSS = Corrected sum of squares total = (n-1)*S_Y^2 = 1340.56.

### The Type I SSQ ANOVA Table is produced only by PROC GLM and 
   is the main reason to usew that PROC. Each line provides a 
   snapshot, of the new decrease in RSS due to inclusion of the
   current term in the "Source" line of the model. In each line, 
   the current line is equal to CSS*(new multiple correlation
   including current term minus previous multiple correlation
   including all previous model terms) and is also eqal to the 
   RSS (for the regression model including all previous model terms) 
   before including the current model term minus the RSS after 
   including the current term.
### Under the null hypothesis that the model already fits in the 
   previous line of the table, the current F-value, which is the 
   current (Type I SS divided by DF) divided by (sigma_hat^2) 
   would be distributed as an F-statistic with DF and n-p degrees 
   of freedom. This statistic is centered at 1, and values above 
   3 or 4 are large (but the exact percentage points do depend on DF
   and n-p, e.g. the 95% quantile for 1,594 df's is 3.86 and the 99% 
   quantile is 6.68.

 Source          DF    Type I SS  Mean Square   F Value   Pr > F

x                 1      69.93302    69.93302     49.56   <.0001
w                 1      50.74898    50.74898     35.96   <.0001
xsq               1      86.93936    86.93936     61.61   <.0001
XW                1     233.99894   233.99894    165.83   <.0001
XSQW              1      60.75024    60.75024     43.05   <.0001

### So this Type I SS table says strongly that every term 
    entered into the model was HIGHLY SIGNIFICANT AT THE POINT WHERE
    IT WAS ENTERED INTO THE MODEL, although we know that not all 
    of these terms should be needed in the final model.

Dependent Variable: y :   Type III SS 


### The table of coefficient estimates, standard errors and t-ratios
    with p-values is produced by either PROC REG or PROC GLM.

                        Standard
Parameter    Estimate      Error    t Value    Pr > |t|

Intercept      3.1139    0.10385     29.99      <.0001
x              1.5485    0.27164      5.70      <.0001
w             -0.0815    0.17448     -0.47      0.6406
xsq           -0.8639    0.11220     -7.70      <.0001
XW            -0.0055    0.49145     -0.01      0.9911
XSQW           1.5887    0.24214      6.56      <.0001

### Here we can see that x, xsq, and xsqw in the final model including
    all terms are super-significant, but that W and XW are not 
    significant at all: we know this to be true from the simulation
    code.

### The Type III ANOVA table (produced by PROC GLM) gives information 
    essentially like that of the standardized coefficient estimators,
    which tell whether a specific coefficient is significantly
    different from 0 when all other terms have already been entered
    into the model. Indeed, the F Value column is exactly equal to
    the square of the t-value column from the previous table of
    standardized coefficient estimators, so it is not surprising that
    the corresponding p-values are exactly equal.

Source           DF   Type III SS  Mean Square  F Value  Pr > F

x                 1     45.85473    45.85473     32.50   <.0001
w                 1      0.30788     0.30788      0.22   0.6406
xsq               1     83.65504    83.65504     59.28   <.0001
XW                1      0.00017     0.00017      0.00   0.9911
XSQW              1     60.75024    60.75024     43.05   <.0001


### This log shows that we want to look both for individual
###   coefficients being significant in the final model, but
###   also whether variables IN THE ORDER ENTERED contributed
###   to a noticeable reduction in RSS.

## Generally we will not include a higher-order term unless the
   lower-order terms made of the same ingredients are also included,
   but as you can see here that may result in some terms iwht
   negligible coefficients being kept !!