THIRD LOG Covering Multiple Regression Model 
Building Steps: Model Selection and Weights
========================================== Stat 430, 11/27/06

(I) The first portion of this Log illustrations the SELECTION
option within the MODEL statement of PROC REG.

libname home ".";

data home.simfil (drop=seed i);
    seed=537;
    do i=1 to 600;
       x = -0.5*log(1- ranuni(seed));
       if ranuni(seed) < 0.4 then w = 1;
       else w=0;
       y = 3+ 2*X - X**2 + 1.5*(X**2)*W + 1.2*rannor(seed);
       xsq = X**2; XW = X*W; XSQW = (X**2)*W; XCU = X**3;
       XCUW = XCU*W;
    output; end;      /* produces 600 obs, 8 variables */

## Recall that the following Type I SS table was generated in the
   MultReg2.LOg script, using PROC GLM with variables entered 
   in the order  X W XSQ XW XSQW

 Source          DF    Type I SS  Mean Square   F Value   Pr > F

x                 1      69.93302    69.93302     49.56   <.0001
w                 1      50.74898    50.74898     35.96   <.0001
xsq               1      86.93936    86.93936     61.61   <.0001
XW                1     233.99894   233.99894    165.83   <.0001
XSQW              1      60.75024    60.75024     43.05   <.0001

## Now we consider the model-selection features in PROC REG.

proc reg data=home.simfil;
    model y = x w xsq xcu xw xsqw  / 
              selection = forward slentry=.05;
   run;

      Corrected Total         599     1340.55960

### NOTE BELOW: Cp = RSS/sighat**2 + 2p-n

            Forward Selection: Step 1
     Variable XSQW Entered: R-Square = 0.2913 and C(p) = 77.0811
            Forward Selection: Step 2
     Variable XCU Entered:  R-Square = 0.3490 and C(p) = 24.2663
            Forward Selection: Step 3
     Variable x Entered:    R-Square = 0.3736 and C(p) = 2.8871

No other variable met the 0.0500 significance level for entry into the
model.


proc reg data=home.simfil;
    model y = x w xsq xcu xw xsqw  / 
              selection = forward slentry=.15;
   run;

### Exactly the same result !!! The problem is that at each 
###   stage the best on the not-yet-selected variables is chosen,
###   and none is ever deleted.
### NOTE that the model with all of the correct terms in it 
###   ( X XSQ XSQWand no others) has  R-square =   .3741

proc reg data=home.simfil;
    model y = x w xsq xcu xw xsqw  / 
              selection = backward slstay=.10;
   run; 

              Backward Elimination: Step 0
     All Variables Entered: R-Square = 0.3756 and C(p) = 7.0000
              Backward Elimination: Step 1
     Variable XW Removed: R-Square = 0.3755 and C(p) = 5.0938
              Backward Elimination: Step 2
     Variable w Removed: R-Square = 0.3751 and C(p) = 3.5122
              Backward Elimination: Step 3
     Variable XCU Removed: R-Square = 0.3741 and C(p) = 2.4055

 All variables left in the model are significant at the 0.1000 level. 

### This leaves only the true model terms !!!

proc reg data=home.simfil;
    model y = x w xsq xcu xw xsqw  / 
              selection = stepwise slentry = .05 slstay=.10;
   run; 
  
 STEPS: entered XSQW, then XCU, then X. then stopped with statement:

   All variables left in the model are significant at the 0.1000 level.
   No other variable met the 0.0500 significance level for entry into 
      the model.

## Here is one final way to do it, searching for "best Cp", 
   producing also the values of AIC, another criterion defined by:

   AIC = n*log(RSS/n) + 2p

proc reg data=home.simfil;
    model y = x w xsq xcu xw xsqw  / 
              selection = CP  AIC ;
   run; 

### Now get exhaustive list tabulating CP & AIC by # variables in
    model, from best to worst according to Cp criterion:

                     Number in
                       Model      C(p)      R-Square        AIC    
Varbls in Model

   x xsq XSQW            3       2.4055      0.3741      209.1769    
   x XCU XSQW            3       2.8871      0.3736      209.6628   
...
  x xsq XCU XSQW         4       3.5122      0.3751      210.2745  
...
  x xsq XCU XW XSQW      5       5.4091      0.3752      212.1703  
...


=============================================================

(II) The second part of this Script illustrates why and how to 
use WEIGHTS in PROC REG.

* Simulate a dataset with variances growing in prescribed fashion
  as a function of a simple explanatory variable X.

data home.simWGT (drop = i seed);
   seed = 77777;
   do i=1 to 400;
       x = 2.5 + 0.5*rannor(seed);
       if(x < 0) then x=1e-2;
       y = -1 + 0.7*x  + 0.5*sqrt(x)*rannor(seed);
       xsqrt = sqrt(x); 
       xrtinv = 1/xsqrt;
       yst = y/xsqrt;
       W = 1/x;
       output;  end;

### As mentioned in class 11/27, this y vs x dataset does 
###  not satisfy the homoscedastic errors assumption

goptions csymbol=black;
options nocenter nodate linesize=70;

proc reg data=home.simwgt;
  model y = x ;
  plot residual. * x;
       run;

### You might not have spotted it if you didn't know to
###   look, but this residuals plot defnitely flares out,
###   becoming wider as x becomes larger.

### So now we re-run the regression two ways: once
###   with WEIGHTS W and once with transformed 
###   response and regression variables.

proc reg data=home.simwgt;
  model y = x ;
  weights w;
       run;
## Estimates a = -0.94088 , b =  0.67378
## Std errors     0.20146       0.08227
## The estimated standard errors are also nearly correct
##    valid because the WEIGHTS are specified inversely 
##    proportional to residual variances.
## Now do transformed regression: we indicated in class 11/27
##    that the original model is equivalent to
       model  YST = xrtinv  xsqrt / nointercept ;
## NB to fit a model without intercept, use this model option!

proc reg data=home.simwgt;
   model yst = xrtinv  xsqrt / noint;
   var x;  /* needed for next plot statement */
   plot residual.*x ;
 run;
				Rsq = 0.0987
Dependent Variable: yst

                Parameter    Standard
Variable   DF    Estimate     Error     t Value   Pr > |t|

xrtinv     1    -0.94088      0.20146    -4.67     <.0001
xsqrt      1     0.67378      0.08130     8.29     <.0001

### The residuals plot now is much more patternless.
### When we re-fit the same model WITH intercept we find 
##   as expected that the intrercept is insignificantly 
##   different from 0.

proc reg data=home.simwgt;
    model yst = xrtinv xsqrt ;
    var x;
    plot residual. * x;
   run;

Intercept     1       -4.40620        4.10714      -1.07      0.2840
xrtinv        1        2.46807        3.18395       0.78      0.4387
xsqrt         1        2.08449        1.31747       1.58      0.1144

## Remarkably, and confusingly, ALL of the coefficients
##  seem insignificant now, although we know they are not.
## More importantly, the residuals look just as before, 
## and theoverall model Rsq is again .0987 !!