THIRD LOG Covering Multiple Regression Model
Building Steps: Model Selection and Weights
========================================== Stat 430, 11/27/06
(I) The first portion of this Log illustrations the SELECTION
option within the MODEL statement of PROC REG.
libname home ".";
data home.simfil (drop=seed i);
seed=537;
do i=1 to 600;
x = -0.5*log(1- ranuni(seed));
if ranuni(seed) < 0.4 then w = 1;
else w=0;
y = 3+ 2*X - X**2 + 1.5*(X**2)*W + 1.2*rannor(seed);
xsq = X**2; XW = X*W; XSQW = (X**2)*W; XCU = X**3;
XCUW = XCU*W;
output; end; /* produces 600 obs, 8 variables */
## Recall that the following Type I SS table was generated in the
MultReg2.LOg script, using PROC GLM with variables entered
in the order X W XSQ XW XSQW
Source DF Type I SS Mean Square F Value Pr > F
x 1 69.93302 69.93302 49.56 <.0001
w 1 50.74898 50.74898 35.96 <.0001
xsq 1 86.93936 86.93936 61.61 <.0001
XW 1 233.99894 233.99894 165.83 <.0001
XSQW 1 60.75024 60.75024 43.05 <.0001
## Now we consider the model-selection features in PROC REG.
proc reg data=home.simfil;
model y = x w xsq xcu xw xsqw /
selection = forward slentry=.05;
run;
Corrected Total 599 1340.55960
### NOTE BELOW: Cp = RSS/sighat**2 + 2p-n
Forward Selection: Step 1
Variable XSQW Entered: R-Square = 0.2913 and C(p) = 77.0811
Forward Selection: Step 2
Variable XCU Entered: R-Square = 0.3490 and C(p) = 24.2663
Forward Selection: Step 3
Variable x Entered: R-Square = 0.3736 and C(p) = 2.8871
No other variable met the 0.0500 significance level for entry into the
model.
proc reg data=home.simfil;
model y = x w xsq xcu xw xsqw /
selection = forward slentry=.15;
run;
### Exactly the same result !!! The problem is that at each
### stage the best on the not-yet-selected variables is chosen,
### and none is ever deleted.
### NOTE that the model with all of the correct terms in it
### ( X XSQ XSQWand no others) has R-square = .3741
proc reg data=home.simfil;
model y = x w xsq xcu xw xsqw /
selection = backward slstay=.10;
run;
Backward Elimination: Step 0
All Variables Entered: R-Square = 0.3756 and C(p) = 7.0000
Backward Elimination: Step 1
Variable XW Removed: R-Square = 0.3755 and C(p) = 5.0938
Backward Elimination: Step 2
Variable w Removed: R-Square = 0.3751 and C(p) = 3.5122
Backward Elimination: Step 3
Variable XCU Removed: R-Square = 0.3741 and C(p) = 2.4055
All variables left in the model are significant at the 0.1000 level.
### This leaves only the true model terms !!!
proc reg data=home.simfil;
model y = x w xsq xcu xw xsqw /
selection = stepwise slentry = .05 slstay=.10;
run;
STEPS: entered XSQW, then XCU, then X. then stopped with statement:
All variables left in the model are significant at the 0.1000 level.
No other variable met the 0.0500 significance level for entry into
the model.
## Here is one final way to do it, searching for "best Cp",
producing also the values of AIC, another criterion defined by:
AIC = n*log(RSS/n) + 2p
proc reg data=home.simfil;
model y = x w xsq xcu xw xsqw /
selection = CP AIC ;
run;
### Now get exhaustive list tabulating CP & AIC by # variables in
model, from best to worst according to Cp criterion:
Number in
Model C(p) R-Square AIC
Varbls in Model
x xsq XSQW 3 2.4055 0.3741 209.1769
x XCU XSQW 3 2.8871 0.3736 209.6628
...
x xsq XCU XSQW 4 3.5122 0.3751 210.2745
...
x xsq XCU XW XSQW 5 5.4091 0.3752 212.1703
...
=============================================================
(II) The second part of this Script illustrates why and how to
use WEIGHTS in PROC REG.
* Simulate a dataset with variances growing in prescribed fashion
as a function of a simple explanatory variable X.
data home.simWGT (drop = i seed);
seed = 77777;
do i=1 to 400;
x = 2.5 + 0.5*rannor(seed);
if(x < 0) then x=1e-2;
y = -1 + 0.7*x + 0.5*sqrt(x)*rannor(seed);
xsqrt = sqrt(x);
xrtinv = 1/xsqrt;
yst = y/xsqrt;
W = 1/x;
output; end;
### As mentioned in class 11/27, this y vs x dataset does
### not satisfy the homoscedastic errors assumption
goptions csymbol=black;
options nocenter nodate linesize=70;
proc reg data=home.simwgt;
model y = x ;
plot residual. * x;
run;
### You might not have spotted it if you didn't know to
### look, but this residuals plot defnitely flares out,
### becoming wider as x becomes larger.
### So now we re-run the regression two ways: once
### with WEIGHTS W and once with transformed
### response and regression variables.
proc reg data=home.simwgt;
model y = x ;
weights w;
run;
## Estimates a = -0.94088 , b = 0.67378
## Std errors 0.20146 0.08227
## The estimated standard errors are also nearly correct
## valid because the WEIGHTS are specified inversely
## proportional to residual variances.
## Now do transformed regression: we indicated in class 11/27
## that the original model is equivalent to
model YST = xrtinv xsqrt / nointercept ;
## NB to fit a model without intercept, use this model option!
proc reg data=home.simwgt;
model yst = xrtinv xsqrt / noint;
var x; /* needed for next plot statement */
plot residual.*x ;
run;
Rsq = 0.0987
Dependent Variable: yst
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
xrtinv 1 -0.94088 0.20146 -4.67 <.0001
xsqrt 1 0.67378 0.08130 8.29 <.0001
### The residuals plot now is much more patternless.
### When we re-fit the same model WITH intercept we find
## as expected that the intrercept is insignificantly
## different from 0.
proc reg data=home.simwgt;
model yst = xrtinv xsqrt ;
var x;
plot residual. * x;
run;
Intercept 1 -4.40620 4.10714 -1.07 0.2840
xrtinv 1 2.46807 3.18395 0.78 0.4387
xsqrt 1 2.08449 1.31747 1.58 0.1144
## Remarkably, and confusingly, ALL of the coefficients
## seem insignificant now, although we know they are not.
## More importantly, the residuals look just as before,
## and theoverall model Rsq is again .0987 !!