SECOND LOG Covering Multiple Regression Model
Building Steps, including ANOVA from PROC GLM
================================= Stat 430, 11/22/06
## Note first that all of the output features
## we have been using (predictors residuals LCL
## and UCLI, etc. can also be produced using PROC GLM.
## The extrra advatnage of GLM is that it produces
### as "Type I SSQ" Analysis of Variance (ANOVA)
### Table. (See below for discussion.) The only
### disadvantage to PROC GLM is that it does not
## have automatic model selection options, which
## we will cover in PROC REG after the break.
libname home ".";
data simfil (drop=seed i);
seed=537;
do i=1 to 600;
x = -0.5*log(1- ranuni(seed));
if ranuni(seed) < 0.4 then w = 1;
else w=0;
y = 3+ 2*X - X**2 + 1.5*(X**2)*W + 1.2*rannor(seed);
xsq = X**2; XW = X*W; XSQW = (X**2)*W;
output; end; /* produces 600 obs, 6 variables */
proc corr data=simfil;
var y; with x W;
## Cor y with x = 0.2284, p-val <.0001
## Cor y with w = 0.19425, p-val <.0001
proc corr data=simfil;
var y; with xsq xW; partial x W;
## Partial (after removing x W )
## Cor y with xsq = -0.26696 , p-val <.0001
## Cor y with XW = 0.48074 , p-val <.0001
Note that the partial correlations of Y with XW are
way too large to be due to chance, even though there
is no X*W term in the model we simulated !! The reason
for the large correlation is that the residuals after
removing X,W from Y are indeed correlated with XW,
because X^2*W is. That tells usto interpret the high
partial correlation as saying not exactly that X*W is
present, but that some interaction term like it is.
---------------------------------------------------
proc glm data=simfil;
model y = x W xsq XW XSQW ;
output out= rsdetc p = Yprd r = Yrsd
Rstudent = StudYRes L95 = LowYCI U95 = HiYCI;
run;
## Here p = number of regression coefficients is 6, one of
which is for the intercept Ybar. The DF for Model terms below
is the number p-1 = 5 of NONCONSTANT explanatory variables.
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 5 502.3705 100.4741 71.20 <.0001
Error 594 838.1891 1.4111
Corrected Total 599 1340.5596
## This crude Analysis of Variance (ANOVA) table is produced
both by PROC REG and PROC GLM. Recall that the Error Mean Square
= Residual Sum of Squares divided by n-p is the same as the
estimated residual variance sigma_hat^2 = 1.4111 in this case.
Also, CSS = Corrected sum of squares total = (n-1)*S_Y^2 = 1340.56.
### The Type I SSQ ANOVA Table is produced only by PROC GLM and
is the main reason to usew that PROC. Each line provides a
snapshot, of the new decrease in RSS due to inclusion of the
current term in the "Source" line of the model. In each line,
the current line is equal to CSS*(new multiple correlation
including current term minus previous multiple correlation
including all previous model terms) and is also eqal to the
RSS (for the regression model including all previous model terms)
before including the current model term minus the RSS after
including the current term.
### Under the null hypothesis that the model already fits in the
previous line of the table, the current F-value, which is the
current (Type I SS divided by DF) divided by (sigma_hat^2)
would be distributed as an F-statistic with DF and n-p degrees
of freedom. This statistic is centered at 1, and values above
3 or 4 are large (but the exact percentage points do depend on DF
and n-p, e.g. the 95% quantile for 1,594 df's is 3.86 and the 99%
quantile is 6.68.
Source DF Type I SS Mean Square F Value Pr > F
x 1 69.93302 69.93302 49.56 <.0001
w 1 50.74898 50.74898 35.96 <.0001
xsq 1 86.93936 86.93936 61.61 <.0001
XW 1 233.99894 233.99894 165.83 <.0001
XSQW 1 60.75024 60.75024 43.05 <.0001
### So this Type I SS table says strongly that every term
entered into the model was HIGHLY SIGNIFICANT AT THE POINT WHERE
IT WAS ENTERED INTO THE MODEL, although we know that not all
of these terms should be needed in the final model.
Dependent Variable: y : Type III SS
### The table of coefficient estimates, standard errors and t-ratios
with p-values is produced by either PROC REG or PROC GLM.
Standard
Parameter Estimate Error t Value Pr > |t|
Intercept 3.1139 0.10385 29.99 <.0001
x 1.5485 0.27164 5.70 <.0001
w -0.0815 0.17448 -0.47 0.6406
xsq -0.8639 0.11220 -7.70 <.0001
XW -0.0055 0.49145 -0.01 0.9911
XSQW 1.5887 0.24214 6.56 <.0001
### Here we can see that x, xsq, and xsqw in the final model including
all terms are super-significant, but that W and XW are not
significant at all: we know this to be true from the simulation
code.
### The Type III ANOVA table (produced by PROC GLM) gives information
essentially like that of the standardized coefficient estimators,
which tell whether a specific coefficient is significantly
different from 0 when all other terms have already been entered
into the model. Indeed, the F Value column is exactly equal to
the square of the t-value column from the previous table of
standardized coefficient estimators, so it is not surprising that
the corresponding p-values are exactly equal.
Source DF Type III SS Mean Square F Value Pr > F
x 1 45.85473 45.85473 32.50 <.0001
w 1 0.30788 0.30788 0.22 0.6406
xsq 1 83.65504 83.65504 59.28 <.0001
XW 1 0.00017 0.00017 0.00 0.9911
XSQW 1 60.75024 60.75024 43.05 <.0001
### This log shows that we want to look both for individual
### coefficients being significant in the final model, but
### also whether variables IN THE ORDER ENTERED contributed
### to a noticeable reduction in RSS.
## Generally we will not include a higher-order term unless the
lower-order terms made of the same ingredients are also included,
but as you can see here that may result in some terms iwht
negligible coefficients being kept !!