SECOND LOG Covering Multiple Regression Model Building Steps, including ANOVA from PROC GLM ================================= Stat 430, 11/22/06 ## Note first that all of the output features ## we have been using (predictors residuals LCL ## and UCLI, etc. can also be produced using PROC GLM. ## The extrra advatnage of GLM is that it produces ### as "Type I SSQ" Analysis of Variance (ANOVA) ### Table. (See below for discussion.) The only ### disadvantage to PROC GLM is that it does not ## have automatic model selection options, which ## we will cover in PROC REG after the break. libname home "."; data simfil (drop=seed i); seed=537; do i=1 to 600; x = -0.5*log(1- ranuni(seed)); if ranuni(seed) < 0.4 then w = 1; else w=0; y = 3+ 2*X - X**2 + 1.5*(X**2)*W + 1.2*rannor(seed); xsq = X**2; XW = X*W; XSQW = (X**2)*W; output; end; /* produces 600 obs, 6 variables */ proc corr data=simfil; var y; with x W; ## Cor y with x = 0.2284, p-val <.0001 ## Cor y with w = 0.19425, p-val <.0001 proc corr data=simfil; var y; with xsq xW; partial x W; ## Partial (after removing x W ) ## Cor y with xsq = -0.26696 , p-val <.0001 ## Cor y with XW = 0.48074 , p-val <.0001 Note that the partial correlations of Y with XW are way too large to be due to chance, even though there is no X*W term in the model we simulated !! The reason for the large correlation is that the residuals after removing X,W from Y are indeed correlated with XW, because X^2*W is. That tells usto interpret the high partial correlation as saying not exactly that X*W is present, but that some interaction term like it is. --------------------------------------------------- proc glm data=simfil; model y = x W xsq XW XSQW ; output out= rsdetc p = Yprd r = Yrsd Rstudent = StudYRes L95 = LowYCI U95 = HiYCI; run; ## Here p = number of regression coefficients is 6, one of which is for the intercept Ybar. The DF for Model terms below is the number p-1 = 5 of NONCONSTANT explanatory variables. Sum of Source DF Squares Mean Square F Value Pr > F Model 5 502.3705 100.4741 71.20 <.0001 Error 594 838.1891 1.4111 Corrected Total 599 1340.5596 ## This crude Analysis of Variance (ANOVA) table is produced both by PROC REG and PROC GLM. Recall that the Error Mean Square = Residual Sum of Squares divided by n-p is the same as the estimated residual variance sigma_hat^2 = 1.4111 in this case. Also, CSS = Corrected sum of squares total = (n-1)*S_Y^2 = 1340.56. ### The Type I SSQ ANOVA Table is produced only by PROC GLM and is the main reason to usew that PROC. Each line provides a snapshot, of the new decrease in RSS due to inclusion of the current term in the "Source" line of the model. In each line, the current line is equal to CSS*(new multiple correlation including current term minus previous multiple correlation including all previous model terms) and is also eqal to the RSS (for the regression model including all previous model terms) before including the current model term minus the RSS after including the current term. ### Under the null hypothesis that the model already fits in the previous line of the table, the current F-value, which is the current (Type I SS divided by DF) divided by (sigma_hat^2) would be distributed as an F-statistic with DF and n-p degrees of freedom. This statistic is centered at 1, and values above 3 or 4 are large (but the exact percentage points do depend on DF and n-p, e.g. the 95% quantile for 1,594 df's is 3.86 and the 99% quantile is 6.68. Source DF Type I SS Mean Square F Value Pr > F x 1 69.93302 69.93302 49.56 <.0001 w 1 50.74898 50.74898 35.96 <.0001 xsq 1 86.93936 86.93936 61.61 <.0001 XW 1 233.99894 233.99894 165.83 <.0001 XSQW 1 60.75024 60.75024 43.05 <.0001 ### So this Type I SS table says strongly that every term entered into the model was HIGHLY SIGNIFICANT AT THE POINT WHERE IT WAS ENTERED INTO THE MODEL, although we know that not all of these terms should be needed in the final model. Dependent Variable: y : Type III SS ### The table of coefficient estimates, standard errors and t-ratios with p-values is produced by either PROC REG or PROC GLM. Standard Parameter Estimate Error t Value Pr > |t| Intercept 3.1139 0.10385 29.99 <.0001 x 1.5485 0.27164 5.70 <.0001 w -0.0815 0.17448 -0.47 0.6406 xsq -0.8639 0.11220 -7.70 <.0001 XW -0.0055 0.49145 -0.01 0.9911 XSQW 1.5887 0.24214 6.56 <.0001 ### Here we can see that x, xsq, and xsqw in the final model including all terms are super-significant, but that W and XW are not significant at all: we know this to be true from the simulation code. ### The Type III ANOVA table (produced by PROC GLM) gives information essentially like that of the standardized coefficient estimators, which tell whether a specific coefficient is significantly different from 0 when all other terms have already been entered into the model. Indeed, the F Value column is exactly equal to the square of the t-value column from the previous table of standardized coefficient estimators, so it is not surprising that the corresponding p-values are exactly equal. Source DF Type III SS Mean Square F Value Pr > F x 1 45.85473 45.85473 32.50 <.0001 w 1 0.30788 0.30788 0.22 0.6406 xsq 1 83.65504 83.65504 59.28 <.0001 XW 1 0.00017 0.00017 0.00 0.9911 XSQW 1 60.75024 60.75024 43.05 <.0001 ### This log shows that we want to look both for individual ### coefficients being significant in the final model, but ### also whether variables IN THE ORDER ENTERED contributed ### to a noticeable reduction in RSS. ## Generally we will not include a higher-order term unless the lower-order terms made of the same ingredients are also included, but as you can see here that may result in some terms iwht negligible coefficients being kept !!