THIRD LOG Covering Multiple Regression Model Building Steps: Model Selection and Weights ========================================== Stat 430, 11/27/06 (I) The first portion of this Log illustrations the SELECTION option within the MODEL statement of PROC REG. libname home "."; data home.simfil (drop=seed i); seed=537; do i=1 to 600; x = -0.5*log(1- ranuni(seed)); if ranuni(seed) < 0.4 then w = 1; else w=0; y = 3+ 2*X - X**2 + 1.5*(X**2)*W + 1.2*rannor(seed); xsq = X**2; XW = X*W; XSQW = (X**2)*W; XCU = X**3; XCUW = XCU*W; output; end; /* produces 600 obs, 8 variables */ ## Recall that the following Type I SS table was generated in the MultReg2.LOg script, using PROC GLM with variables entered in the order X W XSQ XW XSQW Source DF Type I SS Mean Square F Value Pr > F x 1 69.93302 69.93302 49.56 <.0001 w 1 50.74898 50.74898 35.96 <.0001 xsq 1 86.93936 86.93936 61.61 <.0001 XW 1 233.99894 233.99894 165.83 <.0001 XSQW 1 60.75024 60.75024 43.05 <.0001 ## Now we consider the model-selection features in PROC REG. proc reg data=home.simfil; model y = x w xsq xcu xw xsqw / selection = forward slentry=.05; run; Corrected Total 599 1340.55960 ### NOTE BELOW: Cp = RSS/sighat**2 + 2p-n Forward Selection: Step 1 Variable XSQW Entered: R-Square = 0.2913 and C(p) = 77.0811 Forward Selection: Step 2 Variable XCU Entered: R-Square = 0.3490 and C(p) = 24.2663 Forward Selection: Step 3 Variable x Entered: R-Square = 0.3736 and C(p) = 2.8871 No other variable met the 0.0500 significance level for entry into the model. proc reg data=home.simfil; model y = x w xsq xcu xw xsqw / selection = forward slentry=.15; run; ### Exactly the same result !!! The problem is that at each ### stage the best on the not-yet-selected variables is chosen, ### and none is ever deleted. ### NOTE that the model with all of the correct terms in it ### ( X XSQ XSQWand no others) has R-square = .3741 proc reg data=home.simfil; model y = x w xsq xcu xw xsqw / selection = backward slstay=.10; run; Backward Elimination: Step 0 All Variables Entered: R-Square = 0.3756 and C(p) = 7.0000 Backward Elimination: Step 1 Variable XW Removed: R-Square = 0.3755 and C(p) = 5.0938 Backward Elimination: Step 2 Variable w Removed: R-Square = 0.3751 and C(p) = 3.5122 Backward Elimination: Step 3 Variable XCU Removed: R-Square = 0.3741 and C(p) = 2.4055 All variables left in the model are significant at the 0.1000 level. ### This leaves only the true model terms !!! proc reg data=home.simfil; model y = x w xsq xcu xw xsqw / selection = stepwise slentry = .05 slstay=.10; run; STEPS: entered XSQW, then XCU, then X. then stopped with statement: All variables left in the model are significant at the 0.1000 level. No other variable met the 0.0500 significance level for entry into the model. ## Here is one final way to do it, searching for "best Cp", producing also the values of AIC, another criterion defined by: AIC = n*log(RSS/n) + 2p proc reg data=home.simfil; model y = x w xsq xcu xw xsqw / selection = CP AIC ; run; ### Now get exhaustive list tabulating CP & AIC by # variables in model, from best to worst according to Cp criterion: Number in Model C(p) R-Square AIC Varbls in Model x xsq XSQW 3 2.4055 0.3741 209.1769 x XCU XSQW 3 2.8871 0.3736 209.6628 ... x xsq XCU XSQW 4 3.5122 0.3751 210.2745 ... x xsq XCU XW XSQW 5 5.4091 0.3752 212.1703 ... ============================================================= (II) The second part of this Script illustrates why and how to use WEIGHTS in PROC REG. * Simulate a dataset with variances growing in prescribed fashion as a function of a simple explanatory variable X. data home.simWGT (drop = i seed); seed = 77777; do i=1 to 400; x = 2.5 + 0.5*rannor(seed); if(x < 0) then x=1e-2; y = -1 + 0.7*x + 0.5*sqrt(x)*rannor(seed); xsqrt = sqrt(x); xrtinv = 1/xsqrt; yst = y/xsqrt; W = 1/x; output; end; ### As mentioned in class 11/27, this y vs x dataset does ### not satisfy the homoscedastic errors assumption goptions csymbol=black; options nocenter nodate linesize=70; proc reg data=home.simwgt; model y = x ; plot residual. * x; run; ### You might not have spotted it if you didn't know to ### look, but this residuals plot defnitely flares out, ### becoming wider as x becomes larger. ### So now we re-run the regression two ways: once ### with WEIGHTS W and once with transformed ### response and regression variables. proc reg data=home.simwgt; model y = x ; weights w; run; ## Estimates a = -0.94088 , b = 0.67378 ## Std errors 0.20146 0.08227 ## The estimated standard errors are also nearly correct ## valid because the WEIGHTS are specified inversely ## proportional to residual variances. ## Now do transformed regression: we indicated in class 11/27 ## that the original model is equivalent to model YST = xrtinv xsqrt / nointercept ; ## NB to fit a model without intercept, use this model option! proc reg data=home.simwgt; model yst = xrtinv xsqrt / noint; var x; /* needed for next plot statement */ plot residual.*x ; run; Rsq = 0.0987 Dependent Variable: yst Parameter Standard Variable DF Estimate Error t Value Pr > |t| xrtinv 1 -0.94088 0.20146 -4.67 <.0001 xsqrt 1 0.67378 0.08130 8.29 <.0001 ### The residuals plot now is much more patternless. ### When we re-fit the same model WITH intercept we find ## as expected that the intrercept is insignificantly ## different from 0. proc reg data=home.simwgt; model yst = xrtinv xsqrt ; var x; plot residual. * x; run; Intercept 1 -4.40620 4.10714 -1.07 0.2840 xrtinv 1 2.46807 3.18395 0.78 0.4387 xsqrt 1 2.08449 1.31747 1.58 0.1144 ## Remarkably, and confusingly, ALL of the coefficients ## seem insignificant now, although we know they are not. ## More importantly, the residuals look just as before, ## and theoverall model Rsq is again .0987 !!