Log to Illustrate Data Analysis involving Dummy Varbls ====================================================== Ships data : to see the transformed dataset "ShipData" inputted to SAS below, see the course Data Directory ### 5 types (A:E) x 3 years of construction (60=1960-1964, 65=1965-1969, 70, 75) ### x 2 periods of operation (60=1960-1974, 75 = 1975-1979) ### (months in) service is quantitative, correlated 0.852 with incidents Source: P. McCullagh and J. A. Nelder, (1983), _Generalized Linear Models._ Chapman & Hall, section 6.3.2, page 137 SUMMARY OF LINEAR MODEL COEFF'S FITTED TO ALL VARIABLES: Estimate Std. Error t value Pr(>|t|) (Intercept) -21.448322468 9.7501259330 -2.1997995 3.566406e-02 service 0.001075327 0.0001815585 5.9227580 1.730609e-06 typeB 9.058471238 4.3659311530 2.0748085 4.667860e-02 typeC -3.306965265 3.2432652215 -1.0196407 3.160517e-01 typeD -2.446871894 3.2444233883 -0.7541777 4.566218e-01 typeE -0.664215603 3.2439105940 -0.2047577 8.391444e-01 year65 7.380488596 2.9058251582 2.5398942 1.650543e-02 year70 8.213733177 2.9568138231 2.7779000 9.343891e-03 year75 2.108045506 3.0507524204 0.6909920 4.948830e-01 period 0.311070865 0.1369833567 2.2708661 3.049699e-02 ## now have significant coeff's for A vs B but not A vs C, D,E ## also it looks as though 65,70 and 60,75 group together!? ANOVA: Response: incidents Df Sum Sq Mean Sq F value Pr(>F) service 1 6344.6 6344.6 150.8714 3.11e-13 *** type 4 428.1 107.0 2.5448 0.05999 . year 3 478.6 159.5 3.7933 0.02032 * period 1 216.9 216.9 5.1568 0.03050 * Residuals 30 1261.6 42.1 ## We can see that all model terms look significant except ## possibly type, and that is because types B, (A,E), (C,D) ## seem to form three distinct groups. ## Also, scatterplot seems to show many small values of service ## and incidents with roughly log behavior. ## So RECODED to logserv = log(1+service) and ## logIncd = log(1+incidents) and ## grp = 0 for type (A,E) ## = 1 for type B ## = 2 for type (C,D) COEFFICIENTS FOR RE-FITTED LINEAR MODEL: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.53020214 0.85625029 -1.787097 8.339826e-02 logserv 0.26771121 0.04456239 6.007560 1.061944e-06 grp1 0.96627165 0.28923421 3.340793 2.135015e-03 grp2 -0.60837469 0.20485589 -2.969769 5.610574e-03 year65 0.36468738 0.26625722 1.369681 1.803220e-01 year70 0.87301292 0.27193671 3.210353 3.013619e-03 year75 0.55055610 0.27471177 2.004123 5.358252e-02 period 0.01340861 0.01307316 1.025659 3.127433e-01 ## Now not much going on in year-by-year period; drop "period" ## and RECODE year to LateYr = 1 of year 65 or 70, 0 otherwise. COEFFICIENTS FOR RE-FITTED LINEAR MODEL: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.6858036 0.25950376 -2.642750 1.221708e-02 logserv 0.3108127 0.03453341 9.000347 1.237695e-10 grp1 0.8263505 0.27500534 3.004852 4.886704e-03 grp2 -0.6200908 0.20525145 -3.021128 4.683654e-03 LateYr 0.5641176 0.18548303 3.041344 4.442523e-03 PLOTS SHOWN IN CLASS: SEE "ShipPics.pdf" in Scripts Directory Note that the first Residuals picture and set of fitted lines refers to the linear model for logIncd verses logserv + grp1 + grp2 + LateYr, while the second refers to residuals and fitted lines done SEPARATELY for each of the (rather small) sets of observations in the 6 different grp x LateYr categories. > table(grp,LateYr) LateYr #### Unequal numbers of obs in different groups ! grp 0 1 0 8 8 1 4 4 2 8 8 ANOVA TABLES SEEM TO SHOW THAT INTERACTION TERMS BETWEEN logserv and LateYr or between logserv and grp Response: logIncd Df Sum Sq Mean Sq F value Pr(>F) logserv 1 45.109 45.109 141.5783 1.138e-13 *** grp 2 10.615 5.307 16.6575 9.062e-06 *** LateYr 1 3.111 3.111 9.7639 0.003629 ** logserv:LateYr 1 0.938 0.938 2.9452 0.095231 . Residuals 34 10.833 0.319 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > shiplm3C = lm(logIncd ~ logserv + grp + LateYr + grp*logserv) > anova(shiplm3C) Analysis of Variance Table Response: logIncd Df Sum Sq Mean Sq F value Pr(>F) logserv 1 45.109 45.109 148.380 9.312e-14 *** grp 2 10.615 5.307 17.458 6.730e-06 *** LateYr 1 3.111 3.111 10.233 0.003042 ** logserv:grp 2 1.739 0.869 2.860 0.071539 . Residuals 33 10.032 0.304 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ========================================================= Now we explore models formally , using dummies and variable selection. ================================================== On input, "ShipData" cantains only columns: grp LateYr logserv logIncd data Ships (drop=id); infile "ASCdata/ShipData" MISSOVER; input id grp LateYr logserv logIncd ; if grp = 1 then do; grp1=1; grp2=0; end; else if grp=2 then do; grp1=0; grp2=1; end; else do; grp1=0; grp2=0; end; lsrvXLate = logserv * LateYr; lsrvgp1 = logserv*grp1; lsrvgp2 = logserv*grp2; LateGp1 = grp1*LateYr; LateGp2 = grp2*LateYr; lsrvsq = logserv**2; if logserv ne . ; /* NOTE the use of the MISSOVER option to keep the header line of different length from ruining data input */ options nocenter nodate linesize = 80; proc reg data=Ships; model logIncd = logserv grp1 grp2 LateYr lsrvXLate lsrvgp1 lsrvgp2 LateGp1 LateGp2 lsrvsq / SELECTION = forward SLentry = .10 ; run; Dependent Variable: logIncd Summary of Forward Selection Variable Number Partial Model Step Entered Vars In R-Square R-Squ C(p) F Val Pr> F 1 lsrvsq 1 0.8205 .8205 26.7050 173.69 <.0001 2 lsrvgp2 2 0.0490 .8695 11.5838 13.90 .0006 3 LateGp2 3 0.0329 .9024 2.0955 12.13 .0013 4 grp2 4 0.0077 .9100 1.4225 2.98 .0933 /* Re-did the fitting with CP selection crieterion */ Number in Model C(p) R-Squ AIC Variables in Model 3 0.5852 .9067 -64.1550 grp2 LateGp2 lsrvsq 4 1.4225 .9100 -63.6085 grp2 lsrvgp2 LateGp2 lsrvsq 4 1.5221 .9098 -63.4819 grp2 LateYr LateGp2 lsrvsq 4 1.9201 .9086 -62.9799 logserv grp2 LateGp2 lsrvsq 4 1.9703 .9085 -62.9170 LateYr lsrvgp2 LateGp2 lsrvsq 4 2.0382 .9083 -62.8322 lsrvXLate lsrvgp2 LateGp2 lsrvsq 3 2.0955 .9024 -62.3428 lsrvgp2 LateGp2 lsrvsq 5 2.2860 .9133 -63.0820 grp2 LateYr lsrvgp2 LateGp2 lsrvsq 4 2.3056 .9075 -62.4997 grp2 lsrvXLate LateGp2 lsrvsq ### We can see at the end that many models with almost identical Rsq are possible. All involve LateGp2 ,ie, the interactionof LateYr and Gp2, and also gp2 either individually or through its interactions with logserv. Here the log-transform on service does not seem to have been quite right, since the quadratic of this log also figures in most of the good models. ## At the end of the model-fitting, grp can be replaced as a variable by grp2, and both of the interactions lsrvgrp2 and LateGp2 do belong in the model. I would report the forward-selected model in this case as the "final" one.