R Log for solution of HW12 on Insurance dataset in MASS ======================================================= 11/9/2015 > library(MASS) > Insur = Insurance > dim(Insur) [1] 64 5 > names(Insur) [1] "District" "Group" "Age" "Holders" "Claims" > unlist(lapply(Insur,class)) District Group1 Group2 Age1 Age2 Holders Claims "factor" "ordered" "factor" "ordered" "factor" "integer" "integer" > round(data.matrix(aggregate(Claims, by=list(Distr=District), mean)),2) Distr x [1,] 1 86.31 [2,] 2 55.69 [3,] 3 34.56 [4,] 4 20.38 > round(data.matrix(aggregate(Claims/Holders, by=list(Distr=District), mean)),2) Distr x [1,] 1 0.16 [2,] 2 0.19 [3,] 3 0.18 [4,] 4 0.19 ## we have a 4 x 4 x 4 array of means for claims/Holders, as follows: > AvClaim = array(aggregate(Claims/Holders, by=list(Distr=District, Group=Group,Age=Age), mean)$x, c(4,4,4), dimnames=list(levels(District), levels(Group), levels(Age)) ) > round(AvClaim,4) , , <25 <1l 1-1.5l 1.5-2l >2l 1 0.1929 0.2218 0.1429 0.1667 2 0.2588 0.1678 0.2121 0.4444 3 0.1429 0.1887 0.3333 0.4286 4 0.1000 0.2258 0.2778 0.0000 , , 25-29 <1l 1-1.5l 1.5-2l >2l 1 0.1326 0.1567 0.1818 0.2535 2 0.1367 0.1629 0.2629 0.3125 3 0.1507 0.1548 0.2436 0.0690 4 0.1515 0.1235 0.1795 0.3750 , , 30-35 <1l 1-1.5l 1.5-2l >2l 1 0.0813 0.1279 0.2085 0.1919 2 0.1457 0.1169 0.1765 0.1667 3 0.1124 0.1542 0.1983 0.1860 4 0.1000 0.1803 0.2353 0.3200 , , >35 <1l 1-1.5l 1.5-2l >2l 1 0.0929 0.1117 0.1421 0.1704 2 0.0934 0.1187 0.1288 0.1646 3 0.1034 0.1144 0.1460 0.1510 4 0.1139 0.1409 0.1831 0.2895 ### Let's examine a saturated linear model for Claims/Holders Insur = cbind.data.frame(ClaimPerH = Claims/Holders, Insur) > names(Insur) [1] "ClaimPerH" "District" "Group" "Age" "Holders" [6] "Claims" detach() > mod1 = lm(ClaimPerH ~ District+Group+Age, data=Insur) > round( summary(mod1)$coef , 5) Estimate Std. Error t value Pr(>|t|) (Intercept) 0.16096 0.01755 9.17393 0.00000 District2 0.03088 0.02481 1.24466 0.21863 District3 0.01886 0.02481 0.76019 0.45044 District4 0.02629 0.02481 1.05970 0.29400 Group.L 0.07725 0.01755 4.40285 0.00005 Group.Q 0.00248 0.01755 0.14136 0.88811 Group.C -0.01084 0.01755 -0.61769 0.53938 Age.L -0.05680 0.01755 -3.23751 0.00206 Age.Q 0.00063 0.01755 0.03584 0.97154 Age.C -0.00285 0.01755 -0.16239 0.87161 > modForw = step(lm(ClaimPerH ~ 1, data=Insur), .~(District+Group+Claims+Holders)^3, direction="forward", k=4) ### NB ln(n) = 4.16 here. Start: AIC=-317.07 ClaimPerH ~ 1 Df Sum of Sq RSS AIC + Group 3 0.097459 0.32663 -321.78 + Holders 1 0.045375 0.37871 -320.31 + Claims 1 0.035186 0.38890 -318.61 0.42409 -317.07 + District 3 0.008887 0.41520 -306.42 Step: AIC=-321.78 ClaimPerH ~ Group Df Sum of Sq RSS AIC + Holders 1 0.0246732 0.30195 -322.81 + Claims 1 0.0222485 0.30438 -322.30 0.32663 -321.78 + District 3 0.0088872 0.31774 -311.55 Step: AIC=-322.81 ClaimPerH ~ Group + Holders Df Sum of Sq RSS AIC 0.30195 -322.81 + Claims 1 0.0015083 0.30045 -319.13 + Group:Holders 3 0.0203728 0.28158 -315.28 + District 3 0.0045133 0.29744 -311.77 > modB = step(modForw, .~(District+Group+Claims+Holders)^3, direction="both", k=2) ### no change in model ### Provisionally the model in terms of Group + Holders is ### best, without interaction > round(summary(modForw)$coef, 5) Estimate Std. Error t value Pr(>|t|) (Intercept) 0.19241 0.01058 18.17770 0.00000 Group.L 0.06954 0.01823 3.81528 0.00033 Group.Q -0.00849 0.01857 -0.45733 0.64911 Group.C -0.00374 0.01817 -0.20569 0.83774 Holders -0.00003 0.00002 -2.19567 0.03206 ### Suggests that only the second group (1-1.5l) indicator ### along with Holders is significant. > plot(mod1$fit, Insur$ClaimPerH) plot(mod1$fit, mod1$resid) #### the results are really glaring: the residuals look roughly #### balanced around 0, but they fan out dramatically (slightly ### asymmetrically) with larger variability at larger fitted values identify(mod1$fit, mod1$resid) [1] 29 45 46 61 ## The extreme residuals are observations with small numbers of Holders > Insur[.Last.value,] ClaimPerH District Group Age Holders Claims 29 0.44444444 2 >2l <25 9 4 45 0.42857143 3 >2l <25 7 3 46 0.06896552 3 >2l 25-29 29 2 61 0.00000000 4 >2l <25 3 0 ## NB there is a systematic average decrease in Holders by district, so these smallest numbers of holders might deserve separate handling. Also, interaction terms were disallowed because they contain so many parameters. So now try direct coding, using numeric ordered levels. attach(Insurance) Insur2 = cbind.data.frame(ClaimsPerH=Claims/Holders, Insurance[4], lapply(Insurance[1:3],as.numeric)) > mod2 = lm(ClaimsPerH ~ I(Group>2) + Holders , data=Insur2) extractAIC(mod2, k=2) [1] 3.0000 -334.0707 extractAIC(update(mod2,.~.+I(Group>2)*Holders)) [1] 4.0000 -335.6128 ### OK, so a second, possibly better model is the one with these terms and interaction > mod3 = update(mod2,.~.+I(Group>2)*Holders) plot(mod3$fitted, Insur2$ClaimsPerH) ### leaves many of the problems with residuals, but the spread ### of residuals is definitely greater for larger fitted numbers > plot(mod4$fit, mod4$resid) > extractAIC(mod4, k=2) [1] 5.0000 -337.0349 > points(mod4$fit[Insur2$Holders<30], mod4$resid[Insur2$Holders<30], pch=18) ### solid diamonds for the Holders<30 cases: ### A look at this picture shows very nice patternless residuals except for ### the records with very small numbers of Holders which have an unavoidable ### and unpredictable spread. The largest and smallest ClaimsPerH are in the ### Age=1 and Holders < 30 category. > Insur2[Insur2$Holders<30 & Insur2$Group==4,] ClaimsPerH Holders District Group Age 13 0.16666667 24 1 4 1 29 0.44444444 9 2 4 1 45 0.42857143 7 3 4 1 46 0.06896552 29 3 4 2 61 0.00000000 3 4 4 1 62 0.37500000 16 4 4 2 63 0.32000000 25 4 4 3 > summary(Insur2$ClaimsPerH) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0000 0.1286 0.1638 0.1800 0.2094 0.4444