Concrete Data Analysis R Log for HW13 in STAT 705 11/18/2015 =================================================================== ## There were no end-line characters in these online datasets ## word-processing the *csv file by deleting the header with quotes ## after setting "word-wrap" option leaves a *csv file with no header > ConcrDat = read.table("Concrete_Data.csv", header=F, sep=",") > dim(ConcrDat) [1] 1030 9 > names(ConcrDat) = c("Cement", "BLFurnSlag","FlyAsh","Water", "SuperPlast","CoarsAgg","FineAggr","Age","CmprStrgth") > summary(ConcrDat) Cement BLFurnSlag FlyAsh Water Min. :102.0 Min. : 0.0 Min. : 0.00 Min. :121.8 1st Qu.:192.4 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.:164.9 Median :272.9 Median : 22.0 Median : 0.00 Median :185.0 Mean :281.2 Mean : 73.9 Mean : 54.19 Mean :181.6 3rd Qu.:350.0 3rd Qu.:142.9 3rd Qu.:118.30 3rd Qu.:192.0 Max. :540.0 Max. :359.4 Max. :200.10 Max. :247.0 SuperPlast CoarsAgg FineAggr Age Min. : 0.000 Min. : 801.0 Min. :594.0 Min. : 1.00 1st Qu.: 0.000 1st Qu.: 932.0 1st Qu.:731.0 1st Qu.: 7.00 Median : 6.400 Median : 968.0 Median :779.5 Median : 28.00 Mean : 6.205 Mean : 972.9 Mean :773.6 Mean : 45.66 3rd Qu.:10.200 3rd Qu.:1029.4 3rd Qu.:824.0 3rd Qu.: 56.00 Max. :32.200 Max. :1145.0 Max. :992.6 Max. :365.00 CmprStrgth Min. : 2.33 1st Qu.:23.71 Median :34.45 Mean :35.82 3rd Qu.:46.13 Max. :82.60 > head(ConcrDat) Cement BLFurnSlag FlyAsh Water SuperPlast CoarsAgg FineAggr Age CmprStrgth 1 540.0 0.0 0 162 2.5 1040.0 676.0 28 79.99 2 540.0 0.0 0 162 2.5 1055.0 676.0 28 61.89 3 332.5 142.5 0 228 0.0 932.0 594.0 270 40.27 4 332.5 142.5 0 228 0.0 932.0 594.0 365 41.05 5 198.6 132.4 0 192 0.0 978.4 825.5 360 44.30 6 266.0 114.0 0 228 0.0 932.0 670.0 90 47.03 > mod1 = step(lm(CmprStrgth~1, data=ConcrDat), .~Cement+BLFurnSlag+FlyAsh + Water+SuperPlast+CoarsAgg+FineAggr+Age, k=4, direction="forward") > mod1$call lm(formula = CmprStrgth ~ Cement + SuperPlast + Age + BLFurnSlag + Water + FlyAsh, data = ConcrDat) ### NB initial corrected SSQ > 1029*var(ConcrDat$CmprStr) [1] 287175.2 > anova(mod1) Analysis of Variance Table Response: CmprStrgth Df Sum Sq Mean Sq F value Pr(>F) Cement 1 71173 71173 656.869 < 2.2e-16 *** SuperPlast 1 29676 29676 273.886 < 2.2e-16 *** Age 1 37499 37499 346.091 < 2.2e-16 *** BLFurnSlag 1 19922 19922 183.869 < 2.2e-16 *** Water 1 9524 9524 87.903 < 2.2e-16 *** FlyAsh 1 8537 8537 78.793 < 2.2e-16 *** Residuals 1023 110843 108 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > plot(ConcrDat$CoarsAgg, mod1$resid) ### fanning out pattern, could be plenty of structure left > hist(ConcrDat$Cement, nclass=35, prob=T) ## fills range 100:500 hist(ConcrDat$BLFurn, nclass=35, prob=T) > mean( ConcrDat$BLFurn < 13 ) ## 46.1% have values essentially 0 ### could first try to fit model on those values hist(ConcrDat$FlyAsh, nclass=35, prob=T) > mean( ConcrDat$FlyAsh < 10 ) ## 55.0% have values essentially 0 hist(ConcrDat$Water, nclass=35, prob=T) ## fills range reasonably > mean( ConcrDat$SuperPl < 1 ) ## 36.8% values ~ 0 hist(ConcrDat$CoarseAgg, nclass=35, prob=T) ## fills out range hist(ConcrDat$FineAggr, nclass=35, prob=T) ## fills out range ### Let's try model strictly for the case of 0 BLFurnSlag,FlyAsh, SuperPlast > sum(ConcrDat$BLFurn < 13 & ConcrDat$FlyAsh < 10 & ConcrDat$SuperPl < 1) ### 209 records like this > zerind = which(ConcrDat$BLFurn < 13 & ConcrDat$FlyAsh < 10 & ConcrDat$SuperPl < 1) > mod2 = step(update(mod1, formula = .~.-BLFurnSlag-FlyAsh-SuperPlast, data=ConcrDat[zerind,]), .~ (Cement+Water+CoarsAgg+FineAggr+Age)^3, direction="both", k=4) > mod2$call lm(formula = CmprStrgth ~ Cement + Age + Water + FineAggr + Cement:Water + Water:FineAggr, data = ConcrDat[zerind, ]) anova(mod2) Analysis of Variance Table Response: CmprStrgth Df Sum Sq Mean Sq F value Pr(>F) Cement 1 22220.4 22220.4 383.9366 < 2.2e-16 *** Age 1 7683.0 7683.0 132.7519 < 2.2e-16 *** Water 1 1529.5 1529.5 26.4274 6.442e-07 *** FineAggr 1 2.7 2.7 0.0465 0.82941 Cement:Water 1 1241.2 1241.2 21.4464 6.508e-06 *** Water:FineAggr 1 244.0 244.0 4.2167 0.04132 * Residuals 202 11690.8 57.9 > plot(ConcrDat$Age[zerind], mod2$resid) ### residuals show clear need to estimate higher when Age is small, ### and slightly lower as roughly linear function of Age when Age > 100 ### NB Age is in days > extractAIC(mod2) [1] 7.0000 855.0626 > mod2B = update(mod2, formula = .~. + I(Age>100)*Age) > extractAIC(mod2B) [1] 9.0000 722.3541 ### OK, this makes a big difference ! > plot(mod2B$fit,mod2B$resid) ## not too bad plot(ConcrDat$Age[zerind],mod2B$resid) > mod2C = update(mod2B, formula = .~. + I(Age^2) + I(Age>100)*I(Age^2)) > extractAIC(mod2C) [1] 11.0000 636.9007 > anova(mod2C) ### Now everything looks significant -- a MUCH better model plot(ConcrDat$Age[zerind], mod2C$resid) ### NB FinaAggr by itself does not seem to make much difference, but ### interaction term with water does !! ### Now try to add in other variable being > 0, once at a time mod2D = update(mod2C, data=ConcrDat) anova(mod2D) Analysis of Variance Table Response: CmprStrgth Df Sum Sq Mean Sq F value Pr(>F) Cement 1 71173 71173 971.9021 < 2.2e-16 *** Age 1 23993 23993 327.6441 < 2.2e-16 *** Water 1 34599 34599 472.4670 < 2.2e-16 *** FineAggr 1 11714 11714 159.9576 < 2.2e-16 *** I(Age > 100) 1 20156 20156 275.2436 < 2.2e-16 *** I(Age^2) 1 28397 28397 387.7821 < 2.2e-16 *** Cement:Water 1 5350 5350 73.0519 < 2.2e-16 *** Water:FineAggr 1 576 576 7.8619 0.005145 ** Age:I(Age > 100) 1 3137 3137 42.8380 9.404e-11 *** I(Age > 100):I(Age^2) 1 13459 13459 183.7953 < 2.2e-16 *** Residuals 1019 74622 73 ### Everything HIGHLY significant > plot(ConcrDat$SuperP, mod2D$resid) ### OK slightly negative slope cor(ConcrDat$SuperP, mod2D$resid) ## .099, but p-value .001 in lm mod3 = step(mod2D, .~. + SuperPlast*(Cement+Age+Water+FineAggr)^2, direction="forward", k=4) ### resid Sum Squares = 62576 ... Step: AIC=4306.02 CmprStrgth ~ Cement + Age + Water + FineAggr + I(Age > 100) + I(Age^2) + SuperPlast + Cement:Water + Water:FineAggr + Age:I(Age > 100) + I(Age > 100):I(Age^2) + Water:SuperPlast + Cement:Age + Age:FineAggr + Cement:SuperPlast + Age:SuperPlast + Cement:Water:SuperPlast + Cement:Age:SuperPlast anova(mod3) rbind(mod2D= extractAIC(mod2D, k=4), mod3 = extractAIC(mod3,k=4)) mod2D 11 4455.356 mod3 19 4306.022 plot(ConcrDat$Age, mod3$resid) ### generally OK plot(ConcrDat$SuperPlast, mod3$resid) ### " plot(ConcrDat$BLFurn, mod3$resid) ### OK now looks like a slight ### linear or quadratic tendency remains cor(ConcrDat$BLFurn, mod3$resid) ## 0.34 round(summary(lm(mod3$resid ~ BLFurnSlag + I(BLFurnSlag^2), data=ConcrDat))$coef[,1:3], 5) Estimate Std. Error t value (Intercept) -2.73305 0.31545 -8.66392 BLFurnSlag 0.06117 0.00763 8.02086 I(BLFurnSlag^2) -0.00014 0.00003 -4.19447 mod4 = update(mod3, formula=.~. + BLFurnSlag + I(BLFurnSlag^2)) ### makes a big difference --- resid SSQ down to 48147 now mod4B = step(mod4, .~. + BLFurnSlag*(Cement+Age+Water+SuperPlast)^2, direction="forward", k=4) ### added only the term BLFurnSlag:SuperPlast ### reducing AIC to 4042.26 and SSQ to 47847 ### Finally try to add in FlyAsh > plot(ConcrDat$FlyAsh, mod4B$resid) ### maybe quadratic ? cor(ConcrDat$FlyAsh, mod4B$resid) ## 0.114 round(summary(lm(mod4B$resid~FlyAsh+I(FlyAsh^2), data=ConcrDat))$coef,4) Estimate Std. Error t value Pr(>|t|) (Intercept) -0.8132 0.2822 -2.8812 0.0040 FlyAsh 0.0427 0.0123 3.4665 0.0005 I(FlyAsh^2) -0.0002 0.0001 -2.5659 0.0104 mod5 = update(mod4B, formula = .~.+FlyAsh+I(FlyAsh^2)) ### AIC with k=4 is 3975, and SSQ is 44505, df 24 mod5B = step(mod5, .~.+FlyAsh*(Cement+Age+Water+ BLFurnSlag+SuperPlast+FineAggr), k=4, direction="forward") ### improves AIC with k=4 to 3943, df=27, SSQ=42655. Stop with mod 5 or 5B > plot(mod5$fit, mod5$resid) plot(mod5B$fit, mod5B$resid) ### both very similar > plot(mod5$fit, ConcrDat$CmprStrgth) ### very nice prediction, multiple R^2 is > summary(mod5)$r.sq [1] 0.8450244 > summary(mod5)$adj.r.sq [1] 0.8414812 > plot(ConcrDat$FlyAsh, mod5$resid) plot(ConcrDat$BLFurnSlag, mod5$resid) plot(ConcrDat$Age, mod5$resid) plot(ConcrDat$Cement, mod5$resid) plot(ConcrDat$FineAggr, mod5$resid) plot(ConcrDat$Water, mod5$resid) plot(ConcrDat$SuperPlast, mod5$resid) > cor(ConcrDat$SuperPlast^2, mod5$resid) [1] -0.02676724 ### might be large enough to include as a term plot(ConcrDat$CoarsAgg, mod5$resid) > cor(ConcrDat$CoarsAgg, mod5$resid) [1] 0.04535715 ### also might be included > summary(mod5)$sigma [1] 6.651294 > length(mod5$coef) [1] 24 > sqrt((1030/1006)*mean(mod5$resid^2)) [1] 6.651294 ### Check normality of residuals and standardized residuals ... > hist(mod5$resid, nclass=40, prob=T) curve(dnorm(x,0, summary(mod5)$sigma), c(-20,30), add=T, col="blue") ### looks pretty normal but slightly too much mass in center and tails ### Standardized residuals: NB $cov.unscaled is (X^{tr}X)^{-1} > tmpmat = model.matrix(mod5) sum(abs(c(solve(t(tmpmat)%*%tmpmat)-summary(mod5)$cov.unscaled))) [1] 3.899469e-10 SE.resid = sqrt(1-c( ((tmpmat%*%summary(mod5)$cov.unscaled)*tmpmat) %*% rep(1,24) )) * summary(mod5)$sigma summary(mod5$resid/SE.resid)> summary(mod5$resid/SE.resid) Min. 1st Qu. Median Mean 3rd Qu. Max. -3.801000 -0.672200 -0.026800 -0.000225 0.570700 4.718000 > 1-pnorm(3.2) [1] 0.0006871379 plot(mod5$fit, mod5$resid/SE.resid, xlab="Fitted values", ylab="Std.ized Residuals", main= "Std.Resid's vs. Predictors, \n 24-term model for Concrete") abline(h=3.2, col="red") abline(h=-3.2, col="red") LgAbsRsd = numeric(5) modlab = c("1","2D","3","4B","5") for(i in 1:5) { aux = get(paste("mod",modlab[i],sep="")) tmpm = model.matrix(aux) SErsd = sqrt(1-c( ((tmpm%*%summary(aux)$cov.unsc)*tmpm) %*% rep(1,1030-aux$df) )) * summary(aux)$sigma LgAbsRsd[i] = sum(abs(aux$resid)/SErsd > 3.2) } > round(rbind(df = 1030-c(mod1=mod1$df, mod2D=mod2D$df, mod3=mod3$df, mod4B=mod4B$df, mod5=mod5$df), sig = c(summary(mod1)$sigma, summary(mod2D)$sigma, summary(mod3)$sigma, summary(mod4B)$sigma, summary(mod5)$sigma), Rsq = c(summary(mod1)$r.sq, summary(mod2D)$r.sq, summary(mod3)$r.sq, summary(mod4B)$r.sq, summary(mod5)$r.sq), AIC = c(extractAIC(mod1,4)[2], extractAIC(mod2D,4)[2], extractAIC(mod3,4)[2], extractAIC(mod4B,4)[2], extractAIC(mod5,4)[2]), LgRsd = LgAbsRsd ), 3) mod1 mod2D mod3 mod4B mod5 df 7.000 11.000 19.000 22.000 24.000 sig 10.409 8.557 7.867 6.892 6.651 Rsq 0.614 0.740 0.782 0.833 0.845 AIC 26694.800 17647.378 14651.920 10983.338 10144.289 LgRsd 1.000 4.000 4.000 4.000 3.000 plot(mod5$fit, mod5$resid/SE.resid, xlab="Fitted values", ylab="Std.ized Residuals", main= "Std.Resid's vs. Predictors, \n 24-term model for Concrete") abline(h=3.2, col="red") abline(h=-3.2, col="red") > tmppoly = locpoly(mod5$fit, mod5$resid/SE.resid, degree=3, bandwidth=3) lines(tmppoly$x,tmppoly$y, col="blue") > plot(ConcrDat$Cement, mod5$resid/SE.resid, xlab="Fitted values", ylab="Std.ized Residuals", main= "Std.Resid's vs. Predictors, \n 24-term model for Concrete") abline(h=3.2, col="red") abline(h=-3.2, col="red") tmppoly = locpoly(ConcrDat$Cement, mod5$resid/SE.resid, degree=3, bandwidth=10) lines(tmppoly$x,tmppoly$y, col="blue") tmppoly = locpoly(ConcrDat$Cement, mod5$resid/SE.resid, degree=3, bandwidth=20) lines(tmppoly$x,tmppoly$y, col="green") > anova(mod5) Analysis of Variance Table Response: CmprStrgth Df Sum Sq Mean Sq F value Pr(>F) Cement 1 71173 71173 1608.7925 < 2.2e-16 *** Age 1 23993 23993 542.3503 < 2.2e-16 *** Water 1 34599 34599 782.0761 < 2.2e-16 *** FineAggr 1 11714 11714 264.7784 < 2.2e-16 *** I(Age > 100) 1 20156 20156 455.6116 < 2.2e-16 *** I(Age^2) 1 28397 28397 641.8968 < 2.2e-16 *** SuperPlast 1 7518 7518 169.9430 < 2.2e-16 *** BLFurnSlag 1 18226 18226 411.9735 < 2.2e-16 *** I(BLFurnSlag^2) 1 1470 1470 33.2340 1.086e-08 *** FlyAsh 1 4954 4954 111.9750 < 2.2e-16 *** I(FlyAsh^2) 1 1226 1226 27.7033 1.729e-07 *** Cement:Water 1 39 39 0.8755 0.3496512 Water:FineAggr 1 185 185 4.1815 0.0411265 * Age:I(Age > 100) 1 2279 2279 51.5129 1.380e-12 *** I(Age > 100):I(Age^2) 1 11300 11300 255.4224 < 2.2e-16 *** Water:SuperPlast 1 315 315 7.1156 0.0077642 ** Cement:Age 1 631 631 14.2628 0.0001683 *** Age:FineAggr 1 442 442 9.9946 0.0016170 ** Cement:SuperPlast 1 333 333 7.5222 0.0062027 ** Age:SuperPlast 1 544 544 12.2954 0.0004741 *** SuperPlast:BLFurnSlag 1 226 226 5.1076 0.0240348 * Cement:Water:SuperPlast 1 2241 2241 50.6601 2.086e-12 *** Cement:Age:SuperPlast 1 710 710 16.0580 6.596e-05 *** Residuals 1006 44505 44 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > round(summary(mod5)$coef[,-4], 3) Estimate Std. Error t value (Intercept) -67.964 21.050 -3.229 Cement 0.263 0.029 9.199 Age 0.942 0.050 18.743 Water 0.262 0.122 2.150 FineAggr 0.045 0.025 1.773 I(Age > 100)TRUE 17.357 12.136 1.430 I(Age^2) -0.006 0.000 -19.990 SuperPlast 6.659 1.172 5.679 BLFurnSlag 0.088 0.010 8.752 I(BLFurnSlag^2) 0.000 0.000 -3.511 FlyAsh 0.103 0.015 6.886 I(FlyAsh^2) 0.000 0.000 -3.696 Cement:Water -0.001 0.000 -5.506 Water:FineAggr 0.000 0.000 -1.713 Age:I(Age > 100)TRUE -0.746 0.105 -7.075 I(Age > 100)TRUE:I(Age^2) 0.006 0.000 16.260 Water:SuperPlast -0.043 0.007 -6.323 Cement:Age 0.000 0.000 -2.235 Age:FineAggr 0.000 0.000 -2.335 Cement:SuperPlast -0.022 0.003 -7.085 Age:SuperPlast 0.018 0.004 4.955 SuperPlast:BLFurnSlag 0.003 0.001 4.533 Cement:Water:SuperPlast 0.000 0.000 7.260 Cement:Age:SuperPlast 0.000 0.000 -4.007 > mod5C = step(mod5, formmula = .~ (Cement+Age+Water+FineAggr+ I(Age>100) + I(Age^2) + SuperPlast+BLFurnSlag + I(BLFurnSlag^2)+FlyAsh+ I(FlyAsh^2))^3, k=6, direction="both") ### can drop Water:FineAggr , but no other changes suggested ... ## We have already seen that the mean square residual (essentially ## sigma^2) looks good for the "final" model. ## Let's see whether that model also does well when fitted on a ## training-set but evaluated on a test set: subind = sample(1:1030, 700) mod5Train = update(mod5, data=ConcrDat[subind,]) > summary(mod5Train)$sig [1] 6.725003 > sqrt(mean( (c(model.matrix(mod5)[setdiff(1:1030,subind),] %*% mod5Train$coef) - ConcrDat$CmprStrgth[setdiff(1:1030,subind)])^2)) [1] 6.807561 ### NB can also use "predict" on glm with new x's