HW 16 Stat 705 Fall 2015 Assigned Monday 11/23/15 DUE Wednesday 12/2/15 Use (repeated, randomized) data splitting to assess the RMSE in your final model you fitted to the Concrete data in HW13. [If you did not achieve a final model you want to continue with, use the one I supplied in my solution to HW13.] Obtain the RMSE another way by a parametric-bootstrap simulation. Pay some attention to the sampling variability (simulation "errors") in each method in deciding how many replications you need for each. Which one of the two methods (cross-validation by data-splitting, or parametric bootstrap) should better represent what you might see in a futuredata-set with a data-generating mechanism very similar to the one that underlies the Concrete data ? What assumptions are you making for the validity of each method ? Can you think of any other way to assess the mean-squared prediction error of analysis by the specific model in this data setting ? What assumptions does it require for it to give a large-sample-valid estimate of RMSE ? ##>>--------------------- Solution ------------------- ## Let's consider a large set (10,000) of even data-splits for evaluating mod5C, my final model in the HW13 solution > mod5C$call lm(formula = CmprStrgth ~ Cement + Age + Water + FineAggr + I(Age > 100) + I(Age^2) + SuperPlast + BLFurnSlag + I(BLFurnSlag^2) + FlyAsh + I(FlyAsh^2) + Cement:Water + Age:I(Age > 100) + I(Age > 100):I(Age^2) + Water:SuperPlast + Cement:Age + Age:FineAggr + Cement:SuperPlast + Age:SuperPlast + SuperPlast:BLFurnSlag + Cement:Water:SuperPlast + Cement:Age:SuperPlast, data = ConcrDat) MSEconcr = array(0, c(1e4,4), dimnames=list(NULL, c("MSEtrain", "MSEtest","ParBootMSE", "NPBoot"))) tmpmat = model.matrix(mod5C) for(i in 1:1e4) { ### < 2 min ! subind = sample(1:1030, 515) othind = setdiff(1:1030,subind) mod5CTr = update(mod5C, data=ConcrDat[subind,]) MSEconcr[i,1] = (492/515)*summary(mod5CTr)$sig^2 MSEconcr[i,2] = mean( (tmpmat[othind,]%*% mod5CTr$coef - ConcrDat$Cmpr[othind])^2) } > summary(MSEconcr[,1]) Min. 1st Qu. Median Mean 3rd Qu. Max. 33.66 40.60 42.22 42.23 43.89 50.95 > summary(MSEconcr[,2]) Min. 1st Qu. Median Mean 3rd Qu. Max. 37.73 45.13 46.99 47.08 48.90 74.83 ### So the MSE estimate is definitely higher (by 11.5%) on test data ### than on training data. > hist(sqrt(MSEconcr[,1]), nclass=50, prob=T, xlim=c(5.6,7.8), col="red", xlab="rMSE's for mod5C", main="Train & Test rMSE, Concrete mod5C") plot(hist(sqrt(MSEconcr[,2]), nclass=50, prob=T, plot=F), add=T, freq=F) ## The means and CI's based on R=10,000 > round(rbind(Train=c(rMSE.Mean = sqrt(mean(MSEconcr[,1])), CI=sqrt(mean(MSEconcr[,1]) + (1.96/100)*c(-1,1)* sd(MSEconcr[,1]))), Test=c(rMSE.Mean = sqrt(mean(MSEconcr[,2])), CI=sqrt(mean(MSEconcr[,2]) + (1.96/100)*c(-1,1)* sd(MSEconcr[,2])))), 4) rMSE.Mean CI1 CI2 Train 6.4982 6.4946 6.5019 Test 6.8613 6.8573 6.8653 ### Now for parametric bootstrap based on fixed ConcrDat predictors > fit5C = mod5C$fitted; sig5C = summary(mod5C)$sig*sqrt(1007/1030) auxrsd = array(rnorm(1e4*1030), c(1e4,1030)) MSEparB0 = numeric(1e4) auxmat = tmpmat %*% solve(t(tmpmat)%*% tmpmat) %*% t(tmpmat) for (i in 1:1e4) { ## < 1 min CmpStr2 = fit5C + sig5C*auxrsd[i,] MSEparB0[i] = mean( (ConcrDat$Cmpr - auxmat%*% CmpStr2)^2 ) } > c(mean=sqrt(mean(MSEparB0)), CI = sqrt(mean(MSEparB0)+ (1.96/100)*c(-1,1)*sd(MSEparB0))) mean CI1 CI2 6.655944 6.655523 6.656365 ### NOTE: this parametric bootstrap with fixed predictors is not ### quite comparable to MSE's based on resimulated predictors, so we ### re-do the parametric bootstrap in MSEconcr array column 3 with ### NP bootstrap re-simulated X's. indArr = array(sample(1:1030, 1e4*1030, replace=T), c(1e4,1030)) unix.time({ for(i in 1:1e4) { ## 71.3 secs. CmpStr2 = fit5C[indArr[i,]] + sig5C*auxrsd[i,] MSEconcr[i,3] = summary(update(mod5C, formula = CmpStr2 ~ ., data=ConcrDat[indArr[i,],]))$sig^2*(1007/1030) } }) ### Finally, we do a completely NP bootstrap unix.time({ for(i in 1:1e4) { ## 65.3 secs. MSEconcr[i,4] = summary(update(mod5C, data=ConcrDat[indArr[i,],]))$sig^2*(1007/1030) } }) > round(rbind(ParBoot=c(rMSE.Mean = sqrt(mean(MSEconcr[,3])), CI=sqrt(mean(MSEconcr[,3]) + (1.96/100)*c(-1,1)* sd(MSEconcr[,3]))), NPboot=c(rMSE.Mean = sqrt(mean(MSEconcr[,4])), CI=sqrt(mean(MSEconcr[,4]) + (1.96/100)*c(-1,1)* sd(MSEconcr[,4])))), 4) rMSE.Mean CI1 CI2 ParBoot 6.5095 6.5067 6.5123 NPboot 6.5028 6.4993 6.5062 ### So all of these different methods give answers that are ### quite comparable. ### The cross-validated MSEs based on Test data are clearly ### larger than the MSEs obtained by the other methods, which ### would be a little puzzling except that the data-split ### estimators are based only on a sample half as large as the ### bootstrap methods. ASSUMPTIONS NEEDED FOR VALID MSE ESTIMATES: parametric bootstrap requires a valid model, while the other methods do not. But as noted above: the split-data method is not 100% comparable because its predictions are based on samples half the size of the samples in the other methods !