HW 16 Stat 705  Fall 2015

Assigned Monday 11/23/15 DUE Wednesday 12/2/15

Use (repeated, randomized) data splitting to assess the RMSE 
in your final model you fitted to the Concrete data in HW13.

[If you did not achieve a final model you want to continue with, 
use the one I supplied in my solution to HW13.]

Obtain the RMSE another way by a parametric-bootstrap simulation.

Pay some attention to the sampling variability (simulation 
"errors") in each method in deciding how many replications you 
need for each.

Which one of the two methods (cross-validation by data-splitting, 
or parametric bootstrap) should better represent what you might 
see in a futuredata-set with a data-generating mechanism very 
similar to the one that underlies the Concrete data ? What 
assumptions are you making for the validity of each method ?

Can you think of any other way to assess the mean-squared 
prediction error of analysis by the specific model in this 
data setting ? What assumptions does it require for it to give 
a large-sample-valid estimate of RMSE ?

##>>--------------------- Solution -------------------

## Let's consider a large set (10,000) of even data-splits
for evaluating mod5C, my final model in the HW13 solution

> mod5C$call
lm(formula = CmprStrgth ~ Cement + Age + Water + FineAggr + I(Age > 
    100) + I(Age^2) + SuperPlast + BLFurnSlag + I(BLFurnSlag^2) + 
    FlyAsh + I(FlyAsh^2) + Cement:Water + Age:I(Age > 100) + 
    I(Age > 100):I(Age^2) + Water:SuperPlast + Cement:Age + Age:FineAggr + 
    Cement:SuperPlast + Age:SuperPlast + SuperPlast:BLFurnSlag + 
    Cement:Water:SuperPlast + Cement:Age:SuperPlast, data = ConcrDat)

MSEconcr = array(0, c(1e4,4), dimnames=list(NULL, c("MSEtrain",
    "MSEtest","ParBootMSE", "NPBoot")))
tmpmat = model.matrix(mod5C)

for(i in 1:1e4) {                                 ### < 2 min !
    subind = sample(1:1030, 515)
    othind = setdiff(1:1030,subind)
    mod5CTr = update(mod5C, data=ConcrDat[subind,])
    MSEconcr[i,1] = (492/515)*summary(mod5CTr)$sig^2
    MSEconcr[i,2] = mean( (tmpmat[othind,]%*% mod5CTr$coef - 
             ConcrDat$Cmpr[othind])^2)   }

> summary(MSEconcr[,1])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  33.66   40.60   42.22   42.23   43.89   50.95 
> summary(MSEconcr[,2])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  37.73   45.13   46.99   47.08   48.90   74.83 

### So the MSE estimate is definitely higher (by 11.5%) on test data
###    than on training data.

> hist(sqrt(MSEconcr[,1]), nclass=50, prob=T, xlim=c(5.6,7.8), col="red",
        xlab="rMSE's for mod5C", main="Train & Test rMSE, Concrete mod5C")
  plot(hist(sqrt(MSEconcr[,2]), nclass=50, prob=T, plot=F), add=T, freq=F)

## The means and CI's based on R=10,000

> round(rbind(Train=c(rMSE.Mean = sqrt(mean(MSEconcr[,1])),
                CI=sqrt(mean(MSEconcr[,1]) + (1.96/100)*c(-1,1)*
                        sd(MSEconcr[,1]))),
     Test=c(rMSE.Mean = sqrt(mean(MSEconcr[,2])),
                CI=sqrt(mean(MSEconcr[,2]) + (1.96/100)*c(-1,1)*
                        sd(MSEconcr[,2])))), 4)
      rMSE.Mean    CI1    CI2
Train    6.4982 6.4946 6.5019
Test     6.8613 6.8573 6.8653

### Now for parametric bootstrap based on fixed ConcrDat predictors

>   fit5C = mod5C$fitted; sig5C = summary(mod5C)$sig*sqrt(1007/1030)
    auxrsd = array(rnorm(1e4*1030), c(1e4,1030))
    MSEparB0 = numeric(1e4)
    auxmat = tmpmat %*% solve(t(tmpmat)%*% tmpmat) %*% t(tmpmat)

    for (i in 1:1e4) {                        ## < 1 min
         CmpStr2 = fit5C + sig5C*auxrsd[i,]
         MSEparB0[i] = mean( (ConcrDat$Cmpr - auxmat%*% CmpStr2)^2 ) }

> c(mean=sqrt(mean(MSEparB0)), CI = sqrt(mean(MSEparB0)+
      (1.96/100)*c(-1,1)*sd(MSEparB0)))
    mean      CI1      CI2 
6.655944 6.655523 6.656365 

### NOTE: this parametric bootstrap with fixed predictors is not 
### quite comparable to MSE's based on resimulated predictors, so we
### re-do the parametric bootstrap in MSEconcr array column 3 with 
### NP bootstrap re-simulated X's.

    indArr = array(sample(1:1030, 1e4*1030, replace=T), c(1e4,1030))
    unix.time({
    for(i in 1:1e4)     {                        ## 71.3 secs.
         CmpStr2 = fit5C[indArr[i,]] + sig5C*auxrsd[i,]
         MSEconcr[i,3] = summary(update(mod5C, formula = CmpStr2 ~ ., 
              data=ConcrDat[indArr[i,],]))$sig^2*(1007/1030)  }
    })

### Finally, we do a completely NP bootstrap 
     unix.time({
     for(i in 1:1e4)     {                        ## 65.3 secs.
         MSEconcr[i,4] = summary(update(mod5C, 
                 data=ConcrDat[indArr[i,],]))$sig^2*(1007/1030)  }
     })

> round(rbind(ParBoot=c(rMSE.Mean = sqrt(mean(MSEconcr[,3])),
                CI=sqrt(mean(MSEconcr[,3]) + (1.96/100)*c(-1,1)*
                        sd(MSEconcr[,3]))),
     NPboot=c(rMSE.Mean = sqrt(mean(MSEconcr[,4])),
                CI=sqrt(mean(MSEconcr[,4]) + (1.96/100)*c(-1,1)*
                        sd(MSEconcr[,4])))), 4)
        rMSE.Mean    CI1    CI2
ParBoot    6.5095 6.5067 6.5123
NPboot     6.5028 6.4993 6.5062

### So all of these different methods give answers that are 
###  quite comparable.

### The cross-validated MSEs based on Test data are clearly 
###  larger than the MSEs obtained by the other methods, which 
###  would be a little puzzling except that the data-split 
###  estimators are based only on a sample half as large as the 
###  bootstrap methods.

ASSUMPTIONS NEEDED FOR VALID MSE ESTIMATES: parametric bootstrap requires 
a valid model, while the other methods do not. But as noted above: the 
split-data method is not 100% comparable because its predictions are 
based on samples half the size of the samples in the other methods !