Solution to HW Problem 22, about bootstraps in context of HW21. ========================= *** The basis for bootstrapping is that we have 3 `statistics' (to be spelled out now) for estimation of P(loss > 165 | hard, tens) based on a given data-set (data frame with columns loss, hard, tens). Methods for doing that could include : --- a direct logistic regression estimator of coefficients within a model which gives this probability as a logistic distribution function value; --- a linear regression type estimator using I[Y_i >165] as your response variables; ---- a linear regression estimator for the mean of Y_i given X_i which you could then couple with a normal distribution assumption on residuals to get a conditional probability estimate; ---- a linear regression estimator for the mean of Y_i given X_i which you could then couple with a density estimate on residuals to get a conditional probability estimate; ---- a kernel-type nonparametric regression estimate using I[Y_i >165] as your response variables; The statistics used in the solution below are: (a) from normal linear regression: 1 - Phi ((165 - b0 - b1*hard - b2*tens)/sigmahat) (b) from logistic regression: plogis( c0 + c1*hard + c2*tens ) (c) from kernel method (by analogy with NPR approach) sum_i I[loss_i > 165] * ker((x- b0 - b1*hard_i - b2*tens_i)/40)/ sum_i ker((x- b0 - b1*hard_i - b2*tens_i)/40) These are the statistics we bootstrap to get confidence intervals. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But I will add some more indications in changes to this solution document over the weekend of how the results would have looked with some of the other suggested statistics above. 12/11/09 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > Estim3 <- function(evalfram, bootfram) { bootlm <- lm(loss ~ hard+tens, data=bootfram) evallmfit <- bootlm$coef[1]+c(data.matrix(evalfram[,2:3]) %*% bootlm$coef[2:3]) lmest <- 1-pnorm((165 - evallmfit)/summary(bootlm)$sigma) ## binvec <- as.numeric(bootfram$loss > 165) bootlgs <- glm(cbind(binvec, 1-binvec) ~ hard+tens, data=bootfram, family=binomial) glmest <- plogis(bootlgs$coef[1]+c(data.matrix(evalfram[,2:3]) %*% bootlgs$coef[2:3])) kerboot <- outer(evallmfit, bootlm$fit, function(x,y) dlogis(x,y,40)) kerest <- c(kerboot %*% bootfram$binvec)/c(kerboot %*% rep(1,nrow(bootfram))) list(lm.est = lmest, glm.est =glmest, ker.est = kerest) } > Estim3(RubbAlt[1:5,],RubbAlt) $lm.est: [1] 0.99999998 0.85462905 0.50546810 0.19793889 0.04011242 $glm.est: [1] 0.999999891 0.950345758 0.497125366 0.078551388 0.006529196 $ker.est: [1] 0.9593606 0.6186340 0.4408652 0.3167944 0.2251193 ### This completes the definition of the statistics. Now we bootstrap, ### which means that the bootfram data will be generated by ### the different with-replacement (nonparametric bootstrap) ### resampling strategies, while evalfram will always be Rubber[1:5,]. > Boot2mat <- Boot1mat <- array(0, dim=c(4000,5,3)) > for (i in 1:4000) Boot1mat[i,,] <- matrix(unlist(Estim3(RubbAlt[1:5,], RubbAlt[ sample(30,30, replace=T),])), ncol=3) ### This was the simplest Bootstrap strategy: pure with-replacement ### sampling of triples. ### Next consider what is probably the most appealing method for ### bootstrapping a dataset which approximately satisfies a ### linear regression model with additive iid errors. We begin ### by setting up the linear regression & residuals from the ### original data, then bootstrapping (hard,tens) with replacement ### and independently bootstrapping residuals with replacement. > resids <- rubberlm$resid yhats <- rubberlm$fit > for (i in 1:4000) { inds1 <- sample(30,30,replace=T) inds2 <- sample(30,30,replace=T) Boot2mat[i,,] <- matrix(unlist(Estim3(RubbAlt[1:5,], cbind.data.frame(loss=yhats[inds1]+resids[inds2], RubbAlt[inds1,2:4]))), ncol=4) } ### In each of these loops, there were warnings (of exactly the ### same type as in the cross-validation steps) from the logistic- ### regression glm function. > M1 <- round( apply(Boot1mat,2:3, function(x) quantile(x,.025)),4) [,1] [,2] [,3] [1,] 1.0000 1.0000 0.8583 [2,] 0.6636 0.5107 0.4253 [3,] 0.2621 0.0000 0.2609 [4,] 0.0574 0.0000 0.1600 [5,] 0.0033 0.0000 0.0959 > M2 <- round( apply(Boot1mat,2:3, function(x) quantile(x,.975)),4) [,1] [,2] [,3] [1,] 1.0000 1.0000 0.9868 [2,] 0.9680 1.0000 0.7747 [3,] 0.6986 1.0000 0.6166 [4,] 0.3266 0.5820 0.4934 [5,] 0.0898 0.0686 0.3942 > M3 <- round( apply(Boot2mat,2:3, function(x) quantile(x,.025)),4) [,1] [,2] [,3] [1,] 1.0000 0.9999 0.8604 [2,] 0.6324 0.5483 0.3842 [3,] 0.2482 0.0000 0.2410 [4,] 0.0445 0.0000 0.1521 [5,] 0.0022 0.0000 0.0921 > M4 <- round( apply(Boot2mat,2:3, function(x) quantile(x,.975)),4) [,1] [,2] [,3] [1,] 1.0000 1.0000 0.9882 [2,] 0.9836 1.0000 0.8174 [3,] 0.7670 1.0000 0.6655 [4,] 0.3935 0.6294 0.5291 [5,] 0.1183 0.0698 0.4147 ### Still have to make confidence intervals out of these quantiles and the estimated quantities above (from the original dataset.) ### Of course, instead of bootstrapping the residuals we could have taken the parametric approach of simulating them as normally distributed with means 0 and variance summary(rubberlm)$sigma^2, but that is very similar and less appealing ! OK, for constructing CI's we focus on the function of the data which results in the estimated probability P(loss > 165 | hard, tens). Unfortunately, there is no nonparametric functional which gives a consistent estimator, and this is not only a problem of the small sample size (30), but also a problem of the type of (conditional!) probability we are estimating. The main issue is that in a nonparametric context, we have only the observations precisely at the specific (hard,tens) value to help us estimate this conditional probability. *** So I sent you all a long message indicating that the functionals T that we are estimating are those indicated above (the one based on normal linear regression, the one based on logistic regression, and the nonparametric kernel regression using the linear combination of the hard and tens variables estimated from linear *** regression. ### We complete this analysis by generating an array of confidence inteervals, as follows: 5 points by 3 estimators by 2 bootstrap methods by (lower,upper) CI endpoints. > Rubber[1:5,2:3] hard tens 1 45 162 2 55 233 3 61 232 4 66 231 5 71 231 > CIarr <- array(0, dim=c(5,3,2,2), dimnames=list(paste("Pt", 1:5,sep=""), c("lin.reg","lgst.reg","ker.NPR"), c("Lower","Upper"), c("Bootstr.a", "Bootstr.c"))) > Ests <- matrix(unlist(Estim3(RubbAlt[1:5,],RubbAlt)), ncol=3) > Ests [,1] [,2] [,3] [1,] 0.99999998 0.999999891 0.9593606 [2,] 0.85462905 0.950345758 0.6186340 [3,] 0.50546810 0.497125366 0.4408652 [4,] 0.19793889 0.078551388 0.3167944 [5,] 0.04011242 0.006529196 0.2251193 > CIarr[,,1,1] <- 2*Ests- M2 ### This is the nonparametric CIarr[,,2,1] <- 2*Ests- M1 ### bootstrap form of CI's CIarr[,,1,2] <- 2*Ests- M4 CIarr[,,2,2] <- 2*Ests- M3 > round(pmin(pmax(CIarr,0),1),3) , , Lower, Bootstr.a lin.reg lgst.reg ker.NPR Pt1 1.000 1.000 0.932 Pt2 0.741 0.901 0.463 Pt3 0.312 0.000 0.265 Pt4 0.069 0.000 0.140 Pt5 0.000 0.000 0.056 , , Upper, Bootstr.a lin.reg lgst.reg ker.NPR Pt1 1.000 1.000 1.000 Pt2 1.000 1.000 0.812 Pt3 0.749 0.994 0.621 Pt4 0.338 0.157 0.474 Pt5 0.077 0.013 0.354 , , Lower, Bootstr.c lin.reg lgst.reg ker.NPR Pt1 1.000 1.000 0.931 Pt2 0.726 0.901 0.420 Pt3 0.244 0.000 0.216 Pt4 0.002 0.000 0.104 Pt5 0.000 0.000 0.036 , , Upper, Bootstr.c lin.reg lgst.reg ker.NPR Pt1 1.000 1.000 1.000 Pt2 1.000 1.000 0.853 Pt3 0.763 0.994 0.641 Pt4 0.351 0.157 0.481 Pt5 0.078 0.013 0.358