Statistics 818D Bootstrap Methods

Spring 2021 MW 9-9:50am, Online class delivered through ELMS

Instructor: Eric Slud, Statistics program, Math. Dept.

Office: Mth 2314, email: slud@umd.edu
Office Hours: W 1:30-3pm or by appointment

Primary Course References:

(i) Lecture Notes (adapted by me from input of colleagues including Snigdansu Chatterjeee, U. of Minnesota)

(ii) Freely downloadable (from UMD Library) e-book chapters and e-book texts including:

Efron, B. (1982), The Jackknife, the Bootstrap and Other Resampling Plans. SIAM.
Gentle, J. (2009), Computational Statistics, Springer.
Wassermann, L. (2006), Chapter 3 on The Bootstrap and the Jackknife in: All of Nonparametric Statistics, Springer.
Hall. P. (1994), The Bootstrap and Edgewoth Expansion, Springer
Das Gupta, A. (2008), Chapter 29 on The Bootstrap in: Asymptotic Theory of Statistics and Probability, Springer.
Giné, E. (1997), Lectures on Some Aspects of the Bootstrap, pp.37-151 in: Lec. Notes in Math., vol 1665, Springer.
Good, P. (2005) Permutation, Parametric and Bootstrap Tests of Hypothesis, 3rd ed., Springer.

(iii) Journal articles containing various bootstrap applications.

Current Homework Assignment Course Handouts R Scripts

Please fill out the on-line Evaluation form on this Course and instructor at http://CourseEvalUM.umd.edu. Thank you.

Overview: The topic of the course is statistical Resampling Methods, with emphasis on Bootstrap Methods. Resampling means statistical procedures based on re-use (often repeated) of datasets subsets of which are randomly selected. These methods are supported by large-sample probability limit theorems, which sometimes apply in surprisingly small-to-moderate size datasets. The goal of such procedures is any or all of: bias reduction, variance and Confidence Interval construction from statistical estimators, and calculation of null reference distributions of statistics used for hypothesis testing. In many Data Science applications, these techniques provide sensible approaches to the estimation of reference distributions under minimal modeling assumptions, with mathematically justified properties under broad qualitative assumptions on the data generating mechanism. The course is suitable either for STAT students or for students in Applied Mathematics or Mathematical Data Science.

The course will be taught at STAT MA level, more or less, with a mixture of theory, software-oriented (primarily R) applications, and data analyses and simulation case-studies.

Special Features of the Course this Term. 50-minute Lectures will be offered live (ie, synchronously) over Zoom through ELMS and recorded. Each lecture will consist of one segment containing primarily theoretical meterial, and one providing computational data illustration in R. The theory pieces will be pdf slides with voiceover, and sometimes with handwritten pieces via a document camera. The R pieces will either be slides with R codes and pictures, or live demonstration using R or RStudio windows.

THEORETICAL MATERIAL. Both in homeworks and lectures, there will be theoretical material at the level of probability theory (STAT 410 or sometimes 600-601) related to laws of large numbers and central limit theorem (sometimes "functional central limit theorems", which I will explain) , along with the `delta method' (Taylor linearization). There will be some proofs, mostly at advanced-calculus level but some involving measure-theoretic ideas.

Prerequisite: Stat 410 or 600-601, Stat 420 or 700-701, plus some computing familiarity, preferably including some R.

Course requirements and Grading: there will be 6 graded homework sets (one every 2 - 2.5 weeks), plus a project/paper at the end. Homeworks will be split between theory problems and statistical computations and interpretations with data. The homework will be worth 65% of the grade, the term paper 35%.

Course Coverage & Outline:

(1.) Monte Carlo Simulation versus Resampling/Bootstrap (1/25-27/2021)

Introduction to Monte Carlo Simulation; definitions and objectives; statistical uses; simulations from
real-data (empirical) distributions); resampling and `pseudo-data'; bootstrap.

(2.) Statistical Functionals: Bias Reduction via Jackknife & Bootstrap (1/27-29/2021)

Definition of Statistical Functions; higher-order Delta Method and asymptotic expansion for bias;
Jackknife for Bias reduction; Bootstrap for Bias Reduction

(3.) Reference Distributions: Bootstrap Hypothesis Tests and Confidence Intervals (2/1-3/2021)

(4.) Bootstrap with Complicated Statistics (sometimes non-smooth) (2/3-5/2021)

(5.) Proof of Consistency of Bootstrap Distributional Estimate for the Mean

(6.) More on Statistical Functionals, "Influence Functions", and Bootstrap

(7.) Enhanced accuracy for the Bootstrap vs. asymptotic normal approximations
the `Singh Property' (at least 2 lectures)

(8.) Double and Iterated Bootstrap for Higher-order Accuracy

(9.) Bootstrap in Regression Problems -- Bootstrapping Residuals

(10.) Sme Settings where Bootstrap does not Work
Extensions with m-out-of-n Bootstrap and (non-bootstrap) "Subsampling"

(11.) Relation between Functional Central Limit (empirical-process) Theory and Bootstrap Limit Theory

(12.) Weighted and Multiplier Bootstraps

(13.) Parametric Bootstrap -- Theory and application in Mixed-Model and Empirical-Bayes Problems

(14.) Bootstrap for Sample-Survey Inference

(15.) Other Applications of Bootstrap (Survival Analysis, possibly others)

(16.) Bootstrap in Problems with Dependent Data (Time Series, Spatial) Idea of "Block Bootstrap"

(17.) Bootstrap Variants in Machine Learning -- Boosting and Bagging

COMPUTING. In Lectures, the homework-sets and possibly also the course project, you will be doing computations on real and simulated datasets using a statistical computation platform or library. Any of several statistical-computing platforms are suitable for this: R, Matlab, Python, or others. If you are learning one of these packages for the first time, or investing some effort toward deepening your statistical computing skills, I recommend R,which is free and open-source and is the most flexible and useful for research statisticians. I will provide links to free online R tutorials and will provide examples and scripts and will offer some R help, as needed.

Getting Started in R. Lots of R introductory materials can be found on my STAT 705 website from several years ago, in particular in these Notes. Another free and interactive site I recently came across for introducing R to social scientists is: https://campus.sagepub.com/blog/beginners-guide-to-r.

R Scripts

A set of R Scripts on many topics related to the course are available in this directory. .

Homework

Notes and Guidelines. Homeworks should be handed in as pdf's through ELMS "Assignments". Solutions will usually be posted, and a percentage deduction of the overall HW score will generally be made for late papers.

Homework solutions will be posted to the ELMS Assignments tab for the use of students registered in the course.

Assignment 1. (First 2 weeks of course, HW due Wed., Feb. 10 ). Reading: Gentle (2009) chapters on simulation and bootstrap, Efron (1982) Ch.2 & 5, Wassermann (2006) Ch.3, plus R scripts from Lectures. Then solve and hand in all of the following 6 problems:

(1). In Lecture 1 slides 3 and 4, two different algorithms are suggested for simulating n=10,000 or 100,000 Poisson(5) random variables X_i: one by finding a large number m wuch that P(X>m) < 1e-7, and using sample to sample independently using the Poisson(5) probabilities P(X=k | X ≤ m), k=0,...,m; and the other by letting X be the largest k such that V₁+...+V_k ≤ 5. Run a simulation study in R, using system.time to calculate running times, to demonstrate which of the two algorithms is faster.

(2). (Exercise adapted from Dekking (2005) book chapter 18 on Bootstrap or Wassermann's (2006) Chap.3)
(a.) Suppose that a bootstrap sample (x^*_j, 1 ≤ j ≤ 10) is drawn from the (empirical distribution function of) the dataset
0.49, 0.51, 0.48, 0.54, 0.50, 0.46, 0.44, 0.56, 0.45, 0.47 . Compute the probability P(max(x^*_j, 1 ≤ j ≤ 10) < 0.56).
(b.) Calculate P(max(x^*_j, 1 ≤ j ≤ n) < X_(n)) for a bootstrap sample (x^*_j, 1 ≤ j ≤ n) drawn from (the e.d.f. of) a sample (X_j, 1 ≤ j ≤ n) with distinct elements, where X_(n) denotes the n'th order-statistic (the largest element) from the original sample.
(c.) Conclude from this that the large-n limiting distribution for X_i ∼ Uniform(0,θ) of n·(X_(n) - max(x^*_j, 1 ≤ j ≤ n)) is very different from that of n·(θ - X_(n)). (See Ex.11 at the end of the Wassermann bootstrap chapter.)

(3). Show that the limiting bootstrap estimate (when B → ∞) of the bias of the sample second central moment T_n = ((n-1)/n) S² is ((n-1)/n²) S², where S² denotes the ordinary sample variance.

(4)-(5). Gentle (2009) Problems 13.1, 13.8.

(6). (Bias Corrections in Misspecified-Model Setting) Suppose that the model (for a fixed dataset of size n ) is assumed to be f(x, θ) and the Method of Moments estimator T_n = g(x̄) is used, where the smooth function g satisfies g(μ) = θ and where μ = E_θ(X₁). But now suppose that the correct model for the data is actually h(x,θ, b/n^1/2) within a larger smoothly parameterized two-parameter family of densities h(x,θ,β) satisfying h(x,θ,0) ≡ f(x,θ). For large n (and sufficiently large number B of bootstrap replications, what effect do the Jackknife and Bootstrap bias-correction estimators have on bias ?

Homework Assignment 2. (Second 2 weeks of course, HW due Sat., Feb. 27, 11:59pm ). Reading: Wassermann (2006) Ch.3, DasGupta(2008) Chapter 29, plus R scripts from Lectures. Then solve and hand in all of the following problems:

(1). (counts as 2 problems) Perform an adequate number of Monte Carlo iterations to distinguish the coverage performance of the sample median estimator with bootstrap confidence interval for the following three data settings, all with sample size n=50 and number of bootstrap replicates B=300.
(a) f(x) = Gamma(2,1), (b) discrete Uniform on the set {1,...,35}, (c) f(x) = 3(1-2x)² on [0,1].
You may make your own choice among Bootstrap Confidence Interval type -- basic-pivotal, percentile, or one of the improved percentile intervals -- but use the same type of bootstrap CI throughout your simulation. Also, use R=1000 or more Monte Carlo iterations, but determine this number with the simulation error in mind, to make the simulation adequate to distinguish CI performance clearly.

(2). If X and Y are independent Gamma(n₁, b) and Gamma(n₂, b) random variables, use the Multivariate Central Limit Theorem (and multivariate delta method) to prove that n^1/2 (X/(X+Y) - λ) converges in distribution to N(0, λ(1-λ)) as n₁, n₂ → ∞ in such a way that n₁/(n₁+n₂) - λ = o((n₁+n₂)^-1/2), where λ ∈ (0,1). The main hint is that Gamma(n₁,b) is the sum of n₁ iid Gamma(1,b) random variables, and Gamma(n₂,b) the sum of n₂ iid Gamma(1,b) r.v.'s.

(3). (Complicated Statistic arising from estimation after testing) Suppose that we observe a data sample X_n and want to estimate the standard deviation σ of X₁. Assume these data are either N(0,σ²) or Logistic (with density (1/b) e^x/b/(1+e^x/b)², which has variance (π b)²/3). Suppose that we estimate σ with the statistic T_n defined by the following steps:
(i) calculate the MLE sigma.MLE and the maximimized log-likelihood L₁ for the normal-data case,
(ii) calculate the MLE for b and the maximimized log-likelihood L₂ for the logistic-data case; this can be done with two lines of R code,
b.MLE = exp(nlm(function(logb,dat) sum(-dlogis(dat,0,exp(logb),T)),0,dat=xvec)$est)
llk.logis = sum(dlogis(xvec,0,b.MLE,T))

(iii) T_n = sigma.MLE if L₁ > L₂; else = b.MLE * π/sqrt(3)

Find bootstrap estimates for the variance of T_n, and perform a Monte Carlo simulation with sample sizes n = 40, 80 to see how accurate the bootstrap estimates are.

(4). Prove that if F_n and F are strictly increasing and continuous distributions such that F_n(0)=0 and F_n(1)=1 and F_n(x) → F(x) pointwise for all x, as n → ∞, then the Mallows-Wasserstein metric d₂(F_n,F) → 0. Do this by defining random variables Y ~ F, U = F(Y), Y_n = F_n^-1(U).

(5). Define a nonparametric bootstrap estimator T_n of the variance of the sample median for iid samples X₁,...,X_n. Also define a parametric-bootstrap estimator V_n of the same quantity for Expon(λ) data-samples. Do the means of these estimates agree for large n when the data-sample they are based on is actually Expon(λ) ? What about their variances ? What does theory say about the answers to these questions ? Also give a computational answer to the question based on R=400 Monte Carlo iterations with B=300 bootstrap samples.

Homework Assignment 3. due Sun., March 21, 11:59pm .

This homework set is based on Lecture material related to Bootstrap consistency, Edgeworth expansions, and permutational and bootstrap hypothesis testing. Specific lecture references are given in bold-face for each problem.

(1) Do a Monte Carlo study to exhibit the relative sizes of the $LaTeX: H_n\:-\:\Phi$ and $LaTeX: H_{Boot}-H_n$ distribution differences like that of the RscriptLec12.RLog script for Lecture 12 slides 2-5 (with something like N=4000, B=1000) for larger sample sizes like n=80 and 120. The goal is to confirm that the $LaTeX: H_{Boot}-H_n$ difference is generally the smaller one for a summand distribution with skewness (say Gamma with shape-parameter not too large, <5) and does not happen for a symmetric summand distribution (say Beta( $LaTeX: \alpha,\alpha$ ) for $LaTeX: \alpha$ < 3). To show pictorially differences between distribution functions, it would be reasonable to use "lines" in R, which is basically linear interpolation, between finely spaced ordered points along the x-axis.

(2) In Lecture 11, slides 5 and 6, leading up to the formal Theorem about Edgeworth expansions for standardized means, we introduced the Cramer condition that the characteristic function of a summand random variable LaTeX: X_i should have complex modulus bounded away from 1 as its argument t goes to infinity. Hall (1992, Chapter 2) has additional details on this condition.

(a) Show that this condition does not hold for Binom(1,p) random variables LaTeX: X_i .

(b) Show that this condition does not hold for discrete random variables LaTeX: X_i which have a finite number of possible values all of which are integers.

(c) Show that this condition does not hold [this assertion has been corrected from the earlier statement of this problem] for the discrete random variable with 3 possible values 0,1, b (each with positive probability) for b not a rational number.

(3) Show numerically, by Monte Carlo simulations with size $LaTeX: N\:\ge10^4$ , that the Edgeworth-expansion (first-order up to $LaTeX: n^{-1/2}$ terms) displayed in Lecture 11 slides 5 and 6 gives numerically more accurate distribution function values for the standardized mean than the limiting normal distribution for n=30, 50, 70 based on lifetime distributions with skewness such as Weibull and log-normal.

(4) Carry out the Taylor series and Delta Method calculations to figure out exactly, up to first- and-second order ( $LaTeX: O(n^{-1/2})$ and LaTeX: O(1/n) terms) the Edgeworth-expansion corrections to the Central Limit Theorem for $LaTeX: \sqrt{n} \, ( \bar{X}_n - \mu)/\bar{X}_n$ based on Expon( $LaTeX: \lambda$ ) data. The basic Edgeworth expansion is given in Lecture 11 on slides 5 and 6, and in this exercise you should use that expansion together with the Delta Method to obtain the analogous expansion (with different polynomials multiplying the normal density on the correction terms) for the transformed quantity $LaTeX: \sqrt{n} \, ( 1 - \, 1/(\lambda \bar{X}_n))$ .

(5) Do a Monte Carlo study (with $LaTeX: N\:\ge5000$ ) of size and power for permutational and bootstrap tests (of nominal significance level $LaTeX: \alpha$ =0.05, based on the (pooled-variance) t-test statistic and the Mann-Whitney statistic (introduced in Lecture 13, slide 8 and mentioned again in Lecture 15), for equality of two-sample data (sizes m=n=30) in the nonparametric location-shift problem. Also do this for a bootstrap test based on a modified t-statistic based on studentizing the difference of sample means. Do your simulation with the LaTeX: t_6 distribution and the Logistic distribution. For the power calculation, choose a set of (at least 5) alternative values of the location shift close but not too close to 0, so that the powers are bounded away from 0.05 and 1.

Homework Assignment 4. due Sun., April 18, 11:59pm . Five (5) Problems in all.

Read the material on Bootstrapping Regression in Sections 29.12-29.15 of the DasGupta (2008) Bootstrap Chapter, and do problems 29.20, 29.21, 29.23, 29.24 in DasGupta chapter, p.495. Also solve and hand in the following additional problem:

(A) ( 2-pass linear regression} Suppose that data consist of X_i, Y_i with X_i = (1, Z_i,1, Z_i,2,... Z_i,9), where Z_i,k are correlated multivariate normal random variables with mean 0 and variance 1. (The correlation matrix should not be the identity or the problem is not meaningful. The essential correlation is between Z_i,1 and the other Z_i,k variables, but it is OK to leave those other Z_i,k (k ≥ 2) uncorrelated with each other.) Then the Y_i variables are assumed to satisfy Y_i = β₀ + ∑_k=1⁹ β_k Z_i,k + ε_i, where ε_i ~ F₀(x/σ) where F₀ is an unknown distribution with mean 0 and variance 1, and σ > 0 is an unknown scale-parameter.
Suppose that the statistical procedure used to estimate β₁ is the following.
Step 1. Find the least-squares estimator β^~ for β, and the corresponding estimator of σ defined by (σ^~)² = n^-1 ∑_i=1ⁿ (Y_i - (β^~)' X_i)².
Step 2. Find the set K₀ (possibly empty) of indices k ≥ 2 for which (β_k^~)² ≥ 1.960² (σ^~)² ((X'X)^-1)_k+1,k+1.
Step 3. Replace the predictor vectors X_i by the vectors W_i of length |K₀|+2 given by W_i = (1, Z_i,1, (Z_i,k: k ∈ K₀)), and define (β₁)^{^} as the estimated coefficient of Z_i,1 in the least-squares estimate (W'W)^-1 W'Y, where W is the n x (|K_0|+2) matrix with i'th row equal to W_i.
Devise a theoretically justified bootstrap 95% confidence interval for the parameter β₁ in terms of the statistic (β₁)^{^} defined by these Steps, and also define an analogue based on the Residual Bootstrap. Study by simulation the coverage properties of these confidence intervals for n=50, with F₀ the logistic density with scale-parameter √3/π (i.e. with variance 1).

Homework Assignment 5. due Sun., May 2, 11:59pm . Five (5) Problems in all.

This assignment is based completely on a dataset, the "Boston Housing Data", available in the MASS R library as a data-frame "Boston". You are to look at these data from the vantage point of the linear homoscedastic regression model for outcome "medv" in terms of all given predictor variables other than "nox" and "rad". "medv" is the median price (in units of $1000, in 1970) in a town or area near Boston.

The assigned tasks are (A) to do a goodness of fit analysis (of model adequacy of the linear regression model) using the wild-bootstrap method of Stute et al. (1998) covered in Lecture 23 slides 8-10; (B) to prove a fact about one of the statistics in that analysis; and (C) to estimate a specific mean-square prediction error vcia GBS bootstrap for the Boston Housing price data at the mean predictor value.

The point values of the 3 parts are: 30 for (A), and 10 for each of (B) and (C). The assignment is due 11:59pm on Sunday, May 2.

(A) (30 points) The Boston Housing Data, which contain 506 records can be accessed in R by

> library(MASS)
data(Boston)

and the linear regression model under study in this assignment is the one (with intercept) fitted by the R statement

> model1 = lm(medv ~ . - nox - rad, data=Boston)

The task in this problem part is to do a test for Goodness of Fit of this linear model using the Wild Bootstrap method of Stute et al. (1998, JASA) covered in Lecture 23, slides 8-10. The steps are:

(1) Obtain coefficient estimates predictors and residuals from the original data using output list components $coef, $fitted, and $resid . Here $LaTeX: Y_i\:$ is defined equal to the medv variable.

(2) Generate B batches of n=506 wild-bootstrap outcome vectors using the method given by Stute et al., with $LaTeX: Y_i^{\ast} = X_i' \hat{\beta} + V_i^{\ast} \hat{\epsilon}_i \; , \quad \hat{\epsilon}_i = Y_i - X_i'\hat{\beta}$ where $LaTeX: V_i^{\ast}$ are iid variables independent of all the data, generated by the statistician with mean 0 and variance 1. (You could use double-exponential, standard normal, or any distribution you like that has all moments.)

(3) Define the random function $LaTeX: R_n(x) = n^{-1/2} \sum_{i=1}^n I_{[X_i\le x]} (Y_i - X_i'\hat{\beta})$ for x in the same value-space as the non-constant predictors LaTeX: X_i where $LaTeX: X_i \le x$ means that the inequality holds coordinatewise and where $LaTeX: \hat{\beta}$ is the ordinary least-squares estimate based on the original data. Define also the corresponding random function on each batch of wild-bootstrapped data, $LaTeX: R_n^{\ast}(x) = n^{-1/2} \sum_{i=1}^n I_{[X_i \le x]} (Y_i^{\ast} - X_i' \beta^{\ast})$ where $LaTeX: \beta^{\ast}$ is the ordinary least-squares estimator for the bootstrapped outcome data $LaTeX: Y_i^{\ast}$ regressed on the same (unchanged) predictors (plus intercept) from the original data.

(4) There are two different statistics and hypothesis tests you are asked to implement, the first based on $LaTeX: sup_x\:\left|R_n\left(x\right)\right|$ with rejection threshold determined from the 0.95 quantile calculated from the bootstrap-statistics $LaTeX: sup_x |R_n^{\ast}(x)|$ calculated from B bootstrap-data batches of size n. The second test you are asked to implement is based on the statistic $LaTeX: n^{-1} \sum_{i=1}^n R_n^2(X_i)$ with rejection cutoff determined by the 0.95 quantile of the B bootstrap statistics $LaTeX: n^{-1} \sum_{i=1}^n (R_n^{\ast}(X_i))^2$ obtained from B independent wild-bootstrap batches of data.

(5) In implementing both tests, first find the set of B bootstrapped statistic values next say whether the test rejects (for the Boston Housing data and the indicated regression model), and then find the p-value.

(B) (10 points) The statistic LaTeX: R_n(x) is a function of the argument x ranging over the value space of the non-constant regression predictor variables. Prove that the sup over x of the LaTeX: |R_n(x)| function is achieved at one of the observed data points x = X_i, so that for the sup hypothesis-test statistic the LaTeX: R_n(x) function needs to be evaluated only at the n points of the original data sample (which is the same as the LaTeX: X_i values for the bootstrapped data, since the are left unchanged by the wild bootstrap on slide 9 of Lec.23.)

(C) (10 points) Use GBS bootstrap to create an estimate for the Mean Square Prediction Error of the model at an X-vector exactly equal to the mean of the predictor variables, assuming the same homoscedastic linear regression model you analyzed above.

Homework Assignment 6. due Wed., May 12, 11:59pm .

In both of the following Problems, do the requested computations and present the results in a few well-chosen numerical exhibits (pictures and/or tables). Describe in words what you have done and what the results mean, i.e., whether your results are what would have been expected from the theory presented in class. Hand in the R code you use as an Appendix to your submitted paper, and indicate if you did any checks to verify that your R code did exactly what it was supposed to. It would be helpful to comment your R code with text explaining what each code segment does.

(I) Consider the lognormally distributed dataset of size 40 generated by the R command

Xdat = exp(0.5 + rnorm(40)*0.8)

We have seen in our discussion of double bootstrap in Lecture 27, slides 3-7, that the double bootstrap is supposed to increase the accuracy of any bias-correction or confidence interval procedure by a factor of order $LaTeX: O(n^{-1/2})$ . Illustrate this for the statistical functiona l $LaTeX: T(P_1) = T_0(\underline{X}_n)$ defined as $LaTeX: \log\big(n^{-1} \sum_{i=1}^n X_i\big)$ using a Monte Carlo study of N=500 datasets distributed the same as Xdat above
(i) by finding the bias reduction achieved in a single Bootstrap ( ) with that achieved in a double bootstrap ( ), and
(ii) by finding and comparing the coverage probability of a basic pivotal bootstrap confidence interval based on a single Bootstrap ( ) with that achieved in a double bootstrap ( ).

Recall that in slides 3 and 7 in Lecture 27, we stated that the equation of the form $LaTeX: E_{P_0}(f_t(P_0, P_1)) = 0$ being solved for uses the function $LaTeX: f_t(P_0,P_1) = I[T(P_1)-t \le T(P_0)] - (1-\alpha)$ for the Basic Pivotal Confidence Interval and the function for the estimation of Bias (ie, Bias-reduction).

In both double bootstraps, use the second-stage adjustment function $LaTeX: \psi(t,u) = t+u$ as in Step 5 in Slide 5 of Lecture 27. Also, use the R code in RscriptLec27.RLog as a template.

(II) Perform a parametric bootstrap with B=1000 to estimate the Mean Squared Prediction Error of a Beta-Binomial Regression, for each target parameter $LaTeX: E(Y_i | \pi_i) = n_i \pi_i$ in the following data setting

set.seed(4713)
Xvec = runif(70)
nvec = sample(10:20,70, replace=T)
mu = plogis(-1+0.5*Xvec)
Yvec = rbinom(nvec, rbeta(70, 5*mu, 5*(1-mu)))

The Beta-binomial regression model takes the form that $LaTeX: Y_i \sim Binom(n_i, \pi_i), \quad \pi_i \sim Beta(\tau \mu_i, \tau(1-\mu_i)), \quad \mu_i = \beta_1+\beta_2 X_i$ for $LaTeX: i=1,\ldots, 70$ with unknown parameters $LaTeX: (\tau, \beta_1,\beta_2)$ .

The Beta-binomial model was presented on slide 4 of Lecture 29; the form of the Empirical Best Predictor was given on slide 6 of Lec. 29; and the form of the Parametric Bootstrap and associated MSPE estimator were given on slide 8 of Lecture 29B.

You should use RscriptLec29.RLog as a template for your computations, augmented by the following R function for Beta-binomial log-Likelihood. Here the input parameters are:
in.th = c(log(tau), beta1, beta2)
in.r = nvec
in.y = Yvec
in.xmat = cbind(1,Xvec)

logL.BBIN = function (in.th, in.nr, in.y, in.xmat) {
t.tau = exp(in.th[1])
t.mu = plogis(c(in.xmat %*% in.th[-1]))
t.a = t.mu * t.tau
t.b = t.tau*(1-t.mu)
t1 = lgamma(in.nr+1) - lgamma(in.y+1) - lgamma(in.nr-in.y+1)
t2 = lgamma(t.tau) - lgamma(t.a) - lgamma(t.b)
t3 = lgamma(in.y+t.a) + lgamma(in.nr-in.y+t.b) -
lgamma(in.nr+t.tau)
sum(t1 + t2 + t3)
}

FINAL PROJECT ASSIGNMENT, due Tuesday, May 18, 2021, 11:59pm (uploaded to ELMS as pdf or MS Word document). As a final course project, you are to write a paper including at least 7-10 pages of narrative (and no more than 15), plus relevant code and graphical or tabular exhibits, on a statistical journal article or book-chapter related to the course or else a data analysis or case-study [or simulation study] based on a dataset or data structure of your choosing.
The guideline is that the paper should be closely related to the Bootstrap course material, and should be 10 pages (1.5 spaced in a reasonable font 10pt to 12pt) if it is primarily expository based on an article, but could have somewhat fewer pages of narrative if based on a data-analytic case study. However, for the latter kind of paper, all numerical outputs should be accompanied (in an Appendix not counting toward the 10 pages) by code used to generate them, plus discussion and interpretation of software outputs and graphical exhibits. For a data-analysis or case study, the paper should present a coherent and reasoned data analysis with discussion of research questions you want to address, theoretical support where possible for the computational analyses you do and the interpretations and a clear descripting of the findings and conclusions. Do not hand in any numerical outputs that you do not interpret in the text of your paper.

Good sources of data for the paper are the StatLib web-site mentioned below, the UCI Machine Learning Data Repository, or any other public data source. Examples of topics for the paper include:
(i) specific Estimating Equation application using either the original or Generalized bootstrap,
(ii) further material on specific Sample Survey applications of Bootstrap, e.g. a paper I wrote with Jun Shao on bootstrap-based analysis of a survey of Governments in a sample survey with "decision-based" pooling of strata using regression, ,
(iii) discussion of "subsampling" versus Bootstrap as in many papers of Politis and Romano (also a 1999 Springer book of Politis),
(iv) further discussion of "m-out-of-n" bootstraps, with reference either to a 1993 ProcAMS paper of Shao or a 1997 paper of Bickel, Gotze and van Zwet ,
(v) Fast Double Bootstrap: a recent topic, apparently initiated in a 2006 paper, "Improving the reliability of bootstrap tests with the fast double bootstrap", in Computational Statistics and Data Analysis, by R. Davidson and J. MacKinnon.

Please get in touch with me for further suggestions, or consider doing a case-study using bootstrap techniques in analyzing a dataset of interest to you, and some examples may occur to you when you search for available bootstrap-related packages in R.

Handouts for Reference

(1) Pre-history of the Bootstrap, a 2003 Statistical Science paper by Peter Hall.

(2). A set of R Scripts on many topics related to the course are available in this directory.

(3). Several R packages related to Bootstrap are: list to be updated shortly, with links.

Additional Computing Resources. There are many publicly available datasets for practice data-analyses. Many of them are taken from journal articles and/or textbooks and documented or interpreted. A good place to start is Statlib. Here is another good source. Datasets needed in the course will be either be posted to the course web-page, or indicated by links which will be provided here.
A good set of links to data sources from various organizations including Federal and international statistical agencies is at Washington Statistical Society links.

Important Dates

First Class: Mon., Jan. 25, 2021
Spring Break: week of March 15-19, NO CLASS

Last day of classes: Mon. May 10, 2020