Instructor: Eric Slud, Statistics program, Math. Dept.
Office: Mth 2314, email: slud@umd.edu
Office Hours:
W 1:30-3pm or by appointment
Primary Course References:
Please fill out the on-line Evaluation form on this Course and instructor at http://CourseEvalUM.umd.edu. Thank you.
Overview: The topic of the course is statistical Resampling Methods, with emphasis on Bootstrap Methods. Resampling means statistical procedures based on re-use (often repeated) of datasets subsets of which are randomly selected. These methods are supported by large-sample probability limit theorems, which sometimes apply in surprisingly small-to-moderate size datasets. The goal of such procedures is any or all of: bias reduction, variance and Confidence Interval construction from statistical estimators, and calculation of null reference distributions of statistics used for hypothesis testing. In many Data Science applications, these techniques provide sensible approaches to the estimation of reference distributions under minimal modeling assumptions, with mathematically justified properties under broad qualitative assumptions on the data generating mechanism. The course is suitable either for STAT students or for students in Applied Mathematics or Mathematical Data Science.
The course will be taught at STAT MA level, more or less, with a mixture of theory, software-oriented (primarily R) applications, and data analyses and simulation case-studies.
Special Features of the Course this Term. 50-minute Lectures will be offered live (ie, synchronously) over Zoom through ELMS and recorded. Each lecture will consist of one segment containing primarily theoretical meterial, and one providing computational data illustration in R. The theory pieces will be pdf slides with voiceover, and sometimes with handwritten pieces via a document camera. The R pieces will either be slides with R codes and pictures, or live demonstration using R or RStudio windows.
THEORETICAL MATERIAL. Both in homeworks and lectures, there will be theoretical material at the level of probability theory (STAT 410 or sometimes 600-601) related to laws of large numbers and central limit theorem (sometimes "functional central limit theorems", which I will explain) , along with the `delta method' (Taylor linearization). There will be some proofs, mostly at advanced-calculus level but some involving measure-theoretic ideas.
Prerequisite: Stat 410 or 600-601, Stat 420 or 700-701, plus some computing familiarity, preferably including some R.
Course requirements and Grading: there will be 6 graded homework sets (one every 2 - 2.5 weeks), plus a project/paper at the end. Homeworks will be split between theory problems and statistical computations and interpretations with data. The homework will be worth 65% of the grade, the term paper 35%.
(1.) Monte Carlo Simulation versus Resampling/Bootstrap (1/25-27/2021)
(2.) Statistical Functionals: Bias Reduction via Jackknife & Bootstrap (1/27-29/2021)
(3.) Reference Distributions: Bootstrap Hypothesis Tests and Confidence Intervals (2/1-3/2021)
(4.) Bootstrap with Complicated Statistics (sometimes non-smooth) (2/3-5/2021)
(5.) Proof of Consistency of Bootstrap Distributional Estimate for the Mean
(6.) More on Statistical Functionals, "Influence Functions", and Bootstrap
(7.) Enhanced accuracy for the Bootstrap vs. asymptotic normal
approximations
(8.) Double and Iterated Bootstrap for Higher-order Accuracy
(9.) Bootstrap in Regression Problems -- Bootstrapping Residuals
(10.) Sme Settings where Bootstrap does not Work
(11.) Relation between Functional Central Limit (empirical-process) Theory and Bootstrap Limit Theory
(12.) Weighted and Multiplier Bootstraps
(13.) Parametric Bootstrap -- Theory and application in Mixed-Model and Empirical-Bayes Problems
(14.) Bootstrap for Sample-Survey Inference
(15.) Other Applications of Bootstrap (Survival Analysis, possibly others)
(16.) Bootstrap in Problems with Dependent Data (Time Series, Spatial) Idea of "Block Bootstrap"
(17.) Bootstrap Variants in Machine Learning -- Boosting and Bagging
COMPUTING. In Lectures, the homework-sets and possibly also the course project, you will be doing computations on real and simulated datasets using a statistical computation platform or library. Any of several statistical-computing platforms are suitable for this: R, Matlab, Python, or others. If you are learning one of these packages for the first time, or investing some effort toward deepening your statistical computing skills, I recommend R,which is free and open-source and is the most flexible and useful for research statisticians. I will provide links to free online R tutorials and will provide examples and scripts and will offer some R help, as needed.
Getting Started in R. Lots of R introductory materials can be found on my STAT 705 website from several years ago, in particular in these Notes. Another free and interactive site I recently came across for introducing R to social scientists is: https://campus.sagepub.com/blog/beginners-guide-to-r.
A set of R Scripts on many topics related to the course are available in this directory. .
Notes and Guidelines. Homeworks should be handed in as pdf's through ELMS "Assignments". Solutions will usually be posted, and a percentage deduction of the overall HW score will generally be made for late papers.
Assignment 1. (First 2 weeks of course, HW due Wed., Feb. 10 ). Reading: Gentle (2009) chapters on simulation and bootstrap, Efron (1982) Ch.2 & 5, Wassermann (2006) Ch.3, plus R scripts from Lectures. Then solve and hand in all of the following 6 problems:
Homework Assignment 2. (Second 2 weeks of course, HW due Sat., Feb. 27, 11:59pm ). Reading: Wassermann (2006) Ch.3, DasGupta(2008) Chapter 29, plus R scripts from Lectures.
Then solve and hand in all of the following problems:
(1). (counts as 2 problems) Perform an adequate number of Monte Carlo iterations to distinguish the coverage performance of the sample median estimator with bootstrap confidence interval for the following three data settings, all with sample size n=50 and number of bootstrap replicates B=300. (2). If X and Y are independent Gamma(n1, b) and Gamma(n2, b) random variables, use the Multivariate Central Limit Theorem (and multivariate delta method) to prove that n1/2 (X/(X+Y) - λ) converges in distribution to N(0, λ(1-λ)) as n1, n2 → ∞ in such a way that n1/(n1+n2) - λ = o((n1+n2)-1/2), where λ ∈ (0,1). The main hint is that Gamma(n1,b) is the sum of n1 iid Gamma(1,b) random variables, and Gamma(n2,b) the sum of n2 iid Gamma(1,b) r.v.'s. (3). (Complicated Statistic arising from estimation after testing) Suppose that we observe a data sample Xn and want to estimate the standard deviation σ of X1. Assume these data are either N(0,σ2) or Logistic
(with density (1/b) ex/b/(1+ex/b)2, which has variance (π b)2/3). Suppose that we estimate σ with the statistic Tn defined by the following steps: (4). Prove that if Fn and F are strictly increasing and continuous distributions such that Fn(0)=0 and Fn(1)=1 and Fn(x) → F(x) pointwise for all x, as n → ∞, then the Mallows-Wasserstein metric d2(Fn,F) → 0. Do this by defining random variables Y ~ F, U = F(Y), Yn = Fn-1(U). (5). Define a nonparametric bootstrap estimator Tn of the variance of the sample median for iid samples X1,...,Xn. Also define a parametric-bootstrap estimator Vn of the same quantity for Expon(λ) data-samples. Do the means of these estimates agree for large n when the data-sample they are based on is actually Expon(λ) ? What about their variances ? What does theory say about the answers to these questions ? Also give a computational answer to the question based on R=400 Monte Carlo iterations with B=300 bootstrap samples. Homework Assignment 3. due Sun., March 21, 11:59pm .
This homework set is based on Lecture material related to Bootstrap consistency, Edgeworth expansions, and permutational and bootstrap hypothesis testing. Specific lecture references are given in bold-face for each problem. (1) Do a Monte Carlo study to exhibit the relative sizes of the
and
distribution differences like that of the RscriptLec12.RLog script for Lecture 12 slides 2-5 (with something like N=4000, B=1000) for larger sample sizes like n=80 and 120. The goal is to confirm that the
difference is generally the smaller one for a summand distribution with skewness (say Gamma with shape-parameter not too large, <5) and does not happen for a symmetric summand distribution (say Beta(
) for
< 3). To show pictorially differences between distribution functions, it would be reasonable to use "lines" in R, which is basically linear interpolation, between finely spaced ordered points along the x-axis.
(2) In Lecture 11, slides 5 and 6, leading up to the formal Theorem about Edgeworth expansions for standardized means, we introduced the Cramer condition that the characteristic function of a summand random variable
should have complex modulus bounded away from 1 as its argument t goes to infinity. Hall (1992, Chapter 2) has additional details on this condition.
(a) Show that this condition does not hold for Binom(1,p) random variables
.
(b) Show that this condition does not hold for discrete random variables
which have a finite number of possible values all of which are integers.
(c) Show that this condition does not hold [this assertion has been corrected from the earlier statement of this problem] for the discrete random variable with 3 possible values 0,1, b (each with positive probability) for b not a rational number. (3) Show numerically, by Monte Carlo simulations with size
, that the Edgeworth-expansion (first-order up to
terms) displayed in Lecture 11 slides 5 and 6 gives numerically more accurate distribution function values for the standardized mean than the limiting normal distribution for n=30, 50, 70 based on lifetime distributions with skewness such as Weibull and log-normal.
(4) Carry out the Taylor series and Delta Method calculations to figure out exactly, up to first- and-second order (
and
terms) the Edgeworth-expansion corrections to the Central Limit Theorem for
based on Expon(
) data. The basic Edgeworth expansion is given in Lecture 11 on slides 5 and 6, and in this exercise you should use that expansion together with the Delta Method to obtain the analogous expansion (with different polynomials multiplying the normal density on the correction terms) for the transformed quantity
.
(5) Do a Monte Carlo study (with
) of size and power for permutational and bootstrap tests (of nominal significance level
=0.05, based on the (pooled-variance) t-test statistic and the Mann-Whitney statistic (introduced in Lecture 13, slide 8 and mentioned again in Lecture 15), for equality of two-sample data (sizes m=n=30) in the nonparametric location-shift problem. Also do this for a bootstrap test based on a modified t-statistic based on studentizing the difference of sample means. Do your simulation with the
distribution and the Logistic distribution. For the power calculation, choose a set of (at least 5) alternative values of the location shift close but not too close to 0, so that the powers are bounded away from 0.05 and 1.
Homework Assignment 4. due Sun., April 18, 11:59pm . Five (5) Problems in all.
Read the material on Bootstrapping Regression in Sections 29.12-29.15 of the DasGupta (2008) Bootstrap Chapter, and do problems 29.20, 29.21, 29.23, 29.24 in DasGupta chapter, p.495. Also solve and hand in the following additional problem: (A) ( 2-pass linear regression} Suppose that data consist of Xi, Yi with Xi = (1, Zi,1, Zi,2,... Zi,9), where Zi,k are correlated multivariate normal random variables with mean 0 and variance 1. (The correlation matrix should not be the identity or the problem is not meaningful. The essential correlation is between Zi,1 and the other Zi,k variables, but it is OK to leave those other Zi,k (k ≥ 2) uncorrelated with each other.) Then the Yi variables are assumed to satisfy Yi = β0 + ∑k=19 βk Zi,k + εi, where εi ~ F0(x/σ) where F0 is an unknown distribution with mean 0 and variance 1, and σ > 0 is an unknown scale-parameter. Homework Assignment 5. due Sun., May 2, 11:59pm . Five (5) Problems in all.
This assignment is based completely on a dataset, the "Boston Housing Data", available in the MASS R library as a data-frame "Boston". You are to look at these data from the vantage point of the linear homoscedastic regression model for outcome "medv" in terms of all given predictor variables other than "nox" and "rad". "medv" is the median price (in units of $1000, in 1970) in a town or area near Boston. The assigned tasks are (A) to do a goodness of fit analysis (of model adequacy of the linear regression model) using the wild-bootstrap method of Stute et al. (1998) covered in Lecture 23 slides 8-10; (B) to prove a fact about one of the statistics in that analysis; and (C) to estimate a specific mean-square prediction error vcia GBS bootstrap for the Boston Housing price data at the mean predictor value. The point values of the 3 parts are: 30 for (A), and 10 for each of (B) and (C). The assignment is due 11:59pm on Sunday, May 2. (A) (30 points) The Boston Housing Data, which contain 506 records can be accessed in R by > library(MASS)
and the linear regression model under study in this assignment is the one (with intercept) fitted by the R statement > model1 = lm(medv ~ . - nox - rad, data=Boston) The task in this problem part is to do a test for Goodness of Fit of this linear model using the Wild Bootstrap method of Stute et al. (1998, JASA) covered in Lecture 23, slides 8-10. The steps are: (1) Obtain coefficient estimates predictors and residuals from the original data using output list components $coef, $fitted, and $resid . Here
is defined equal to the medv variable.
(3) Define the random function
for x in the same value-space as the non-constant predictors
where
means that the inequality holds coordinatewise and where
is the ordinary least-squares estimate based on the original data. Define also the corresponding random function on each batch of wild-bootstrapped data,
where
is the ordinary least-squares estimator for the bootstrapped outcome data
regressed on the same (unchanged) predictors (plus intercept) from the original data.
(4) There are two different statistics and hypothesis tests you are asked to implement, the first based on
with rejection threshold determined from the 0.95 quantile calculated from the bootstrap-statistics
calculated from B bootstrap-data batches of size n. The second test you are asked to implement is based on the statistic
with rejection cutoff determined by the 0.95 quantile of the B bootstrap statistics
obtained from B independent wild-bootstrap batches of data.
(5) In implementing both tests, first find the set of B bootstrapped statistic values next say whether the test rejects (for the Boston Housing data and the indicated regression model), and then find the p-value. (B) (10 points) The statistic
is a function of the argument x ranging over the value space of the non-constant regression predictor variables. Prove that the sup over x of the
function is achieved at one of the observed data points x = X_i, so that for the sup hypothesis-test statistic the
function needs to be evaluated only at the n points of the original data sample (which is the same as the
values for the bootstrapped data, since the
are left unchanged by the wild bootstrap on slide 9 of Lec.23.)
(C) (10 points) Use GBS bootstrap to create an estimate for the Mean Square Prediction Error of the model at an X-vector exactly equal to the mean of the predictor variables, assuming the same homoscedastic linear regression model you analyzed above. Homework Assignment 6. due Wed., May 12, 11:59pm .
In both of the following Problems, do the requested computations and present the results in a few well-chosen numerical exhibits (pictures and/or tables). Describe in words what you have done and what the results mean, i.e., whether your results are what would have been expected from the theory presented in class. Hand in the R code you use as an Appendix to your submitted paper, and indicate if you did any checks to verify that your R code did exactly what it was supposed to. It would be helpful to comment your R code with text explaining what each code segment does. (I) Consider the lognormally distributed dataset of size 40 generated by the R command Xdat = exp(0.5 + rnorm(40)*0.8) We have seen in our discussion of double bootstrap in Lecture 27, slides 3-7, that the double bootstrap is supposed to increase the accuracy of any bias-correction or confidence interval procedure by a factor of order
. Illustrate this for the statistical functiona l
defined as
using a Monte Carlo study of N=500 datasets distributed the same as Xdat above
Recall that in slides 3 and 7 in Lecture 27, we stated that the equation of the form
being solved for
uses the function
for the Basic Pivotal Confidence Interval and the function
for the estimation of Bias (ie, Bias-reduction).
In both double bootstraps, use the second-stage adjustment function
as in Step 5 in Slide 5 of Lecture 27. Also, use the R code in RscriptLec27.RLog as a template.
(II) Perform a parametric bootstrap with B=1000 to estimate the Mean Squared Prediction Error of a Beta-Binomial Regression, for each target parameter
in the following data setting
set.seed(4713)
The Beta-binomial regression model takes the form that
for
with unknown parameters
.
The Beta-binomial model was presented on slide 4 of Lecture 29; the form of the Empirical Best Predictor was given on slide 6 of Lec. 29; and the form of the Parametric Bootstrap and associated MSPE estimator were given on slide 8 of Lecture 29B. You should use RscriptLec29.RLog as a template for your computations, augmented by the following R function for Beta-binomial log-Likelihood. Here the input parameters are:
logL.BBIN = function (in.th, in.nr, in.y, in.xmat) {
FINAL PROJECT ASSIGNMENT, due Tuesday, May 18, 2021, 11:59pm
(uploaded to ELMS as pdf or MS Word document). As a final course project, you
are to write a paper including at least 7-10 pages of narrative (and no more than 15), plus relevant code and graphical or tabular exhibits, on a statistical journal article or book-chapter related to the course or else a data analysis or case-study [or simulation study] based on a dataset or data structure of your choosing.
0.49, 0.51, 0.48, 0.54, 0.50, 0.46, 0.44, 0.56, 0.45, 0.47 .
Compute the probability P(max(x*j, 1 ≤ j ≤ 10) < 0.56).
(a) f(x) = Gamma(2,1), (b) discrete Uniform on the set {1,...,35}, (c) f(x) = 3(1-2x)2 on [0,1].
You may make your own choice among Bootstrap Confidence Interval type -- basic-pivotal, percentile, or one of the improved percentile intervals -- but use the same type of bootstrap CI throughout your simulation. Also, use
R=1000 or more Monte Carlo iterations, but determine this number with the simulation error in mind, to make the simulation adequate to distinguish CI performance clearly.
Find bootstrap estimates for the variance of Tn, and perform a Monte Carlo simulation with sample sizes n = 40, 80 to see how accurate the bootstrap estimates are.
Step 1. Find the least-squares estimator β~ for β, and the corresponding estimator of σ defined by (σ~)2 = n-1 ∑i=1n (Yi - (β~)' Xi)2.
Step 2. Find the set K0 (possibly empty) of indices k ≥ 2 for which (βk~)2 ≥ 1.9602 (σ~)2 ((X'X)-1)k+1,k+1.
Step 3. Replace the predictor vectors Xi by the vectors Wi of length |K0|+2 given by Wi = (1, Zi,1, (Zi,k: k ∈ K0)), and define (β1)^ as the estimated coefficient of Zi,1 in the least-squares estimate (W'W)-1 W'Y, where W is the n x (|K_0|+2) matrix with i'th row equal to Wi.
data(Boston)
(2) Generate B batches of n=506 wild-bootstrap outcome vectors using the method given by Stute et al., with
where
are iid variables independent of all the data, generated by the statistician with mean 0 and variance 1. (You could use double-exponential, standard normal, or any distribution you like that has all moments.)
(i) by finding the bias reduction achieved in a single Bootstrap (
) with that achieved in a double bootstrap (
), and
(ii) by finding and comparing the coverage probability of a basic pivotal bootstrap confidence interval based on a single Bootstrap (
) with that achieved in a double bootstrap (
).
Xvec = runif(70)
nvec = sample(10:20,70, replace=T)
mu = plogis(-1+0.5*Xvec)
Yvec = rbinom(nvec, rbeta(70, 5*mu, 5*(1-mu)))
in.th = c(log(tau), beta1, beta2)
in.r = nvec
in.y = Yvec
in.xmat = cbind(1,Xvec)
t.tau = exp(in.th[1])
t.mu = plogis(c(in.xmat %*% in.th[-1]))
t.a = t.mu * t.tau
t.b = t.tau*(1-t.mu)
t1 = lgamma(in.nr+1) - lgamma(in.y+1) - lgamma(in.nr-in.y+1)
t2 = lgamma(t.tau) - lgamma(t.a) - lgamma(t.b)
t3 = lgamma(in.y+t.a) + lgamma(in.nr-in.y+t.b) -
lgamma(in.nr+t.tau)
sum(t1 + t2 + t3)
}
(1) Pre-history of the Bootstrap, a 2003 Statistical Science paper by Peter Hall.
(2). A set of R Scripts on many topics related to the course are available in this directory.
(3). Several R packages related to Bootstrap are: list to be updated shortly, with links.
Additional Computing Resources. There
are many publicly available datasets for practice data-analyses. Many of them are taken from journal articles and/or textbooks and documented or interpreted. A good place to start is Statlib. Here is another good source. Datasets needed in the course will be either be posted to the course web-page, or indicated by links which will be provided here.
A good set of links to data sources from various organizations including Federal and international statistical agencies is at Washington Statistical Society links.
The UMCP Math Department home page.
The University of Maryland home page.
My home page.
© Eric V Slud, May 5, 2021.