Homework Problem 7, Due Wednesday Oct 14. -------------------------------------- (a) Generate 1000 independent batches of 20 i.i.d. random 2-vectors (X_i,Y_i), i=1,..,20 for which X has a standard logistic distribution, and conditionally given X , Y/X ~ Unif(3/4, 5/4) #----------------- Solution to part (a) ---------------------- SimPairs = function(N=1000,K=20) { ### here K is the batch size and N the number of independent ### batches to generate. Xvec = rlogis(N*K) Yvec = runif(N*K,.75,1.25)*Xvec array(c(Xvec,Yvec), dim=c(N,K,2), dimnames=list(1:N, 1:K, c("X","Y"))) } XYarr = SimPairs() ## To check for desired distribution, some or all of the following could be used: ## (i) Check for marginal distribution of X and Y/X: > hist(c(XYarr[,,"X"]), nclass=50, prob=T) curve(dlogis(x), -6,6, add=T, col="orange") > hist(c(XYarr[,,2]/XYarr[,,1]), nclass=50, prob=T) abline(h=2, col="orange") ## (ii) Check for marginal distribution of Y. Here the marginal density must be obtained by numerical integration, which we do for the whole set of 20,000 Y's by setting up a Simpson's rule with 60 bins because it can be done really fast by linear algebra as opposed to "integrate" or loops. (It would also have been fine to use "integrate" to define a function Ydens(y) and then overplot that using > curve(Ydens(y), add=T, col="green") instead of the overplotting done in "lines" below. Note that the distribution function of Y given U follows the formula P(Y<=y|U=u) = plogis(y/u), so the unconditional distribution function is F_Y(y) = 2*int_{0.75}^{1.25} plogis(y/u) du and the density is Ydens(y) = 2*int_{0.75}^{1.25} dlogis(y/u) du/u Weights and points for the 61 Simpson-rule points seq(.75,1.25, by=.5/60) are > wts = (0.5/(6*30))*c(1,rep(c(4,2),29),4,1) hist(c(XYarr[,,2]), nclass=50, prob=T, xlab="Y values", main = paste("Scaled Histogram of Y with \n", "Theoretical Density Overplotted")) pts = seq(.75,1.25, by=.5/60) ypts = sort( c(XYarr[,,2]) ) fYarr = outer(ypts, pts, function(y,u) 2*dlogis(y/u)/u ) Ydensarr = array(fYarr %*% wts, c(1000,20)) lines(ypts, c(Ydensarr), col="orange") ### could have used far fewer points than these 20,000 ### to get a good over-plotted density curve ## (iii) Could also directly evaluate the empirical relative frequency versus the theoretical probability of falling into a set like (X,Y) in [-3,1]X[0,4], which is the same as of falling into [0,1] X [0,1.25], which is simply the probability that X falls in [0,1] > tmp = c(abs(XYarr[,,1]+1)<=2)*c(abs(XYarr[,,2]-2)<=2) tmp2 = c(abs(XYarr[,,1]-0.5)<=0.5)*c(abs(XYarr[,,2]-0.625)<=0.625) > c(mean(tmp), mean(tmp2)) [1] 0.2291 0.2291 ### OK, they are supposed to be the same !! c(Empir.Prob=mean(tmp2), Empir.CI = mean(tmp)+1.96*c(-1,1)*sqrt(mean(tmp2)* (1-mean(tmp2))/sqrt(2e4)), Theor.Prob = plogis(1)-0.5) Empir.Prob Empir.CI1 Empir.CI2 Theor.Prob 0.2291000 0.1598356 0.2983644 0.2310586 ### OK. ---------- HW Assignment, cont'd ----------------------- (b) The objective of this simulation is to evaluate E( (Y_(10) - X_(10))^2 ) (*) where the subscript _(10) refers to the 10th order statistic (ie the 10th smallest observation) within the batch. One method of evaluating the expectation (*) is by a direct simulation: perform (and time!) a simulation of 10000 batches and find (*) as a direct empirical average of 10000 quantities (Y_(10) - X_(10))^2 that you simulate. Explore whether you can speed up the simulation, while obtaining accuracy at least as good as with your first method, by using fewer simulation replications and applying the "method of control variates" with control variate (X_(10))^2 . #----------------------- Solution to part (b) ------------------- Xord10 = apply(XYarr[,,1], 1, function(vec) sort(vec)[10]) Yord10 = apply(XYarr[,,2], 1, function(vec) sort(vec)[10]) # A hint was given in the problem statement that the 10th order statistic from 20 iid Uniform[0,1] random variables is Beta(10,11), so the exact distribution of the X_(10) order-statistic is the same as qlogis(Z) for a Beta(10,11) random variable Z. Therefore the density of X_(10), by the transformation formula for monotone functions of a univariate random variable, is dbeta(plogis(x),10,11)*dlogis(x) and E( X_(10)^2 ) = > AvXordsq = integrate(function(x) x^2*dbeta(plogis(x),10,11)*dlogis(x),-Inf,Inf)$value AvXordsq [1] 0.2103327 ### Compare this with the empirical 2nd moment mean(Xord10^2) [1] 0.2062683 ### Comparison seems OK ## The first method (with CI), assuming Xord10 and Yord10 already calculated, is unix.time({ Z = (Xord10-Yord10)^2 SDZ = sd(Z) avZ = mean(Z) cat("Est.= ", avZ,", and CIwidth = ",2*1.96*SDZ/sqrt(1000),", \n") }) Est.= 0.003078716 , and CIwidth = 0.0008166372 user system elapsed 0 0 0 ### Second method: unix.time({ Z = (Xord10-Yord10)^2 AvXordsq=integrate(function(x) x^2*dbeta(plogis(x),10,11)*dlogis(x),-Inf,Inf)$value X10sq = Xord10^2-AvXordsq tmp = lm(Z~X10sq) cat("Est.= ", tmp$coef[1],", and CIwidth = ", 2*1.96*summary(tmp)$coef[1,2],", \n") }) Est.= 0.003129713 , and CIwidth = 0.0006834433 , user system elapsed 0 0 0 Timing runs cannot be distinguished based on 1000 batches; doing it again with N=1e5 gives instead unix.time({ XYarr = SimPairs(N=1e5) Xord10 = apply(XYarr[,,1], 1, function(vec) sort(vec)[10]) Yord10 = apply(XYarr[,,2], 1, function(vec) sort(vec)[10]) }) user system elapsed 21.30 0.06 21.39 unix.time({ Z = (Xord10-Yord10)^2 SDZ = sd(Z) avZ = mean(Z) cat("Est.= ", avZ,", and CIwidth = ",2*1.96*SDZ/sqrt(1e5)," \n") }) Est.= 0.003293407 , and CIwidth = 9.048807e-05 , user system elapsed 0 0 0 unix.time({ Z = (Xord10-Yord10)^2 AvXordsq=integrate(function(x) x^2*dbeta(plogis(x),10,11)*dlogis(x),-Inf,Inf)$value X10sq = Xord10^2-AvXordsq tmp = lm(Z~X10sq) cat("Est.= ", tmp$coef[1],", and CIwidth = ", 2*1.96*summary(tmp)$coef[1,2]," \n") }) Est.= 0.00328951 , and CIwidth = 7.259099e-05 user system elapsed 0.04 0.00 0.06 ### So as we saw in class, the time of computation of the estimate is not reduced by using the control variates method here, BUT the increase in accuracy (reduction in confidence interval width) is about 20%, which means that the reduction in N needed to achieve the same accuracy is about 36%. > cor(Xord10^2, (Yord10-Xord10)^2) [1] 0.6021362 > sqrt(1-.6021362^2) [1] 0.7983905 ### the reduction factor for CI width