Homework Problem 7, Due Wednesday Oct 14.
--------------------------------------

(a) Generate 1000 independent batches of 20 i.i.d. random 2-vectors 
(X_i,Y_i), i=1,..,20 for which X has a standard logistic distribution, 
and conditionally given X ,         Y/X ~ Unif(3/4, 5/4) 

#----------------- Solution to part (a) ----------------------

SimPairs = function(N=1000,K=20) {
  ### here K is the batch size and N the number of independent 
  ###   batches to generate.
   Xvec = rlogis(N*K)
   Yvec = runif(N*K,.75,1.25)*Xvec
  array(c(Xvec,Yvec), dim=c(N,K,2), dimnames=list(1:N,
      1:K, c("X","Y"))) }

XYarr = SimPairs()

## To check for desired distribution, some or all of the following could be used:
## (i) Check for marginal distribution of X and Y/X:

> hist(c(XYarr[,,"X"]), nclass=50, prob=T)
  curve(dlogis(x), -6,6, add=T, col="orange")

> hist(c(XYarr[,,2]/XYarr[,,1]), nclass=50, prob=T)
  abline(h=2, col="orange")

## (ii) Check for marginal distribution of Y. Here the marginal density must 
be obtained by numerical integration, which we do for the whole set of 20,000 Y's
by setting up a Simpson's rule with 60 bins because it can be done really fast by
linear algebra as opposed to "integrate" or loops. (It would also have been fine 
to use "integrate" to define a function Ydens(y) and then overplot that using 

> curve(Ydens(y), add=T, col="green") 

instead of the overplotting done in "lines" below. Note that the distribution 
function of Y given U follows the formula P(Y<=y|U=u) = plogis(y/u), so the 
unconditional distribution function is F_Y(y) = 2*int_{0.75}^{1.25} plogis(y/u) du
and the density is  Ydens(y) = 2*int_{0.75}^{1.25} dlogis(y/u) du/u

Weights and points for the 61 Simpson-rule points seq(.75,1.25, by=.5/60) are

> wts = (0.5/(6*30))*c(1,rep(c(4,2),29),4,1)
  hist(c(XYarr[,,2]), nclass=50, prob=T, xlab="Y values",
    main = paste("Scaled Histogram of Y with \n",
                "Theoretical Density Overplotted"))
  pts = seq(.75,1.25, by=.5/60)
  ypts = sort( c(XYarr[,,2]) )
  fYarr = outer(ypts, pts, function(y,u) 2*dlogis(y/u)/u ) 
  Ydensarr = array(fYarr %*% wts, c(1000,20))
  lines(ypts, c(Ydensarr), col="orange")
     ### could have used far fewer points than these 20,000 
     ###    to get a good over-plotted density curve

## (iii) Could also directly evaluate the empirical relative frequency versus
the theoretical probability of falling into a set like (X,Y) in [-3,1]X[0,4],
which is the same as of falling into [0,1] X [0,1.25], which is simply the 
probability that X falls in [0,1]

> tmp = c(abs(XYarr[,,1]+1)<=2)*c(abs(XYarr[,,2]-2)<=2)
  tmp2 = c(abs(XYarr[,,1]-0.5)<=0.5)*c(abs(XYarr[,,2]-0.625)<=0.625)
> c(mean(tmp), mean(tmp2))
[1] 0.2291 0.2291            ### OK, they are supposed to be the same !!

c(Empir.Prob=mean(tmp2), Empir.CI = mean(tmp)+1.96*c(-1,1)*sqrt(mean(tmp2)*
  (1-mean(tmp2))/sqrt(2e4)), Theor.Prob = plogis(1)-0.5)
Empir.Prob  Empir.CI1  Empir.CI2 Theor.Prob 
 0.2291000  0.1598356  0.2983644  0.2310586                  ### OK.


---------- HW Assignment, cont'd -----------------------

(b) The objective of this simulation is to evaluate

E( (Y_(10) - X_(10))^2 )               (*)

where the subscript _(10) refers to the 10th order statistic (ie the 10th 
smallest observation) within the batch.

One method of evaluating the expectation (*) is by a direct simulation: 
perform (and time!) a simulation of 10000 batches and find (*) as a direct 
empirical average of 10000 quantities  (Y_(10) - X_(10))^2  that you simulate.

Explore whether you can speed up the simulation, while obtaining accuracy 
at least as good as with your first method, by using fewer simulation 
replications and applying the "method of control variates" with control 
variate  (X_(10))^2 .

#----------------------- Solution to part (b) -------------------

Xord10 = apply(XYarr[,,1], 1, function(vec) sort(vec)[10])
Yord10 = apply(XYarr[,,2], 1, function(vec) sort(vec)[10])


# A hint was given in the problem statement that the 10th order statistic from 
20 iid Uniform[0,1] random variables is Beta(10,11), so the exact distribution 
of the X_(10) order-statistic is the same as qlogis(Z) for a Beta(10,11) 
random variable Z. Therefore the density of X_(10), by the transformation 
formula for monotone functions of a univariate random variable, is

dbeta(plogis(x),10,11)*dlogis(x)

and E( X_(10)^2 ) = 

> AvXordsq = integrate(function(x) x^2*dbeta(plogis(x),10,11)*dlogis(x),-Inf,Inf)$value
  AvXordsq
[1] 0.2103327
### Compare this with the empirical 2nd moment
mean(Xord10^2)
[1] 0.2062683      ### Comparison seems OK

## The first method (with CI), assuming Xord10 and Yord10 already calculated, is 

unix.time({ Z = (Xord10-Yord10)^2
            SDZ = sd(Z)
            avZ = mean(Z)
            cat("Est.= ", avZ,", and CIwidth = ",2*1.96*SDZ/sqrt(1000),", \n") })
Est.=  0.003078716 , and CIwidth =  0.0008166372  
   user  system elapsed 
      0       0       0 

### Second method:

unix.time({ Z = (Xord10-Yord10)^2
            AvXordsq=integrate(function(x) 
                  x^2*dbeta(plogis(x),10,11)*dlogis(x),-Inf,Inf)$value
            X10sq = Xord10^2-AvXordsq
            tmp = lm(Z~X10sq)
            cat("Est.= ", tmp$coef[1],", and CIwidth = ",
                  2*1.96*summary(tmp)$coef[1,2],", \n") })
Est.=  0.003129713 , and CIwidth =  0.0006834433 , 
   user  system elapsed 
      0       0       0 

Timing runs cannot be distinguished based on 1000 batches; doing it again 
with N=1e5 gives instead

unix.time({ XYarr = SimPairs(N=1e5) 
            Xord10 = apply(XYarr[,,1], 1, function(vec) sort(vec)[10]) 
            Yord10 = apply(XYarr[,,2], 1, function(vec) sort(vec)[10])  })
   user  system elapsed 
  21.30    0.06   21.39 

unix.time({ Z = (Xord10-Yord10)^2
            SDZ = sd(Z)
            avZ = mean(Z)
            cat("Est.= ", avZ,", and CIwidth = ",2*1.96*SDZ/sqrt(1e5)," \n") })
Est.=  0.003293407 , and CIwidth =  9.048807e-05 , 
   user  system elapsed 
      0       0       0 

unix.time({ Z = (Xord10-Yord10)^2
            AvXordsq=integrate(function(x) 
                  x^2*dbeta(plogis(x),10,11)*dlogis(x),-Inf,Inf)$value
            X10sq = Xord10^2-AvXordsq
            tmp = lm(Z~X10sq)
            cat("Est.= ", tmp$coef[1],", and CIwidth = ",
                  2*1.96*summary(tmp)$coef[1,2]," \n") })
Est.=  0.00328951 , and CIwidth =  7.259099e-05  
   user  system elapsed 
   0.04    0.00    0.06 

### So as we saw in class, the time of computation of the estimate is not 
reduced by using the control variates method here, BUT the increase in 
accuracy (reduction in confidence interval width) is about 20%, which means 
that the reduction in N needed to achieve the same accuracy is about 36%.

> cor(Xord10^2, (Yord10-Xord10)^2)
[1] 0.6021362
> sqrt(1-.6021362^2)
[1] 0.7983905             ### the reduction factor for CI width