HW 14 Stat 705  Fall 2015

Assigned Saturday 11/14/15 DUE Monday 11/23/15


HW 14 Stat 705  Fall 2015

Assigned Saturday 11/14/15 DUE Monday 11/23/15


(A) ---------- Solution --------------------

> set.seed(3)
  x = runif(100)
  y = 3 - 7*x + 10*abs(x-.5)^1.5*rnorm(100)
  Dset = cbind.data.frame(y, x)

> tmp1 = lm(y ~ x)

> summary(tmp1)$coef
             Estimate Std. Error    t value     Pr(>|t|)
(Intercept)  2.746689  0.3028526   9.069393 1.265556e-14
x           -6.667683  0.5391668 -12.366643 1.003877e-21


wtfr = cbind.data.frame(ystar = y/abs(x-.5)^1.5, onestar = 1/abs(x-.5)^1.5, 
             xstar = x/abs(x-.5)^1.5)

> tmp2 = lm(ystar ~ onestar + xstar -1, data=wtfr)
  summary(tmp2)$coef
         Estimate Std. Error   t value     Pr(>|t|)
onestar  2.878960  0.1394922  20.63886 1.877378e-37
xstar   -6.756853  0.2783812 -24.27194 3.112938e-43

> tmp3 = lm(y ~ x, weights = 1/abs(x-.5)^3)
> summary(tmp3)$coef
             Estimate Std. Error   t value     Pr(>|t|)
(Intercept)  2.878960  0.1394922  20.63886 1.877378e-37
x           -6.756853  0.2783812 -24.27194 3.112938e-43

> sum(abs(summary(tmp2)$coef - summary(tmp2)$coef))
[1] 0 

> sum(abs(summary(tmp2)$cov.unsc - summary(tmp2)$cov.unsc))
[1] 0

> sum(abs(summary(tmp2)$sig - summary(tmp3)$sig))
[1] 1.421085e-14

> sum(abs(tmp2$fitted*abs(Dset$x-0.5)^1.5 - tmp3$fitted))
[1] 8.310505e-13


VERIFY that weighted least squares models  y ~x  fitted to this 
dataset in two different ways, 

(1) using lm with weights = b(x),  and 

(2) using lm (without an intercept!) with  ystar = y*sqrt(b(x))  
and model-matrix columns all also multiplied componentwise by sqrt(b(x))

NB. A student point out (correctly) that if any observations x[i] 
were extremely close to 0.5, then the weights for those points would be 
so large as to dominate all other points. In these data, the smallest
distance from a point x[i] to 0.5 and the largest corresponding weight are

> aux = min(abs(Dset$x-0.5))
  c(aux, 1/aux^3)
[1] 5.023914e-03 7.886301e+06

If we were doing this with real data, I would certainly suggest tapering 
these weights so that the ratio of largest to smallest is no larger 
than 1000, say.


(B) ------ Solution -------------------

> set.seed(3577)
  x = rgamma(250, 0.7)
  z = sample(1:4, 250, replace=T, prob=c(0.2,0.3,0.3,0.2))
  nv = rep(c(5,10),125)
  y = rbinom(250, nv, plogis(0.3 + 1.2*x - 0.7*z))
  AuxFr = cbind.data.frame(y,x,z,nv)

probit1 = glm(cbind(y, nv-y) ~ x + z, data=AuxFr, 
    family=binomial(link="probit"))

##    or 

probit1B = glm(y/nv ~ x + z, data=AuxFr, weight=nv, 
             family=binomial(link="probit"))

## Saturated-model deviance:
  glm(cbind(y, nv-y) ~ I(qnorm((y+1.e-8)/(nv+2.e-8))), data=AuxFr, 
       family=binomial(link="probit"))$dev
[1] 7.199995e-07

### This shows that the saturated model deviance is 0, which means
### that the reference value of -2*logLik is

> Refdv = -2*sum(dbinom(y,nv, (y+1.e-7)/(nv+2.e-7), log=T))
> Refdv
[1] 480.7319

## Null deviance, four ways

> prbt0 = glm(cbind(y, nv-y) ~ 1, data=AuxFr, 
    family=binomial(link="probit"))
  prbt0$dev
[1] 652.3531
  prbt0$null.dev
[1] 652.3531

> summary(prbt0$fitted)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.3872  0.3872  0.3872  0.3872  0.3872  0.3872 

>  -2*sum(dbinom(y,nv,prbt0$fitted, log=T)) - Refdv
[1] 652.3531
> c( -2*logLik(prbt0) - Refdv)
[1] 652.3531

### So the current deviance is obtained in either of the ways

> c(probit1$dev, c(-2*logLik(probit1) -Refdv))
[1] 262.7669 262.7669