HW 14 Stat 705 Fall 2015 Assigned Saturday 11/14/15 DUE Monday 11/23/15 HW 14 Stat 705 Fall 2015 Assigned Saturday 11/14/15 DUE Monday 11/23/15 (A) ---------- Solution -------------------- > set.seed(3) x = runif(100) y = 3 - 7*x + 10*abs(x-.5)^1.5*rnorm(100) Dset = cbind.data.frame(y, x) > tmp1 = lm(y ~ x) > summary(tmp1)$coef Estimate Std. Error t value Pr(>|t|) (Intercept) 2.746689 0.3028526 9.069393 1.265556e-14 x -6.667683 0.5391668 -12.366643 1.003877e-21 wtfr = cbind.data.frame(ystar = y/abs(x-.5)^1.5, onestar = 1/abs(x-.5)^1.5, xstar = x/abs(x-.5)^1.5) > tmp2 = lm(ystar ~ onestar + xstar -1, data=wtfr) summary(tmp2)$coef Estimate Std. Error t value Pr(>|t|) onestar 2.878960 0.1394922 20.63886 1.877378e-37 xstar -6.756853 0.2783812 -24.27194 3.112938e-43 > tmp3 = lm(y ~ x, weights = 1/abs(x-.5)^3) > summary(tmp3)$coef Estimate Std. Error t value Pr(>|t|) (Intercept) 2.878960 0.1394922 20.63886 1.877378e-37 x -6.756853 0.2783812 -24.27194 3.112938e-43 > sum(abs(summary(tmp2)$coef - summary(tmp2)$coef)) [1] 0 > sum(abs(summary(tmp2)$cov.unsc - summary(tmp2)$cov.unsc)) [1] 0 > sum(abs(summary(tmp2)$sig - summary(tmp3)$sig)) [1] 1.421085e-14 > sum(abs(tmp2$fitted*abs(Dset$x-0.5)^1.5 - tmp3$fitted)) [1] 8.310505e-13 VERIFY that weighted least squares models y ~x fitted to this dataset in two different ways, (1) using lm with weights = b(x), and (2) using lm (without an intercept!) with ystar = y*sqrt(b(x)) and model-matrix columns all also multiplied componentwise by sqrt(b(x)) NB. A student point out (correctly) that if any observations x[i] were extremely close to 0.5, then the weights for those points would be so large as to dominate all other points. In these data, the smallest distance from a point x[i] to 0.5 and the largest corresponding weight are > aux = min(abs(Dset$x-0.5)) c(aux, 1/aux^3) [1] 5.023914e-03 7.886301e+06 If we were doing this with real data, I would certainly suggest tapering these weights so that the ratio of largest to smallest is no larger than 1000, say. (B) ------ Solution ------------------- > set.seed(3577) x = rgamma(250, 0.7) z = sample(1:4, 250, replace=T, prob=c(0.2,0.3,0.3,0.2)) nv = rep(c(5,10),125) y = rbinom(250, nv, plogis(0.3 + 1.2*x - 0.7*z)) AuxFr = cbind.data.frame(y,x,z,nv) probit1 = glm(cbind(y, nv-y) ~ x + z, data=AuxFr, family=binomial(link="probit")) ## or probit1B = glm(y/nv ~ x + z, data=AuxFr, weight=nv, family=binomial(link="probit")) ## Saturated-model deviance: glm(cbind(y, nv-y) ~ I(qnorm((y+1.e-8)/(nv+2.e-8))), data=AuxFr, family=binomial(link="probit"))$dev [1] 7.199995e-07 ### This shows that the saturated model deviance is 0, which means ### that the reference value of -2*logLik is > Refdv = -2*sum(dbinom(y,nv, (y+1.e-7)/(nv+2.e-7), log=T)) > Refdv [1] 480.7319 ## Null deviance, four ways > prbt0 = glm(cbind(y, nv-y) ~ 1, data=AuxFr, family=binomial(link="probit")) prbt0$dev [1] 652.3531 prbt0$null.dev [1] 652.3531 > summary(prbt0$fitted) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.3872 0.3872 0.3872 0.3872 0.3872 0.3872 > -2*sum(dbinom(y,nv,prbt0$fitted, log=T)) - Refdv [1] 652.3531 > c( -2*logLik(prbt0) - Refdv) [1] 652.3531 ### So the current deviance is obtained in either of the ways > c(probit1$dev, c(-2*logLik(probit1) -Refdv)) [1] 262.7669 262.7669