Stat 705 HW 15 Solution ======================= > kyphosis = read.table( "http://www-users.math.umd.edu/~evs/s798c/Data/kyph2.dat", header=T, skip=8) > table(as.numeric(kyphosis$Kyphosis)-1) 0 1 64 17 > kyphosis$Kyphosis = as.numeric(kyphosis$Kyphosis)-1 > kyphosis[1:5,] Id Kyphosis Age Number Start 1 1 0 71 3 5 2 2 0 158 3 14 3 3 1 128 4 5 4 4 0 2 5 1 5 5 0 1 4 15 > fitmod1 = step(glm(Kyphosis ~ 1, family=binomial, data=kyphosis), Kyphosis ~ Age + Number + Start + I(Age^2), k=2) ### all variables are entered, with k=2, but Number is not significant > round(summary(fitmod1)$coef, 3) Estimate Std. Error z value Pr(>|z|) (Intercept) -4.384 2.055 -2.133 0.033 Start -0.204 0.071 -2.883 0.004 Number 0.427 0.237 1.805 0.071 Age 0.082 0.035 2.364 0.018 I(Age^2) 0.000 0.000 -2.082 0.037 ### NOTE that only "Start" is significant if I(Age^2) is not within ### scope. Another way to say this: with k=3.84 and scope given only ### by the Number + Start + Age model, no predcitor variable other ### than "Start" would be entered ! > round(summary(update(fitmod1, .~.-Number))$coef, 3) Estimate Std. Error z value Pr(>|z|) (Intercept) -1.737 1.288 -1.348 0.178 Start -0.234 0.069 -3.403 0.001 Age 0.074 0.031 2.386 0.017 I(Age^2) 0.000 0.000 -2.105 0.035 > fitmod1 = update(fitmod1, .~.-Number) > table(kyphosis$Kyphosis, trunc(5*fitmod1$fit) ) 0 1 2 3 4 0 49 10 0 4 1 1 3 5 4 3 2 > tabfit1 = .Last.value > round(tabfit1[2,]/(tabfit1[1,]+tabfit1[2,]), 4) 0 1 2 3 4 0.0577 0.3333 1.0000 0.4286 0.6667 ### This expresses the observed frequency of Kyphosis response ### within each of the 5 covariate-defined categories. ### Since these are categories defined by intervals of successively ### larger fitted values, it makes sense that the larger the category ### number ,the larger the observed frequency of response. That is ### what we see except in the very strange category 2 !! ## This is a setting with 5 cells, so 5 degrees of freedom, but ## then 4 parameters were estimated, so there is just 1 remaining df. ## If we take the observed numbers in the 5 categories as fixed, > nobs = table(trunc(5*fitmod1$fit) ) 0 1 2 3 4 52 15 4 7 3 ### then the model expected and the observed numbers of positive ### responses are respectively: > round(rbind(expected= aggregate(fitmod1$fit, list(trunc(5*fitmod1$fit)), sum)$x, observed= aggregate(kyphosis$Kyphosis, list(trunc(5*fitmod1$fit)),sum)$x),2) [,1] [,2] [,3] [,4] [,5] expected 3.75 4.02 1.94 4.83 2.45 observed 3.00 5.00 4.00 3.00 2.00 ### really bad in category 2 > counts1 = .Last.value > sum((counts1[1,]-counts1[2,])^2*(1/counts1[1,]+1/(nobs-counts1[1,]))) [1] 7.422915 ### Looks large ! Not a good model, even though this is not a formal test ... ### Probably the model with Number left in is better. ### Nevertheless, this is all I meant you to do for this exercise. ## Next follow the corresponding steps for the probit model: > fitmod2 = step(glm(Kyphosis ~ 1, family=binomial(link="probit"), data=kyphosis), Kyphosis ~ Age + Number + Start + I(Age^2), k=3.84) ### Now only "Start" is included in the model: > round(summary(fitmod2)$coef, 3) Estimate Std. Error z value Pr(>|z|) (Intercept) 0.547 0.382 1.434 0.152 Start -0.132 0.034 -3.825 0.000 ### With this model, we can imitate the previous informal ### "goodness of fit" analysis: > table(kyphosis$Kyphosis, trunc(5*fitmod2$fit) ) 0 1 2 3 0 49 7 4 4 1 5 3 6 3 > tabfit2 = .Last.value round(tabfit2[2,]/(tabfit2[1,]+tabfit2[2,]), 4) > round(tabfit2[2,]/(tabfit2[1,]+tabfit2[2,]), 4) 0 1 2 3 0.0926 0.3000 0.6000 0.4286 ### strange because the highest predictor category ### does not actually have the highest observed ### relative response frequency, but the numbers are small! > round(rbind(expected= aggregate(fitmod2$fit, list(trunc(5*fitmod2$fit)), sum)$x, observed= aggregate(kyphosis$Kyphosis, list(trunc(5*fitmod2$fit)),sum)$x),2) [,1] [,2] [,3] [,4] expected 4.99 2.54 4.66 4.53 observed 5.00 3.00 6.00 3.00 ### Not too bad: > counts2 = .Last.value nobs2 = table(trunc(5*fitmod2$fit)) sum((counts2[1,]-counts2[2,])^2*(1/counts2[2,]+1/(nobs2-counts2[1,]))) [1] 2.462474 ### This would be compared with chi-squared 4-2 = 2df ## So it seems that the simpler probit model fits better !! ## (See below for more details and some formal tests.) ================================================================ The formal goodness of fit test is not so easy to do without re-fitting the model a different and slightly more complicated way. The complication arises because there are estimated parameters (the coefficients in the model), which must be estimated by a minimumm chi-squared criterion (equivalent to maximizing the likelihood for the COUNT data ie the numbers of observations falling into the various cells you define. If you divide the population into categories based on the predictor variables, creating say K cells based on predictors, then there will altogether be 2*K cells based on predictors x response. For each parameter vector beta, and each cell j based on predictors, you would find the number R_j of responses and M_j of nonresponses for individuals within that cell, and then you would find the corresponding expected numbers, e.g. in the logistic model E_{j1}(beta) = sum of plogis(x_i*beta) over predictor vectors x_i falling in cell j, and similarly E_{j0}(beta) = M_j+R_j-E_{j1}(beta). Note that R_j-E_{j1}(beta) = -(M_j-E_{j0}(beta)). Then the mnimum chi-square would be the minimum over beta of sum_{j=1}^K (R_j-E_{1j}(beta))^2*(1/E_{1j}(beta}+1/E_{0j}(beta)) and this minimum would be a test statistic for goodness of fit which would reject if larger than qchisq(1-alpha, K-p) where p is the number of estimated coefficients in the model. The degrees of freedom are K-p because there are 2K cells but K constraints (ie that the total numbers of observations with predictors in cell j are fixed at M_j+R_j) and p estimated parameters. ============================================================ HERE IS AN IMPLEMENTATION OF THE FORMAL GOODNESS OF FIT TEST, for the logistic fitted model. ## First we create 4 cells by cross-classifying Age by Start > table(kyphosis$Age > 84, kyphosis$Start > 9) FALSE TRUE FALSE 14 26 TRUE 9 32 > celnum = as.numeric(interaction(kyphosis$Age > 84, kyphosis$Start > 9)) table(celnum) celnum 1 2 3 4 14 9 26 32 > celrsp = aggregate(kyphosis$Kyphosis, list(celnum), sum)$x > celrsp [1] 4 7 2 4 ## Now we create a little function for the chi-square to ## minimize in fitmod1: > Xmat1 = cbind( model.matrix(fitmod1), kyphosis$Number ) mobs1 = table(celnum) > ChiSq1 = function(beta, Xmat, logit=T) { Ebeta = aggregate(c( if(logit) plogis(Xmat %*% beta) else pnorm(Xmat %*% beta) ), list(celnum), sum)$x sum((celrsp-Ebeta)^2*(1/Ebeta + 1/(mobs1-Ebeta))) } > ChiSq1(fitmod1$coef, Xmat1[,1:4]) ### looks OK, but 0 df ! [1] 1.207659 > optim(fitmod1$coef, ChiSq1, Xmat=Xmat1[,1:4]) ### gets to 0, but for absurd parameter values > optim(fitmod1$coef[1:3], ChiSq1, Xmat=Xmat1[,1:3])[1:2] ## Omits Age^2 $par (Intercept) Start Age 0.22386528 -0.26165877 0.01217635 $value [1] 2.284673 ### OK, good for 1df ### Here is another way to get 1df: include only Start and Number > optim(rep(0,3), ChiSq1, Xmat=Xmat1[,c(1,2,5)])[1:2] $par [1] -1.2925359 -0.1687925 0.3874084 $value [1] 4.907799 ### So for this chi-square with Age cells, better ### to use Age rather than Number in the model. > optim(fitmod1$coef[1:2], ChiSq1, Xmat=Xmat1[,1:2])[1:2] ## Uses only Start > 1-pchisq(5.48,2) [1] 0.06457035 ### barely non-significant ### Let's do the same thing for the probit model > ChiSq1(fitmod2$coef, Xmat1[,1:2], F) [1] 5.50925 ### Barely non-significant but not the min! > optim(fitmod2$coef[1:2], ChiSq1, Xmat=Xmat1[,1:2], logit=F)[1:2] $par (Intercept) Start 0.6055596 -0.1347709 $value [1] 5.471162 ### p-value 1-pchisq(5.471,2) = 0.0649 > optim(c(fitmod2$coef[1:2],0), ChiSq1, Xmat=Xmat1[,1:3], logit=F)[1:2] ## uses Age $par (Intercept) Start 0.179369670 -0.147975325 0.006260119 $value [1] 2.637998 ### Now slightly worse than logistic ### but still non-significant chi-sq