Stat 705 HW 15 Solution
=======================

> kyphosis = read.table(
"http://www-users.math.umd.edu/~evs/s798c/Data/kyph2.dat", 
             header=T, skip=8)
> table(as.numeric(kyphosis$Kyphosis)-1)
 0  1
64 17

> kyphosis$Kyphosis = as.numeric(kyphosis$Kyphosis)-1
> kyphosis[1:5,]
  Id Kyphosis Age Number Start
1  1        0  71      3     5
2  2        0 158      3    14
3  3        1 128      4     5
4  4        0   2      5     1
5  5        0   1      4    15

> fitmod1 = step(glm(Kyphosis ~ 1, family=binomial, data=kyphosis), 
       Kyphosis ~ Age + Number + Start + I(Age^2), k=2)
### all variables are entered, with k=2, but Number is not significant
> round(summary(fitmod1)$coef, 3)
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   -4.384      2.055  -2.133    0.033
Start         -0.204      0.071  -2.883    0.004
Number         0.427      0.237   1.805    0.071
Age            0.082      0.035   2.364    0.018
I(Age^2)       0.000      0.000  -2.082    0.037
### NOTE that only "Start" is significant if I(Age^2) is not within 
###   scope. Another way to say this: with k=3.84 and scope given only
###   by the  Number + Start + Age model, no predcitor variable other 
###   than "Start" would be entered !
> round(summary(update(fitmod1, .~.-Number))$coef, 3)
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   -1.737      1.288  -1.348    0.178
Start         -0.234      0.069  -3.403    0.001
Age            0.074      0.031   2.386    0.017
I(Age^2)       0.000      0.000  -2.105    0.035

> fitmod1 = update(fitmod1, .~.-Number)

> table(kyphosis$Kyphosis, trunc(5*fitmod1$fit) )
     0  1  2  3  4
  0 49 10  0  4  1
  1  3  5  4  3  2

> tabfit1 = .Last.value
> round(tabfit1[2,]/(tabfit1[1,]+tabfit1[2,]), 4)
     0      1      2      3      4 
0.0577 0.3333 1.0000 0.4286 0.6667 

### This expresses the observed frequency of Kyphosis response
###   within each of the 5 covariate-defined categories.
### Since these are categories defined by intervals of successively
### larger fitted values, it makes sense that the larger the category
### number ,the larger the observed frequency of response. That is 
### what we see except in the very strange category 2 !!

## This is a setting with 5 cells, so 5 degrees of freedom, but 
## then 4 parameters were estimated, so there is just 1 remaining df.
## If we take the observed numbers in the 5 categories as fixed,
> nobs = table(trunc(5*fitmod1$fit) )
 0  1  2  3  4 
52 15  4  7  3 

### then the model expected and the observed numbers of positive 
###    responses are respectively:
> round(rbind(expected= 
     aggregate(fitmod1$fit, list(trunc(5*fitmod1$fit)), sum)$x, 
        observed=
     aggregate(kyphosis$Kyphosis, list(trunc(5*fitmod1$fit)),sum)$x),2)
         [,1] [,2] [,3] [,4] [,5]
expected 3.75 4.02 1.94 4.83 2.45
observed 3.00 5.00 4.00 3.00 2.00    ### really bad in category 2

> counts1 = .Last.value
> sum((counts1[1,]-counts1[2,])^2*(1/counts1[1,]+1/(nobs-counts1[1,])))
[1] 7.422915
### Looks large ! Not a good model, even though this is not a formal test ...
### Probably the model with Number left in is better.

### Nevertheless, this is all I meant you to do for this exercise.

## Next follow the corresponding steps for the probit model:

> fitmod2 = step(glm(Kyphosis ~ 1, family=binomial(link="probit"), 
     data=kyphosis), Kyphosis ~ Age + Number + Start + I(Age^2), k=3.84)
### Now only "Start" is included in the model:
> round(summary(fitmod2)$coef, 3)
            Estimate Std. Error z value Pr(>|z|)
(Intercept)    0.547      0.382   1.434    0.152
Start         -0.132      0.034  -3.825    0.000

### With this model, we can imitate the previous informal 
###     "goodness of fit" analysis:

> table(kyphosis$Kyphosis, trunc(5*fitmod2$fit) )
     0  1  2  3
  0 49  7  4  4
  1  5  3  6  3

> tabfit2 = .Last.value
  round(tabfit2[2,]/(tabfit2[1,]+tabfit2[2,]), 4)
>   round(tabfit2[2,]/(tabfit2[1,]+tabfit2[2,]), 4)
     0      1      2      3 
0.0926 0.3000 0.6000 0.4286    
    ### strange because the highest predictor category
   ###  does not actually have the highest observed 
   ###  relative response frequency, but the numbers are small!

> round(rbind(expected= 
     aggregate(fitmod2$fit, list(trunc(5*fitmod2$fit)), sum)$x, 
        observed=
     aggregate(kyphosis$Kyphosis, list(trunc(5*fitmod2$fit)),sum)$x),2)
         [,1] [,2] [,3] [,4]
expected 4.99 2.54 4.66 4.53
observed 5.00 3.00 6.00 3.00    ### Not too bad:

> counts2 = .Last.value
  nobs2 = table(trunc(5*fitmod2$fit))
  sum((counts2[1,]-counts2[2,])^2*(1/counts2[2,]+1/(nobs2-counts2[1,])))
[1] 2.462474  ### This would be compared with chi-squared 4-2 = 2df

## So it seems that the simpler probit model fits better !!
## (See below for more details and some formal tests.)

================================================================

The formal goodness of fit test is not so easy to do without 
re-fitting the model a different and slightly more complicated 
way. The complication arises because there are estimated 
parameters (the coefficients in the model), which must be
estimated by a minimumm chi-squared criterion (equivalent to 
maximizing the likelihood for the COUNT data ie the numbers of 
observations falling into the various cells you define. If you 
divide the population into categories based on the predictor
variables, creating say K cells based on predictors, then there 
will altogether be 2*K cells based on predictors x response. For 
each parameter vector beta, and each cell j based on predictors, 
you would find the number R_j of responses and M_j of nonresponses
for individuals within that cell, and then you would find the 
corresponding expected numbers, e.g. in the logistic model 
E_{j1}(beta) = sum of plogis(x_i*beta)  over predictor vectors 
x_i falling in cell j, 
and similarly   E_{j0}(beta) = M_j+R_j-E_{j1}(beta). 
Note that   R_j-E_{j1}(beta) = -(M_j-E_{j0}(beta)).

Then the mnimum chi-square would be the minimum over
beta of 
sum_{j=1}^K (R_j-E_{1j}(beta))^2*(1/E_{1j}(beta}+1/E_{0j}(beta))
and this minimum would be a test statistic for goodness
of fit which would reject if larger than qchisq(1-alpha, K-p)
where p is the number of estimated coefficients in the model.
The degrees of freedom are K-p because there are 2K cells
but K constraints (ie that the total numbers of observations 
with predictors in cell j are fixed at M_j+R_j) and 
p estimated parameters.

============================================================

HERE IS AN IMPLEMENTATION OF THE FORMAL GOODNESS OF FIT TEST, 
for the logistic fitted model.

## First we create 4 cells by cross-classifying Age by Start
> table(kyphosis$Age > 84, kyphosis$Start > 9)
         FALSE TRUE
  FALSE    14   26
  TRUE      9   32

> celnum = as.numeric(interaction(kyphosis$Age > 84, 
        kyphosis$Start > 9))
  table(celnum)
celnum
 1  2  3  4 
14  9 26 32 
> celrsp = aggregate(kyphosis$Kyphosis, list(celnum), sum)$x
> celrsp
[1] 4 7 2 4

## Now we create a little function for the chi-square to 
##   minimize in fitmod1:
> Xmat1 = cbind( model.matrix(fitmod1), kyphosis$Number )
  mobs1 = table(celnum)

> ChiSq1 = function(beta, Xmat, logit=T) {
      Ebeta = aggregate(c( if(logit) plogis(Xmat %*% beta)  else 
                  pnorm(Xmat %*% beta) ), list(celnum), sum)$x 
      sum((celrsp-Ebeta)^2*(1/Ebeta + 1/(mobs1-Ebeta))) 
      }
     
>  ChiSq1(fitmod1$coef, Xmat1[,1:4])   ### looks OK, but 0 df !
[1] 1.207659

> optim(fitmod1$coef, ChiSq1, Xmat=Xmat1[,1:4])  
         ### gets to 0, but for absurd parameter values

> optim(fitmod1$coef[1:3], ChiSq1, Xmat=Xmat1[,1:3])[1:2]    ## Omits Age^2
$par
(Intercept)       Start         Age 
 0.22386528 -0.26165877  0.01217635 

$value
[1] 2.284673        ### OK, good for 1df

### Here is another way to get 1df: include only Start and Number
> optim(rep(0,3), ChiSq1, Xmat=Xmat1[,c(1,2,5)])[1:2]    
$par
[1] -1.2925359 -0.1687925  0.3874084

$value
[1] 4.907799       ### So for this chi-square with Age cells, better
                ### to use Age rather than Number in the model.

> optim(fitmod1$coef[1:2], ChiSq1, Xmat=Xmat1[,1:2])[1:2]    ## Uses only Start
> 1-pchisq(5.48,2)
[1] 0.06457035        ### barely non-significant

### Let's do the same thing for the probit model
  
> ChiSq1(fitmod2$coef, Xmat1[,1:2], F)  
[1] 5.50925         ### Barely non-significant but not the min!
> optim(fitmod2$coef[1:2], ChiSq1, Xmat=Xmat1[,1:2], logit=F)[1:2]    
$par
(Intercept)       Start 
  0.6055596  -0.1347709 

$value
[1] 5.471162        ### p-value 1-pchisq(5.471,2) = 0.0649

> optim(c(fitmod2$coef[1:2],0), ChiSq1, Xmat=Xmat1[,1:3], logit=F)[1:2]  ## uses Age 
$par
 (Intercept)        Start              
 0.179369670 -0.147975325  0.006260119 

$value
[1] 2.637998   ### Now slightly worse than logistic 
             ### but still non-significant chi-sq