DATA ANALYSIS LOG IN R USING GLM
===================================
(I) (10/28/09) Begin by getting syntax for different
"family" and "link" choices.
### Start with basic syntax for "binomial" data, i.e.
### data which comes in batches of "successes", and
### "failures" under each of a number of experimental
### settings summarized through predictor "covariates"
### here written as X1 and X2:
> X1 = runif(120); X2 = rpois(120,3)
ntrials = rep(c(5,6,3), 40)
success = rbinom( 120, ntrials, plogis(-1 + 0.6*X1 + 0.1*X2) )
failure = ntrials - success
simlogis = data.frame( X1, X2, success, failure )
### So the data looks like
X1 X2 success failure
1 0.2120602 3 3 2
2 0.7260127 1 2 4
3 0.1604437 3 1 2
4 0.4431811 4 0 5
5 0.8397253 5 5 1
...
### "Logistic regression" should give intercept -1, and
### coefs for X1 and X2 respectively 0.6, 0.1
> fitB0 = glm ( cbind(success, failure) ~ X1 + X2,
family=binomial, data=simlogis )
### To display coef's, you get much more info from
### summary(fitB0)$coef than from fitB0$coef
> round( summary(fitB0)$coef, 4)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.1342 0.2400 -4.7259 0.0000
X1 0.8906 0.3188 2.7940 0.0052
X2 0.0998 0.0566 1.7627 0.0780
#### NOT BAD, but not precise as you can see from Std.Error
### Here the model was the correct one; but for "probit"
### regression which puts success probability equal to
### pnorm( a + b1*X1 + b2*X2 ) , an incorrect model,
### get a, b1 and b2 nearly the same after rescaling,
### since the scale of the logistic is different
### from the scale of the normal by the factor (ratio of sd's)
> sqrt( 1 / integrate(function(x) x^2*dlogis(x), -Inf, Inf)$val )
[1] 0.5513289
### Now fit "probit" regression:
> fitB0.probit = glm ( cbind(success, failure) ~ X1 + X2,
family=binomial(link="probit"), data=simlogis )
> round( rbind(probit= fitB0.probit$coef,
logit.rescal= fitB0$coef*0.5513), 4)
(Intercept) X1 X2
probit -0.7045 0.5487 0.0628
logit.rescal -0.6253 0.4910 0.0550
### Two sets of coef's close after rescaling
-------------------------------------------------------------
(II) Binomial data often comes in binary form, with ntrials = 1.
### Use data "rats" from "survival" package in R.
### Data on 150 rats contain identifying "litter",
### "rx" (indicator of injection of drug after initial
### administration of carcinogen), time in days on
### study (ignored initally, and "status" which is
### indicator of tumor, our binary response variable.
> library(survival)
> fitB1 = glm(cbind(status,1-status) ~ rx, family=binomial,
data = rats)
### NOTE you can use the column headers as variable names
### if you specify the data-frame using "data="
> fitB1
Call: glm(formula = cbind(status, 1 - status) ~ rx, family = binomial, data = rats)
Coefficients:
(Intercept) rx
-1.450 1.127
Degrees of Freedom: 149 Total (i.e. Null); 148 Residual
Null Deviance: 174
Residual Deviance: 165.3 AIC: 169.3
> summary(fitB1)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.450010 0.2549063 -5.688404 1.282324e-08
rx 1.127237 0.3835089 2.939272 3.289845e-03
### Coefficients fitted by MLE, link="logit" is default for
### binomial-family data: this is logistic regression
## Std.error found as sqrt of diagonal in variance:
> sqrt(diag(summary(fitB1)$cov.scaled))
(Intercept) rx
0.2549063 0.3835089
### "Deviance" = "Residual Deviance" = -2*logLik
> c(fitB1$deviance, -2*logLik(fitB1))
[1] 165.2738 165.2738
### "Null.deviance" is -2 times the logLik
### for the same model with only a const term
> c(fitB1$null.dev,
-2*logLik(update(fitB1, formula = .~ 1))
[1] 173.9746 173.9746
### NOTE the use of the "update" function to refit a model
### of the same type as the one previously done: in the
### model formula, the left-hand side term is a .
### to indicate that it is the same as before. We have
### changed the right-hand side from rx (which automatically
### included intercept along with rx predictor)
### to one with "1" or intercept alone.
### Next use "update" to change the fitB1 model not in its
### model formula but in its specified link within
### the binomial "family".
> fitB2 = update(fitB1, family=binomial(link="probit"))
rbind(logit= fitB1$coef, probit= fitB2$coef,
rescal.probit = fitB2$coef/0.5513)
(Intercept) rx
logit -1.4500102 1.1272368
probit -0.8778963 0.6760028
rescal.probit -1.5924112 1.2261977
### RECALL THAT WE USE DEVIANCES AND DIFFERENCES BETWEEN
### THEM BECAUSE WILKS' THEOREM says that 2 times the
### difference between logLik for a model with p extra
### parameter dimensions versus logLik for a base model
### is equal to approximately a chi-square p df variate
### when the base model is the true one !
### LET's use this idea to examine the quality of the model
### with predictor log(time) in the model along with rx
### NOTE that we must put the expression log(time)
### within I( ) to have it evaluated and constructed as
### a new predictor within the glm fitting function
> fitB3 = update(fitB1, formula= . ~ . + I(log(time)))
### the "data" is the same now as in fitB1, so "time"
### is the column in the data-frame, and a log(time)
### predictor is created under "glm"
### It was reasonable to use "time" as a predictor, partly
### because the range of times is not too different for rats who
### generate tumors, so "time" is not really a response variable.
> summary(rats$time[rats$status==1])
Min. 1st Qu. Median Mean 3rd Qu. Max.
34.00 66.75 80.00 77.28 92.50 104.00
> summary(rats$time[rats$status==0])
Min. 1st Qu. Median Mean 3rd Qu. Max.
45.00 83.50 104.00 93.85 104.00 104.00
> cbind(rats[1:10,], model.matrix(fitB3)[1:10,])
litter rx time status (Intercept) rx I(log(time))
1 1 1 101 0 1 1 4.615121
2 1 0 49 1 1 0 3.891820
3 1 0 104 0 1 0 4.644391
4 2 1 104 0 1 1 4.644391
5 2 0 102 0 1 0 4.624973
6 2 0 104 0 1 0 4.644391
7 3 1 104 0 1 1 4.644391
8 3 0 104 0 1 0 4.644391
9 3 0 104 0 1 0 4.644391
10 4 1 77 0 1 1 4.343805
### the first 4 columns are the original "rats" data
### the last 3 are the design matrix columns created by glm
### So we can look at the new model fit for significance of coefs
> summary(fitB3)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) 17.868661 4.4924813 3.977459 6.965559e-05
rx 1.194019 0.4284626 2.786751 5.323935e-03
I(log(time)) -4.355407 1.0180848 -4.278040 1.885462e-05
### or alternatvely compare deviances or logLik's with fitB1
> c(2*(logLik(fitB3)-logLik(fitB1)), fitB1$dev-fitB3$dev)
[1] 24.37307 24.37307
### this is LRT stat to be compared with chisq 1 df
> 1-pchisq(24.373,1)
[1] 7.937339e-07 ## still highly significant but
### somewhat different p-value from the I(log(time)) coef
### probably because the model is still far from the
### right one.
### We could try to enter additional terms like log(time)^2
### or rx * log(time)
> fitB4 = update(fitB3, .~. + I(rx*log(time)) + I(log(time)^2))
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.784968 52.690740 -0.1857056 0.85267560
rx -13.501206 8.732837 -1.5460274 0.12209794
I(log(time)) 10.179730 24.364938 0.4178024 0.67609159
I(rx * log(time)) 3.324780 1.973389 1.6848069 0.09202582
I(log(time)^2) -1.871215 2.819823 -0.6635932 0.50695069
### This time, the new variables do NOT look significant
### which we can check also through deviances:
> fitB3$dev - fitB4$dev
[1] 3.901065 ### to be compared with chisq 2df, so is
### not at all significant
### We can also do these deviance comparisons all at once
### by looking at an "analysis of deviance" table
> anova(fitB4)
Analysis of Deviance Table
Model: binomial, link: logit
Response: cbind(status, 1 - status)
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev
NULL 149 173.975
rx 1 8.701 148 165.274
I(log(time)) 1 24.373 147 140.901
I(rx * log(time)) 1 3.478 146 137.422
I(log(time)^2) 1 0.423 145 137.000
### As in an anova table, in which RSS replaced Deviance
### these "Deviance values" are the amounts by which the
### current model-fitting line decreases the deviance:
### (recall that fitB1 uses rx only, fitB3 augments by log(time),
### and fitB4 by log(time) plus the 2 additional rx*log(tim)
### and log(time)^2 terms
> Devs = c(fitB1$null.dev, fitB1$dev, fitB3$dev,
update(fitB3, .~.+I(rx*log(time)))$dev, fitB4$dev)
> Devs
[1] 173.9746 165.2738 140.9007 137.4223 136.9997
> round ( -diff(Devs) , 3 ) ### successive differences of llks
[1] 8.701 24.373 3.478 0.423 ### give Deviance col in "anova"
------------------------------------------------
(III) Now look at a descriptive check of model fit within the last
series of models. We can plot for each litter the observed versus
predicted mean "status) (ie mean tumor proportion):
## First we can group the fitted and observed data into 50x3 matrices
## so that we can create sums over the 3 members of each litter
> fitmn = c( matrix(fitB3$fitted, ncol=3) %*% c(1,1,1)/3 )
obsmn = c( matrix(rats$status, ncol=3) %*% c(1,1,1)/3 )
plot(fitmn, obsmn)
### the picture shows generally increasing relationship, but there
### is still only a weak relationship between fitted and obs.
> table(obsmn*3) ### there are lots of litter with 0 tumors,
0 1 2 3 ### and 1 litter with 3
19 23 7 1
### Clearly there is still a lot difference across litters !
### Here is one more idea for descriptively plotting fit vs observation:
### we can assemble litters into bins according to their predictors
### and then plot predictor versus observed tumor prop for the group.
### hisotgram of logit values qlogis(fitB3$fitted) suggests grouping
### into intervals with breaks c(-2.5,-2.0, -1.5, -1.0, -0.5, 0, 4)
> hist(qlogis(fitB3$fitted), breaks=c(-2.5,-2.0, -1.5, -1.0, -0.5, 0, 4),
plot=F)$counts
[1] 54 10 35 16 16 19
### Define groups mathematically:
> grp = ifelse(qlogis(fitB3$fitted) > 0, 6,
1+trunc(2*qlogis(fitB3$fitted)+5))
grp
1 2 3 4 5 6
54 10 35 16 16 19
### The following command produces a table of fitted and
### status values summed by group.
> aggregate.data.frame(cbind.data.frame(fit=round(fitB3$fitted,2),
obs = rats$status), list(grp=grp), sum)
grp fit obs ### so the fit is actually not bad !!
1 1 4.95 3
2 2 1.39 0
3 3 8.32 12 ### although we can see in grp 3 that
4 4 5.24 5 ### the model is not very precise.
5 5 6.79 8
6 6 13.53 12
-----------------------------------------------------------------
### One way to see "litter" differences is enter "litter" as a
### fixed or random effect; another would be to fit a model
### using "conditional logistic regression" (clogit in R)
### which allows the nuisance parameter litter-effect to cancel
### away by conditioning on the sufficient statistic = total number
### of tumors for each litter.
> summary(glm(cbind(status,1-status) ~ rx + time + factor(litter),
family=binomial, data=rats))$coef[1:5,]
Estimate Std. Error z value Pr(>|z|)
(Intercept) 9.6333910 3.626144e+00 2.656649127 0.0078921537
rx 3.4784372 9.711567e-01 3.581746504 0.0003413049
time -0.1517933 3.951368e-02 -3.841537689 0.0001222660
factor(litter)2 -16.6354770 9.136262e+03 -0.001820819 0.9985471977
factor(litter)3 -16.6106806 9.070080e+03 -0.001831371 0.9985387783
...
### estimates wildly different from those before: unreliable because
### the number of parameters is now increased by 49 (= # litters -1)
### Here is the "conditional logistic" fit (conditioned on litter totals)
> clogit(status ~ rx + I(log(time)) + strata(litter), data=rats, method="exact")
Call:
clogit(status ~ rx + I(log(time)) + strata(litter), data = rats,
method = "exact")
coef exp(coef) se(coef) z p
rx 1.97 7.176830 0.68 2.90 0.0038
I(log(time)) -6.91 0.000996 2.50 -2.77 0.0056
Likelihood ratio test=28.6 on 2 df, p=6.2e-07 n= 150
### Tells that the coef's are significant and roughly like those
### fitted in glm, but still goodness of fit is hard to assess.
> fitB3$coef
(Intercept) rx I(log(time))
17.868661 1.194019 -4.355407