Log for Logistic Regression Analysis of Pima Data
=================================================
Stat 430, 12/6/06 (data can be found in UCI repository,
which has a link on the course web-page).

options linesize=70 nocenter nodate;
 data pima0 ;
      infile "ASCdata/pima.dat";
      input Obs   pregnant glucose diastolic triceps
          insulin  bmi diabetes age Diab ;
      if obs ne . ;
    run;               /* 768 by 10 dataset */

  proc corr data=pima0;
         var diab;
         with pregnant, diabetes;
     run;
CORRELATIONS Diab with:

pregnant       0.22190
                <.0001

diabetes       0.17384
                <.0001    ### both very strong !!!

## I tried to analyze "pregnant" which is the number 
  of pregnancies ranging from 0 to 17, first by re-coding
  to 3 categories (0 , 1-3 and 4-20 in steps not shown 
  here, but that actually weakened the analysis.

proc logistic data=pima0;
   model Diab =  pregnant glucose diastolic triceps 
                 insulin bmi diabetes age;

             Analysis of Maximum Likelihood Estimates
                               Standard          Wald
Parameter    DF    Estimate       Error    Chi-Square    Pr > ChiSq

Intercept     1      8.4047      0.7166      137.5452        <.0001
pregnant      1     -0.1232      0.0321       14.7466        0.0001
glucose       1     -0.0352     0.00371       89.8965        <.0001
diastolic     1      0.0133     0.00523        6.4537        0.0111
triceps       1    -0.00062     0.00690        0.0080        0.9285
insulin       1     0.00119    0.000901        1.7485        0.1861
bmi           1     -0.0897      0.0151       35.3467        <.0001
diabetes      1     -0.9452      0.2991        9.9828        0.0016
age           1     -0.0149     0.00933        2.5372        0.1112


### Only triceps and insulin look unimportant
### Now let's try an automatic selection

data pima1;
   set pima0;
   agesq = age*age;

  proc logistic data=pima1;
          model Diab =  agesq pregnant glucose
                  diastolic insulin bmi diabetes age
                  pregnant|glucose  pregnant|diastolic
                  pregnant|insulin  pregnant|bmi
                  pregnant|diabetes pregnant|age glucose|diastolic
                  glucose|insulin glucose|bmi glucose|diabetes
                  glucose|age diastolic|insulin diastolic|bmi
                  diastolic|insulin  diastolic|bmi diastolic|diabetes
                  diastolic|age insulin|bmi insulin|diabetes insulin|age
                  bmi|diabetes bmi|age diabetes|age/
                  stepwise ;  run;

                     Summary of Stepwise Selection

                             Effect                             Number
    Step  Entered                Removed                  DF        In

       1  glucose                                          1         1
       2  bmi                                              1         2
       3  pregnant                                         1         3
       4  diabetes                                         1         4
       5  glucose*diabetes                                 1         5
       6  diastolic                                        1         6

          Summary of Stepwise Selection

               Score          Wald
    Step  Chi-Square    Chi-Square    Pr > ChiSq

       1    167.1922                      <.0001
       2     34.3033                      <.0001
       3     27.3305                      <.0001
       4      9.6773                      0.0019
       5      9.5047                      0.0020
       6      6.0955                      0.0136

             Analysis of Maximum Likelihood Estimates

                                  Standard        Wald
Parameter           DF  Estimate     Error  Chi-Square  Pr > ChiSq

Intercept            1    9.7526    0.9461    106.2493      <.0001
pregnant             1   -0.1565    0.0281     30.9176      <.0001
glucose              1   -0.0479   0.00580     68.2077      <.0001
diastolic            1    0.0124   0.00509      5.9655      0.0146
bmi                  1   -0.0879    0.0143     37.6694      <.0001
diabetes             1   -4.1489    1.1177     13.7791      0.0002
glucose*diabetes     1    0.0250   0.00814      9.4219      0.0021


### These stepwise, forward, or backward selections do require 
###    some care, since the exact order of entering terms in 
###    the model partly determines the stewpsie trajectory.
### There are also options (here and also in PROC REG) for 
###    forcing the inclusion in the model of some terms: the 
###    syntax --- in both PROC's --- is (after your SELECTION=
###    choice) the option  INCLUDE=k  to make sure that the first k
###    model terms you have listed are preserved in all models 
###    fitted.

### After fitting a similar stepwise-selected logistic regression
###   using different software, and doing some more interactive steps
###   looking at significance of individual coefficients, I found 
####   the following as the "best" model:

> glm3$call
glm(formula = cbind(Diab, 1 - Diab) ~ glucose + diastolic + bmi + 
    diabetes + age + I(age^2) + glucose:diabetes + diastolic:age, 
    family = binomial, data = pima0)
> summary(glm3)$coef
                      Estimate   Std. Error   z value     Pr(>|z|)
(Intercept)      -12.464859018 1.7075468315 -7.299864 2.880595e-13
glucose            0.045881317 0.0059454537  7.717042 1.190603e-14
diastolic         -0.054205074 0.0196112980 -2.763972 5.710248e-03
bmi                0.080127736 0.0145553959  5.505019 3.691278e-08
diabetes           3.704377000 1.1319492614  3.272565 1.065765e-03
age                0.261892820 0.0605939160  4.322098 1.545528e-05
I(age^2)          -0.003942011 0.0006835604 -5.766881 8.075205e-09
glucose:diabetes  -0.022027030 0.0082891666 -2.657327 7.876290e-03
diastolic:age      0.001218600 0.0005808102  2.098104 3.589600e-02

> table(pima0$Diab)
  0   1 
500 268 

> sum(glm3$fit[pima0$Diab==0] >= 0.5)/500
[1] 0.132            #### Easier to "predict" nondiabetic
> sum(glm3$fit[pima0$Diab==1] <= 0.5)/268
[1] 0.3731343             ###  than diabetic
-------------------------------------------------------------------

### Now we try to reproduce this selection search in SAS using the
  PROC LOGISTIC selection option with INCLUDE= 8


  proc logistic data=pima1;
          model Diab = glucose diastolic bmi diabetes 
                age agesq glucose|diabetes diastolic|age 
                pregnant insulin bmi  pregnant|glucose  
                pregnant|diastolic pregnant|insulin  
                pregnant|bmi pregnant|diabetes pregnant|age 
                glucose|diastolic glucose|insulin glucose|bmi 
                glucose|age diastolic|insulin diastolic|bmi
                diastolic|insulin  diastolic|bmi diastolic|diabetes
                insulin|bmi insulin|diabetes insulin|age
                bmi|diabetes bmi|age diabetes|age/
                  stepwise include=8 lackfit;  run;

         Model Fit Statistics

                             Intercept
              Intercept            and
Criterion          Only     Covariates

AIC             995.484        711.687
SC             1000.128        753.482
-2 Log L        993.484        693.687

The LOGISTIC Procedure

        Testing Global Null Hypothesis: BETA=0
Test                 Chi-Square       DF     Pr > ChiSq

Likelihood Ratio       299.7965        8         <.0001
Score                  250.2776        8         <.0001
Wald                   177.2121        8         <.0001


     Residual Chi-Square Test

Chi-Square       DF     Pr > ChiSq
   32.2487       21         0.0552

NOTE: No (additional) effects met the 0.05 significance level for
      entry into the model.
### This says that the previous model I found still looks best
###   using SAS  with the SELECTION and INCLUDE options.

### LACKFIT option causes goodness of fit statistic to be calculated

Hosmer and Lemeshow Goodness-of-Fit Test

Chi-Square       DF     Pr > ChiSq

    8.9515        8         0.3464    ### indicates no reason to reject

==============================================================

Additional Script on Output of PROC LOGISTIC in the 
alternative  R/N  MODEL syntax in PROC LOGISTIC.
===============================================

    libname home ".";

    data home.simlogist (drop=i);
       seed = 55733;
       do i=1 to 600;
         x1 = ranuni(seed);
          x2 = ranbin(seed,1,.5);
          ncas = ranpoi(seed,8);
     /* Here I made the number of "trials" at each 
          observation a Poisson(8) distributed r.v.  NCAS, 
          and  Y  will be the number of successes */
           y = ranbin(seed, ncas, 
                 1/(1+exp(-1.5 + 0.3*x1 + 0.5*X2)));
  /* Now we have y = # successes ranging from 0:ncas */
       output; end;  run;

  options nodate nocenter linesize=70;

   proc logistic data=home.simlogist;
       model y/ncas = x1 x2 x1|x2 ;
   run;

### The output included the following:

         Model Fit Statistics

                             Intercept
              Intercept            and
Criterion          Only     Covariates

AIC            5479.608       5444.085  ### = -2*logL + 2*p
SC             5486.084       5469.990
-2 Log L       5477.608       5436.085

...


The LOGISTIC Procedure

             Analysis of Maximum Likelihood Estimates

                               Standard          Wald
Parameter    DF    Estimate       Error    Chi-Square    Pr > ChiSq

Intercept     1      1.4100      0.1007      196.0715        <.0001
x1            1     -0.2637      0.1670        2.4931        0.1143
x2            1     -0.4794      0.1356       12.4902        0.0004
x1*x2         1      0.1319      0.2269        0.3382        0.5609

### The display-table for coefficient estimates and SE's and 
###   significance test statistics looks very much the same 
###   when we use the  R/N  model statement syntax in PROC LOGISTIC.  

### Here the "Estimate" column contains the estimated beta-
###   coefficients, with SE's in the next column. The 
###   "Wald Chi-Square" statistics have 1 degree
###   of freedom and are simply the squares of the 
###   "studentized" ratios  Estimate/SE , and the ChiSq p-values
###   in the final column are then the SAME as the (2-sided) 
###   normal-distribution p-values for the studentized Estimates.

In this case, although we created the x1 and x2 statistics to be
significant, the x1 looks INsignificant because the insignificant
term  x1*x2  is included in the model. To get rid of the 
insignificant variables, try backward selection:

proc logistic data=home.simlogist;
   model y/ncas = x1 x2 x1|x2 / selection = backward sls=.10;
  run;  

Step  1. Effect x1*x2 is removed:
...
NOTE: No (additional) effects met the 0.1 significance level for
      removal from the model.


                   Summary of Backward Elimination

            Effect                Number          Wald
    Step    Removed       DF          In    Chi-Square    Pr > ChiSq

       1    x1*x2          1           2        0.3382        0.5609


             Analysis of Maximum Likelihood Estimates

                               Standard          Wald
Parameter    DF    Estimate       Error    Chi-Square    Pr > ChiSq

Intercept     1      1.3726      0.0769      318.2839        <.0001
x1            1     -0.1923      0.1130        2.8958        0.0888
x2            1     -0.4107      0.0666       38.0016        <.0001


### Note that although x2 is more significant than before, 
      once x1*x2 is removed,  x2  would ALSO have been 
      removed if we took SLS=.05  .