Log for Logistic Regression Analysis of Pima Data ================================================= Stat 430, 12/6/06 (data can be found in UCI repository, which has a link on the course web-page). options linesize=70 nocenter nodate; data pima0 ; infile "ASCdata/pima.dat"; input Obs pregnant glucose diastolic triceps insulin bmi diabetes age Diab ; if obs ne . ; run; /* 768 by 10 dataset */ proc corr data=pima0; var diab; with pregnant, diabetes; run; CORRELATIONS Diab with: pregnant 0.22190 <.0001 diabetes 0.17384 <.0001 ### both very strong !!! ## I tried to analyze "pregnant" which is the number of pregnancies ranging from 0 to 17, first by re-coding to 3 categories (0 , 1-3 and 4-20 in steps not shown here, but that actually weakened the analysis. proc logistic data=pima0; model Diab = pregnant glucose diastolic triceps insulin bmi diabetes age; Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 8.4047 0.7166 137.5452 <.0001 pregnant 1 -0.1232 0.0321 14.7466 0.0001 glucose 1 -0.0352 0.00371 89.8965 <.0001 diastolic 1 0.0133 0.00523 6.4537 0.0111 triceps 1 -0.00062 0.00690 0.0080 0.9285 insulin 1 0.00119 0.000901 1.7485 0.1861 bmi 1 -0.0897 0.0151 35.3467 <.0001 diabetes 1 -0.9452 0.2991 9.9828 0.0016 age 1 -0.0149 0.00933 2.5372 0.1112 ### Only triceps and insulin look unimportant ### Now let's try an automatic selection data pima1; set pima0; agesq = age*age; proc logistic data=pima1; model Diab = agesq pregnant glucose diastolic insulin bmi diabetes age pregnant|glucose pregnant|diastolic pregnant|insulin pregnant|bmi pregnant|diabetes pregnant|age glucose|diastolic glucose|insulin glucose|bmi glucose|diabetes glucose|age diastolic|insulin diastolic|bmi diastolic|insulin diastolic|bmi diastolic|diabetes diastolic|age insulin|bmi insulin|diabetes insulin|age bmi|diabetes bmi|age diabetes|age/ stepwise ; run; Summary of Stepwise Selection Effect Number Step Entered Removed DF In 1 glucose 1 1 2 bmi 1 2 3 pregnant 1 3 4 diabetes 1 4 5 glucose*diabetes 1 5 6 diastolic 1 6 Summary of Stepwise Selection Score Wald Step Chi-Square Chi-Square Pr > ChiSq 1 167.1922 <.0001 2 34.3033 <.0001 3 27.3305 <.0001 4 9.6773 0.0019 5 9.5047 0.0020 6 6.0955 0.0136 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 9.7526 0.9461 106.2493 <.0001 pregnant 1 -0.1565 0.0281 30.9176 <.0001 glucose 1 -0.0479 0.00580 68.2077 <.0001 diastolic 1 0.0124 0.00509 5.9655 0.0146 bmi 1 -0.0879 0.0143 37.6694 <.0001 diabetes 1 -4.1489 1.1177 13.7791 0.0002 glucose*diabetes 1 0.0250 0.00814 9.4219 0.0021 ### These stepwise, forward, or backward selections do require ### some care, since the exact order of entering terms in ### the model partly determines the stewpsie trajectory. ### There are also options (here and also in PROC REG) for ### forcing the inclusion in the model of some terms: the ### syntax --- in both PROC's --- is (after your SELECTION= ### choice) the option INCLUDE=k to make sure that the first k ### model terms you have listed are preserved in all models ### fitted. ### After fitting a similar stepwise-selected logistic regression ### using different software, and doing some more interactive steps ### looking at significance of individual coefficients, I found #### the following as the "best" model: > glm3$call glm(formula = cbind(Diab, 1 - Diab) ~ glucose + diastolic + bmi + diabetes + age + I(age^2) + glucose:diabetes + diastolic:age, family = binomial, data = pima0) > summary(glm3)$coef Estimate Std. Error z value Pr(>|z|) (Intercept) -12.464859018 1.7075468315 -7.299864 2.880595e-13 glucose 0.045881317 0.0059454537 7.717042 1.190603e-14 diastolic -0.054205074 0.0196112980 -2.763972 5.710248e-03 bmi 0.080127736 0.0145553959 5.505019 3.691278e-08 diabetes 3.704377000 1.1319492614 3.272565 1.065765e-03 age 0.261892820 0.0605939160 4.322098 1.545528e-05 I(age^2) -0.003942011 0.0006835604 -5.766881 8.075205e-09 glucose:diabetes -0.022027030 0.0082891666 -2.657327 7.876290e-03 diastolic:age 0.001218600 0.0005808102 2.098104 3.589600e-02 > table(pima0$Diab) 0 1 500 268 > sum(glm3$fit[pima0$Diab==0] >= 0.5)/500 [1] 0.132 #### Easier to "predict" nondiabetic > sum(glm3$fit[pima0$Diab==1] <= 0.5)/268 [1] 0.3731343 ### than diabetic ------------------------------------------------------------------- ### Now we try to reproduce this selection search in SAS using the PROC LOGISTIC selection option with INCLUDE= 8 proc logistic data=pima1; model Diab = glucose diastolic bmi diabetes age agesq glucose|diabetes diastolic|age pregnant insulin bmi pregnant|glucose pregnant|diastolic pregnant|insulin pregnant|bmi pregnant|diabetes pregnant|age glucose|diastolic glucose|insulin glucose|bmi glucose|age diastolic|insulin diastolic|bmi diastolic|insulin diastolic|bmi diastolic|diabetes insulin|bmi insulin|diabetes insulin|age bmi|diabetes bmi|age diabetes|age/ stepwise include=8 lackfit; run; Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 995.484 711.687 SC 1000.128 753.482 -2 Log L 993.484 693.687 The LOGISTIC Procedure Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 299.7965 8 <.0001 Score 250.2776 8 <.0001 Wald 177.2121 8 <.0001 Residual Chi-Square Test Chi-Square DF Pr > ChiSq 32.2487 21 0.0552 NOTE: No (additional) effects met the 0.05 significance level for entry into the model. ### This says that the previous model I found still looks best ### using SAS with the SELECTION and INCLUDE options. ### LACKFIT option causes goodness of fit statistic to be calculated Hosmer and Lemeshow Goodness-of-Fit Test Chi-Square DF Pr > ChiSq 8.9515 8 0.3464 ### indicates no reason to reject ============================================================== Additional Script on Output of PROC LOGISTIC in the alternative R/N MODEL syntax in PROC LOGISTIC. =============================================== libname home "."; data home.simlogist (drop=i); seed = 55733; do i=1 to 600; x1 = ranuni(seed); x2 = ranbin(seed,1,.5); ncas = ranpoi(seed,8); /* Here I made the number of "trials" at each observation a Poisson(8) distributed r.v. NCAS, and Y will be the number of successes */ y = ranbin(seed, ncas, 1/(1+exp(-1.5 + 0.3*x1 + 0.5*X2))); /* Now we have y = # successes ranging from 0:ncas */ output; end; run; options nodate nocenter linesize=70; proc logistic data=home.simlogist; model y/ncas = x1 x2 x1|x2 ; run; ### The output included the following: Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 5479.608 5444.085 ### = -2*logL + 2*p SC 5486.084 5469.990 -2 Log L 5477.608 5436.085 ... The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 1.4100 0.1007 196.0715 <.0001 x1 1 -0.2637 0.1670 2.4931 0.1143 x2 1 -0.4794 0.1356 12.4902 0.0004 x1*x2 1 0.1319 0.2269 0.3382 0.5609 ### The display-table for coefficient estimates and SE's and ### significance test statistics looks very much the same ### when we use the R/N model statement syntax in PROC LOGISTIC. ### Here the "Estimate" column contains the estimated beta- ### coefficients, with SE's in the next column. The ### "Wald Chi-Square" statistics have 1 degree ### of freedom and are simply the squares of the ### "studentized" ratios Estimate/SE , and the ChiSq p-values ### in the final column are then the SAME as the (2-sided) ### normal-distribution p-values for the studentized Estimates. In this case, although we created the x1 and x2 statistics to be significant, the x1 looks INsignificant because the insignificant term x1*x2 is included in the model. To get rid of the insignificant variables, try backward selection: proc logistic data=home.simlogist; model y/ncas = x1 x2 x1|x2 / selection = backward sls=.10; run; Step 1. Effect x1*x2 is removed: ... NOTE: No (additional) effects met the 0.1 significance level for removal from the model. Summary of Backward Elimination Effect Number Wald Step Removed DF In Chi-Square Pr > ChiSq 1 x1*x2 1 2 0.3382 0.5609 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 1.3726 0.0769 318.2839 <.0001 x1 1 -0.1923 0.1130 2.8958 0.0888 x2 1 -0.4107 0.0666 38.0016 <.0001 ### Note that although x2 is more significant than before, once x1*x2 is removed, x2 would ALSO have been removed if we took SLS=.05 .