LOG TO ILLUSTRATE REGRESSION MODELLING AND GRAPHICAL
TECHNIQUES IN THE CONTEXT OF SIMULATED DATA
============================================= 11/3/07
(SO THAT WE CAN KNOW THE 'TRUE' MODELS)
We begin by creating a small simulated SAS data-set
which has a strong quadratic term.
libname home ".";
data home.SIMSET0 (drop = seed I );
if _N_ = 1 then do;
SEED = 55;
/* use non-zero seed to make the generated data-set
reproducible --- helps in debugging */
do i=1 to 500; X=ranuni(seed); end;
/* this is just to initialize or "burn in" the simulation */
end; /* initialization */
do I=1 to 75;
x = ranuni(seed);
y = 0.7 + 1.3*X + 0.8*X**2 + 0.8*rannor(seed);
/* a = 0.7, b1 = 1.3, b2 = .8, sigsq = (.8)^2 */
output; end;
/* Need output before "end" in order to have more
than 1 OBS !! */
data simset0;
set home.simset0;
Xsq = X**2;
Ymn = 0.7 + 1.3*X + 0.8*Xsq ;
run;
/* Of course in a real regression problem we would not
know the true mean */
options linesize=70 nocenter;
goptions csymbol=black;
symbol1 V="+" ;
symbol2 V=NONE I=RLCLI95 ;
symbol3 V=NONE I=JOIN Line=3;
symbol4 V=diamond;
symbol5 V=NONE I=RLCLI99 ;
proc sort data=simset0;
by X; /* sort first to graph with I=JOIN */
proc gplot data=simset0;
plot Y * X = 1
Y * X = 2 /* no "outliers" with 5 in place of 2 !! */
YMN * X = 3 /overlay ;
title "Simple Regression Model Plots vs True Curve";
run;
NOTE: Regression equation : y = 0.645668 + 2.011291*X.
### We can see that the coeff of X got inflated because
### of the X^2 term (which was ignored in model so far).
### There are three points which land slightly outside the
### confidence band CLI95, which is not at all surprising
### because the EXPECTED number of points that will is
### 75 * (.05) = 3.75 .
### But if we had wanted to identify the points falling
### outside, we can:
proc reg data=simset0;
model y = X ;
output out=LinRsd0 p=PredL r = RsdL L95=Low U95=Hi
stdr =StdR Student = StuResW
Rstudent = StuResWO;
data BigRsd;
set LinRsd0 (drop=Xsq);
if Y < Low or Y > Hi or abs(StuResWO) > 1.96;
proc print; run;
Simple Regression Model Plots vs true Curve
10:20 Friday, November 3, 2006
Obs X y Ymn PredL Low
1 0.43740 -0.65496 1.42168 1.52541 -0.26815
2 0.46659 -0.30014 1.48074 1.58413 -0.20891
3 0.48727 -0.62646 1.52339 1.62570 -0.16710
Obs Hi RsdL StdR StuResW StuResWO
1 3.31897 -2.18038 0.88713 -2.45779 -2.54862
2 3.37716 -1.88427 0.88740 -2.12337 -2.17708
3 3.41851 -2.25216 0.88751 -2.53761 -2.63926
## One thing we can do to find out if there is any use in
## including a model term based on Xsq is to get a partial
## correlation:
proc corr data=simset0;
var Y; with Xsq; partial X;
run;
Pearson Partial Correlation Coefficients, N = 75
Prob > |r| under H0: Partial Rho=0
y
Xsq 0.21307
0.0683 /* So we do not quite have
significance, but this suggests Xsq
term will help the fit. */
proc reg data=simset0;
model y = x xsq ;
output out= Qrsd0 p=PredQ r = RsdQ RStudent= StuRQ;
The REG Procedure
Model: MODEL1
Dependent Variable: y
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 1.08035 0.31299 3.45 0.0009
X 1 -0.48110 1.39129 -0.35 0.7305
Xsq 1 2.42558 1.31078 1.85 0.0683
### NOTE last p-value .0683 same as for partial correlation !
### This is not a coincidence !!!
data Combin (drop=Ymn);
merge Qrsd0 LinRsd0 (keep = RsdL StuResWO
rename=(StuResWO=StuRL));
Tru = Ymn - PredQ;
proc Gplot;
plot Tru * X = 3
RsdQ * X = 1
RsdL * X = 4 / overlay;
title " Quadratic vs Linear Model Residuals";
run;
### This graph shows that the residuals from Quadratic model
(pluses in circles) generally but not always lie closer to
the horizontal 0-line than do the Linear model residuals
(diamonds). Dashed line shows the true mean-curve minus predicted.
NOTE that the quadratic-model fitted coeff of X is negative, and it
should not be !! This is a small-sample phenomenon which can occur
because X and X-squared are highly correlated:
proc corr data=simset0;
var xsq; with x; run; /* corr .9681 */
===========================================================
NOW WE CONTINUE WITH A LARGER SIMULATED DATASET TO ILLUSTRATE
THE USE OF RESIDUALS FALLING OUTSIDE OF EXPECTED RANGES
-------------------------------------------------------
(continuation 11/8/06, using same OPTIONS and GOPTIONS as before)
libname home ".";
data home.SIMSET1 (drop = seed I );
if _N_ = 1 then do;
SEED = 77777;
do i=1 to 500; X=ranuni(seed); end; end;
do I=1 to 500;
x = ranuni(seed);
y = 0.7 + 1.3*X + 0.8*X**2 + 0.8*rannor(seed);
/* a = 0.7, b1 = 1.3, b2 = .8, sigsq = (.8)^2 */
output; end;
data simset1;
set home.simset1;
Xsq = X**2;
Ymn = 0.7 + 1.3*X + 0.8*Xsq ;
proc sort data=simset1; by X;
proc reg data=simset1;
model y = X ;
output out=LinRsd1 p=PredL r = RsdL L95=Low U95=Hi
stdr =StdR Student = StuResW
Rstudent = StuResWO CookD = Cook;
data tmp ;
set LinRsd1;
LowRsd = Low - PredL;
HiRsd = Hi-PredL;
Zer = 0;
YmnRsd = Ymn - PredL;
symbol6 V=NONE I = JOIN LINE = 2;
proc gplot data=tmp;
plot StuResWO * X = 4
/* same as StuResW in this case because n is large */
Zer * X = 3
LowRsd * X = 6
HiRsd * X = 6
YmnRsd * X = 6 / overlay;
title "Studentized Residual Plot, SimSet1";
run;
data tmp2 ;
set tmp;
if abs(StuResW0) > 1.96;
/* Log window shows this happens 17 times,
while expected # = 25 */
proc corr data=tmp;
var Y; with Xsq; partial X;
run;
/* Value .14834 is highly significant, p-value .0009,
even though studentized residuals don't show it ! */