LOG TO COVER LOOSE ENDS FROM SEVERAL PREVIOUS SAS LECTURES
================================================== 3/5/09

(I) TOPIC: data example and output from PROC TTEST.
---------------------------------------------------

 data football;
   infile "ASCdata/Football";
   input Trial Air Helium;
       if Air ne . ; run;

/* DATA look like this:
Obs    Trial    Air    Helium
  1       1      25      25
  2       2      23      16
  3       3      18      25
...
 36      36      25      29
 37      37      31      29
 38      38      28      30
 39      39      28      26                 */


options linesize=65 nocenter nodate;
proc gchart data=football;
   vbar air helium / levels=10; run;

* Air does not seem to have outliers; there
    certainly are some separated low values of Helium
    (5 values 16 and below), but they may not be
    isolated enough to be removed as outliers;

  data foot2;
     set football;
     VALUE=Air; Label="A"; OUTPUT;
     VALUE=Helium; Label="H"; OUTPUT;
  run;
  proc sort data=foot2; by Label;
  proc boxplot; plot VALUE*Label; run;

* Boxplot for Helium looks as though median and quartiles
  are higher, but also definitely has greater spread;

(b) The same kicker kicks both footballs, alternately by
A and H. There is certainly dependence between the
different kicks and we might expect greatest dependence
at the closest times. The pairing there but not very
strong: the unpaired 2-sample t-test is  not quite right
although maybe the one-sample version is not so suitable
either !

(c)  data onesamp;
       set football;
       dif = air - helium;
     proc means data=onesamp t prt;
       var dif; run;

The MEANS Procedure

Analysis Variable : dif

t Value    Pr > |t|
-------------------
  -0.42      0.6770
-------------------

* Another way to get the same result on the original
   dataset  which is described in Chapter 6, is: ;

proc ttest data=football;
   paired air*helium; run;

                           Statistics

                         Lower CL            Upper CL   Lower CL
Difference           N       Mean     Mean       Mean    Std Dev
Air - Helium        39     -2.687   -0.462     1.7643     5.6117


                          Upper CL
Difference       Std Dev   Std Dev  Std Err  Minimum  Maximum
Air - Helium      6.8666    8.8495   1.0995      -17       14


                   T-Tests
Difference           DF    t Value    Pr > |t|
Air - Helium         38      -0.42      0.6770


*Although a few helium values had looked like outliers, a
histogram of the differences did not maker those
observations look like outliers ;

* If you did want to do a two-sample t test, here is one
   way to code it using a Group variable;

data twot;
  set football (drop = Trial Helium rename=(Air=Dist))
      football (drop = Trial Air rename=(Helium = Dist));
  if _N_ < 40 then Group = "A"; else Group = "H";

/* NOTE the way in which the same dataset is entered twice !! */

proc ttest data = twot;
  class Group;
  var Dist;    run;

                            T-Tests

Variable   Method          Variances     DF   t Value   Pr > |t|

Dist       Pooled          Equal         76     -0.37     0.7122
Dist       Satterthwaite   Unequal     70.7     -0.37     0.7123


                    Equality of Variances

Variable    Method      Num DF    Den DF    F Value    Pr > F
Dist        Folded F        38        38       1.76    0.0862


(II) Outputting records from PROC FREQ, also including some tricks
related to SORTing: example (I) from HW4, F08:
------------------------------------------------------------

 First word-processed out all of the "s from the
Geog.dat file.

     data Geog;
       infile "Geog.dat";
       input ID State $ Locoff X ;
     run;  /* 1200 records kept */

(i) For # states:

 data tmp3 (keep = state);
               set geog;
             proc sort noduplicates;
               by state;
   proc print; run;   /* 10 unique states */

/* Output has them listed as:
      AZ  CA  CT  GA  NY  SD  TX  VT  WI  WY  */

proc means data=Geog N mean ;
  class State Locoff;
  var X;
  output out= Outgeo N = nobs Mean = Xavg ; run;
  /* 126 observations of which the first gives the overall dataset
        nobs and mean; the next 10 give the Statewise nobs and means;
        the next 11 give the LocOff nobs and means; and the next 104
        give the cross-classified State x Locoff nobs and means */

  options nocenter linesize=80;
        proc print data = Outgeo; run;


Obs    State    Locoff    _TYPE_    _FREQ_    nobs      Xavg

  1                .         0       1200     1200    6.23471
  2                1         1        233      233    5.89323
  3                2         1        220      220    6.07311
  4                3         1        194      194    5.87485
  5                4         1        157      157    6.13181
  6                5         1         80       80    6.92724
  7                6         1         78       78    6.89394
  8                7         1         48       48    6.69385
  9                8         1         69       69    6.50832
 10                9         1         64       64    6.61998
 11               10         1         33       33    6.79821
 12               11         1         24       24    6.65525
 13     AZ         .         2        104      104    6.41100
 14     CA         .         2        263      263    6.16598
 15     CT         .         2         83       83    6.37181
 16     GA         .         2        127      127    6.43317
 17     NY         .         2        197      197    6.19989
 18     SD         .         2         24       24    6.06613
 19     TX         .         2        169      169    6.20879
 20     VT         .         2         58       58    6.24859
 21     WI         .         2        113      113    6.04733
 22     WY         .         2         62       62    6.21561
 23     AZ         1         3         16       16    5.90256
 24     AZ         2         3         19       19    6.35563
 25     AZ         3         3         17       17    6.06753
 26     AZ         4         3         11       11    6.97627
 27     AZ         5         3          7        7    6.76829
 28     AZ         6         3          5        5    7.00520
...

   data geo2 (drop = _TYPE_ _FREQ_);
    set outgeo;
       if _TYPE_ = 3;  /* the ones with no missing class varbls */
     run;   /* again get 104 records this way */

/* This file geo2 has 104 recs with 4 columns
           State Locoff nobs  Xavg
   Could also get the same output file by coding directly: */

   proc sort data=Geog out = geo2;
      by State Locoff;
   data geo2 (drop=X ID);
      set geo2;
       by State Locoff;
       if first.Locoff then do;
             Npers=0; meanX=0; end;
       Npers+1;
       meanX+X;
       if last.Locoff then do;
            meanX = meanX/Npers;
            output; end;  run;    /* This also solves (ii) */

/* To check that all ID's were unique: */

data tmp1 (keep = id);
   set Geog;
proc sort noduplicates;
  by ID;
  run;


/* LAST STEPS: outputting table & chisq residuals */
   
 proc freq  data=Geog ;
         tables State * Locoff /norow nopercent nocum nocol
               chisq expected cellchi2 deviation out = OutTab;
          run;
      proc print data=OutTab; run;
   

/* Unfortunately, the output dataset contains only:

Obs    State    Locoff    COUNT    PERCENT

  1     AZ         1        16     1.33333
  2     AZ         2        19     1.58333
  3     AZ         3        17     1.41667
  4     AZ         4        11     0.91667
  5     AZ         5         7     0.58333
  6     AZ         6         5     0.41667
  7     AZ         7         5     0.41667
  8     AZ         8         9     0.75000
  9     AZ         9         5     0.41667
 10     AZ        10         7     0.58333
 11     AZ        11         3     0.25000
 12     CA         1        51     4.25000
 13     CA         2        42     3.50000
 14     CA         3        42     3.50000
 15     CA         4        38     3.16667
...

So we must use dataset computations and proc means to get chi-square
residuals printed out, although the "cellchi2" and "deviatoin" options
specify that closely related quantities be printed out in the OUTPUT
window .


*/