LOG TO COVER LOOSE ENDS FROM SEVERAL PREVIOUS SAS LECTURES ================================================== 3/5/09 (I) TOPIC: data example and output from PROC TTEST. --------------------------------------------------- data football; infile "ASCdata/Football"; input Trial Air Helium; if Air ne . ; run; /* DATA look like this: Obs Trial Air Helium 1 1 25 25 2 2 23 16 3 3 18 25 ... 36 36 25 29 37 37 31 29 38 38 28 30 39 39 28 26 */ options linesize=65 nocenter nodate; proc gchart data=football; vbar air helium / levels=10; run; * Air does not seem to have outliers; there certainly are some separated low values of Helium (5 values 16 and below), but they may not be isolated enough to be removed as outliers; data foot2; set football; VALUE=Air; Label="A"; OUTPUT; VALUE=Helium; Label="H"; OUTPUT; run; proc sort data=foot2; by Label; proc boxplot; plot VALUE*Label; run; * Boxplot for Helium looks as though median and quartiles are higher, but also definitely has greater spread; (b) The same kicker kicks both footballs, alternately by A and H. There is certainly dependence between the different kicks and we might expect greatest dependence at the closest times. The pairing there but not very strong: the unpaired 2-sample t-test is not quite right although maybe the one-sample version is not so suitable either ! (c) data onesamp; set football; dif = air - helium; proc means data=onesamp t prt; var dif; run; The MEANS Procedure Analysis Variable : dif t Value Pr > |t| ------------------- -0.42 0.6770 ------------------- * Another way to get the same result on the original dataset which is described in Chapter 6, is: ; proc ttest data=football; paired air*helium; run; Statistics Lower CL Upper CL Lower CL Difference N Mean Mean Mean Std Dev Air - Helium 39 -2.687 -0.462 1.7643 5.6117 Upper CL Difference Std Dev Std Dev Std Err Minimum Maximum Air - Helium 6.8666 8.8495 1.0995 -17 14 T-Tests Difference DF t Value Pr > |t| Air - Helium 38 -0.42 0.6770 *Although a few helium values had looked like outliers, a histogram of the differences did not maker those observations look like outliers ; * If you did want to do a two-sample t test, here is one way to code it using a Group variable; data twot; set football (drop = Trial Helium rename=(Air=Dist)) football (drop = Trial Air rename=(Helium = Dist)); if _N_ < 40 then Group = "A"; else Group = "H"; /* NOTE the way in which the same dataset is entered twice !! */ proc ttest data = twot; class Group; var Dist; run; T-Tests Variable Method Variances DF t Value Pr > |t| Dist Pooled Equal 76 -0.37 0.7122 Dist Satterthwaite Unequal 70.7 -0.37 0.7123 Equality of Variances Variable Method Num DF Den DF F Value Pr > F Dist Folded F 38 38 1.76 0.0862 (II) Outputting records from PROC FREQ, also including some tricks related to SORTing: example (I) from HW4, F08: ------------------------------------------------------------ First word-processed out all of the "s from the Geog.dat file. data Geog; infile "Geog.dat"; input ID State $ Locoff X ; run; /* 1200 records kept */ (i) For # states: data tmp3 (keep = state); set geog; proc sort noduplicates; by state; proc print; run; /* 10 unique states */ /* Output has them listed as: AZ CA CT GA NY SD TX VT WI WY */ proc means data=Geog N mean ; class State Locoff; var X; output out= Outgeo N = nobs Mean = Xavg ; run; /* 126 observations of which the first gives the overall dataset nobs and mean; the next 10 give the Statewise nobs and means; the next 11 give the LocOff nobs and means; and the next 104 give the cross-classified State x Locoff nobs and means */ options nocenter linesize=80; proc print data = Outgeo; run; Obs State Locoff _TYPE_ _FREQ_ nobs Xavg 1 . 0 1200 1200 6.23471 2 1 1 233 233 5.89323 3 2 1 220 220 6.07311 4 3 1 194 194 5.87485 5 4 1 157 157 6.13181 6 5 1 80 80 6.92724 7 6 1 78 78 6.89394 8 7 1 48 48 6.69385 9 8 1 69 69 6.50832 10 9 1 64 64 6.61998 11 10 1 33 33 6.79821 12 11 1 24 24 6.65525 13 AZ . 2 104 104 6.41100 14 CA . 2 263 263 6.16598 15 CT . 2 83 83 6.37181 16 GA . 2 127 127 6.43317 17 NY . 2 197 197 6.19989 18 SD . 2 24 24 6.06613 19 TX . 2 169 169 6.20879 20 VT . 2 58 58 6.24859 21 WI . 2 113 113 6.04733 22 WY . 2 62 62 6.21561 23 AZ 1 3 16 16 5.90256 24 AZ 2 3 19 19 6.35563 25 AZ 3 3 17 17 6.06753 26 AZ 4 3 11 11 6.97627 27 AZ 5 3 7 7 6.76829 28 AZ 6 3 5 5 7.00520 ... data geo2 (drop = _TYPE_ _FREQ_); set outgeo; if _TYPE_ = 3; /* the ones with no missing class varbls */ run; /* again get 104 records this way */ /* This file geo2 has 104 recs with 4 columns State Locoff nobs Xavg Could also get the same output file by coding directly: */ proc sort data=Geog out = geo2; by State Locoff; data geo2 (drop=X ID); set geo2; by State Locoff; if first.Locoff then do; Npers=0; meanX=0; end; Npers+1; meanX+X; if last.Locoff then do; meanX = meanX/Npers; output; end; run; /* This also solves (ii) */ /* To check that all ID's were unique: */ data tmp1 (keep = id); set Geog; proc sort noduplicates; by ID; run; /* LAST STEPS: outputting table & chisq residuals */ proc freq data=Geog ; tables State * Locoff /norow nopercent nocum nocol chisq expected cellchi2 deviation out = OutTab; run; proc print data=OutTab; run; /* Unfortunately, the output dataset contains only: Obs State Locoff COUNT PERCENT 1 AZ 1 16 1.33333 2 AZ 2 19 1.58333 3 AZ 3 17 1.41667 4 AZ 4 11 0.91667 5 AZ 5 7 0.58333 6 AZ 6 5 0.41667 7 AZ 7 5 0.41667 8 AZ 8 9 0.75000 9 AZ 9 5 0.41667 10 AZ 10 7 0.58333 11 AZ 11 3 0.25000 12 CA 1 51 4.25000 13 CA 2 42 3.50000 14 CA 3 42 3.50000 15 CA 4 38 3.16667 ... So we must use dataset computations and proc means to get chi-square residuals printed out, although the "cellchi2" and "deviatoin" options specify that closely related quantities be printed out in the OUTPUT window . */