Solution Script for HW2, S705
=============================

> Titanic3 = read.csv("http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv",
               stringsAsFactors=F)
> dim(Titanic3)
[1] 1309   14

(a) How many different people have data in this dataset ?
    NOTE: there are some duplicate names. How many ? Do you think that 
            the duplicate-pairs represent different people ? Give 
            reasoning based on information in the dataset.

> attach(Titanic3)
> which(duplicated(name))
[1] 727 926
> Titanic3[c(727,926),]
    pclass survived                 name    sex age sibsp parch ticket   fare cabin
727      3        0 Connolly, Miss. Kate female  30     0     0 330972 7.6292      
926      3        0     Kelly, Mr. James   male  44     0     0 363592 8.0500      
    embarked boat body home.dest
727        Q        NA   Ireland
926        S        NA 

> Titanic3[name==name[727],]
    pclass survived                 name    sex age sibsp parch ticket   fare cabin
726      3        1 Connolly, Miss. Kate female  22     0     0 370373 7.7500      
727      3        0 Connolly, Miss. Kate female  30     0     0 330972 7.6292      
    embarked boat body home.dest
726        Q   13   NA   Ireland
727        Q        NA   Ireland

## Of the two Kate Conolly's the ages and ticket id's are quite different, and 
## only one survived. Similarly for the two James Kelly's (although neither 
## survived, they embarked at different ports).

## So despite the duplicates in these common Irish names, there were 1309 distinct 
##    people. None of the names are missing or blank.

(b)  As is true of many datasets, this one has some missing (NA) values in 
numeric columns and blanks in character columns (which are also "missing" 
in the same sense). Designate each column as numeric or character, and 
count the missing values.
    Also count the individual records (row) of the data-frame for which there 
are respectively 1,2, ... (and what is the maximum ?) missing fields.

> sapply(Titanic3, class)
     pclass    survived        name         sex         age       sibsp       parch 
  "integer"   "integer" "character" "character"   "numeric"   "integer"   "integer" 
     ticket        fare       cabin    embarked        boat        body   home.dest 
"character"   "numeric" "character" "character" "character"   "integer" "character" 

> aux = sapply(Titanic3,class)=="character"
  sapply(Titanic3[,aux], function(col) sum(col==""))
     name       sex    ticket     cabin  embarked      boat home.dest 
        0         0         0      1014         2       823       564 
> sapply(Titanic3[,!aux], function(col) sum(is.na(col)))
  pclass survived      age    sibsp    parch     fare     body 
       0        0      263        0        0        1     1188 

> missarr = cbind(is.na(Titanic3[,!aux]), Titanic3[,aux]=="")
  table(apply(missarr,1,sum))

  1   2   3   4   5 
186 290 399 278 156 

## That is, every row has at least one missing field; 186 have just 1, 290 have 2, etc.


(c) Tabulate the survival rates, by sex and passenger class (pclass) for 
individuals on the Titanic3 data file.

> attach(Titanic3)      ### need to do this so that columns can be referenced 
                   ### directly, eg as "pclass" rather than "Titanic3$pclass"

> tmp = round(table(sex[survived==1], pclass[survived==1])/
        table(sex,pclass), 3)
  dimnames(tmp)[[2]] = paste("Class",1:3,sep="")
  tmp

         Class1 Class2 Class3
  female  0.965  0.887  0.491
  male    0.341  0.146  0.152

## So this does show a striking difference in survival rate by sex and pclass.


(d) Exploratory (descriptive) questions, to answer as time permits:

--- at what young age x does it seem that there was no longer a survival 
advantage to have been aged <= x on the Titanic ? Does this seem to have 
varied with passenger class ?


> Advantage = function(x) mean(survived[age <= x], na.rm=T) - 
                          mean(survived[age>x], na.rm=T)
  adv = numeric(25)
  for(i in 1:25) adv[i] = Advantage(i)
  plot(adv, xlab="Age cutoff", ylab="Survival rate diff",
     main="Survival rate diff for age <=x vs age > x", type="l",col="blue")

> round(adv[15:25],3)
 [1] 0.186 0.165 0.138 0.103 0.080 0.067 0.032 0.040 0.034 0.042 0.028

### advantage plateau ending at 15, around .17, decreasing sharply to < .05 at 21.

> AdvTabl = function(x) table(pclass[survived==1 & age <=x])/table(pclass[age<=x]) - 
                        table(pclass[survived==1 & age>x])/table(pclass[age>x])

> SurvDifTbl = array(0,c(17,3), dimnames=list(paste("age",5:21,sep="-"), 
                                 paste("class",1:3,sep="")))
  for(i in 5:21) SurvDifTbl[i-4,] = AdvTabl(i)

> mat2 = table(ceiling(age[age>=5 & age <22]), pclass[age>=5 & age <22]) 
  dimnames(mat2)[[2]] = paste("#class",1:3,sep="")
  cbind(round(SurvDifTbl,3), mat2)

       class1 class2 class3 #class1 #class2 #class3
age-5   0.030  0.596  0.272       0       1       4
age-6   0.114  0.598  0.247       1       1       4
age-7   0.114  0.601  0.237       0       1       3
age-8   0.114  0.611  0.215       0       4       2
age-9   0.114  0.611  0.208       0       0      10
age-10  0.114  0.611  0.176       0       0       4
age-11  0.166  0.611  0.155       1       0       3
age-12  0.166  0.616  0.159       0       2       2
age-13  0.200  0.619  0.157       1       1       3
age-14  0.225  0.583  0.159       1       2       5
age-15  0.245  0.587  0.170       1       1       6
age-16  0.283  0.519  0.167       3       2      14
age-17  0.242  0.502  0.135       4       3      13
age-18  0.237  0.355  0.125       6       9      24
age-19  0.188  0.305  0.108       5       9      18
age-20  0.188  0.340  0.093       0       4      19
age-21  0.190  0.277  0.065       5       8      29

## The advantage is strongest in class 2  !! 
## But class 3 is where most of the children were.


--- what were the home-destinations of Titanic passengers, and the average 
ages of passengers and survival rates corresponding to each ?   
           OMITTED AS PER EMAILED INSTRUCTIONS


--- does the survival rate of passengers within each passenger class seem 
to depend at all on the fare paid ?

> classAvgfare = c(mean(fare[pclass==1]), mean(fare[pclass==2]),
                   mean(fare[pclass==3], na.rm=T))
> round(classAvgfare,2)
[1] 87.51 21.18 13.30

> mat3 = table((fare > classAvgfare[pclass])[survived==T], pclass[survived==T])/
  table(fare > classAvgfare[pclass], pclass)
  dimnames(mat3) = list(c("LowerFare","HigherFare"), paste("Class",1:3,sep=""))
  round(mat3,3)

           Class1 Class2 Class3
LowerFare   0.575  0.343  0.243
HigherFare  0.722  0.559  0.284

### So those who paid an above-average fare for their pclass did have higher 
### survival rates in classes 1 and 2, but there was not much of a similar 
### effect in class 3.

### Whether these differences were "significant" is a topic for more 
###    formal investigation ...