Solution Script for HW2, S705 ============================= > Titanic3 = read.csv("http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv", stringsAsFactors=F) > dim(Titanic3) [1] 1309 14 (a) How many different people have data in this dataset ? NOTE: there are some duplicate names. How many ? Do you think that the duplicate-pairs represent different people ? Give reasoning based on information in the dataset. > attach(Titanic3) > which(duplicated(name)) [1] 727 926 > Titanic3[c(727,926),] pclass survived name sex age sibsp parch ticket fare cabin 727 3 0 Connolly, Miss. Kate female 30 0 0 330972 7.6292 926 3 0 Kelly, Mr. James male 44 0 0 363592 8.0500 embarked boat body home.dest 727 Q NA Ireland 926 S NA > Titanic3[name==name[727],] pclass survived name sex age sibsp parch ticket fare cabin 726 3 1 Connolly, Miss. Kate female 22 0 0 370373 7.7500 727 3 0 Connolly, Miss. Kate female 30 0 0 330972 7.6292 embarked boat body home.dest 726 Q 13 NA Ireland 727 Q NA Ireland ## Of the two Kate Conolly's the ages and ticket id's are quite different, and ## only one survived. Similarly for the two James Kelly's (although neither ## survived, they embarked at different ports). ## So despite the duplicates in these common Irish names, there were 1309 distinct ## people. None of the names are missing or blank. (b) As is true of many datasets, this one has some missing (NA) values in numeric columns and blanks in character columns (which are also "missing" in the same sense). Designate each column as numeric or character, and count the missing values. Also count the individual records (row) of the data-frame for which there are respectively 1,2, ... (and what is the maximum ?) missing fields. > sapply(Titanic3, class) pclass survived name sex age sibsp parch "integer" "integer" "character" "character" "numeric" "integer" "integer" ticket fare cabin embarked boat body home.dest "character" "numeric" "character" "character" "character" "integer" "character" > aux = sapply(Titanic3,class)=="character" sapply(Titanic3[,aux], function(col) sum(col=="")) name sex ticket cabin embarked boat home.dest 0 0 0 1014 2 823 564 > sapply(Titanic3[,!aux], function(col) sum(is.na(col))) pclass survived age sibsp parch fare body 0 0 263 0 0 1 1188 > missarr = cbind(is.na(Titanic3[,!aux]), Titanic3[,aux]=="") table(apply(missarr,1,sum)) 1 2 3 4 5 186 290 399 278 156 ## That is, every row has at least one missing field; 186 have just 1, 290 have 2, etc. (c) Tabulate the survival rates, by sex and passenger class (pclass) for individuals on the Titanic3 data file. > attach(Titanic3) ### need to do this so that columns can be referenced ### directly, eg as "pclass" rather than "Titanic3$pclass" > tmp = round(table(sex[survived==1], pclass[survived==1])/ table(sex,pclass), 3) dimnames(tmp)[[2]] = paste("Class",1:3,sep="") tmp Class1 Class2 Class3 female 0.965 0.887 0.491 male 0.341 0.146 0.152 ## So this does show a striking difference in survival rate by sex and pclass. (d) Exploratory (descriptive) questions, to answer as time permits: --- at what young age x does it seem that there was no longer a survival advantage to have been aged <= x on the Titanic ? Does this seem to have varied with passenger class ? > Advantage = function(x) mean(survived[age <= x], na.rm=T) - mean(survived[age>x], na.rm=T) adv = numeric(25) for(i in 1:25) adv[i] = Advantage(i) plot(adv, xlab="Age cutoff", ylab="Survival rate diff", main="Survival rate diff for age <=x vs age > x", type="l",col="blue") > round(adv[15:25],3) [1] 0.186 0.165 0.138 0.103 0.080 0.067 0.032 0.040 0.034 0.042 0.028 ### advantage plateau ending at 15, around .17, decreasing sharply to < .05 at 21. > AdvTabl = function(x) table(pclass[survived==1 & age <=x])/table(pclass[age<=x]) - table(pclass[survived==1 & age>x])/table(pclass[age>x]) > SurvDifTbl = array(0,c(17,3), dimnames=list(paste("age",5:21,sep="-"), paste("class",1:3,sep=""))) for(i in 5:21) SurvDifTbl[i-4,] = AdvTabl(i) > mat2 = table(ceiling(age[age>=5 & age <22]), pclass[age>=5 & age <22]) dimnames(mat2)[[2]] = paste("#class",1:3,sep="") cbind(round(SurvDifTbl,3), mat2) class1 class2 class3 #class1 #class2 #class3 age-5 0.030 0.596 0.272 0 1 4 age-6 0.114 0.598 0.247 1 1 4 age-7 0.114 0.601 0.237 0 1 3 age-8 0.114 0.611 0.215 0 4 2 age-9 0.114 0.611 0.208 0 0 10 age-10 0.114 0.611 0.176 0 0 4 age-11 0.166 0.611 0.155 1 0 3 age-12 0.166 0.616 0.159 0 2 2 age-13 0.200 0.619 0.157 1 1 3 age-14 0.225 0.583 0.159 1 2 5 age-15 0.245 0.587 0.170 1 1 6 age-16 0.283 0.519 0.167 3 2 14 age-17 0.242 0.502 0.135 4 3 13 age-18 0.237 0.355 0.125 6 9 24 age-19 0.188 0.305 0.108 5 9 18 age-20 0.188 0.340 0.093 0 4 19 age-21 0.190 0.277 0.065 5 8 29 ## The advantage is strongest in class 2 !! ## But class 3 is where most of the children were. --- what were the home-destinations of Titanic passengers, and the average ages of passengers and survival rates corresponding to each ? OMITTED AS PER EMAILED INSTRUCTIONS --- does the survival rate of passengers within each passenger class seem to depend at all on the fare paid ? > classAvgfare = c(mean(fare[pclass==1]), mean(fare[pclass==2]), mean(fare[pclass==3], na.rm=T)) > round(classAvgfare,2) [1] 87.51 21.18 13.30 > mat3 = table((fare > classAvgfare[pclass])[survived==T], pclass[survived==T])/ table(fare > classAvgfare[pclass], pclass) dimnames(mat3) = list(c("LowerFare","HigherFare"), paste("Class",1:3,sep="")) round(mat3,3) Class1 Class2 Class3 LowerFare 0.575 0.343 0.243 HigherFare 0.722 0.559 0.284 ### So those who paid an above-average fare for their pclass did have higher ### survival rates in classes 1 and 2, but there was not much of a similar ### effect in class 3. ### Whether these differences were "significant" is a topic for more ### formal investigation ...