HW3 solution, Due Wednesday September 14, 2016. ----------------------------------------------- (a) Define and test an R function "framfix" which takes as a single argument a data-frame (infram) and does the following: (o) it coerces all boolean columns to numeric, and then: (i) it changes all missing (NA) values in numeric columns to 0; (ii) it tabulates the numbers of distinct values of each character column, and puts this into a vector; and tabulates the frequencies of the different values for each character column, and puts this information into a list with number of components equal to the number of character columns; and (iii) it tabulates means and variances of the numeric columns into a Kx2 matrix, where K = number of numeric columns. The output should be a list (with appopriately named components). SOLUTION: ## The following code stops with a message in case any of the columns are ## other than "numeric","integer","character", or "logical", but it ## would have been OK simply to ignore those columns and proceed. > framfix = function(infram) { colclass = sapply(infram, class) oddclass = setdiff(unique(colclass), c("numeric","integer", "character","logical")) if(length(oddclass>0)) { cat("Unusual class types:",oddclass,"!! \n", sep=" ") } else { for(i in which(colclass=="logical")) infram[,i] = as.numeric(infram[,i]) colclass[colclass=="boolean"] = "numeric" numids = which(colclass %in% c("integer","numeric")) Num.MeanVar = array(0,c(length(numids),2), dimnames=list(numids, c("Mean","Var"))) for(i in 1:length(numids)) { k = numids[i] infram[is.na(infram[,k]), k] = 0 Num.MeanVar[i,] = c(mean(infram[,k]), var(infram[,k])) } Num.Distinct = sapply(infram[,numids],function(col) unique(col)) names(Num.Distinct) = paste("col.",numids,sep="") Freq.Charcols = NULL charids = which(colclass=="character") for(i in charids) Freq.Charcols = c(Freq.Charcols, list(table(infram[,i])) ) names(Freq.Charcols) = paste("col",charids,sep=".") list(Num.Distinct=Num.Distinct, Freq.Charcols=Freq.Charcols, Num.MeanVar = Num.MeanVar) } } > toyfram = cbind.data.frame(V1=1:4, V2=letters[5:8], V3=c(3,NA,-2,1), V4=c(4,1,NA,0), V5=rep(c(T,F),2)) > toyfram V1 V2 V3 V4 V5 1 1 e 3 4 TRUE 2 2 f NA 1 FALSE 3 3 g -2 NA TRUE 4 4 h 1 0 FALSE > framfix(toyfram[,-5]) Unusual class types: factor !! > toyfram[,2] = as.character(toyfram[,2]) framfix(toyfram) $Num.Distinct $Num.Distinct$col.1 [1] 1 2 3 4 $Num.Distinct$col.3 [1] 3 0 -2 1 $Num.Distinct$col.4 [1] 4 1 0 $Freq.Charcols $Freq.Charcols$col.2 e f g h 1 1 1 1 $Num.MeanVar Mean Var 1 2.50 1.666667 3 0.50 4.333333 4 1.25 3.583333 (b) Suppose you are given a 3-way array Xarr and want to compute the 2-dimension array: ( sum_{j,k} X[i,j,k] X[i',j,k] )_{i,i'} indexed by i,i'. Do this inside a function, two ways: (1) using a for-loop, and (2) using vector/matrix commands [Hint: this may involve re-formatting 3-way arrays into 2-way arrays !] Also inside thee same function, insert code to record the system.time taken by the two methods of calculation. Give the code for your function, and do some tests on small arrays to show that your function computes correctly on small 3-dim arrays. Finally, apply your function to the array array(runif(8e6), c(200,160,250) ) and verify that your two ways of computing the output 300 x 300 give the same result (to within your machine accuracy). SOLUTION: > SumCalc = function(Xarr) { dims = dim(Xarr) tim1 = system.time({ Frstsum = array(0, c(dims[1],dims[1])) for(i in 1:dims[1]) for(ip in 1:dims[1]) Frstsum[i,ip] = Frstsum[i,ip] + sum(Xarr[i,,]*Xarr[ip,,]) }) tim2 = system.time({ Xaux = array(Xarr, c(dims[1],dims[2]*dims[3])) Scndsum = Xaux %*% t(Xaux) }) list(Sum1 = Frstsum, Sum2 = Scndsum, times=c(tim1[1],tim2[1]))} > SumCalc( array(1:18,c(3,3,2)) ) $Sum1 [,1] [,2] [,3] [1,] 591 642 693 [2,] 642 699 756 [3,] 693 756 819 $Sum2 [,1] [,2] [,3] [1,] 591 642 693 [2,] 642 699 756 [3,] 693 756 819 $times user.self user.self 0 0 > tmp = SumCalc(array(runif(8e6), c(200,160,250))) sum(abs(tmp[[1]]-tmp[[2]])) [1] 1.842061e-06 ### close enough to 0 since this is a sum of 8e6 numbers > tmp$times user.self user.self 192.24 2.57 ### nice factor of speed-advantage for the in-built functions