BASIC COMMANDS IN R FOR DATA ENTRY AND COMPUTATIONS
OF LIFE TABLES AND ESTIMATORS, WITH ILLUSTRATIONS
================================================== 9/7/05
PRELIMINARY
Start R by simply typing "R" at a unix command-prompt,
within the directory where you would like to save your
R workspace, or by double-clicking an R icon in Windows.
When you exit R, by typing
> q()
if you answer "yes" or "y" to the save request, the
workspace will be saved in ".RData" and will
automatically be re-loaded when you start R again within
the same directory. (In Windows you can save your R
workspace by going to "save" under the File Menu, and
choosing a directory and naming a file with suffix
".RData" to save into, which can then re-open in future
by double-clicking on it.)
Note that R commands are given following command prompt >
and that text within a line following "#" is ignored,
and that continuation lines begin with "+".
> .First <- function() {
+ options(editor="emacs")
+ library(survival)
+ help.start() }
## Putting this function with this name into your saved
## workspace means that it is called immediately whenever
## you re-open the workspace in a new R session.
### To call the initializing function ".First" directly:
> .First()
Making links in ~/.R ...
If netscape is already running, it is *not* restarted, and you must
switch to its window.
Otherwise, be patient ...
---------------------------------------------------
(1) R objects are named and assigned with the assignment operator
"<-" . Some examples of R objects are
VECTORS
> x <- 1:10
> y <- c(3, 7, -4, 2, 4.555, 20)
> z <- runif(100)
> L <- c("a","B","Wr")
Here x is a 10-dimensional vector with entries 1,2,3,...,10;
y is a 6-dimensional vector with indicated entries;
z is a vector of 100 randomly generated uniform variates;
and L is a 3-dimensional vector whose entries are
character-strings.
Refer to vector entries in square-brackets, eg "x[3]".
Define subvectors by referring to a vector of integer
entries, e.g.
> y[c(1,3,6)]
[1] 3 -4 20
MATRICES
> M <- matrix( 1:10, ncol=2) ### defines a 5x2 matrix M
> M
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
> M[3,2]
> M[3,2]
[1] 8
> M[2:4,]
[,1] [,2]
[1,] 2 7
[2,] 3 8
[3,] 4 9
LISTS
R objects can conveniently be strung together
into lists, which means that a number of related
vectors or matrices (or other R objects, including
lists!) can be given a single name and worked with
as a unit.
This is useful because the "object-oriented" style
of R typically makes the output from any statistical
function in R a list which may then serve as input
to another statistical function (either a pre-coded
one, or one which you may code yourself).
Thus:
> LL <- list(x=x, y=y, Ran = round(z[1:7],5))
> LL
$x
[1] 1 2 3 4 5 6 7 8 9 10
$y
[1] 3.000 7.000 -4.000 2.000 4.555 20.000
$Ran
[1] 0.76233 0.08557 0.35093 0.27004 0.11411 0.50839 0.25646
The "x", "y", and "Ran" names are used to refer
to the three list components; thus LL$x is the vector 1:10,
LL$y is the 6-vector with same entries as y defined above,
and LL$Ran is the vector of the first 10 entries of z rounded
to 5 decimal places.
DATA FRAMES
A data frame is an object like a matrix (with entries that
can be referred to in the same way, eg: Dfram[3,8]) but
also with the structure of a list. So if the 3rd column of
the data-frame Dfram has header or name "ndth", then the
whole column can be referred to either as Dfram[,3] or as
Dfram$ndth. One useful aspect of data-frames is that,
unlike matrices which must be either all numeric or all
boolean or all character, you can have a data-fram with
different colmuns of different types, e.g.
> Dfram <- cbind.data.frame(newx=x[4:9], Ran2=
round(z[11:16],4), Good=c(T,T,F,T,F,T), label=c(L,L))
> Dfram
newx Ran2 Good label
1 4 0.7302 TRUE a
2 5 0.0851 TRUE B
3 6 0.8456 FALSE Wr
4 7 0.4622 TRUE a
5 8 0.6959 FALSE B
6 9 0.8326 TRUE Wr
> Dfram$Ran2
[1] 0.7302 0.0851 0.8456 0.4622 0.6959 0.8326
---------------------------------------------------
(2) DATA ENTRY.
See the script at
http://www.math.umd.edu/~evs/Data/ImportDat.txt
for some illustrations.
The R function scan("indata.txt") reads all of the
entries (assumed separated by blanks) as a long
numeric vector if possible, or as a vector of
character-strings if there is any character data in
the input file (as there generally will be if you do
not skip initial header
lines).
If the ASCII input file is in a tabular form, with
column headers in the first line, and with exactly
the same number of entries in each line (and with
character data always appearing in the
same columns if at all), then the command
> newfram <- read.table("indata.txt", header=T)
will enter it into a data-frame named "newfram".
---------------------------------------------------
(3) LIFE TABLE AND SURVIVAL-RELATED FUNCTIONS
Let's make a sample survival dataset using random
numbers, beginning with a specified seed so that
you can reproduce the commands.
> .Random.seed
[1] 1 695938663 850815041
> NewSeed <- .Random.seed
In order to use the same seed, you should
issue the command
> .Random.seed <- c(1, 695938663 850815041)
Now we create a vector T consisting of 20 unordered
exponentially distributed variates, and a vector Dth
consisting of Binom(1,.7) variates.
> T <- rexp(20)
Dth <- rbinom(20,1,.7)
> Dth
[1] 1 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1
> round(T,6)
[1] 1.285517 1.217859 0.710261 1.941706 1.493600 0.261126 1.149667 1.223809
[9] 0.519391 1.752205 1.728407 0.671232 0.851777 1.269756 0.394057 1.331550
[17] 0.175630 1.498749 0.001993 0.113999
Two of the very useful R commands related to survival data
are Surv and survfit, both within the "survival" package
to which you have access after the command
> library(survival)
which you must either type directly or put into your
initializing function .First() as above.
"Surv" creates a "survival object" by marking with a "+"
those times that are censored, ie those that correspond to
death-indicator 0:
> Surv(T,Dth)
[1] 1.285517101 1.217858786+ 0.710260889+ 1.941706190 1.493600235
[6] 0.261125644 1.149666859+ 1.223808518 0.519391177 1.752204616
[11] 1.728407180 0.671232499+ 0.851776768 1.269756441 0.394056911
[16] 1.331549701 0.175629556 1.498749230 0.001993308+ 0.113998710
Then survfit creates a list with various useful life-table
elements and estimates, using the survival object as input:
> Slist <- survfit(Surv(T,Dth))
> names(Slist)
[1] "n" "time" "n.risk" "n.event" "surv" "type"
[7] "std.err" "upper" "lower" "conf.type" "conf.int" "call"
At this point, you should use the fact that the $time, $n.event,
and $n.risk list components are respectively the ordered distinct
survival times, the number of observed deaths (with Dth=1) at
those times, and the number-at-risk:
> round(Slist$time,5)
[1] 0.00199 0.11400 0.17563 0.26113 0.39406 0.51939 0.67123 0.71026 0.85178
[10] 1.14967 1.21786 1.22381 1.26976 1.28552 1.33155 1.49360 1.49875 1.72841
[19] 1.75220 1.94171
> Slist$n.ev
[1] 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 1 1 1 1 1
> Slist$n.risk
[1] 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
If we want to restrict these vectors to the times at which observed
deaths occur, we make an index-vector
> ind <- (1:20)[Slist$n.ev==1]
> ind
[1] 2 3 4 5 6 9 12 13 14 15 16 17 18 19 20
> round(Slist$time[ind],4)
[1] 0.1140 0.1756 0.2611 0.3941 0.5194 0.8518 1.2238 1.2698 1.2855 1.3315
[11] 1.4936 1.4987 1.7284 1.7522 1.9417
> Slist$n.ev[ind]
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> Slist$n.risk[ind]
[1] 19 18 17 16 15 12 9 8 7 6 5 4 3 2 1
## You should check these commands and results to make
## sure that you understand them.
NOTE: Slist$surv contains the Kaplan--Meier estimated
(right-continuous) survival function estimator at the
times in Slist$time.
-------------------------------------------------------------
(4) ILLUSTRATION USING SIMULATED SURVIVAL DATA,
INCLUDING PLOTTING
==============================================
Here are some steps in R to reproduce the Table 1.4 and
Figure 1.2 exhibits relating to the data in Table 1.1,
Chapter 1 of Kalbfleisch and Prentice.
> Tbl1.1fr <- cbind.data.frame( Group=c(rep(1,19), rep(2,21)),
Dth=c(rep(1,17),0,0,rep(1,19),0,0), Times=
c(143, 164, 188, 188, 190, 192, 206, 209, 213, 216, 220,
227, 230, 234, 246, 265, 304, 216, 244,
142, 156, 163, 198, 205, 232, 232, rep(233, 4), 239,
240, 261, 280, 280, 296, 296, 323, 204, 344))
> Ex1.1Gp1 <- survfit(Surv(Tbl1.1fr$Times[1:19],
Tbl1.1fr$Dth[1:19]))
Ex1.1Gp2 <- survfit(Surv(Tbl1.1fr$Times[20:40],
Tbl1.1fr$Dth[20:40]))
### To get Table 1.4 outputs: first for Group 1
> round(cbind(ti=Ex1.1Gp1$time, di=Ex1.1Gp1$n.ev,
ni=Ex1.1Gp1$n.ri, SKMti=Ex1.1Gp1$surv,
SKMvar=(Ex1.1Gp1$surv*Ex1.1Gp1$std.err)^2),5)
### and a similar command works for Group 2.
### Now for the over-plotted pictures
In general, plot(x,y) sets up axes and plots points
(x[i],y[i]) for two vectors of equal lengths. Additional
character-string arguments xlab, ylab, and main to the
"plot" function respectively control the x-axis label,
y-axis label and main title of the graph.
If you want to plot lines, use the argument type="l";
if stairsteps type="s" (which is what we want in a
survival-function context). Plotting-characters for
plotted points are controlled by the pch argument,
for plotted lines by lty (=1 for solid line, the
default, or =2 or higher for dashed and dotted lines).
Finally, to overplot you should issue subsequent commands
with "points(x,y)" to plot new points on the same axes,
or "lines(x,y)" to plot new lines.
Here are the commands used to generate a graph like
Figure 1.2, p.17, in R.
> plot(Ex1.1Gp1$time, Ex1.1Gp1$surv, xlab="Time",
ylab="Surv Prob", type="s", lty=1, xlim=c(0,325),
main="Carcinogenesis Data KM Plots")
lines(Ex1.1Gp2$time, Ex1.1Gp2$surv, type="s", lty=2)
## To add a legend, you type the following in R,
## then click the middle mouse-button over the graph
## at the point where you want ther upper-left of
## the printed legend box to appear. [On PC, first
## click left mouse-button and then right.
> legend(locator(), legend=c("Group 1","Group 2"), lty=1:2)
## After producing a graph, to get rid of the graph window type
> dev.off()
Finally, to do a log-rank test comparing survival in the two
groups:
> survdiff(formula = Surv(Times, Dth) ~ Group, data = Tbl1.1fr)
Call:
survdiff(formula = Surv(Times, Dth) ~ Group, data = Tbl1.1fr)
N Observed Expected (O-E)^2/E (O-E)^2/V
Group=1 19 17 12.2 1.853 3.12
Group=2 21 19 23.8 0.954 3.12
Chisq= 3.1 on 1 degrees of freedom, p= 0.0772
## Here the Log-Rank "numerator" O-E for Gp2 vs Gp1 survival
## (with positive numbers indicating "excess deaths in Gp1
## which implies Gp2 superiority with respect to survival)
## is 17-12.2 = 4.8, and the variance can be found as
## 3.12/(4.8^2) = 0.1354. The signed logrank statistic is
## (O-E)/sqrt(V) = sqrt(3.12) = 1.766352, which is treated as
## a standard normal deviate. We can recover the p-value,
## except for errors due to rounding in the table above, as
> 2*(1-pnorm(1.76635))
[1] 0.0773