Instructor: Professor Eric Slud, Statistics Program, Math Dept., Rm 2314, x5-5469, slud@umd.edu
Office hours: M 1-2, W 11-12 (initially), or email me to make an appointment (can be on Zoom).
Please fill out the on-line Evaluation form on this Course and instructor at http://CourseEvalUM.umd.edu. Thank you.
SAMPLE
PROBLEMS FOR IN-CLASS TEST: from
2005 and 2008 and Practice Problems 2021.
Syllabus
Course Overview: The topic of the course is the statistical analysis of data on
lifetimes or durations. Such data often have the feature of being (right-) censored,
where subjects may leave the study at random times (and in some cases return) and those who
are in the study and have not died at the ending time of the study are simply recorded as
being still alive. Another possible data pattern is , or truncated, where subjects enter
the study (at a recorded time) only if they meet some criterion which may involve an
age-variable or time since diagnosis or other preliminary event. Such data arise
frequently in clinical trials, epidemiologic studies, reliabilitytests, and insurance. We
first present parameterizations of survival distributions, in terms of hazard intensities,
which lend themselves to the formulation of parametric models, including regression-type
models which relate failure-time distributions to auxiliary biomedical predictors. The
special features of truncation or censoring present unique challenges in the formulation
of likelihoods and efficient estimation and testing in settings where the distributions
of arrival-times and withdrawal-times are unknown and not parametrically modelled.
This statistical topic has achieved great prominence in the theoretical statistical
literature because it is a particularly good arena for the introduction of techniques
of estimating and testing finite-dimensional parameter values --- such as a treatment-
effectiveness parameter in clinical studies --- in the presence of infinite-dimensional
unknown parameters. Such problems are called Semiparametric.
Klein, J. and Moeschberger, M. (2003) Survival Analysis: Techniques for
Censored and Truncated Data, 2nd ed. Springer
Prerequisites: Stat 410 and either Stat 420 or Stat 700. The
presentation will be geared to second-year Stat grad students.
Grading: The course grade will be based 50% on 6 or 7 homework problem sets, 25% on an in-class test (tentatively scheduled for October 29), and 25% on a course project or paper at the end. The homework problems will be a mixture of theoretical problems at Stat 410/Stat 700 level, and of computational or data-analysis problems. Grading of these problems will also be based on the quality of verbal descriptions and interpretations of results: submitted analyses presenting only undigested numerical output will be graded down. The in-class test will be designed to test (i) definitions (of models and distributions and statistics), (ii) ability to use model definitions to construct likelihoods (and partial likelihoods) and derive statistics from them, and (iii) basic properties of estimators and test-statistics studied in class.
President Pines provided clear expectations to the University about the wearing of masks for students, faculty, and staff. Face coverings over the nose and mouth are required while you are indoors at all times. The only allowed exception when it comes to classrooms and laboratories is for course instructors while they are teaching and adequately distanced from the class.) Students not wearing a mask will be given a warning and asked to wear one, or will be asked to leave the room immediately. Students who have additional issues with the mask expectation after a first warning will be referred to the Office of Student Conduct for failure to comply with a directive of University officials. (1) Kalbfleisch, J. and Prentice, R. (2002) The Statistical Analysis of Failure Time Data, 2nd ed. Wiley
This book was used once for the course. Its explanations are harder, less straightforward and often more intuitive.
(2) Another very useful and readable recommended text (reissued in 1998
as a paperback and currently as an e-book)
(3) An easier book that can be used for self-study and review, free to students as an e-book through the UMD library:
(4) For the more mathematically inclined, a primarily
theoretical text by two former Maryland students:
Coverage of the Klein & Moeschberger book will be Chapters 1-9,
plus a few miscellaneous topics. The main topics are:
Klein & Moeschberger is a very methods-oriented book, and will be covered along with R software implementation with real-data examples. The Miller book explains things well and gives good background and literature references. For additional mathematical justifications, including the connection with counting processes and martingales, I will draw additional material from Fleming and Harrington, my own notes, and the research literature. Other data examples, and more sophisticated data analyses, can be found in the Kalbfleisch and Prentice book (get the data from the R package KMsurv).
Computing in the course can be done with R, SAS, or any other package you are familiar with that also has preprogrammed Survival Analysis modules. However,
R is by far the best choice if you want guidance and/or help from me, and if you want access to the newest methods from the research literature. Various datasets can be explored and accessed within existing R packages and libraries, e.g. by issuing the command
> data() after > library(survival) or [for all datasets from the Klein and Moeschberger book including its exercises] after library(KMsurv). Whatever package you choose, you can get computing help, datasets, and further links here. In particular you can get lots of survival datasets, including some that were in the Kalbfleisch and Prentice book, by clicking here and searching for the keyword "survival".
See the Handouts section below for a link to the "Rbasics" file connected with the data analysis tasks needed for Homeworks. For the systematic Introduction to R and R reference manual distributed with the R software, either download from the R website or simply invoke
the command > help.start() from within R. For a slightly
less extensive introductory tutorial in R, click here. A very handy reference card containing R commands can be found here.
For a directory of R Scripts relevant this course, click here. Handouts can be found at linked pages for each of the following topics:
(0)
Basics on R commands for data entry and Life Tables and Life Table construction. In addition, various useful files on Statistical Computing can be found at my course web-page for Stat 705, along with additional relevant links. (1) UPDATED Nelson-Aalen & Kaplan Meier calculation
(2) UPDATED Illustrative R Script for Survival Curves, Hazards, Medians, and SE's.
(3) UPDATED Nelson-Aalen calculation for left-truncated right-censored data.
(4) UPDATED Script and
Illustrative Picture on model fitting of VA Lung-Cancer data in R. This Script and picture also contain material about fitting and plotting the Cox Model for the same dataset and comparing the
results to the previous accelerated failure time parametric regression model.
(5) UPDATED R
calculations for weighted logrank (2-sample) test. Also available is a New Script-file on Stratified and K-sample Logrank statistics using "survdiff" UPDATED for F21.
(6) New illustration of Stratified versus interaction-term tests of difference between coefficients in subgroups of a survival dataset. This R script and picture explain in the example of a Mayo lung-cancer study that there are differences between the coefficient for a baseline health index ("Karnofsky score") for the two sexes in the study, but that these differences are obscured if an assumption of common baseline hazard for both sexes is made. (7) Handout containing UPDATED R Log on Self-Consistency Property of Kaplan-Meier Estimator and Redistribute-to-the-Right Algorithm and UPDATED Coding for Turnbull (1974) self-consistent estimator of survival-distribution in double-censored survival data.
(8) R Script for Time-dependent Cox-Model fitting, illustrated with data analysis of Mayo-Clinic Lung Cancer Data. An UPDATED version is now available in the RScripts directory,
here.
(9) R script for calculating Partial Likelihoods in (non-time-dependent) Cox-model. This includes calculations with risk-groups. The script will later be
augmented to include the calculation of score statistics for individual coefficients.
Chapter 1. Introduction: Terminology, data structures & examples. 1 class, 8/30
Chapter 2. Failure Time models. Chapter 3. Censored-Data Parametric Inference & Likelihoods.
4 classes, 9/13 - 9/20
Chapter 4. Nonparametric survival-curve estimation. Chapter 5. Estimates for other censoring schemes. Chapter 7. Rank statistics for 1- and 2-sample Tests.
Chapter 8. Relative Risk Regression Models Chapter 6. Other estimation techniques. Chapter 9. Stratified & Time-Dependent Covariate Cox models.
3 classes, 11/12 - 11/17
Chapter 10. Extended Survival Regression Models.
Problem Set 1, Due Monday Sept.13, 2021. ( 6 Problems in all, worth 10 points each.)
Problem Set 2, Six Problems in all, Due Tuesday Sept.28, 2021 (11:59pm). Problem Set 3, Due Sunday October 17, 2021, 11:59pm. Problem Set 4, due Monday October 25, 2021, 11:59pm. (3 Problems, 30 points) plus an optional 5-pt Extra-credit problem. Problem Set 5, due Thursday November 18, 2021, 11:59pm. (6 Problems, 60 points). Problem Set 6, due Saturday December 11, 2021, 11:59pm. (6 Problems, 60 points). I have created a data-file Lymphom.dat which you can use in your project. It is large, with 31689 records of 13 columns each, subsetted and re-coded from the
National Cancer Institute's SEER database of Lymphoma cancer cases from 1973-2001. The file can be inputted with read.table and you will get the proper column-headers if you use the option header=T. You may certainly subset it further in any analyses you do and write up. Details concerning the records retained, the variables chosen, their meanings and the way I re-coded them, can be found here .
Guidelines for the Final Project.
As will be discussed in class, the culminating work for the course, beyond HW and the in-class Test, is a take-home course project which is to consist of a 10-12 page paper based on an original data analysis using the ideas covered in the course, to be handed in before 11:59pm, Saturday December 18, 2021. You may find data anywhere. I suggest that you find a survival dataset with enough structure (eg, regression variables, clear hypothesis of interest like treatment effectiveness in a two-group clinical trial) and sufficient sample-size so that it would make sense to try a few different survival analyses and compare the results. You will be graded on appropriateness and interest of the analyses and especially on the clarity and reasonableness of the conclusions (and/or comparisons among conclusions from different methods) that you reach. Your 10-12 pages (excluding data and plots) should explain clearly the models and assumptions and conclusions in a readable narrative. You may hand in (but preferably give URL for) data, intermediate statistical results, and summary displays such as plots and/or histograms, but I do not want to be given any undigested outputs. That is, any such computed outputs should be presented as exhibits, with specific references to such material and suitable interpretations given in the text of your paper.
If you want to do anything other than a data analysis and narrative for your paper (eg, simulation study or exploration of theoretical and illustrative material on additional methods not covered in the course), such an alternative may be OK, but you must see me about it to get it approved first !!
Important Dates: The UMCP Math
Department home page.
The University of
Maryland home page.
© Eric V Slud, December 9, 2021.
Homework Guidelines: Homework papers are to be worked on individually, except that you may share verbal hints (or get such hints from me) about how to approach a problem. Working together or sharing computed results or written work is a violation of the Code of Academic Integrity and will be reported. You are to hand in HW papers as electronically as pdfs posted to ELMS. If you create the homework paper by using text files containing R scripts, graphical outputs or scanned files, then I recommend that you import these into MS Word and save the document as a single pdf before submitting it. Multiple-document submissions will not be acceptable.
About in-class Masking for Fall 2021
Academic Integrity and HONOR CODE
The University of Maryland, College Park has a nationally recognized Code of Academic Integrity, administered by the Student Honor Council. This Code sets standards for academic integrity at Maryland for all undergraduate and graduate students. As a student you are responsible for upholding these standards for this course. It is very important for you to be aware of the consequences of cheating, fabrication, facilitation, and plagiarism. For more information on the Code of Academic Integrity or the Student Honor Council, please visit http://www.shc.umd.edu.
To further exhibit your commitment to academic integrity, remember to sign the Honor Pledge on all examinations: "I pledge on my honor that I have not given or received any unauthorized assistance on this examination (assignment)."
The following material and handouts were produced 10+ years ago, when the Klein & Moeschberger text had a web-site from which data could be downloaded, and in the intervening years that website was taken down and some of the important R functions in the survival package (such as survfit have changed in important ways. The datasets can now be found in the R package KMsurv, and the R functions are mostly still usable with minor changes. But these handout have to be refreshed and updated to be completely current. I am gradually doing that, and will indicated with a label UPDATED next to the R Scripts indicated below those for which I have mad these modifications in Fall 2021.
NOTE: to get started using survival-related functions in R, you need to "load" the R survival package, which is accomplished by the command: > library(survival)
Do problems # 2.3, 2.9 (the times to substitute are 12, 24, and 60 months), 2.10, 2.16, and 2.20 from the end of Chapter 2 in the Klein and Moeschberger text.
Also to be handed in: using the data in Table 1.2 of the book, create a life table, with rows
corresponding to ordered increasing infection times within each of the two ("Surgically Placed Catheter" and "Percutaneous Placed Catheter") groups, showing the number of "failures" (=infections) occurring
at that time, and the number at risk (ie individuals within the group who are neither infected nor censored before that time).
See the R script at LifeTab.RLog for an indication of how to build Life Tables using R commands.
Do #3.6, 3.8, 4.1(a)-(e) and (i), 4.2(a)-(c) and (e)-(f). In addition
( 5th problem to hand in ): read Theoretical Note 1 on pp. 56-57 and
show as much as you can of the following statement given there:
if, in a bivariate setting with dependent (X,C) having a joint density, the
function ρ(t) defined on p.56 is known to be identically equal to a constant ρ,
and also the sub-distribution function F1(t) and event-time survival function ST(t) are known, then the marginal survival function SX(t) is uniquely determined, and this survival function depends in a monotonically decreasing way
on ρ.
( 6th problem to hand in ): (a) Suppose that T = λ Xα for some constants α > 1 and λ > 0. Find a formula for hT(t) in terms of the function hX(.) and known functions of t (which depend on λ, α). (b) Use (a) to show that if X ~ f(x,β, θ) has hX(t,β,θ) = eθ h0(t, β), then the hazard function of T also factors into a function of θ times a function of (t,β).
In the dataset, you should ignore the "pair" information. The last column is "treat" , a factor (categorical) variable.
for appropriately chosen V(t) = Vn(t) = n · ∑j: tj < t dj/(lj(lj-dj)). Now suppose as in (I) that there are two independent samples or groups with separate KM curves \hat{S}KM,g(t), g=1,2. Denote by tjk, djk, ljk the respective ordered distinct death-times, number of deaths, and numbers at risk for these two samples or groups, and assume (as a null hypothesis H0) that both have the same SX(t) function. Show how to construct a hypothesis test of H0 at significance level 0.05 based on
where S(t) is a pooled estimator of SX(t) obtained e.g. as an average of \hat{S}KM,g(t) for g=1,2, and V*(t) a suitable variance estimator.
(I) From Chapter 8 of the course text, do problems 8.4, 8.5, 8.8(a) and (c).
(II) Using the data and results of problem 8.8, find and plot estimators for:
(a) the baseline cumulative hazard function Λ0(t),
and (b) the population summary survival functions for the ALL, AML Low-Risk, and AML Hi-Risk groups.
(III) Do problems 6.3, 9.3 .
My home page.