Statistics 705
COMPUTATIONAL STATISTICS IN R
Fall 2017 MW 5-6:15,
MATH Building 1308
Instructor: Eric Slud, Statistics Program,
Math. Dept., evs@math.umd.edu
Office: MTH 2314, x5-5469
Office hours: tentatively, M 3, W 1. But you can make an appointment for
office-hour help at other times by emailing me.
Course Text (Recommended): Venables, W. N. and Ripley, B. D. Modern Applied Statistics with S-PLUS (4th ed, 2002.). New York: Springer-Verlag.
Additional Recommended texts (free to UMCP students through campus library account):
Spector, P. Data Manipulation with R
(2008), Springer e-book.
Robert, C. and Casella, G.
Introducing Monte Carlo Methods with R (2010), Springer e-book, for use with
mcsm R package.
Albert, J. Bayesian Computation with R
(2009), Springer e-book.
Gentle, J. Computational Statistics
(2009), Springer e-book.
There is no required text. There are many R
introductions available on the web, and a set of *pdf course notes including Introduction to R
available on this web-page. An excellent introduction to R concepts and syntax can be
found in the recommended Venables and Ripley text, but the main value you will derive from this
excellent book is a short and insightful introduction to the use of the major base statistical
packages, some of which will be introduced in this course.
Some Recommended Online Resources:
The R
Introduction that is distributed free with the downloaded R code is
this link. It does give
a first exposure to R concepts and definitions, but is not as useful as the syntax
portions (the first 80 pages) of the Venables and Ripley text. Many other Introduction and sets of
notes can be found online, e.g. Rodriguez
Princeton Notes. The freely downloadable notes for this course, at Lecture
Notes descriptions below, are another good source.
A really useful short summary of a lot of R commands can be found here.
Overview of course: Statistical research and application has changed dramatically because of cheap and powerful computational and graphical tools. This course presents modern methods of computational statistics and their application to both practical problems and research. The techniques covered in STAT 705, which include some numerical-analysis ideas arising particularly in Statistics, should be part of every statistician's toolbox.
Statistical
methodology in the course will be presented informally, with emphasis on the intuitive basis for
the techniques and brief discussion of their theoretical pedigree. Implementation of each method
will be given in R, and each method will be illustrated by application to data, often from
real datasets but sometimes from datasets simulated from statistical models.
Prerequisite: STAT 420 or STAT 700, and some programming experience (any language).
Course requirements and Grading: Grading will be based completely on graded
DAILY assignments involving data analysis and statistical
computation (a total of about 20-22 of them). The homework tasks will be of moderate length and
difficulty assigned in each class session, usually due 2 classes after they are
assigned.
Homework Guidelines: For Fall 2017, you may [and it is actually
preferred that you] hand in your homeworks electronically as single-document pdf's,
directed to the grader by the due date at the specially created gmail address
stat705.grader@gmail.com. If you create the homework paper by using text files containing
R scripts, then I recommend that you import these into MS Word and save the document as a pdf
before sending it to the grader. This way of doing it will make it particularly easy to import R
exhibits such as Tables and Graphs as part of a single document. Multiple-document submissions
will not be acceptable.
Also: the grader
will deduct at least 20% credit for late papers, unless you first (before the due-date) get
permission from me for lateness.
For information and Directories on the following topics, click these links:
Homework information
, HW Directory
Data source info ,
Data Directory
Lecture Notes descriptions
, Lecture Notes Directory
Rlog and Scripts descriptions ,
Rlog and Scripts Directory
HONOR CODE
The University
of Maryland, College Park has a nationally recognized Code of Academic Integrity, administered by the
Student Honor Council. This Code sets standards for academic integrity at Maryland for all undergraduate
and graduate students. As a student you are responsible for upholding these standards for this course.
It is very important for you to be aware of the consequences of cheating, fabrication, facilitation,
and plagiarism. For more information on the Code of Academic Integrity or the Student Honor Council,
please visit http://www.shc.umd.edu.
To further exhibit your commitment to academic integrity,
remember to sign the Honor Pledge on all examinations and assignments:
"I pledge on my honor that
I have not given or received any unauthorized assistance on this examination (assignment)."
1. Introduction to R:
Starting and quitting R, on-line help, R operators and functions, creating R objects, data types (vectors, matrices, factors, functions, lists), managing data (combining objects, subsetting, creation of frames), R graphics.
2. Monte Carlo and Simulation in R:
Basic random number generation, applications of LLN and CLT in simulations,numerical integration, importance sampling, empirical distributions, Markov Chain Monte Carlo. Managing loops in R.
3. Numerical Optimization in Statistics:
Objective functions in statistics, and managing functions in R.Linear and nonlinear least squares, special considerations in maximizing likelihoods, penalized likelihood, steepest descent, quasi-Newton-Raphson methods, constrained maximization, EM algorithm. Diagnostics for misspecified models.
4. Linear and Generalized Linear Models:
Regression summaries, model fitting, prediction, model updating, analysis of residuals,model criticism, ANOVA, generalized linear models, specifying link and variance functions, stepwise model selection, deviance analysis.
Brief comparisons of implementations in R and SAS. Fitting mixed-effect (generalized) linear models in R.
5. Bootstrap Methodology:
Parametric bootstrap, empirical CDF, bootstrap standard errors and confidence intervals, estimation of bias, jackknife, application to regression.
6. Smoothing & Nonparametric Regression:
Spline smoothing, density estimation, local polynomial regression kernel smoothing, selecting tuning parameters by cross-validation. Graphical aspects of smoothing.
7. MCMC and the Gibbs Sampler.
Definitions and basic ideas of MCMC and Gibbs-Sampler simulation methodology, possibly including a brief introduction to `Bayesian Computing' using BUGS through R.
Note: This course is about the R
language and statistical programming platform. This free software package is syntactically very
similar to the older Splus. If you are new to R, you should get started as soon as
possible, using it either on your university Glue account in a Linux setting, or on a
workstation or PC, either at the University or on your home computer by downloading the software
following instructions at the R website.
For the systematic Introduction to R and R reference manual distributed with the R
software, either download from the R website or simply
invoke the command
> help.start()
from within R. For a quick start, see my own
Rbasics handout originally
intended for a Survival Analysis class, and then read more about R objects and syntax in the
Venables and Ripley text, in my Notes, and in the R introduction manual distributed with
the R software.
In the middle of the course, we may also mention SAS and other statistical software,
primarily in order to contrast the way in which linear and generalized-linear models are handled
in the different software packages, but this course will not spend any time introducing SAS
or other software.
The topics of individual pdf-filenote-packets are as follows
Sec1NotF09.pdf: Overview,
Unix & R preliminaries, R language
Sec2NotF09.pdf:Introduction to Pseudo-Random Number Generation.
Sec3NotF09.pdf: Introduction to Graphics in R. Also: Simulation
Sec4NotF09.pdf: Numerical maximization methods (for likelihoods).
Sec5NotF09.pdf:Miscellanea: subsetting & parallelizing plus:
Sec6NotF16.pdf: EM (Expectation-maximization) Algorithm for ML
Sec7NotF09.pdf:Markov Chain Monte Carlo: introduction and application
BayesConjug.pdf Conjugate priors for Bayesian inference from data Lec03Pt5B.pdf:Linear
Regression in SAS (including some graphics.) Lec03Pt5C.pdf:Factors,
ANOVA and Regression in SAS vs. Splus. Lec03Pt5D.pdf:Simulation in Splus versus SAS. HANDOUTS distributed in class are included for reference here.
The topics treated on these handout logs are as follows:
Explaining the Gibbs Sampler : This is a readable,
well written introduction EM example on Random Effects ANOVA:
this is a pdf associated with an old HW DensNPR.Log : this log is
a condensed version Factor.Log : class handout on
R handling of Factors and contrasts Contrasts.txt : handout mentioned in 4/4/08 class on defining contrasts in R
StepExmp.Log : gives
a script in R and SAS for stepwise (mostly forward) GLMdispersF08.Log : is
the record of a small R session showing how the dispersion RNGdemoF08.Log
: covers an in-class demonstration of random-number generation Graphics_Rejection.Log :
re-caps an in-class demonstration of acceptance/rejection ImportSamp.Log : gives
the Log covered in class on Importance Sampling. Antith_Contr09: is a Log covered in class about the methods of Antithetic
Minimiz.Log : is a Log combining
two parts: one about numerical maximization using Rfcn.Log : a log on simulation
of Mixtures and inverse functions via uniroot. RlogF09.LinRegr.txt: an R log covered in class 10/26/09 about using
and RlogF09.GLM.txt: an R log from
10/28/09 about fitting and comparison of PredSamp.LM: an R log covered in
class Nov. 2009 about Bayesian posterior SteamDat.Exmp :illustration using
Steam-Use data from Draper and Smith CrabsLog.pdf: extended data-fitting
example in (Splus and) R for DensEst.Log : log
illustrating several different density estimation
elements, Vector & Array
operations, Inputting Data,
and Lists. Functions in R,
& how and why to vectorize.
speedup methods (Accept-Reject & Importance sampling).
Introduction to
Smoothing Splines (and their use in
quick function-inversion in R).
estimation with
missing data.
in an EM estimation
problem in random-intercept logistic
regression. For additional pdf files of "Mini-Course"
Lectures, see
assumed to follow
Exponential Family distributions.
The remaining Handouts/Notes date from previous years and relate to
Lec03Pt5.pdf: SAS
Introduction.
comparisons between
Splus (which apply also to R) versus SAS.
to the idea of the Gibbs Sampler, a good
choice for reading material
to go with the lectures and HW on the Gibbs
Sampler and MCMC.
For Background on Markov Chain Monte Carlo: First
see
Introduction and application of
MCMC within an EM estimation problem
in random-intercept logistic
regression. For additional pdf files of
"Mini-Course" Lectures, including
computer-generated figures, see
Lec.1 on Metropolis-Hastings Algorithm, and Lec.2 on Gibbs Sampler,
with
Figures that can be found in Mini-Course Figure Folders.
problem, not assigned this
year, working out the EM iteration
for the EM algorithm likelihood
maximization in a Balanced Two-
Way Random Effects Analysis of Variance
(ANOVA) setting like
the one treated in the Class R Log for 10/22/2015.
in Spring '04, of the DensEst.Log and NonPReg.Log
below,
illustrating several different density estimation and
nonparametric regression and smoothing techniques. In addition,
the density estimation part has a small section on (Least-
Squares)
cross-validated bandwidth selection, and the
nonparametric regression
component also has some material on
comparative evaluation of methods
using cross-validation.
(using the Bass data in an illustrative example)
within
linear model fitting functions.
for use with Factors in fitting linear models.
selection of variables for
linear regression within an R
dataset called "attitude" rating places to work in
terms of
ratings in various categories reported on numerical scales.
and goodness of fit of glm-fitted
model objects can be assessed.
and simulation, plus
a brief section on unix.time applied to
linear-algebra operations.
sampling, with outputs illustrated by graphics.
Variables and Control Variates for speeding up Monte Carlo.
"nlm" with and without supplying
"gradient" and "hessian"
attributes for the values of the function being minimized.
The second part is a log involving
Maximization, Root-finding,& vectorization in R.
interpreting the R linear model-fitting function "lm".
generalized linear models using the R model-fitting function "lm".
and predictive sampling in normal linear regression (related to
"bass" data of Fall
2009 HW 14 and BayesConjug.pdf Lec-Notes file).
regression book, showing PROC REG in SAS and the R steps related
to
function lm for reproducing the same computed results.
GLM analysis of Horseshoe
Crab data discussed
extensively in Agresti Categorical
Data Analysis book.
techniques
(kernel-density estimation, splines, and
parametric fitting
by a mixture of Gaussian or logistic
components) using the
Galaxies data from a 1996 article
by Roeder. Plots can
be found in pdf format here.
NonPReg.Log : log
illustrating several methods of nonparametric regression and smoothing, using artificial (simulated)
data. Methods include kernel-density, lowess, and splines.
Plots can be found in pdf format here.
Bootstr.Log: log with data
examples to illustrate the connections betweenand mechanics of: Permutational distributions,
p-values & confidence intervals, Parametric Bootstrap and (a very
quick idea of) Nonparametric Bootstrap.
A technical report which explains in some detail the idea of "adaptive
Gaussian quadrature" related to the topic of
"Laplace Approximation"
covered in Stat 705 can be found here.
Steps for analysis of kyphosis
dataset (available both in R as a dataset and also under ASCII data directory
on this web-page) using Generalized Linear Model modules, glm
in R and PROC GENMOD in SAS.
SASlog1.txt : log of practice scripts for categorical data analysis(PROC's FREQ and GENMOD in SAS).
SASlog2.txt :
log on GLM's and deviance, with Analysis of Deviance Tables and implementations in both
SAS and Splus.
SASlog3.txt :
additional material specifically related to the kyphosis dataset, model-fitting and interpretation in
both SAS and R including some material on `deviance' and `standardized
Pearson' logistic-regression residuals.
Some additional material on stepwise fitting in PROC
LOGISTIC and building ananalysis of deviance table from SAS
  output can be found in another
SASlog .
Finally, an R log
summarizing the steps in some GLM's of Fisher scoring versus Newton-Raphson iterations to calculate
Maximum Likelihood
Estimates can be found in NR.FS.Glm .
Listings of special-purpose R functions referenced in Lec-Notes and Handouts
can be found here.
HOMEWORK PROBLEMS and due dates (usually 2, sometimes 3 classes after they are assigned), can be found here. (Occasional solutions will also be posted to the same place.) For guidelines on the amount of material (code & output) to submit with the Homeworks, see the Instructions.txt file. As described in the Instructions file, Homeworks are to be handed in as hard-copy in class on the due-date.
DATA
Several datasets used in the course and handouts can be found here in ASCII or text format. Later in the course, I may post additional large datasets to shared drive space available on University accounts.
In addition, in any environment supporting R, you have access to lots of data in pre-suppliedR libraries which you can look at either by issuing the commands
> search() or
> data()
COMPUTER ACCOUNTS.MATH, STAT, and AMSC graduate students have access to R, SAS and Matlab
under Unix through their University glue accounts. R is freely available in Unix or PC form
through this link.
Additional Computing Resources.There are many publicly
available datasets for practice data-analyses. Many of them are taken from journal articles and/or
textbooks and documented or interpreted. A good place to start is
Statlib, and additional sources can be found here.
Datasets needed in the course will
be either be available in indicated R packages, posted to the Data Directory linked to this
web-page, or indicated by links which will be provided in this space.
CourseEvalUM main page: https://www.CourseEvalUM.umd.edu (top button)
The UMCP Math Department homepage.
The University of Maryland home page.
My home page.
© Eric V Slud, August 28, 2017.