Instructor: Eric Slud, Statistics program, Math. Dept.
Office: Mth 2314, x5-5469, email evs@math.umd.edu
Office Hours: W or F 11am-12pm or by appointment
Course Text: A. Agresti, Categorical Data Analysis, 3rd ed. 2013. Find Errata here.
Overview: This course covers the statistical analysis of discrete data, cross-classified by and modeled in terms of auxiliary covariate measurements which may be continuous or discrete. Such data structures arise in a wide variety of fields of application, especially in the social and biological sciences. The basic underlying model is the multinomial distribution, with cell-probabilities parametrically restricted according to their array structure, with conditional probability masses for a distinguished response variable often expressed linearly in terms of covariates. Important models of this type (some of which generalize to the case of continuous covariates) include logistic regression, other `generalized linear models', and loglinear models. The modern approach to these topics involves estimation via likelihood-based methods or generalizations to so-called quasilikelihood estimating equations, with emphasis on statistical computing and model diagnostics. In addition, computational advances have made categorical data models with random effects tractable to estimate and interpret, and Bayesian and empirical-Bayes methods are an important part of the material included in the new edition of the Agresti text. Methods covered in the course will be presented in terms of theoretical properties, computational implementation (primarily in R), and real-data application.
NOTE ON USE OF THEORETICAL MATERIAL. Both in homeworks and the in-class test, there will be theoretical material at the level of probability theory needed to apply the law of large numbers and central limit theorem, along with the `delta method' (Taylor linearization) and other manipulations at advanced-calculus level.
Prerequisite: Stat 420 or Stat 700, plus some computing familiarity, preferably including some R.
Course requirements and Grading: there will be 6 graded homework sets (one every 2 weeks), plus a project/paper at the end. Homeworks will be split between theory problems and statistical computations and interpretations with data. The homework will be worth 65% of the grade, the term paper 35%.
Course Coverage: in the Agresti book:
Slide-decks for Lectures, adapted and revised from the way they were given in Fall 2020, can be found in this directory.
NOTE ON COMPUTING. Both in the homework-sets and the course project, you will be required to do computations on real datasets well beyond the scope of hand calculation or spreadsheet programs. Any of several statistical-computing platforms can be used to accomplish these: R, SAS, Minitab, Matlab, or SPSS, or others. If you are learning one of these packages for the first time, or investing some effort toward deepening your statistical computing skills, I recommend R which is free and open-source and is the most flexible and useful for research statisticians. I will provide links to free online R tutorials and will provide examples and scripts and will offer some R help. The Agresti book gives scripts and illustrations in SAS.
Getting Started in R and SAS. Lots of R introductory materials can be found on my
STAT 705 website from several years ago, in particular in these Notes.
Another free and interactive site I recently came across for introducing R to social scientists is: https://campus.sagepub.com/blog/beginners-guide-to-r.
Various pieces of information to help you get started in using SAS can be found under an old (F09) course
website Stat430. In particular you can find:
--- an overview of the minimum necessary steps to use SAS from Mathnet.
--- a series of SAS logs with edited outputs for illustrative examples.
The Agresti text has an Appendix A describing software, including SAS scripts, which can be used to perform categorical data analyses. In addition, datasets can be downloaded from Agresti's website. Several logs in SAS (with some comparative Splus analyses [as precursor to R]) doing illustrative data analyses and including standard SAS scripts can be found here. There is also a lengthy manual for performing R analyses of examples in (the 2nd edition of) the Agresti book.
A set of R Scripts on many topics related to the course are available in this directory. Those that are specifically cited in Lectures will be separately located in this sub-directory.
Notes and Guidelines. Homeworks should be handed in as pdf's through ELMS "Assignments". Solutions will usually be posted on ELMS, and a percentage deduction of the overall HW score will generally be made for later papers.
Assignment 1. (First 2 weeks of course, HW due Fri., Feb. 9). Read all of Chapter 1, plus the first of the sections from the historical notes in Chap. 17. Then solve and hand in all of the following problems:
(b) Let the m black balls position themselves among N bins as in (a). But now suppose that, given the positions
of black balls, the positions for the n white balls are chosen in such a way that the odds for each white ball
to fall in a bin occupied by a black ball is multiplied by a factor eθ as compared with the
odds of falling in a bin not occupied by a black ball. (A more precise way to say or model this question is
as follows: (i) suppose that for some fixed probability p , for each bin j=1,...,N independently
of the others, a black ball is placed in bin j; but that (ii) we condition on the total number of bins containing
black balls being equal to m; and (iii) suppose that each bin not containing a black ball independently of all
other bins has probability q of receiving a white ball, while each bin containing a black
ball independently of all other bins has probability q* [defined to satisfy:
q*/(1-q*) = eθ q/(1-q) ], of receiving a white ball; but
that (iv) we condition on the total number of bins containing white balls being equal to n.) Now what is
the probability distribution of the number X of bins occupied by both a black and a white ball ? Your answer
should turn out not to depend on q* and q, i.e. to depend only on m, n, N and θ.
A listing of R functions and commands that can be used to solve problem (A) above can be found here. You can also look at the resulting picture either for sample size n=40 as requested, or for n=100.
For an interesting comparison between an `Agresti-Coull (1998) confidence interval ' advocated by the author of our text (see problem 1.25), versus the other standard intervals we are studying, and also versus a transformed Wald-interval (with 0.5 added to number of successes and failures) on logit scale, see this picture.
Assignment 2.(Second 2 weeks of course, HW due Tuesday Feb.27, 11:59pm). Read all of Chapter 2, plus Chap.3 Sections 3.1-3.3 and 3.5.1-3.5.2. Then solve and hand in the following problems:
Drugs
No Drugs
Using these data and assuming the rows were sampled independently and iid, (a) conduct a test of (row vs column) independence and interpret the P-value; (b) Obtain standarized residuals and interpret, and (c) partition the LRT (and the approximating chi-squares, if you like) into three components to describe differences and similarities among the diagnoses, by comparing (i) the first two columns,the 3rd and 4th column, and the last column to the combination of the first two colums and the combination of the 3rd and 4th columns.
Assignment 3. (Third 2 weeks of course, HW due 3/16/24 11:59pm). Read Bayes Sec. 3.6, plus Chapter 4 and the first few sections of Chapter 5. Then solve and hand in the following problems (6 in all): # 4.10, 4.12, on pages 156-158 plus the following
(I). Do problem # 3.21 in Agresti Chapter 3. But after doing part (b), do two more parts assigned here: (II). Fit a logistic regression model to the Crabs mating data with outcome variable: (Crabs$sat > 1) Use the predictors: spine (as factor), weight (rounded to nearest 0.2kg), and width (rounded to nearest cm). You may use interactions if they help. Fit the best model you can, and assess the quality of fit of your best model using the techniques in Sec. 5.2.
(III). For the "best" model you fit in problem (II): (a) fit the coefficients directly (by a likelihood calculation that you code in R using "optim" with method="BFGS") and also by coding the Fisher-scoring algorithm and using 5 or 10 iterations (starting from all coef's = 0), and (b) check that the SEs for coefficients found by "glm" are close to those found from your observed information matrix estimates calculated in (a). (IV). (Compare problem 4.33 on p.162) Use the formulas in the book or class to show how the observed (whole-dataset, not per-observation) information matrix with the probit link depends on the data and differs from the expected (whole-dataset) or Fisher information. Assignment 4. HW submissions due by upload to ELMS by April 5, 2024, 11:59pm.. Assignment 5. HW submissions due by upload to ELMS by April 21, 2024,11:59pm.
Reading for this assignment includes , the book and Lecture material on Power (Section 6.6 plus handouts plus Lectures 17-18) and Chapter 9 (including some computational topics in Section 9.6-9.7) and Section 8.1. Problems to hand in (6 in all) : 9.2, 9.3, 9.16(d and e only), 9.34 in Agresti, plus:
(A). Data are to be collected from 6 clinical centers on the effectiveness of a treatment
for heart disease in diabetics. The same number $m$ of diabetic patients will be recruited at each
center, to be randomly divided into 2 groups, m/2 patients to be treated using standard therapy and
m/2 with a new experimental drug. However, three of the clinical centers will restrict their
recruitment to patients with better overall health, and the developers of the treatment have reason
to believe that the difference between probability of positive response for the treatment and
standard therapy should be twice as great for these patients as for those with the average
overall health of the general (diabetic) population. Effectiveness will be measured using an overall
diagnostic evaluation after 6 months, and since the therapy is not risk-free, significance testing will be
done two-tailed. The standard therapy is approximately 40% effective, and it is desired
that the multi-center clinical trial have an overall power 80% (in significance tests of size 0.05)
to detect a positive-response proportion at least 10% larger than the proportion under standard
therapy. (B). Consider the dataset in Table 9.16 (given in Ex.9.1 in Agresti) in which 3
binary factors G, I, and H are measured on 621 subjects. Fit the loglinear model (GI, HI)
[with sum-constraints]. Based on the model, find confidence intervals for each of the
main-effect coefficients for G=Male, for I=Support, and for H=Support, and find a confidence
interval for the probability t6hat a subject will be in the (Male,Support,Support) category.
Assignment 6, due by 11:59pm, Tuesday May 7, 2024. FINAL PROJECT ASSIGNMENT, due Friday, May 17, 2024, 11:59pm (uploaded to ELMS as pdf or MS Word document).As a final course
project, you are to write a paper including 5-10 pages of narrative, plus relevant code and graphical or tabular
exhibits, on a statistical journal article or textbook-chapter (not covered in class) related to the course or else a data analysis or case-study based on a dataset of your choosing. The guideline is that the paper should be 10-12 pages if it is primarily expository based on an article or chapter, but could have somewhat fewer pages of narrative if based on a data-analytic case study. However, for the latter kind of paper, all numerical outputs should be accompanied by code used to generate them, plus clear verbal discussion and interpretation of software outputs and graphical exhibits. For a data-analysis or case study, the paper should present a coherent and reasoned data analysis with evidence and connecting text supporting the model you choose to fit, the variables you choose to include and exclude, whatever indications you can provide for the adequacy of fit of the models, and a summary of what the model says about the generating mechanism of the data.
(0) Slide-decks for lectures, adapted and revised from the form in which they were given in Fall 2020, can be found in this directory. (1) Two pdf-handout files contain readings related to (Maximum Likelihood) estimation of parameters in Generalized Linear Mixed Models (GLMM's), specifically in random-intercept logistic regression models: (i) A handout from
Stat 705 on ML estimation using the EM (Expectation-Maximization) algorithm along with another on MCMC (Markov Chain Monte Carlo) techniques. (ii) A
technical report (written by me for the Small Area Income and Poverty Estimates program at the Census Bureau) on numerical maximization of the random-intercept logistic regression model using the Adaptive Gaussian Quadratures method developed by Pinheiro and Bates (the authors of related nonlinear-model mixed-effect software in R later adapted to NLMIXED in SAS). (2). A link to a lecture by Agresti in Italy on
History of Categorical Data Analysis. Further historical material can be found in an interesting historical article by Stephen Stigler (2002) showing just how recent is the display of data by cross-classification into contingency tables. (3). Handouts produced for other classes cover Aymptotics relating Wald, Score and LRT Tests, and another Proof of Wilks' Theorem and equivalence of corresponding chi-square statistic with
Wald & Rao-Score statistics. Here you can also find a handout on power in contingency-table score tests. (4). You can get an idea of test topics and course coverage from previous semesters in an old In-Class Test from April 14, 2003. Also see a directory
of SASlogs and a
Sample Test. A small writeup of computational details related to the first problem of the sample test can be found here. (5). Proof
of limiting distribution of multinomial Chi-square goodness of fit test statistic. (6). See the directory Survey Confidence Intervals for two papers and a supplement on the extension of binomial confidence intervals to unknown proportions estimated in complex surveys. The JSM 2014 paper, published in the American Statistical Association Proceedings of the Survey Research and Methods Section from the 2014 annual statistical meetings, contains the results of a simulation study showing the relative merits of various binomial-proportion Confidence Intervals adapted to complex survey data. The other paper and supplement, which extends the simulation study and improves the way the intervals are adapted to survey data, has appeared recently in the Journal of Survey Statistics and Methodology.
(7). A very interesting case-study on a criminal case in the Netherlands and
the importance of accounting for overdispersion in doing justice to a criminal-defendant. The case study is authored
by eminent Dutch statisticians, Gill, Groeneboom and de Jong. The math is very accessible and the point very clear.
(8). A set of R Scripts on many topics related to the course are available in this directory. Those that are specifically cited in Lectures will be separately located in this sub-directory. (9). Several R packages for fitting Generalized Linear Mixed Models (particularly, binomial and Poisson
family random-intercept were mentioned in a script-file covered in class. Some only
approximate the GLMM log-likelihood using Monte Carlo techniques, such as glmm or blme, while others (which are most useful
for the relatively simple random-intercept models arising in the applications in Agresti) calculate the
log-likelihoods as accurately as desired using Adaptive Gaussian Quadrature (AGQ): these include lme4, glmmML, or GLMMadaptive, and these can also be checked against my own code in the workspace Rscripts. Also see exposition of AGQ that I wrote in a random-intercept logistic-regression context, which should be accessible and useful to students in this course.
(10). Another package that can be used to fit multiple-outcome ("generalized-logistic" or "multinomial")
logistic regression is mlogit.
That package, which may be the only R package currently capable of fitting random-intercept models of generalized
logistic type, was written not for that purpose but to analyze `social choice' datasets of interest to
econometricians.
(11). A handout explains the meaning and use of factors and contrasts, first in the Linear-Model setting and then in connection with categorucal-data GLM's and Loglinear models. Both in that handout, and in another handout on loglinear models transformed to GLM's, are discussed the necessary steps to code contrasts and interactions from loglinear models (with sum side-conditions on coefficients) to GLM's. 1. Introduction --- binomial and multinomial probabilities, statistical tests, estimators
and confidence intervals. Law of large numbers, central limit theorem, delta method,
asymptotic normal distribution of maximum likelihood estimators, Wilks' Theorem. 2. Computational Techniques -- Numerical Maximization, Fisher Scoring, Iterative Proportional
Fitting, EM Algorithm, Bayesian (MCMC) Computation. 3. Describing Contingency Tables --- models and measures of independence vs. association of
categorical variables in multiway contingency tables, including tests of independence. Hypotheses equating
proportions for different variables. Conditional and marginal odds ratios and relative risks. Confidence
intervals for parameters, including Bayesian formulations. Historical notes on contingency table analysis.
4. Generalized linear models. Formulation of conditional response probabilities as
linear expressions in terms of covariables. Likelihood and inference. Quasilikelihood and estimating equations. 5. Logistic regression. Interpretation and inference on model parameters. Model
fitting, prediction, and comparison. 6. Model-building including variables selection, diagnostics and inference about variable
associations in logistic regression models. 7. Logistic regression extensions: multiple-category responses, weighted observations,
and missing data. 8. Loglinear models and their likelihood-based parametric statistical inference. 9. Generalized linear models with random effects. Likelihood and penalized likelihood
based inference. Missing-data formulation. 10. Comparison of prediction and classification using various model-fitting strategies. Likelihood, quasilikelihood, penalized likelihood, Bayes. Models & strategies include logistic regression and multicategory
extensions, loglinear models, GLMMs, Support Vector Machines, recursive partitioning & decision trees Additional Computing Resources. There are many
publicly available datasets for practice data-analyses. Many of them are taken from journal articles
and/or textbooks and documented or interpreted. A good place to start is Statlib. Here is another good source. Datasets needed in the course
will be either be posted to the course web-page, or indicated by links which will be provided here. The UMCP Math Department home page.
(I). (a) Build a simple, reasonable model using fixed-effect logistic regression for the outcome
class="malignant" in the dataset "biopsy" in package MASS, using only the first biopsies for the 645 unique IDs,
and ignoring variable V6.
For all three parts, fit the model and interpret/compare the coefficients and quality of fit.
(II). Consider a dataset with a three-level outcome Y (=0,1 or 2) and a predictor X
generated artificially with the model
using the commands
set.seed(7010)
Xvec = rnorm(5000)
Ymat = array(0, c(5000,3))
Ymat[,1] = rbinom(5000,1, exp(-1.5+0.6*Xvec)/(1+exp(-1.5+0.6*Xvec)+exp(-.5+0.1*Xvec)))
Ymat[,2] = rbinom(5000,1-Ymat[,1], plogis(-.5+0.1*Xvec))
Ymat[,3] = 1-Ymat[,1]-Ymat[,2]
(III). Find and display the posterior densities of four coefficients [Intercept and District2,
District3, and District4] in the Bayesian Poisson (fixed-effects) regression of Claims in the dataset
"Insurance" within the R package MASS, using a Bayesian GLMM package in R (such as blme or brms or bayesglm). Note that you should specify
the Poisson regression with an "offset" term of log(Holders) added to the Intercept.
(IV). Use a Bayesian GLMM package (as in (III) or else a direct Metropolis-Hastings MCMC implementation as in the script BayesLgst.RLog to do a Bayesian analysis of the random intercept effect in the "biopsy" dataset (using only first biopsy for each unique ID, and omitting V6) using a fixed and mixed-effect generalized-linear-model (logistic-regression with random intercept). Your analysis should give point and interval estimates for the standard deviation of the random intercept effect and say something descriptive about the posterior distribution of that standard deviation.
Two good sources of data for the paper are the StatLib web-site mentioned below,
or Agresti's book web-site.
Possible topics for the paper include: (a) Zero-inflated Poisson regression models, based on an
original paper of Lambert but discussed in connection with the Horseshoe Crabs dataset in a web-page posted by Agresti (indexed
also under heading 2. in his book web-page.
(b) The relationship between individual heterogeneity and
overdispersion and prediction, in a very nice article connected with a court case in the Netherlands
mentioned as a Handout (number (7) under Handouts section below).
(c) Discussion of `raking' in connection with survey-weighted contingency tables, and extension of the Iterative Proportional Fitting Algorithm covered in the loglinear chapter in the course.
(d) I mentioned in class that those of you with interests in Educational Statistics might consider covering some article or book-chapter in categorical data analysis related to Item Response Theory modeling, such as the article Semiparametric Estimation in the Rasch Model and Related Exponential Response Models, Including a Simple Latent Class Model for Item Analysis, by Bruce Lindsay, Clifford C. Clogg, John Grego, in the Journal of the American Statistical Association, Vol. 86, No. 413 (Mar., 1991), pp. 96-107, http://www.jstor.org/stable/2289719.
(e) You might also base your paper on discussion of an R-package, with data
illustration to do analyses related to some categorical-data data structure not covered in detail in the class, such as analysis of multicategory generalizations of (fixed-effect) logistic or probit regression, or ordinal-outcome categorical modeling, or `social choice' modeling.
(f) Some additional keywords for topics, that I could fill in if there is interest, can be
found here.
Handouts for Reference
August 27, 2020
SYLLABUS for Stat 770, based on Agresti 3rd edition, Fall 2020 and Spring 2024
A good set of links to data sources from various organizations including Federal
and international statistical agencies is at Washington
Statistical Society links.Important Dates
The University of Maryland home page.
My home page.
Eric V Slud, Apr. 24, 2024.