**Instructor: **Eric Slud, Statistics program, Math. Dept.

**Office: ** Mth 2314, x5-5469, email evs@math.umd.edu**Office Hours:** W or F 11am-12pm or by appointment

**Course Text:** A. Agresti, *Categorical Data Analysis*, 3rd ed. 2013. Find Errata
here.

**Overview:** This course covers the statistical analysis of discrete data, cross-classified by and modeled in terms of auxiliary covariate measurements which may be continuous or discrete. Such data structures arise in a wide variety of fields of application, especially in the social and biological sciences. The basic underlying model is the multinomial distribution, with cell-probabilities parametrically
restricted according to their array structure, with conditional probability masses for a distinguished
response variable often expressed linearly in terms of covariates. Important models of this type (some
of which generalize to the case of continuous covariates) include logistic regression, other `generalized
linear models', and loglinear models. The modern approach to these topics involves estimation via
likelihood-based methods or generalizations to so-called quasilikelihood estimating equations, with
emphasis on statistical computing and model diagnostics. In addition, computational advances have made
categorical data models with random effects tractable to estimate and interpret, and Bayesian and
empirical-Bayes methods are an important part of the material included in the new edition of the
Agresti text. Methods covered in the course will be presented in terms of theoretical properties,
computational implementation (primarily in **R**), and real-data application.

**NOTE ON USE OF THEORETICAL MATERIAL. **Both in homeworks and the in-class test, there will
be theoretical material at the level of probability theory needed to apply the law of large numbers and
central limit theorem, along with the `delta method' (Taylor linearization) and other manipulations at
advanced-calculus level.

**Prerequisite: **Stat 420 or Stat 700, plus some computing familiarity, preferably including some **R**.

**Course requirements and Grading:** there will be 6 graded homework sets (one every 2 weeks), plus a project/paper at the end.
Homeworks will be split between theory problems and statistical computations and interpretations with data.
The homework will be worth 65% of the grade, the term paper 35%.

**Course Coverage:** in the Agresti book:

** Slide-decks for Lectures, adapted and revised from the way they were given in Fall 2020, can be found in this directory.**

**NOTE ON COMPUTING. **Both in the homework-sets and the course project, you will be required
to do computations on real datasets well beyond the scope of hand calculation or spreadsheet programs.
Any of several statistical-computing platforms can be used to accomplish these: **R**, SAS, Minitab,
Matlab, or SPSS, or others. If you are learning one of these packages for the first time, or investing some effort toward
deepening your statistical computing skills, I recommend
**R** which is free and open-source and is the most flexible and useful for research statisticians.
I will provide links to free online **R** tutorials and will provide examples and scripts and will
offer some **R** help. The Agresti book gives scripts and illustrations in SAS.

**Getting Started in R and SAS.** Lots of R introductory materials can be found on my
STAT 705 website from several years ago, in particular in these Notes.
Another free and interactive site I recently came across for introducing R to social scientists is: https://campus.sagepub.com/blog/beginners-guide-to-r.

Various pieces of information to help you get started in using SAS can be found under an old (F09) course
website Stat430. In particular you can find:

--- an overview of the minimum necessary steps to use SAS from Mathnet.

--- a series of SAS logs with edited outputs for illustrative examples.

The Agresti text has an Appendix A describing software, including SAS scripts, which can be used to perform categorical data analyses. In addition, datasets can be downloaded from Agresti's website. Several logs in SAS (with some comparative Splus analyses [as precursor to R]) doing illustrative data analyses and including standard SAS scripts can be found here. There is also a lengthy manual for performing R analyses of examples in (the 2nd edition of) the Agresti book.

A set of R Scripts on many topics related to the course are available in this directory. Those that are specifically cited in Lectures will be separately located in this sub-directory.

**Notes and Guidelines.** Homeworks should be handed in as pdf's through ELMS "Assignments". Solutions will
usually be posted on ELMS, and a percentage deduction of the overall HW score will generally be made for later papers.

Assignment 1. (First 2 weeks of course, HW due Fri., Feb. 9). Read all of Chapter 1, plus the first of the sections from the historical notes in Chap. 17. Then solve and hand in all of the following problems:

**(A).**

**(B).**_{i} are independent and identically distributed
(*iid*) discrete random variables with values {1,...,K} and probability mass function P(X_{i}=k) =
p_{k}. Find the joint probability distribution of (N_{k}, k=1,...,K) where
N_{k} = summation over i=1 to N of I_{[Xi=k]} where
N~Poisson(n) is independent of {X_{i}: i=1,2,...}.

**(C).**

(b) Let the m black balls position themselves among N bins as in (a). But now suppose that, given the positions
of black balls, the positions for the n white balls are chosen in such a way that the odds for each white ball
to fall in a bin occupied by a black ball is multiplied by a factor e^{θ} as compared with the
odds of falling in a bin not occupied by a black ball. (*A more precise way to say or model this question is
as follows: (i) suppose that for some fixed probability p , for each bin j=1,...,N independently
of the others, a black ball is placed in bin j; but that (ii) we condition on the total number of bins containing
black balls being equal to m; and (iii) suppose that each bin not containing a black ball independently of all
other bins has probability q of receiving a white ball, while each bin containing a black
ball independently of all other bins has probability q ^{*} [defined to satisfy:
q^{*}/(1-q^{*}) = e^{θ} q/(1-q) ], of receiving a white ball; but
that (iv) we condition on the total number of bins containing white balls being equal to n.*) Now what is
the probability distribution of the number X of bins occupied by both a black and a white ball ?

**(D).**

**A listing of R functions and commands that can be used to solve problem (A) above can be found here. You can also look at the resulting picture either for
sample size n=40 as requested, or
for n=100.**

For an interesting comparison between an `Agresti-Coull (1998) confidence interval ' advocated by the author of our text (see problem 1.25), versus the other standard intervals we are studying, and also versus a transformed Wald-interval (with 0.5 added to number of successes and failures) on logit scale, see this picture.

Assignment 2.(Second 2 weeks of course, HW due Tuesday Feb.27, 11:59pm). Read all of Chapter 2, plus Chap.3 Sections 3.1-3.3 and 3.5.1-3.5.2. Then solve and hand in the following problems:

**(A).** Consider the data in Table 2.8 of the Agresti book, on page 57, and assume that the data sampled in each Age group are independent identically distributed among a well-defined population of employed people. Find a 95% confidence interval for the fraction of each age-defined subpopulation that is *"fairly satisfied"* (i.e., falls in
Job-satisfaction category (2), according to each of the following methods: (i) Wald, (ii) Inverted Score-test,
(iii) Clopper-Pearson, (iv) Likelihood Ratio Test, and (v) Bayesian credible interval based on a Beta(1,1) = Uniform prior distribution on the unknown fraction p. Now **assume** you know that the proportion falling in Job-satisfaction category (2) is the **same** in all three age-groups, and (vi) find a 95% confidence interval for that proportion.

**(B).** Establish (formulas for) a large-sample CI for log relative risk log(p_{1}/p_{2}) using the Delta
method, in the setting of independent observations X_{j} ~ Binom(n_{j}, p_{j}) for j=1,2, and
apply it to the data in Table 2.1 on Aspirin and Heart Attack Study data on p.38, in two ways: (i) to find a 90% CI
for log RR of fatal heart attack for those on placebo relative to those taking aspirin, and (ii) to find a 95% CI
for log RR for incidence of Heart Attack (whether fatal or not) for those on placebo relative to those taking aspirin.
Assume that the placebo patients and those taking aspirin were sampled independently and equiprobably from large
general populations.

**(C).** The following data from a paper by Helmes and Fekken (1986 Jour. Clin. Psych. **42**, 569-576)
classifies a sample of psychiatric patients by their diagnoses and whether their treatment prescribed drugs:

Drugs

No Drugs

Using these data and assuming the rows were sampled independently and iid, (a) conduct a test of (row vs column) independence and interpret the P-value; (b) Obtain standarized residuals and interpret, and (c) partition the LRT (and the approximating chi-squares, if you like) into three components to describe differences and similarities among the diagnoses, by comparing (i) the first two columns,the 3rd and 4th column, and the last column to the combination of the first two colums and the combination of the 3rd and 4th columns.

*In this problem each sampled individual (out of the total of 276 in the table) is viewed as an iid random draw from a large population of psychiatric patients, with the facts recorded about the psychiatric disorder affecting each patient and then also the information about whether or not they were prescribed drugs. So this is an overall multinomial table, if you regard the number n=276 as fixed in advance. If you condition further on the number of patients receiving and not receiving drugs, each of the rows of the Table becomes multinomial, but that is not the way the experiment was done. The interpretation of the question is that we are testing whether the type of disorder and the fact of receiving drugs as treatment are dependent as categorical random variables.*

**(D).** Of the 14 candidates for 4 managerial positions, 7 are female and 7 male. Denote the females F1,..,F7 and the males M1,..,M7. The actual result of choosing the managers is (F3,M1,M4,M6). (i) How many possible without-replacement samples were there ? Construct the contingency table for the without-replacement sample actually obtained. (ii) Find and interpret the P-value for the Fisher's Exact Test of the hypothesis that all candidates were equally likely to be selected. (iii) Re-do the same problem if there were 60 candidates (30 male and 30 female) for the four managerial positions.

**(E).** Find the likelihood ratio test and chi-squared test test-statistics in a 3x3 table
for the hypothesis H_0: p_{jk} ∝ exp(aj + bk + cjk) versus the general alternative.
Find and interpret the P-values for these statistics in the 3x3 table with first row (3,7,18), second row
(5,18,17), and 3rd row (9,35,42). *Note: you should find the likelihood-maximizing values â,
b̂ and ĉ under H _{0}, either using a numerical-maximizing function like*

**(F).** Thirty measurements W_{1},..., W_{30} of body weight of male students are collected by randomly sampling men from a large population. These are thought to be normally distributed. But the only data we have access to are the numbers of the 30 weights respectively falling into
the intervals (0,142],(142,165], (165,180], (180,200] and (200,Infty], and those counts are respectively 4,7,9,8,2. Use these data to find a likelihood ratio test of the null hypothesis that the 30 iid observations were normally distributed.

Assignment 3. (Third 2 weeks of course, HW due 3/16/24 11:59pm). Read Bayes Sec. 3.6, plus Chapter 4 and the first few sections of Chapter 5. Then solve and hand in the following problems (6 in all): ** # 4.10, 4.12, on pages 156-158** plus the following

**(I).** Do problem # 3.21 in Agresti Chapter 3. But after doing part (b), do two more parts assigned here:

**ratio** of proportions π_{1}/π_{2}; and

_{1} and π_{2} and probability 1/2 to Uniformly distributed π = π_{1} = π_{2}

**(II).** Fit a logistic regression model to the Crabs mating data with outcome variable: (Crabs$sat > 1) Use the predictors: spine (as factor), weight (rounded to nearest 0.2kg), and width (rounded to nearest cm). You may use interactions if they help. Fit the best model you can, and assess the quality of fit of your best model using the techniques in Sec. 5.2.

**(III).** For the "best" model you fit in problem (II): (a) fit the coefficients directly (by a likelihood calculation that you code in R using "optim" with method="BFGS") and also by coding the Fisher-scoring algorithm and using 5 or 10 iterations (starting from all coef's = 0), and (b) check that the SEs for coefficients found by "glm" are close to those found from your observed information matrix estimates calculated in (a).

**(IV).** (Compare problem 4.33 on p.162) Use the formulas in the book or class to show how the observed (whole-dataset, not per-observation) information matrix with the probit link depends on the data and differs from the expected (whole-dataset) or Fisher information.

Assignment 4. HW submissions due by upload to ELMS by April 5, 2024, 11:59pm.. **Reading:** the rest of Chapters 5 and 6 plus Sec.7.1-7.2. Then solve and hand in the following problems (7 in all): 4.20, 5.6, 5.9,
5.30, 5.39, 6.8, 6.14 (ROC and AUC only).

Assignment 5. HW submissions due by upload to ELMS by April 21, 2024,11:59pm.
Reading for this assignment includes , the book and Lecture material on Power (Section 6.6 plus handouts plus Lectures 17-18) and Chapter 9 (including some computational topics in Section 9.6-9.7) and Section 8.1.

Problems to hand in (6 in all) : 9.2, 9.3, 9.16(d and e only), 9.34 in Agresti, plus:

**(A).** Data are to be collected from 6 clinical centers on the effectiveness of a treatment
for heart disease in diabetics. The same number $m$ of diabetic patients will be recruited at each
center, to be randomly divided into 2 groups, m/2 patients to be treated using standard therapy and
m/2 with a new experimental drug. However, three of the clinical centers will restrict their
recruitment to patients with better overall health, and the developers of the treatment have reason
to believe that the difference between probability of positive response for the treatment and
standard therapy should be twice as great for these patients as for those with the average
overall health of the general (diabetic) population. Effectiveness will be measured using an overall
diagnostic evaluation after 6 months, and since the therapy is not risk-free, significance testing will be
done two-tailed. The standard therapy is approximately 40% effective, and it is desired
that the multi-center clinical trial have an overall power 80% (in significance tests of size 0.05)
to detect a positive-response proportion at least 10% larger than the proportion under standard
therapy.

**(B).** Consider the dataset in Table 9.16 (given in Ex.9.1 in Agresti) in which 3
binary factors G, I, and H are measured on 621 subjects. Fit the loglinear model (GI, HI)
[with sum-constraints]. Based on the model, find confidence intervals for each of the
main-effect coefficients for G=Male, for I=Support, and for H=Support, and find a confidence
interval for the probability t6hat a subject will be in the (Male,Support,Support) category.

Assignment 6, due by 11:59pm, Tuesday May 7, 2024. The reading
for the last part of the course consists of Chapter 13, Chapter 14 sections 14.3 and 14.5.

**(I).** (a) Build a simple, reasonable model using fixed-effect logistic regression for the outcome
class="malignant" in the dataset "biopsy" in package MASS, using only the first biopsies for the 645 unique IDs,
and ignoring variable V6.

For all three parts, fit the model and interpret/compare the coefficients and quality of fit.

**(II).** Consider a dataset with a three-level outcome Y (=0,1 or 2) and a predictor X
generated artificially with the model

using the commands

set.seed(7010)

Xvec = rnorm(5000)

Ymat = array(0, c(5000,3))

Ymat[,1] = rbinom(5000,1, exp(-1.5+0.6*Xvec)/(1+exp(-1.5+0.6*Xvec)+exp(-.5+0.1*Xvec)))

Ymat[,2] = rbinom(5000,1-Ymat[,1], plogis(-.5+0.1*Xvec))

Ymat[,3] = 1-Ymat[,1]-Ymat[,2]

**(III).** Find and display the posterior densities of four coefficients [Intercept and District2,
District3, and District4] in the Bayesian Poisson (fixed-effects) regression of Claims in the dataset
"Insurance" within the R package MASS, using a Bayesian GLMM package in **R** (such as **blme** or **brms** or **bayesglm**). Note that you should specify
the Poisson regression with an "offset" term of log(Holders) added to the Intercept.

**(IV).** Use a Bayesian GLMM package (as in **(III)** or else a direct Metropolis-Hastings MCMC implementation as in the script **BayesLgst.RLog** to do a Bayesian analysis of the random intercept effect in the "biopsy" dataset (using only first biopsy for each unique ID, and omitting V6) using a fixed and mixed-effect generalized-linear-model (logistic-regression with random intercept). Your analysis should give point and interval estimates for the standard deviation of the random intercept effect and say something descriptive about the posterior distribution of that standard deviation.

FINAL PROJECT ASSIGNMENT, due Friday, May 17, 2024, 11:59pm (uploaded to ELMS as pdf or MS Word document).As a final course
project, you are to write a paper including 5-10 pages of narrative, plus relevant code and graphical or tabular
exhibits, on a statistical journal article or textbook-chapter (not covered in class) related to the course or else a data analysis or case-study based on a dataset of your choosing. The guideline is that the paper should be 10-12 pages if it is primarily expository based on an article or chapter, but could have somewhat fewer pages of narrative if based on a data-analytic case study. However, for the latter kind of paper, all numerical outputs should be accompanied by code used to generate them, plus clear verbal discussion and interpretation of software outputs and graphical exhibits. For a data-analysis or case study, the paper should present a coherent and reasoned data analysis with evidence and connecting text supporting the model you choose to fit, the variables you choose to include and exclude, whatever indications you can provide for the adequacy of fit of the models, and a summary of what the model says about the generating mechanism of the data.

Two good sources of data for the paper are the StatLib web-site mentioned below,
or Agresti's book web-site.

Possible topics for the paper include: **(a)** Zero-inflated Poisson regression models, based on an
original paper of Lambert but discussed in connection with the Horseshoe Crabs dataset in a web-page posted by Agresti (indexed
also under heading 2. in his book web-page.

**(b)** The relationship between individual heterogeneity and
overdispersion and prediction, in a very nice article connected with a court case in the Netherlands
mentioned as a Handout (number (7) under Handouts section below).

**(c)** Discussion of `raking' in connection with survey-weighted contingency tables, and extension of the Iterative Proportional Fitting Algorithm covered in the loglinear chapter in the course.

**(d)** I mentioned in class that those of you with interests in Educational Statistics might consider covering some article or book-chapter in categorical data analysis related to Item Response Theory modeling, such as the article *Semiparametric Estimation in the Rasch Model and Related Exponential Response Models, Including a Simple Latent Class Model for Item Analysis*, by Bruce Lindsay, Clifford C. Clogg, John Grego, in the Journal of the American Statistical Association, Vol. 86, No. 413 (Mar., 1991), pp. 96-107, http://www.jstor.org/stable/2289719.

**(e)** You might also base your paper on discussion of an R-package, with data
illustration to do analyses related to some categorical-data data structure not covered in detail in the class, such as analysis of multicategory generalizations of (fixed-effect) logistic or probit regression, or ordinal-outcome categorical modeling, or `social choice' modeling.

**(f)** Some additional keywords for topics, that I could fill in if there is interest, can be
found here.

**(0) Slide-decks for lectures, adapted and revised from the form in which they were given in Fall 2020, can be found in this directory.**

**(1)** Two pdf-handout files contain readings related to (Maximum Likelihood) estimation of parameters in Generalized Linear Mixed Models (GLMM's), specifically in random-intercept logistic regression models:

(i) A handout from Stat 705 on ML estimation using the EM (Expectation-Maximization) algorithm along with another on MCMC (Markov Chain Monte Carlo) techniques.

(ii) A technical report (written by me for the Small Area Income and Poverty Estimates program at the Census Bureau) on numerical maximization of the random-intercept logistic regression model using the Adaptive Gaussian Quadratures method developed by Pinheiro and Bates (the authors of related nonlinear-model mixed-effect software in R later adapted to NLMIXED in SAS).

**(2).** A link to a lecture by Agresti in Italy on
History of Categorical Data Analysis. Further historical material can be found in an interesting historical article by Stephen Stigler (2002) showing just how recent is the display of data by cross-classification into contingency tables.

**(3).** Handouts produced for other classes cover Aymptotics relating Wald, Score and LRT Tests, and another Proof of Wilks' Theorem and equivalence of corresponding chi-square statistic with
Wald & Rao-Score statistics. Here you can also find a handout on power in contingency-table score tests.

**(4).** You can get an idea of test topics and course coverage from previous semesters in an old In-Class Test from April 14, 2003. Also see a directory
of SASlogs and a
Sample Test. A small writeup of computational details related to the first problem of the sample test can be found here.

**(5).** Proof
of limiting distribution of multinomial Chi-square goodness of fit test statistic.

**(6).** See the directory Survey Confidence Intervals for two papers and a supplement on the extension of binomial confidence intervals to unknown proportions estimated in complex surveys. The JSM 2014 paper, published in the American Statistical Association Proceedings of the Survey Research and Methods Section from the 2014 annual statistical meetings, contains the results of a simulation study showing the relative merits of various binomial-proportion Confidence Intervals adapted to complex survey data. The other paper and supplement, which extends the simulation study and improves the way the intervals are adapted to survey data, has appeared recently in the Journal of Survey Statistics and Methodology.

**(7).** A very interesting case-study on a criminal case in the Netherlands and
the importance of accounting for overdispersion in doing justice to a criminal-defendant. The case study is authored
by eminent Dutch statisticians, Gill, Groeneboom and de Jong. The math is very accessible and the point very clear.

**(8).** A set of R Scripts on many topics related to the course are available in this directory. Those that are specifically cited in Lectures will be separately located in this sub-directory.

**(9).** Several R packages for fitting Generalized Linear Mixed Models (particularly, binomial and Poisson
family random-intercept were mentioned in a script-file covered in class. Some only
approximate the GLMM log-likelihood using Monte Carlo techniques, such as glmm or blme, while others (which are most useful
for the relatively simple random-intercept models arising in the applications in Agresti) calculate the
log-likelihoods as accurately as desired using Adaptive Gaussian Quadrature (AGQ): these include lme4, glmmML, or GLMMadaptive, and these can also be checked against my own code in the workspace Rscripts. Also see exposition of AGQ that I wrote in a random-intercept logistic-regression context, which should be accessible and useful to students in this course.

**(10).** Another package that can be used to fit multiple-outcome ("generalized-logistic" or "multinomial")
logistic regression is mlogit.
That package, which may be the only R package currently capable of fitting random-intercept models of generalized
logistic type, was written not for that purpose but to analyze `social choice' datasets of interest to
econometricians.

**(11).** A handout explains the meaning and use of factors and contrasts, first in the Linear-Model setting and then in connection with categorucal-data GLM's and Loglinear models. Both in that handout, and in another handout on loglinear models transformed to GLM's, are discussed the necessary steps to code contrasts and interactions from loglinear models (with sum side-conditions on coefficients) to GLM's.

August 27, 2020

**1. Introduction** --- binomial and multinomial probabilities, statistical tests, estimators
and confidence intervals. Law of large numbers, central limit theorem, delta method,
asymptotic normal distribution of maximum likelihood estimators, Wilks' Theorem.

**2. Computational Techniques** -- Numerical Maximization, Fisher Scoring, Iterative Proportional
Fitting, EM Algorithm, Bayesian (MCMC) Computation.

**3. Describing Contingency Tables** --- models and measures of independence vs. association of
categorical variables in multiway contingency tables, including tests of independence. Hypotheses equating
proportions for different variables. Conditional and marginal odds ratios and relative risks. Confidence
intervals for parameters, including Bayesian formulations. Historical notes on contingency table analysis.

**4. Generalized linear models. **Formulation of conditional response probabilities as
linear expressions in terms of covariables. Likelihood and inference. Quasilikelihood and estimating equations.

**5. Logistic regression. **Interpretation and inference on model parameters. Model
fitting, prediction, and comparison.

**6. Model-building** including variables selection, diagnostics and inference about variable
associations in logistic regression models.

**7. Logistic regression extensions**: multiple-category responses, weighted observations,
and missing data.

**8. Loglinear models** and their likelihood-based parametric statistical inference.

**9. Generalized linear models with random effects.** Likelihood and penalized likelihood
based inference. Missing-data formulation.

**10. Comparison of prediction and classification using various model-fitting strategies.** Likelihood, quasilikelihood, penalized likelihood, Bayes. Models & strategies include logistic regression and multicategory
extensions, loglinear models, GLMMs, Support Vector Machines, recursive partitioning & decision trees

Additional Computing Resources. There are many
publicly available datasets for practice data-analyses. Many of them are taken from journal articles
and/or textbooks and documented or interpreted. A good place to start is Statlib. Here is another good source. Datasets needed in the course
will be either be posted to the course web-page, or indicated by links which will be provided here.

A good set of links to data sources from various organizations including Federal
and international statistical agencies is at Washington
Statistical Society links.

**First Class: Wed., January 24, 2024****Two classes Friday Feb. 16 and Monday Feb. 19 will be Asynchronous Zoom classes on ELMS.****Last schedule-adjustment Date (for Drop/Withdrawal): April 9, 2020****No Classes March 18--22, 2024: Spring Break****Last day of classes: Wed. May 8, 2024**

**The UMCP Math Department home page.
The University of Maryland home page.
My home page.
Eric V Slud, Apr. 24, 2024.**