Instructor: Professor Eric Slud, Statistics Program, Math Dept., Rm 2314, x5-5469, slud@umd.edu
Office hours: M 1-2, W 11-12 (initially), or email me to make an
appointment (can be on Zoom).
Syllabus
Course Text: K. Mardia, J.Kent, and J. Bibby
Multivariate Analysis, 1980, Academic Press (paperback, free online). Recommended Texts: (i) R. Johnson and D. Wichern, Applied Multivariate Statistical Analysis, 6th ed. Available free as pdf online. Overview:
This course is about statistical models and methods of inference for multivariate observations with dependent coordinates. Much of the theoretical material relates to the multivariate normal distribution and to the statistical sampling behavior of empirical variance-covariance matrices and of various projections and eigen-decompositions of them. Models studied include regression, principal components analysis, factor models, and canonical correlations. In addition, important algorithmic or machine-learning methods like Clustering and Support Vector Machines will also be discussed. All methods will be illustrated using computational data examples in R. Prerequisite: STAT 420 or STAT 700.
Familiarity with some (any) statistical software package would be very helpful, but familiarity with R would be best. The presentation will be geared to second-year Stat grad students. The data exercises in the course require that you have familiarity with and access to
a reasonably powerful statistical software package, e.g. R, SAS, Python or MATLAB. I will do examples and provide software scripts in R, and can help you get past coding difficulties in R but can probably not help much with programming difficulties if you do your data exercises in other languages. Course requirements and Grading: there will be 6 graded homework sets (one every 1½ to 2 weeks) which together will count 45% of the course grade. These will be divided about evenly between theoretical problems and computational data analysis problems. There will also be an in-class test (tentatively scheduled for Wed., March 16) and a final take-home or project, which will respectively count 25% and 30% toward the overall course grade. The course project will be either a paper on a topic not fully covered in class, with mathematical content related to the material of the course and preferably with an illustrative data analysis, or an extended and coherent data analysis and writeup (of about 10-12 pages, not including computer output). It will be due by midnight, Monday May 16, 2022. Some Datasets for the project and homework can be found here. Another good source of larger and more challenging datasets is the UCI Machine Learning Repository. Different directories linked on this web-page will contain R Scripts, mostly R Logs containing code, discussion and interpretations for class material that you should work through yourself, and Handouts (mostly pdfs, with some additional R Scripts).
(1). There are lots of very handy R packages doing Multivariate Statistics calculations and displays that you can download directly, and I will tell you about them throughout the term. Some will be used as sources of interesting multivariate data, some for implementing the theory that we cover in class. A few of the packages that you will want to install on your own computers are: mclust, Yaletoolkit, ICSNP, and chemometrics. For an overview of all of the R packages doing Multivariate Statistics tasks, try this Multivariate Stats link.
(2). In the RScripts directory you will find updated periodically a set of scripts covering class R demonstrations and illustrating related topics and R packages you can use to solve homework problems. (Of course you can use other packages too, but please cross-check any new ones you find that we have not discussed in class, to check that they compute things the way you expect them to, so that they match the computing formulas we develop in class.
Homework: Assignments, including any changes and hints, will continually be posted here. The most current form of the assignment will be posted also on ELMS. The directory in which you can find old homework assignments and selected problem solutions is Homework. HW1 due Monday Feb.7, 11:59pm (upload to ELMS) HW2 due Wednesday Feb.23, 11:59pm (upload to ELMS) Hand in the following 7 problems by 11:59pm Wednesday 2/23/22. HW3 due Saturday Mar.12, 11:59pm (upload to ELMS) Topics Power of Tests, UIT's, and Regression. (I)(10 points) MKB #3.5.1(c). Problem on power of 1-sample Hotelling T2 tests at a particular value μ1 ≠ μ0 = 0. (II)(15 points) Recall the dataset "Coated" within package MVTests (2 samples associated with different "Coatings" consisting of 15 2-dimensional observations of a Depth and Number) that we used in class, in the HotellingT2.RLog script in the RScripts directory of the course web-page. We performed a two-sample Hotelling T2 test and used the MVTests function "TwoSamplesHT2" to reproduce the exact statistic value and p-value. That function also outputs simultaneous confidence intervals $CI for the difference between the Depth and Number mean-parameters across the two coatings. Those simultaneous confidence intervals are the ones you would get from implementing the UIT two-sample test as a Union-Intersection Test as discussed in class on 2/25. Show exactly how you would compute those simultaneous confidence intervals, verifying to 6 decimal-place accuracy that these are the confidence intervals (at the default confidence level 0.95) produced by "TwoSamplesHT2" MVTests. (III)(15 points) (a) Consider the problem of testing H0: μ = k * μ0 in terms of an nxp Normal(μ,Σ) data-matrix X, where μ0 is known and (k, Σ) are unknown. Find the Likelihood Ratio Test, including the exact distribution of the test statistic. (The MKB book has MLE information in Ch.4 to help you in this.) (IV) This problem investigates the Fisher iris data (which can also be found in the MVTests package).
Using the R script provided in http://www.math.umd.edu/~evs/s750/Rscript.Iris, (a) perform and interpret a little Monte-Carlo simulation, using the estimated covariance matrix as the true parameter, and (b) bootstrap the iris
data (which means: repeatedly draw iid samples of the same size (n=150) by sampling equiprobably with replacement from the original set of 150 vectors of 6 variables (one of which is the 3-valued categorical label "species"), to check the deficiencies of the Wishart as distribution of the sample covariance matrix in the iris data sample. HW4 due Sunday, April 10, 11:59pm (upload to ELMS), 65 points total Topics PCA and Factor Analysis. (I). Do problems 8.2.2-8.2.4 combined into one problem (counts 15 points), 8.4.2, 8.8.1 in MKB, and 8.10 in Johnson-Wichern, p.473.
(II). Here are two more problems on Factor Analysis: Johnson and Wichern #9.9 and #9.28.
For the data on international Women's track records in problem 9.28, step through the PCA analysis in #8.18 [but do not hand that analysis in, only use it to inform your factor analysis].
HW5 due Monday, April 25, 11:59pm (upload to ELMS), 60 points, 6 problems total Topics: Factor Analysis, Canonical Correlations, Cluster Analysis (I). Do problems 9.6.1, 13.2.1, 13.3.2 in MKB, and 12.7 in Johnson and Wichern.
(II). Do problem 10.2.10 in MKB, but you need not use problem 10.2.9 to find the canonical
correlations; use any moethod you like. But as part ofthe same problem [to be handed in], perform a hypothesis test of whether reading ability is correlated with arithmetic ability (using multivariate-normal assumptions, as in the MKB chapter.
(III). Consider the banknote data (in the mclust package), restricting yourself to the first 100 obseervations (the genuine banknotes) and omit the 1st and 6th columns ("Status" and "Top"). Homework 6, due 11:59pm May 10, 2022 4-part problem, with point values indicated, totaling 65pts. Read in the dataset HCV from the UCI Machine Learning Repository, omitting the first column and deleting the 7 subjects in "suspect Blood Donor" category using the following R statements: > HCV = read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00571/hcvdat0.csv")[,-1] Of the columns 2:13 (all numeric except for Factor in Category column), Category is the class label that we hope to reproduce via clustering. Since the measurements in columns 4:13 are very skewed (positive), convert them to logs before proceeding: (I)(20pts) Apply k-means, agnes (separately with method="single", method="complete", and method="average"), diana and at least one other clustering algorithm (your choice) to these data. You might want to pre-process the data (to make all the measurement units and indicators comparable) by re-scaling so that all the numeric columns have variance 1. Some of these clustering methods will give nonsensical answers (eg clusters of 1, etc.), so you may want to choose a distance-function yourself by creating a "dissimilarity matrix" in a way other than simply using L2 or L1 distances. (II)(15pts) Apply a PC or Factor Analysis method to the dataset, ignoring the Sex and Category columns, and find the best grouping of columns you can into at most 4 groups for describing these data. (III)(10pts) Using [only] the PCs or Factors you found in part (II), together with Sex, re-do the clustering methods you tried in part (I). Did you lose clustering information or precision in passing from (I) to these new clusterings ? (IV)(20pts) Pick the single most successful of your clustering methods tried in (I) (or possibly (III)), and assess its accuracy by computing a Confusion Matrix, and the (average) Sensitivity and PPV of the 4 clusters across 2000 resampled datasets created by the Nonparametric Bootstrap. By looking at these same performance metrics for the clusters in each of the 2000 resampled datasets, provide the best summary you can of how stable your clustering method is (across these 2000 resampled datasets). NOTE: because the "0=Blood Donor" category is so large compared to the others, you may subset it if you like, ie, if you think it will clarify your clustering you may reduce the size of the dataset by retaining a random subset (generated once) of 75 of the 533 "0=Blood Donor" records. Lecture Handouts (A) File of keywords for topics covered, lecture by lecture.
(B) Handout on Spherical Symmetry cleans up some unfinished business from the January 31 lecture on the Spherical Symmetry topic.
(C) Handout on EM Algorithm to supplement our class coverage of EM and its application to ML estimation in the Factor Model.
(D) Text-file containing suggested topics, papers and book-chapters you might use in your Term Project.
(E) A beautifully written tutorial introduction to Spectral Clustering supplements the material we covered briefly on Spectral Clustering in class on April 22, with a little more illustration on implementation (on an ideal dataset and on the iris data) in the R Script IrisCluster.RLog. That Script contains a quick overview of all the clustering methods we have covered, implemented in R.
(F) Handout summarizing bootstrap ideas for Large-p Clustering based on background bootstrap ideas plus several journal papers. The references and brief notes on the papers can be found here.
(G) The paper Kernel methods in machine learning in Annals of Statistics 2008 is a sophisticated reference beyond the scope of this course, to serve as further reading on kernel methods in multivariate statistics, beyond what we cover in the final 3 lectures. 0. Overview/Introduction: Matrix and Data Structures The UMCP Math Department home page.
The University of Maryland home page.
© Eric V Slud, May 4, 2022.
This text covers both theory and data examples, with ample verbal explanations
and motivation.
This is a popular and good applied book to be used as a source of examples and alternate, intuitive explanations.
Standard and authoritative, but theoretical and fairly dry, with deeper mathematical treatment than Mardia, Kent and Bibby.
Another good applied book, maybe at a slightly higher mathematical level than Johnson-Wichern. Available as free e-book to students through the UMD library.
Probability theory material needed throughout this course includes joint probability densities and change-of-variable formulas, law of large numbers and (multivariate) central limit theorem. In addition, the course makes extensive use of linear algebra, especially
including eigenvalues and eigenspaces and singular value decompositions.
Text-file containing suggested topics, papers and book-chapters you might use in your Term Project.
All homework and take-home work will be handed in as uploaded *pdf or *doc files on ELMS.
Statistical Computing Scripts and Handouts
Read Chapters 1 and 2 of Mardia, Kent and Bibby.
Do problems # 2.5.1, 2.6.4, 2.7.1 in MKB, along with 4 additional problems (A), (B), (C) and (D) that are written out in the linked pdf document HW1Spr22.pdf in the Homework folder. All 7 are to be handed in (uploaded) Monday Feb. 7 on ELMS.
NOTE: There is a correction to the statement of Problem B.(iii), given explicitly on ELMS and also linked here. The revised form of the HW1Spr22.pdf document linked above reflects this correction, as of 5pm 2/7/22.
Reading assignment in MKB text: Chapter 3 omitting 3.4.3, 3.6.2 and 3.7-3.8, Ch. 4 Sec. 4.2 through 4.2.2.2, pp.~102-107,
and Ch. 5, Section 5.2.1 and 5.3.1
Some of Ch.3 is difficult reading, and you may find the coverage in Haerdle and Simar (2007, 2nd ed.) Chapter 5, Sec. 6.1, Sec.7.1 a little less demanding. In that case, the Haerdle and Simar reading (and compare the similar treatment in Johnson and Wichern) should be good enough.
(I) Problems 3.2.6 on p.87, 3.4.2 on p.89, and 3.4.16 on p.92 in Chapter 3 of MKB.
(II) Exercise 5.10 on pp.159-160 of Haerdle and Simar, and Ex.4.21, p.205 in Johnson-Wichern.
(III) Generate 10,000 independent W4(Σ,10) random matrices, where Σ is a 4x4 diagonal matrix = diag(1:4). Use this random sample to estimate 0.5, 0.8, 0.9 and 0.95 quantiles for each of the 4 eigenvalues of a W4(Σ,10) matrix, and look at the histograms of these eigenvalues (10,000 observations for each eignevalue) to see how each of them differs from a normal distribution.
(IV) In (III), you may use the rWishart function in base-R to do the main part of this simulation. Check as part of your simulation, using the distribution of the largest eigenvalue, that rWishart is giving you the same result as the simWish function coded in classroom demonstration from first principles.
HCV = HCV[HCV$Categ !="0s=suspect Blood Donor",]
HCV$Category = factor(HCV$Categ)
HCV$Sex = ifelse(HCV$Sex=="m", 1, 2)
> for(i in 4:13) HCV[,i] = log(HCV[,i])
You may find that you want to take further transformations of the columns before going on to other steps (informed by how those steps turn out if you do not transform).
There are a total of 31 missing values in the remaining data. Replace them by the average of the values in their respective columns within their same "Category" (after doing whatever nonlinear transformations on the columns that seem appropriate to you, like the log's above). For example, if ALP is missing in a row with Category = "Cirrhosis", then replace it by the average of all ALP values in the "Cirrhosis" records.)
(A) Calculate the confusion-matrices for the methods you try that create 4 clusters.
(B) Calculate the "silhouette" for each clustering method with 3 to 6 cluster, and explain for each method what is the best number of clusters to choose based upon "silhouette".
SYLLABUS for Stat 750
We will cover Chapters 2, 3, 5-10, and 13 of the Mardia, Kent, and
Bibby book thoroughly: topics include the multivariate normal distribution, Wishart's and Hotelling's distributions; tests of hypotheses, estimation in the general linear model, distribution of test criteria; generalized distance, principal components, canonical correlations, factor analysis, and clustering. Other chapters and topics in MKB will be touched more lightly and material will be taken from the other books and some journal papers. Specific references on machine learning topics and applied examples will be added as the term progresses.
OUTLINE
1. Linear Algebra & Probability Review.
2. Wishart distribution; Hotelling T2; Mahalanobis distance.
3. Statistics based on likelihood for multivariate normal data
4.Multivariate regression.
5. Econometric Ideas
6. Principal Components Analysis.
7. Factor Analysis.
8. Cluster Analysis.
9. Miscellaneous Data-Analytic and Machine-Learning Ideas in Multivariate Stats
Important Dates
My home page.