Mon. 3-4, Rm Mth 2400
Fall '05
Eric Slud Statistics Program , Math Department Rm 2314 x5-5469
Interested participants should get in touch with
me at evs@math.umd.edu
Research Focus: Large datasets arise naturally in many areas of
science, government,
and business. Typically, as the size of a dataset
gets large, the complexity of questions
which one addresses with
it also increases. Such problems range from standard parametric
models whose parameters are allowed to vary with cross-classifying
variables, to problems
in regression and time series which lead
naturally to the development of Order-selection
techniques, to
Semiparametric Statistical Inference (problems where the nuisance
parameters
are infinite-dimensional but may be approximated in
some sense by finite-dimensional
parameters of growing dimension,
to Classification and Clustering.
Growing parameter dimension violates the formal setting of most
textbook mathematical
statistics, in which parameter-dimension
is fixed and sample size increases to infinity.
The new setting
requires a new Asymptotic Theory which explicitly recognizes the
controlled
growth of the parameter-space of a probability model
as a function of the dataset size.
This Research Interaction seminar on mathematical and statistical
topics in Large Cross-
Classified Datasets broadly encompasses
the overlap of my students' thesis projects and
most of my own
current research interests.
The reading list presented below for this
RIT has a somewhat more theoretical flavor than
those in
related past RITs I have run. However, the
problem descriptions and
applications which we discussed in the
past are closely related to those we will discuss this
term,
and should serve as very good motivation for students in search of
good research
problems which mix theory and applications.
Graduate Prerequisites: To benefit from this research
activity, a graduate student
should have completed Stat 700
and at least one of Stat 740, 741, 750, or 770,
and have some
familiarity with Statistical Computing at the level of Stat 430
(SAS
programming) or Stat 798C (Splus or R).
Undergraduate Prerequisites: An interested undergraduate
should have had at
least one course in Mathematical Statistics
(e.g. Stat 401 or 420) and experience ---
either in courses
or projects --- with numerical computing or data analysis.
Graduate Program: Graduate students will be involved in
reading and presenting
papers from the statistical literature
concerning provable properties of models and
statistical
inference methods related to large cross-classified data structures,
including longitudinal data and spatial or survey data
cross-classified or stratified in
terms of many observed
covariates. In some cases, students may explore and
present
software for the statistical analysis of some of the data
structures studied.
(One example would be GEE
or generalized estimating equation methods for
longitudinal data.)
Undergraduate Program: Undergraduate students will be
involved primarily in
comparative numerical experiments
involving algorithms for simulating and analyzing
the large
cross-classified data structures we study.
Work Schedule: We will meet weekly in the fall of 2005.
Students will choose
from the following list of Topics and Papers
(which will regularly be augmented on
this web-page) and present
the material in subsequent weeks, after an introductory
couple of
weeks' talks by me. Presentations can be informal, but should be
detailed
enough and present enough background that we can
understand the issues and ideas
clearly. It is expected that
many presentations will extend to a second week.
Topics by Keyword:
misspecified regression models,
random-effect GLM's,
regression-variable & model selection
principal components analysis & factor analysis,
asymptotics for linear models with growing numbers of regressors,
errors-in-variables (`measurement error') models,
longitudinal models & GEE methods (Generalized
Estimating Equations),
Panel Data Econometrics,
classification and clustering in large datasets,
experimental design (`response surface
methodology')
Topics by Theoretical Idea :
Rates of growth of parameter dimension
p(n)
compatible with
consistency, asymptotic normality & efficiency of estimators,
Linear model asymptotics; GLM extensions using
exponential families,
Profile and adjusted profile
likelihoods,
Projected score and Hilbert space
techniques,
Estimating equations,
Other modified profile likelihoods,
Properties of automatic model selection methods,
Methods
related to misspecified models.
Relevant papers read in past terms (which might be revisited):
Neyman, J. and E. Scott (1948) Consistent estimates based on
partially consistent observations. Econometrica 16, 1-32.
Self, Steven G. and Liang, Kung-Yee (1987) Asymptotic properties
of maximum likelihood estimators and likelihood ratio tests under
nonstandard conditions. Jour. Amer. Statist. Assoc. 82,
605-610.
White, Halbert (1982) Maximum likelihood estimation of misspecified
models.
Econometrica 50, no. 1, 1-25.
Important papers on the focused RIT topic:
Chen, Ru (2005) Misspecified Models with Parameters of
Increasing Dimension.
University of Maryland College Park
Thesis, Statistics Program, August 2005.
He, X. and Shao, Qi-Man (2004) On parameters of increasing dimensions. Preprint.
Jiang, Jiming (1999) Conditional inference about generalized
linear mixed models.
Ann. Statist. 27 , 1974-2007.
Li, Haihong, Lindsay, Bruce G. and Waterman, Richard P. (2003)
Efficiency of projected
score methods in rectangular array
asymptotics. J. Roy. Statist. Soc. Ser. B 65, 191-208.
Lindsay, Bruce, Clogg, C., and Grego, J. (1991)
Semiparametric estimation in the Rasch
model and related
exponential response models, including a simple latent class
model
for item analysis. J. Amer. Statist. Assoc. 86,
96-107.
McCullagh, P. and Tibshirani, R. (1990) A simple method for the
adjustment of profile
likelihoods. J. Roy. Statist. Soc. Ser. B
52, 325-344.
Pfanzagl, J. (1993) Incidental versus random nuisance
parameters. Ann. Statist.
21, 1663-91.
Portnoy, Stephen (1988) Asymptotic behavior of
likelihood methods for exponential
families when the number
of parameters tends to infinity. Ann. Statist. 16, 356-366.
Sartori, N. (2005) Modifications to the profile likelihood in models
with incidental
nuisance parameters. Preprint.
Slud, E. and Vonta, F. (2005) Efficient semiparametric estimators
via modified
profile likelihood. Jour. Statist. Planning &
Inf. 44, 339-367.
Wei, C. Z. (1992) On predictive least squares principles. Ann. Statist. 20, 1-42.
Yang, Yuhong, and Barron, Andrew R. (1998) An asymptotic property of
model
selection criteria. IEEE Trans. Inform. Theory 44, no.
1, 95-116.
Yohai, V. and Maronna, R. (1979) Asymptotic behavior of M-estimators
for the
linear model. Ann. Statist. 7, 258-268.
Many additional references on large datasets with particular reference
to DNA
Microarrays (used in AMSC seminars in Fall '01 and RIT
in Spring '02)
can be found
here.
A
web-site of microarray references, created by a statistician at LSU named
Barry Moser, may also be helpful. There is also a Special Issue of
Statistical Science
(Feb. 2003) on Statistical Challenges and Methods
for Microarray Analysis
which contains survey articles and
bibliographies with many items of methodological
interest for
large cross-classified genomic datasets. Special Issues of other
journals
(including Statistica Sinica) also were devoted to
the topic. See further 2002 RIT
materials here.
In 2003 the RIT broadened to encompass non-genomics problems under the
heading of Statistics of Large Cross-Classified Datasets. You
can see the back-
ground discussion, readings and talk titles at the
RIT '03
web-page.
Finally, in
Spring '04, the Large Cross-Classified Datasets considered
became (mostly
biomedical) multicenter studies unified by random effects
models under the trendy
title of "Meta-Analysis", with web-page here.
Schedule of Talks ---
: Sept 26, Eric
Slud
(1990 JRSSB paper of
McCullagh & Tibshirani)
Please get in touch with me at
evs@math.umd.edu
to volunteer to read a paper and give a talk.