Thur 1-2, Rm Mth 2400
F '03
Eric Slud Statistics Program, Math Department Rm 2314
Interested participants should get in touch with me at evs@math.umd.edu
Research Focus: Large datasets arise naturally
in many areas of science,
government, and business. Typically, as the size
of a dataset gets large, the
complexity of questions which one addresses with
it also increases. Such
problems range from Semiparametric Statistical
Inference to Order-selection
problems in regression and time series, to Classification
and Clustering as in the
Microarray Data problem-area in which I ran the
AMSC seminar in Fall '01
and an RIT in 2002-03. This is unlike the formal
setting of most mathematical
statistics, in which parameter-dimension is fixed
and sample size increases to
infinity. The contrast suggests the need for
a new Asymptotics which explicitly
recognizes the growth of the parameter-space
of a probability model as a function
of the size n of the dataset.
This
`research interaction' seminar on mathematical/statistical topics
in
Large Cross-Classified Datasets broadly encompasses
the overlap of my students'
thesis projects and most of my own current research
interests.
Graduate Prerequisites: To benefit from
this research activity, a graduate student
should have completed Stat 700 and at least one
of Stat 740, 741, 750, or 770, and
have some familiarity with Statistical Computing
at the level of Stat 430 (SAS
programming) or Stat 798C (Splus and SAS).
Undergraduate Prerequisites: An interested
undergraduate should have had at
least one course in Mathematical Statistics (e.g.
Stat 401 or 420) and considerable
experience -- either in courses or projects ---
with numerical computing or
data analysis.
Graduate Program: Graduate students will
be involved in reading and presenting
papers from the statistical literature concerning
provable properties of models and
statistical-inference methods related to large
cross-classified data structures,
including longitudinal data and spatial or survey
data cross-classified or stratified in
terms of many observed covariates. In some cases,
students may explore and present
software for the statistical analysis of some
of the data structures studied. (One
example would be GEE or generalized
estimating equation methods for
longitudinal data.
Undergraduate Program: Undergraduate students
will be involved primarily in
comparative numerical experiments involving algorithms
for simulating and analyzing
the large cross-classified data structures we
study.
Work Schedule: Unlike the meeting schedule
in previous terms, we will meet weekly
in the fall of 2003. Students will choose from
the following list of Topics and Papers
(which will regularly be augmented on this web-page)
and present the material in
subsequent weeks. Presentations can be informal,
but should be detailed enough and
present enough background that we can understand
the issues and ideas clearly. It is
expected that many presentations will extend
to a second week.
Topics: misspecified
models, random-effect GLM's,
regression-variable & model selection,
principal components analysis, factor analysis,
asymptotics for models with numbers of regressors
growing with sample size,
errors-in-variables (`measurement error') models,
longitudinal models & GEE methods (Generalized
Estimating Equations),
Panel Data Econometrics,
classification and clustering in large datasets,
experimental design (`response surface methodology').
Papers read in past terms:
Neyman, J. and E. Scott (1948) Consistent estimates based on
partially consistent observations. Econometrica 16, 1-32.
Pinheiro, J. and Bates, D. (1995) Documentation of Splus functions
lme and nlme
(with data illustrations).
Robinson, G. K. (1991) That BLUP is a good thing: the estimation
of random effects.
With comments and a rejoinder by the author. Statist.
Sci. 6 , 15-51.
Self, Steven G. and Liang, Kung-Yee (1987) Asymptotic properties
of maximum likelihood estimators and likelihood ratio tests under
nonstandard conditions. Jour. Amer. Statist. Assoc. 82,
605-610.
Shibata, Ritei (1981) An optimal selection of regression variables.
Biometrika 68, 45-54.
Many additional references on large datasets
with particular reference to DNA
Microarrays (used in AMSC seminars and RITs
in past terms) can be found here
.
A web-site
of microarray references, created by a statistician at LSU named
Barry Moser, may also be helpful.
Additional papers and books of interest:
Agresti, A. (2002) Categorical Data Analysis, 2nd ed. Chapter 11.
Akaike, Hirotugu (1970) Statistical predictor identification.
Ann.
Inst. Statist.Math.
22, 203--217.
Akaike, H. (1973). Information theory and an extension of the maximum
likelihood
principle. Proc. 2nd Int. Symp. on Information Theory,
B. N. Petrov and F. Csaki, eds.
Akademia Kiado, Budapest, 267-281.
Barron, Andrew, Rissanen, Jorma, and Yu, Bin (1998) The minimum description
length principle in coding and modeling. Information theory: 1948--1998.
IEEE
Trans. Inform. Theory 44, no. 6, 2743--2760.
Box, George E. P., Hunter, William G., and Hunter, J. Stuart (1978)
Statistics
for
experimenters. An introduction to design, data analysis,
and model building.
John Wiley & Sons, New York-Chichester-Brisbane,.
Ford, Ian , Titterington, D. M. , and Kitsos, Christos P. (1989),
Recent
advances in
nonlinear experimental design, Technometrics 31 , 49-60
Fuller, Wayne (1987) Measurement Error Models. New York: John Wiley.
Li, Haihong, Lindsay, Bruce G. and Waterman, Richard P. (2003) Efficiency
of projected
score methods in rectangular array asymptotics. J. R. Stat.
Soc. Ser. B 65, 191-208.
Lindsay, Bruce G. and Lesperance, Mary L. (1993) A review of
semiparametric mixture
models. Statistical modelling (Leuven). J. Statist. Plann. Inference
47
(1995), 29-39.
Myers, Raymond H. and Montgomery, Douglas C. (2002) Response Surface
Methodology:
process and product optimization using designed experiments. New
York: John Wiley.
Rao, C.R and Wu, Y. (2001). On model selection (with discussion),
IMS Lecture notes-
Monograph Series, Vol. 38, pp. 1-64.
Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6, 461-464.
Slud, E., Stone, M., Smith, P. and Goldstein, M. Jr. (2002) Principal
components
representation of the two-dimensional coronal tongue surface,
Phonetica 59, 108-133.
White, Halbert. (1982) Maximum likelihood estimation of misspecified
models.
Econometrica 50, no. 1, 1-25.
Yang, Yuhong, and Barron, Andrew R. (1998) An asymptotic property
of model
selection criteria. IEEE Trans. Inform. Theory 44, no.
1, 95-116.
There is also a Special Issue of Statistical
Science (Feb. 2003) on Statistical
Challenges and Methods for Microarray Analysis
which contains survey articles
and bibliographies with many items of interest
for our RIT. Special Issues of
other journals (including Statistica Sinica)
also were devoted to the topic.
Other papers used in Candidacy presentation of Ru Chen:
D.A. Freedman (1983), A note on Screening Regression Equations,
The American
Statistician 37, 152-155.
Laurence S. Freedman & David Pee (1989) Return to a Note on Screening
Regression
Equations, The American Statistician 43, 279-282
Portnoy, Stephen (1988) Asymptotic behavior of likelihood methods
for exponential
families when the number of parameters tends to infinity. Ann.
Statist.
16, 356-366.
Shao, Jun (1997) An asymptotic theory for linear model selection.
With comments and
a rejoinder by the author. Statist. Sinica 7, 221-264.
Other papers used in Candidacy presentation of Sophie Tsou:
Anderson, T.W. and Rubin, H. (1956) Statistical inference in factor
analysis. Proc. 3rd
Berk. Symp. 5, 111-150.
Tucker, L. (1966) Some mathematical notes on 3-mode factor analysis.
Psychometrika
31(3).
(Other) Papers covered this term:
Donald W. K. Andrews, DWK and Ploberger, W. (1994) Optimal
Tests when a Nuisance
Parameter is Present Only Under the Alternative. Econometrica
62(6),
pp. 1383-1414.
Ghosh, M. and Rao, J. (1994) Small area estimation: an appraisal.
Statist. Sci. 9.
(Other) Papers to be considered for the future:
Liu, M.,Taylor, J. and Belin, T. (2000) Multiple
imputation and posterior simulation for
multivariate missing data in longitudinal studies. Biometrics
56, 1157-63.