Mon. 3-4,  Rm  Mth 2400  
              
              
              
Fall '05
 
Eric Slud Statistics Program , Math Department Rm 2314 x5-5469
Interested participants should get in touch with
me at   evs@math.umd.edu
Research Focus: Large datasets arise naturally in many areas of 
science, government, 
 and business. Typically, as the size of a dataset 
gets large, the complexity of questions 
 which one addresses with
it also increases.  Such problems range from standard parametric 
 models whose parameters are allowed to vary with cross-classifying 
variables, to problems 
 in regression and time series which lead 
naturally to the development of Order-selection 
 techniques, to 
Semiparametric Statistical Inference (problems where the nuisance 
parameters 
 are infinite-dimensional but may be approximated in 
some sense by finite-dimensional 
 parameters of growing dimension,
to Classification and Clustering. 
Growing parameter dimension violates the formal setting of most 
textbook mathematical 
 statistics, in which parameter-dimension 
is fixed and sample size increases to infinity. 
 The new setting 
requires a new Asymptotic Theory which explicitly recognizes the 
controlled 
 growth of the parameter-space of a probability model 
as a function of the dataset size. 
 This Research Interaction seminar on mathematical and statistical 
topics in Large Cross- 
 Classified Datasets broadly encompasses
the overlap of my students' thesis projects and 
most of my own 
current research interests. 
 The reading list presented below for this 
RIT has a somewhat more theoretical flavor than 
 those in 
related past RITs I have run. However, the 
problem descriptions and 
 applications which we discussed in the 
past are closely related to those we will discuss this 
 term, 
and should serve as very good motivation for students in search of 
good research 
 problems which mix theory and applications.
Graduate Prerequisites: To benefit from this research 
activity, a graduate student 
should have completed Stat 700 
and at least one of Stat 740, 741, 750, or 770, 
 and have some 
familiarity with Statistical Computing at the level of Stat 430 
(SAS 
 programming) or Stat 798C (Splus or R).
Undergraduate Prerequisites: An interested undergraduate 
should have had at 
 least one course in Mathematical Statistics 
(e.g. Stat 401 or 420) and experience --- 
 either in courses 
or projects --- with numerical computing or data analysis.
Graduate Program: Graduate students will be involved in 
reading and presenting 
 papers from the statistical literature 
concerning provable properties of models and 
 statistical 
inference methods related to large cross-classified data structures,
 including longitudinal data and spatial or survey data 
cross-classified or stratified in 
 terms of many observed 
covariates. In some cases, students may explore and 
 present
software for the statistical analysis of some of the data 
structures studied. 
 (One example would be  GEE  
or generalized estimating equation  methods for 
 
longitudinal data.)
Undergraduate Program: Undergraduate students will be 
involved primarily in 
 comparative numerical experiments 
involving algorithms for simulating and analyzing 
 the large 
cross-classified data structures we study.
Work Schedule: We will meet weekly in the fall of 2005. 
Students will choose 
 from the following list of Topics and Papers
(which will regularly be augmented on 
 this web-page) and present 
the material in subsequent weeks, after an introductory 
 couple of 
weeks' talks by me. Presentations can be informal, but should be 
detailed 
 enough and present enough background that we can 
understand the issues and ideas 
 clearly. It is expected that 
many presentations will extend to a second week.
 
Topics by Keyword: 
 
            
     misspecified regression models, 
random-effect GLM's, 
     
           
regression-variable & model selection 
     
           
principal components analysis & factor analysis, 
    
            
asymptotics for linear models with growing numbers of regressors, 
            
    errors-in-variables (`measurement error') models, 
            
    longitudinal models & GEE methods (Generalized 
Estimating Equations), 
        
         Panel Data Econometrics,
            
     classification and clustering in large datasets,
            
     experimental design (`response surface 
methodology')
Topics by Theoretical Idea   :
            
     Rates of growth of parameter dimension   
 p(n)     
compatible with 
          
             
consistency, asymptotic normality &   efficiency of estimators,
            
     Linear model asymptotics; GLM extensions using 
exponential families, 
         
        Profile and adjusted profile 
likelihoods, 
           
       Projected score and Hilbert space 
techniques, 
          
       Estimating equations, 
   
             
Other modified profile likelihoods, 
             
     Properties of automatic model selection methods, 
            
      Methods 
related to misspecified models. 
Relevant papers read in past terms (which might be revisited):
Neyman, J. and E. Scott (1948) Consistent estimates based on
partially consistent observations. Econometrica 16, 1-32.
Self, Steven G. and Liang, Kung-Yee (1987) Asymptotic properties
of maximum likelihood estimators and likelihood ratio tests under
nonstandard conditions.  Jour. Amer. Statist. Assoc. 82,
605-610.
White, Halbert (1982) Maximum likelihood estimation of misspecified
models. 
Econometrica 50, no. 1, 1-25.
Important papers on the focused RIT topic:
 Chen, Ru (2005)  Misspecified Models with Parameters of
Increasing Dimension.  
 University of Maryland College Park
Thesis, Statistics Program, August 2005.  
He, X. and Shao, Qi-Man (2004) On parameters of increasing dimensions. Preprint.
 Jiang, Jiming (1999) Conditional inference about generalized
linear mixed models. 
 Ann. Statist. 27 , 1974-2007. 
Li, Haihong, Lindsay, Bruce G. and Waterman, Richard P. (2003) 
Efficiency of projected 
score methods in rectangular array 
asymptotics. J. Roy. Statist. Soc. Ser. B 65, 191-208.
Lindsay, Bruce, Clogg, C., and Grego, J. (1991) 
Semiparametric estimation in the Rasch 
 model and related
exponential response models, including a simple latent class  
model 
 for item analysis. J. Amer. Statist. Assoc. 86, 
96-107.
 McCullagh, P. and Tibshirani, R. (1990)   A simple method for the 
adjustment of profile 
 likelihoods. J. Roy. Statist. Soc. Ser. B 
52, 325-344.
 Pfanzagl, J. (1993) Incidental versus random nuisance
parameters.  Ann. Statist. 
 21, 1663-91.
Portnoy, Stephen  (1988) Asymptotic behavior of 
likelihood methods for exponential 
 families when the number 
of parameters tends to infinity. Ann. Statist. 16, 356-366.
 Sartori, N. (2005) Modifications to the profile likelihood in models 
with incidental 
 nuisance parameters.  Preprint.
 Slud, E. and Vonta, F. (2005) Efficient semiparametric estimators
via modified 
 profile likelihood. Jour. Statist. Planning &
Inf. 44, 339-367.
Wei, C. Z. (1992) On predictive least squares principles. Ann. Statist. 20, 1-42.
Yang, Yuhong, and Barron, Andrew R. (1998) An asymptotic property of 
model 
 selection criteria. IEEE Trans. Inform. Theory 44, no.
1, 95-116. 
Yohai, V. and Maronna, R. (1979)  Asymptotic behavior of M-estimators 
for the 
 linear model. Ann. Statist. 7, 258-268.
         
Many additional references on large datasets with particular reference 
to DNA 
 Microarrays (used in AMSC seminars in Fall '01 and RIT 
in Spring '02) 
 can be found 
here. 
 
          A 
web-site of microarray references, created by a statistician at LSU named
 Barry Moser, may also be helpful. There is also a Special Issue of 
Statistical Science 
 (Feb. 2003) on Statistical Challenges and Methods 
for Microarray Analysis 
 which contains survey articles and 
bibliographies with many items of methodological 
 interest for 
large cross-classified genomic datasets. Special Issues of other 
journals 
 (including Statistica Sinica) also were devoted to 
the topic. See further 2002 RIT 
 materials here.
         
In 2003 the RIT broadened to encompass non-genomics problems under the 
heading of Statistics of Large Cross-Classified Datasets. You 
can see the back- 
 ground discussion, readings and talk titles at the 
RIT '03 
web-page.
Finally, in 
 Spring '04, the Large Cross-Classified Datasets considered
became (mostly 
 biomedical) multicenter studies unified by random effects 
models under the trendy 
 title of "Meta-Analysis", with web-page here. 
 Schedule of Talks --- 
 : Sept 26, Eric
Slud
            
       (1990 JRSSB paper of 
McCullagh & Tibshirani) 
   Please get in touch with me at 
      evs@math.umd.edu
      
    
    to volunteer to read a paper and give a talk.