Math Home > Statistics >	[ Search \| Contact \| Help! ]

Statistics Seminars, Spring 2012

Organizer: Abram Kagan
Link to seminar schedules from previous years
Link to the program for Advances in Statistics and Applied Probability: Unified Approaches, A Symposium in Honor of Benjamin N. Kedem, July 30-31, 2009
Statistics Consortium Seminars and Workshops

Spring 2012 Talks

(Spring 2012, Seminar No. 1)

SPEAKER: Dr. Yaakov Malinovsky, Dept. of Math. and Stat., UMBC
University of Maryland
Baltimore, MD USA

TITLE: Monotonicity in the Sample Size of the Length of Classical Confidence Intervals Abstract

TIME AND PLACE: February 9, 2012, 3:30pm
Room 1313, Math Bldg.

ABSTRACT: It is proved that the average length of standard confidence intervals for parameters of gamma and normal distributions monotonically decreases with the sample size. Though the monotonicity seems a very natural property, the proofs are based on fine properties of the classical gamma function and are of independent interest. (It is a joint work with Abram Kagan).

(Spring 2012, Seminar No. 2)

SPEAKER: Prof. David Hamilton, Dept. of Mathematics
University of Maryland
College Park, MD USA

TITLE: An Accurate Genetic Clock and the Third Moment

TIME AND PLACE: February 16, 2012, 3:30pm
Room 1313, Math Bldg.

ABSTRACT: The genetic clock uses mutations at molecular markers to estimate the time T1 of origin of a population. It has become important in the evolution of species and diseases, forensics, history and geneology. However the two types of methods used yield very different estimates even from the same data. For humans at about 10,000 ybp .Mean square Estimates. (MSE) give results about 100% more than .Bayesian analysis of random trees. (BAT). Also the SD are about 50% of T1. (In the last 500 years all methods give similar and accurate results). Our new theory explains why MSE overestimates by about 50%, while BAT underestimates by about 25%. This is just not a mathematical problem but involves two quite different physical phenomena. The first comes from the mutation process itself. The second is macroscopic and arises from the reproductive dominance of elite lineages. Our method deals with both giving 15% accuracy at 10,000 ybp. This is precise enough to resolve a question first mentioned in Genesis, argued over by archeologists and linguists(and Nazis): the origin of the Europeans. The theory depends on solving a stochastic system of infinite dimensional ode by hyperbolic Bessel functions. At the heart is a new inequality for probability distributions P normalized with mean . = 0, variance _ = 1: If the third moment ! > 0 we have P(1,+1) > 0.

(Spring 2012, Seminar No. 3)

SPEAKER: Dr. Ping Li, Cornell University

TITLE: Probabilistic Hashing Methods for Fitting Massive Logistic Regressions and SVM with Billions of Variables

TIME AND PLACE: February 23, 2012, 3:30pm
Room 1313, Math Bldg.

ABSTRACT: In modern applications, many statistics tasks such as classification using logistic regression or SVM often encounter extremely high-dimensional massive datasets. In the context of search, certain industry applications have used datasets in 2^64 dimensions, which are larger than the square of billion. This talk will introduce a recent probabilistic hashing technique called b-bit minwise hashing (Research Highlights in Comm. of ACM 2011), which has been used for efficiently computing set similarities in massive data. Most recently (NIPS 2011), we realized that b-bit minwise hashing can be seamlessly integrated with statistical learning algorithms such as logistic regression or SVM to solve extremely large-scale prediction problems. Interestingly, for binary data, b-bit minwise hashing is substantially much more accurate than other popular methods such as random projections. Experimental results on 200GB data (in billion dimensions) will also be presented.

(Spring 2012, Seminar No. 4)

SPEAKER: Yuriy Sverchkov, Intelligent Systems Program
University of Pittsburgh

TITLE: A Multivariate Probabilistic Method for Comparing Datasets

TIME AND PLACE: March 1, 2012, 3:30pm
Room 1313, Math Bldg.

ABSTRACT: We present a novel method for obtaining a concise and mathematically grounded description of multivariate differences between a pair of datasets. Often data collected under similar circumstances reflect fundamentally different patterns. For example, information about patients undergoing similar treatments in different intensive care units (ICUs), or within the same ICU during different periods, may show systematically different outcomes. In such circumstances, the multivariate probability distributions induced by the datasets would differ in selected ways. To capture the probabilistic relationships, we learn a Bayesian network (BN) from the union of the two datasets. We include an indicator variable that represents the dataset from which a given patient record is obtained. We then extract the relevant conditional distributions from the network by finding the conditional probabilities that differ most when conditioning on the indicator variable. Our work is a form of explanation that bears some similarity to previous work on BN explanation; however, while previous work has mostly focused on justifying inference, our work is aimed at explaining multivariate differences between distributions.

(Spring 2012, Seminar No. 5)

SPEAKER: Paul M. Torrens, Associate Professor, Geosimulation Research Laboratory, Department of Geographical Sciences
University of Maryland
College Park, MD USA

TITLE: Modeling Human Movement

TIME AND PLACE: March 8, 2012, 3:30pm
Room 1313, Math Bldg.

ABSTRACT: The boundaries between the real and virtual are continually being blurred in geographic research. Owing to the difficulty in experimenting with real people and places on the ground, geographers are turning, increasingly, to synthetic, computer-based worlds for testing their ideas, plans, policies, and hypotheses. These simulation environments are incredibly useful for exploring human behavior in critical situations, which are practically inaccessible to academic inquiry by other means. One example domain is modeling human movement, which is important in exploring a variety of systems and problems, from estimating evacuation potential and planning pedestrian infrastructure, to understanding crowd dynamics and marketing retail facilities. Building models of these things is a challenge: to be useful, the models need to be realistic. For geographers, this often means that the models should put the right people in the right places and times, doing the right things, in the right company and context. This is not always an easy task and many model-builders have relied upon mathematics and algorithms that are .good enough. proxies for geography, drawing inspiration from parsimonious methods from physics, economics, and informatics to build their simulations. This often places the models at odds with the theory that they purport to explore and with the reality on the ground that they promise to mimic. The discrepancies between traditional methods for representing movement and our evolving understanding of the realities of movement have become more and more apparent as high-resolution data from location-aware technologies have become more common. My contention is that much more useful models and experimental schemes can be built in computer models if we seed them with realistic human, behavioral, and urban geography, potentially with the result of expanding the range of questions that can actually be posed in simulation. This could advance geographic information science and geocomputing in some important ways, but it could also forge new connections between quantitative and computational geography and the more theoretical and practical interests that computational scientists and social scientists share. I will demonstrate my research to develop a flexible and fundamentally realistic pipeline for simulating human movement, one that caters to the biomechanics of the human body, the cognition that allows humans to acquire geographic information while moving through social and built infrastructure, and the behavior that enables them to make use of that information to determine their actions and interactions at multiple scales of space and time. The models are built on theoretical assumptions, but they are also .fed. with realistic data from location-aware technologies. I will show the usefulness of this scheme for applications in ordinary and extraordinary pedestrian and crowd movement.

(Spring 2012, Seminar No.6 )

SPEAKER: Prof. Bimal Sinha, Dept. of Mathematics and Statistics, UMBC
University of Maryland
Baltimore, MD USA

TITLE: Generalized P-values: Theory and Applications

TIME AND PLACE: Thursday, March 15, 2012, 3:30pm
Room 1313, Math Bldg.

ABSTRACT: During the last fifteen years or so, generalized R-values have become quite useful in solving testing problems in many non-standard situations. In this talk the notion of a generalized P-value will be explained and its many applications will be presented. The application area will mostly include linear models.

(Spring 2012, Seminar No.7 )

SPEAKER: Prof. Hegang H. Chen, Division of Biostatistics and Bioinformatics, UMD School of Medicine
University of Maryland
Baltimore, MD USA

TITLE: Optimal Selection Criteria for Regular Fractional Factorial Designs

TIME AND PLACE: Thursday, April 5, 2012, 3:30pm
Room 1313, Math Bldg.

ABSTRACT: Fractional factorial designs have a long history of successful use in scientific investigations. Resolution (Box and Hunter (1961)) and its refinement, minimum aberration (Fries and Hunter (1980)), are commonly used criteria for selecting regular fractional factorial designs. Both of these criteria are based on wordlength patterns of the designs. Cheng, Steinberg and Sun (1999) showed that minimum aberration criterion is a good surrogate for some model-robustness criteria such as maximum estimation capacity. Recently, the concept of estimation index (Chen and Cheng (2004)) was proposed to help assess a fractional factorial design's capability to estimate factorial effects. The estimation index provides some insight into when a design is capable of entertaining the largest number of lower-order effects. In this talk, the relationships among estimation index, resolution, minimum aberration and estimation capacity will be discussed. In addition to deriving some general results of relationship among various criteria, I will demonstrate how to combine information on wordlength pattern and estimation index to study the estimation capability of regular fractional factorial designs. This talk is based on joint work with Prof. Ching-Shui Cheng.

(Spring 2012, Seminar No.8)

SPEAKER: Dr. Michail Sverchkov, Bureau of Labor Statistics

TITLE: On Modeling and Estimation of Response Probabilities when Missing Data are Not Missing at Random

TIME AND PLACE: Thursday, April 12, 2012, 3:30pm
Room 1313, Math Bldg.

ABSTRACT: Most methods that deal with the estimation of response probabilities assume either explicitly or implicitly that the missing data are missing at random (MAR). However, in many practical situations this assumption is not valid, since the probability to respond often depends on the outcome value or on latent variables related to the outcome. The case where the missing data are not MAR (NMAR) can be treated by postulating a parametric model for the distribution of the outcomes under full response and a model for the response probabilities. The two models define a parametric model for the joint distribution of the outcome and the response indicator, and therefore the parameters of this model can be estimated by maximization of the likelihood corresponding to this distribution. Modeling the distribution of the outcomes under full response, however, can be problematic since no data are available from this distribution. Back in 2008 the speaker proposed an approach that permits estimating the parameters of the model for the response probabilities without modelling the distribution of the outcomes under full response. The approach utilizes relationships between the sample distribution and the sample-complement distribution derived by Sverchkov and Pfeffermann in 2004. The present paper extends the above approach.

B>(Spring 2012, Seminar No.9 )

SPEAKER: Professor Yury Tyurin, Moscow State University, MOSCOW

TITLE: Geometric Theory of Multivariate Analysis

TIME AND PLACE: Thursday, April 19, 2012, 3:30pm
Room 1313, Math Bldg.

ABSTRACT: In the talk a new geometric approach to multivariate analysis will be presented. It will be illustrated on inference for linear models.

(Spring 2012, Seminar No.10)

SPEAKER: Professor Jian-Jian Ren, University of Maryland - College Park

TITLE: BIVARIATE NONPARAMETRIC MAXIMUM LIKELIHOOD ESTIMATOR WITH RIGHT CENSORED DATA

TIME AND PLACE: Thursday, April 26, 2012, 3:30pm
Room 1313, Math Bldg.

ABSTRACT: In the analysis of survival data, we often encounter situations where the response variable (the survival time) T is subject to right censoring, but the covariates Z are completely observable. To use the nonparametric approach (i.e., without imposing any model assumptions) in the study of the relation between the right censored response variable T and the completely observable covariate variable Z, one natural thing to do is to estimate the bivariate distribution function F_o(t, z) of (T, Z) based on the available bivariate data which are right censored in one coordinate - we called it BD1RC data. In this article, we derive the bivariate nonparametric maximum likelihood estimator (BNPMLE) F_n(t,z) for F_o(t, z) based on the BD1RC data, which has an explicit expression and is unique in the sense of empirical likelihood. Other nice features of F_n(t,z) include that it has only nonnegative probability masses, thus it is monotone in bivariate sense, while these properties generally do not hold for most existing distribution estimators with censored bivariate data. We show that under BNPMLE F_n(t,z), the conditional distribution function (d.f.) of T given Z is of the same form as the Kaplan-Meier estimator for the univariate case, and that the marginal d.f. F_n(\infty,z) coincides with the empirical d.f. of the covariate sample. We also show that when there is no censoring, F_n(t,z) coincides with the bivariate empirical distribution function. For the case with discrete covariate Z, the strong consistency and weak convergence of F_n(t,z) are established. The extension of our BNPMLE F_n(t,z) to the case with p-variate Z for p>1 is straightforward. This work is joint with Tonya Riddlesworth.

(Spring 2012, Seminar No.11)

SPEAKER: Eric Slud,(Census Bureau and UMCP) and Jiraphan Suntornchost, (UMCP) University of Maryland - College Park

TITLE: Parametric Survival Densities from Phase-Type Models

TIME AND PLACE: Thursday, May 3, 2012, 3:30pm
Room 1313, Math Bldg.

ABSTRACT: After a historical survey of parametric survival models, from actuarial, biomedical, demographical and engineering sources, we will discuss the persistent reasons why parametric models still play an important role in exploratory statistical research. The phase-type models are advanced as a flexible family of latent-class models with interpretable components. These models are now supported by computational statistical methods that make numerical calculation of likelihoods and statistical estimation of parameters feasible in theory for quite complicated settings. However, consideration of Fisher Information and likelihood-ratio type tests (Kullback-Leibler distances) to discriminate between model families indicates that only the simplest phase-type model topologies can be stably estimated in practice, even on rather large datasets. An example of a parametric model with features of mixtures, multiple stages or `hits', and a trapping-state is given to illustrate simple computational tools in R, both on simulated data and on a large SEER 1992-2002 breast-cancer dataset.