Statistics Seminars, Spring 2012
Spring 2012 Talks
(Spring
2012, Seminar No. 1)
SPEAKER: Dr. Yaakov Malinovsky, Dept. of Math. and Stat., UMBC
University of Maryland
Baltimore, MD USA
TITLE: Monotonicity in the Sample Size of the Length of Classical Confidence
Intervals Abstract
TIME AND PLACE:
February 9, 2012, 3:30pm
 
Room 1313, Math Bldg.
ABSTRACT:
It is proved that the average length of standard confidence intervals for
parameters of gamma and normal distributions
monotonically decreases with the sample size. Though the monotonicity seems
a very natural property, the proofs are based on fine properties of the
classical gamma function and are of independent interest.
(It is a joint work with Abram Kagan).
(Spring
2012, Seminar No. 2)
SPEAKER: Prof. David Hamilton, Dept. of Mathematics
University of Maryland
College Park, MD USA
TITLE: An Accurate Genetic Clock and the Third Moment
TIME AND PLACE:
February 16, 2012, 3:30pm
 
Room 1313, Math Bldg.
ABSTRACT:
The genetic clock uses mutations at molecular markers to estimate the
time T1 of origin of a population. It has become important in the evolution of
species and diseases, forensics, history and geneology. However the two types
of methods used yield very different estimates even from the same data. For
humans at about 10,000 ybp .Mean square Estimates. (MSE) give results
about 100% more than .Bayesian analysis of random trees. (BAT). Also the
SD are about 50% of T1. (In the last 500 years all methods give similar and
accurate results).
Our new theory explains why MSE overestimates by about 50%, while
BAT underestimates by about 25%. This is just not a mathematical problem
but involves two quite different physical phenomena. The first comes from
the mutation process itself. The second is macroscopic and arises from the
reproductive dominance of elite lineages. Our method deals with both giving
15% accuracy at 10,000 ybp. This is precise enough to resolve a question first
mentioned in Genesis, argued over by archeologists and linguists(and Nazis):
the origin of the Europeans.
The theory depends on solving a stochastic system of infinite dimensional
ode by hyperbolic Bessel functions. At the heart is a new inequality for
probability distributions P normalized with mean . = 0, variance _ = 1:
If the third moment ! > 0 we have P(1,+1) > 0.
(Spring
2012, Seminar No. 3)
SPEAKER: Dr. Ping Li, Cornell University
TITLE: Probabilistic Hashing Methods for Fitting Massive Logistic Regressions and SVM with Billions of
Variables
TIME AND PLACE:
February 23, 2012, 3:30pm
 
Room 1313, Math Bldg.
ABSTRACT:
In modern applications, many statistics tasks such as classification using logistic regression or SVM often encounter extremely
high-dimensional massive datasets. In the context of search, certain industry applications have used datasets in 2^64 dimensions,
which are larger than the square of billion. This talk will introduce a recent probabilistic hashing technique called b-bit minwise
hashing (Research Highlights in Comm. of ACM 2011), which has been used for efficiently computing set similarities in massive data.
Most recently (NIPS 2011), we realized that b-bit minwise hashing can be seamlessly integrated with statistical learning algorithms
such as logistic regression or SVM to solve extremely large-scale prediction problems. Interestingly, for binary data, b-bit minwise
hashing is substantially much more accurate than other popular methods such as random projections. Experimental results on 200GB data (in billion
dimensions) will also be presented.
(Spring
2012, Seminar No. 4)
SPEAKER: Yuriy Sverchkov, Intelligent Systems Program
University of Pittsburgh
TITLE: A Multivariate Probabilistic Method for Comparing Datasets
TIME AND PLACE:
March 1, 2012, 3:30pm
 
Room 1313, Math Bldg.
ABSTRACT:
We present a novel method for obtaining a concise and mathematically
grounded description of multivariate differences between a pair of
datasets. Often data collected under similar circumstances reflect
fundamentally different patterns. For example, information about patients
undergoing similar treatments in different intensive care units (ICUs), or
within the same ICU during different periods, may show systematically
different outcomes. In such circumstances, the multivariate probability
distributions induced by the datasets would differ in selected ways. To
capture the probabilistic relationships, we learn a Bayesian network (BN)
from the union of the two datasets. We include an indicator variable that
represents the dataset from which a given patient record is obtained. We
then extract the relevant conditional distributions from the network by
finding the conditional probabilities that differ most when conditioning
on the indicator variable. Our work is a form of explanation that bears
some similarity to previous work on BN explanation; however, while
previous work has mostly focused on justifying inference, our work is
aimed at explaining multivariate differences between distributions.
(Spring
2012, Seminar No. 5)
SPEAKER: Paul M. Torrens, Associate Professor, Geosimulation Research Laboratory,
Department of Geographical Sciences
University of Maryland
College Park, MD USA
TITLE: Modeling Human Movement
TIME AND PLACE:
March 8, 2012, 3:30pm
 
Room 1313, Math Bldg.
ABSTRACT: The boundaries between the real and virtual are continually being blurred in geographic research. Owing to the difficulty in
experimenting with real people and places on the ground, geographers are turning, increasingly, to synthetic, computer-based worlds for testing their
ideas, plans, policies, and hypotheses. These simulation environments are incredibly useful for exploring human behavior in critical situations,
which are practically inaccessible to academic inquiry by other means. One example domain is modeling human movement, which is important in exploring
a variety of systems and problems, from estimating evacuation potential and planning pedestrian infrastructure, to understanding crowd dynamics and
marketing retail facilities. Building models of these things is a challenge: to be useful, the models need to be realistic. For geographers, this
often means that the models should put the right people in the right places and times, doing the right things, in the right company and context. This
is not always an easy task and many model-builders have relied upon mathematics and algorithms that are .good enough. proxies for geography, drawing
inspiration from parsimonious methods from physics, economics, and informatics to build their simulations. This often places the models at odds with
the theory that they purport to explore and with the reality on the ground that they promise to mimic. The discrepancies between traditional methods
for representing movement and our evolving understanding of the realities of movement have become more and more apparent as high-resolution data from
location-aware technologies have become more common.
My contention is that much more useful models and experimental schemes can be built in computer models if we seed them with realistic human,
behavioral, and urban geography, potentially with the result of expanding the range of questions that can actually be posed in simulation. This could
advance geographic information science and geocomputing in some important ways, but it could also forge new connections between quantitative and
computational geography and the more theoretical and practical interests that computational scientists and social scientists share.
I will demonstrate my research to develop a flexible and fundamentally realistic pipeline for simulating human movement, one that caters to the
biomechanics of the human body, the cognition that allows humans to acquire geographic information while moving through social and built
infrastructure, and the behavior that enables them to make use of that information to determine their actions and interactions at multiple scales of
space and time. The models are built on theoretical assumptions, but they are also .fed. with realistic data from location-aware technologies. I will
show the usefulness of this scheme for applications in ordinary and extraordinary pedestrian and crowd movement.
(Spring
2012, Seminar No.6 )
SPEAKER: Prof. Bimal Sinha, Dept. of Mathematics and
Statistics, UMBC
University of Maryland
Baltimore, MD USA
TITLE: Generalized P-values: Theory and Applications
TIME AND PLACE:
Thursday, March 15, 2012, 3:30pm
 
Room 1313, Math Bldg.
ABSTRACT:
During the last fifteen years or so, generalized R-values have become
quite useful in solving testing problems in many non-standard situations.
In this talk the notion of a generalized P-value will be explained and its
many applications will be presented. The application area will mostly
include linear models.
(Spring
2012, Seminar No.7 )
SPEAKER: Prof. Hegang H. Chen, Division of Biostatistics and Bioinformatics,
UMD School of Medicine
University of Maryland
Baltimore, MD USA
TITLE: Optimal Selection Criteria for Regular Fractional Factorial Designs
TIME AND PLACE:
Thursday, April 5, 2012, 3:30pm
 
Room 1313, Math Bldg.
ABSTRACT:
Fractional factorial designs have a long history of successful use in scientific investigations. Resolution (Box and Hunter (1961)) and its
refinement, minimum aberration (Fries and Hunter (1980)), are commonly used criteria for selecting regular fractional factorial designs. Both of
these criteria are based on wordlength patterns of the designs. Cheng, Steinberg and Sun (1999) showed that minimum aberration criterion is a good
surrogate for some model-robustness criteria such as maximum estimation capacity. Recently, the concept of estimation index (Chen and Cheng (2004))
was proposed to help assess a fractional factorial design's capability to estimate factorial effects. The estimation index provides some insight into
when a design is capable of entertaining the largest number of lower-order effects. In this talk, the relationships among estimation index,
resolution, minimum aberration and estimation capacity will be discussed. In addition to deriving some general results of relationship among various
criteria, I will demonstrate how to combine information on wordlength pattern and estimation index to study the estimation capability of regular
fractional factorial designs. This talk is based on joint work with Prof. Ching-Shui Cheng.
(Spring
2012, Seminar No.8)
SPEAKER: Dr. Michail Sverchkov, Bureau of Labor Statistics
TITLE: On Modeling and Estimation of Response Probabilities when Missing
Data are Not Missing at Random
TIME AND PLACE:
Thursday, April 12, 2012, 3:30pm
 
Room 1313, Math Bldg.
ABSTRACT:
Most methods that deal with the estimation of response probabilities
assume either explicitly or implicitly that the missing data are missing at random (MAR). However, in many practical
situations this assumption is
not valid, since the probability to respond often depends on the
outcome value or on latent variables related to the outcome. The case
where the missing data are not MAR (NMAR) can be treated by postulating a parametric model for the distribution of the
outcomes under full response
and a model for the response probabilities. The two models define
a parametric model for the joint distribution of the outcome and the
response indicator, and therefore the parameters of this model can be estimated by maximization of the likelihood
corresponding to this
distribution. Modeling the distribution of the outcomes under full response,
however, can be problematic since no data are available from this
distribution. Back in 2008 the speaker proposed an approach that permits estimating the parameters of the model for the
response probabilities
without modelling the distribution of the outcomes under full response.
The approach utilizes relationships between the sample distribution and
the sample-complement distribution derived by Sverchkov and Pfeffermann in 2004. The present paper extends the above
approach.
B>(Spring
2012, Seminar No.9 )
SPEAKER: Professor Yury Tyurin, Moscow State University,
MOSCOW
TITLE: Geometric Theory of Multivariate Analysis
TIME AND PLACE:
Thursday, April 19, 2012, 3:30pm
 
Room 1313, Math Bldg.
ABSTRACT:
In the talk a new geometric approach to multivariate analysis will be
presented. It will be illustrated on inference for linear models.
(Spring
2012, Seminar No.10)
SPEAKER: Professor Jian-Jian Ren, University of Maryland - College Park
TITLE: BIVARIATE NONPARAMETRIC MAXIMUM LIKELIHOOD ESTIMATOR WITH RIGHT CENSORED DATA
TIME AND PLACE:
Thursday, April 26, 2012, 3:30pm
 
Room 1313, Math Bldg.
ABSTRACT:
In the analysis of survival data, we often encounter situations where the response variable (the survival time) T is subject to right
censoring, but
the covariates Z are completely observable. To use the nonparametric approach (i.e., without imposing any model assumptions) in the study of
the
relation between the right censored response variable T and the completely observable covariate variable Z, one natural thing to do is to
estimate
the bivariate distribution function F_o(t, z) of (T, Z) based on the available bivariate data which are right censored in one coordinate - we called
it BD1RC data. In this article, we derive the bivariate nonparametric maximum likelihood estimator (BNPMLE) F_n(t,z) for F_o(t, z) based on the BD1RC
data, which has an explicit expression and is unique in the sense of empirical likelihood. Other nice features of F_n(t,z) include that it has only
nonnegative probability masses, thus it is monotone in bivariate sense, while these properties generally do not hold for most existing distribution
estimators with censored bivariate data. We show that under BNPMLE F_n(t,z), the conditional distribution function (d.f.) of T given Z is of the same
form as the Kaplan-Meier estimator for the univariate case, and that the marginal d.f. F_n(\infty,z) coincides with the empirical d.f. of the
covariate sample. We also show that when there is no censoring, F_n(t,z) coincides with the bivariate empirical distribution function. For the case
with discrete covariate Z, the strong consistency and weak convergence of F_n(t,z) are established. The extension of our BNPMLE F_n(t,z) to the case
with p-variate Z for p>1 is straightforward. This work is joint with Tonya Riddlesworth.
(Spring
2012, Seminar No.11)
SPEAKER: Eric Slud,(Census Bureau and UMCP) and Jiraphan Suntornchost, (UMCP)
University of Maryland - College Park
TITLE: Parametric Survival Densities from Phase-Type Models
TIME AND PLACE:
Thursday, May 3, 2012, 3:30pm
 
Room 1313, Math Bldg.
ABSTRACT:
After a historical survey of parametric survival models,
from actuarial, biomedical, demographical and engineering sources, we
will discuss the persistent reasons why parametric models still play
an important role in exploratory statistical research. The phase-type
models are advanced as a flexible family of latent-class models with
interpretable components. These models are now supported by
computational statistical methods that make numerical calculation of
likelihoods and statistical estimation of parameters feasible in
theory for quite complicated settings. However, consideration of Fisher
Information and likelihood-ratio type tests (Kullback-Leibler
distances) to discriminate between model families indicates that only
the simplest phase-type model topologies can be stably estimated in
practice, even on rather large datasets. An example of a parametric
model with features of mixtures, multiple stages or `hits', and a
trapping-state is given to illustrate simple computational tools in R,
both on simulated data and on a large SEER 1992-2002 breast-cancer
dataset.
|