Recursos de colección
Venturini, Sergio; Dominici, Francesca; Parmigiani, Giovanni
An important question in health services research is the
estimation of the proportion of medical expenditures that exceed
a given threshold. Typically, medical expenditures present
highly skewed, heavy tailed distributions, for which (a) simple
variable transformations are insufficient to achieve a tractable
low-dimensional parametric form and (b) nonparametric methods
are not efficient in estimating exceedance probabilities for
large thresholds. Motivated by this context, in this paper we
propose a general Bayesian approach for the estimation of tail
probabilities of heavy-tailed distributions, based on a mixture
of gamma distributions in which the mixing occurs over the shape
parameter. This family provides a flexible and novel approach
for modeling heavy-tailed distributions, it is computationally
efficient,...
Zhang, Xiaoxi; Johnson, Timothy D.; Little, Roderick J. A.; Cao, Yue
Quantitative Magnetic Resonance Imaging (qMRI) provides
researchers insight into pathological and physiological
alterations of living tissue, with the help of which researchers
hope to predict (local) therapeutic efficacy early and determine
optimal treatment schedule. However, the analysis of qMRI has
been limited to ad-hoc heuristic methods. Our research provides
a powerful statistical framework for image analysis and sheds
light on future localized adaptive treatment regimes tailored to
the individual’s response. We assume in an imperfect world we
only observe a blurred and noisy version of the underlying
pathological/physiological changes via qMRI, due to measurement
errors or unpredictable influences. We use a hidden Markov
random field to model the spatial dependence in the...
Ferkingstad, Egil; Frigessi, Arnoldo; Rue, Håvard; Thorleifsson, Gudmar; Kong, Augustine
In an empirical Bayesian setting, we provide a new multiple
testing method, useful when an additional covariate is
available, that influences the probability of each null
hypothesis being true. We measure the posterior significance of
each test conditionally on the covariate and the data, leading
to greater power. Using covariate-based prior information in an
unsupervised fashion, we produce a list of significant
hypotheses which differs in length and order from the list
obtained by methods not taking covariate-information into
account. Covariate-modulated posterior probabilities of each
null hypothesis are estimated using a fast approximate
algorithm. The new method is applied to expression quantitative
trait loci (eQTL) data.
Scharpf, Robert B.; Parmigiani, Giovanni; Pevsner, Jonathan; Ruczinski, Ingo
Chromosomal DNA is characterized by variation between individuals
at the level of entire chromosomes (e.g., aneuploidy in which
the chromosome copy number is altered), segmental changes
(including insertions, deletions, inversions, and
translocations), and changes to small genomic regions (including
single nucleotide polymorphisms). A variety of alterations that
occur in chromosomal DNA, many of which can be detected using
high density single nucleotide polymorphism (SNP) microarrays,
are linked to normal variation as well as disease and are
therefore of particular interest. These include changes in copy
number (deletions and duplications) and genotype (e.g., the
occurrence of regions of homozygosity). Hidden Markov models
(HMM) are particularly useful for detecting such alterations,
modeling the spatial dependence...
Yu, Qingzhao; Stasny, Elizabeth A.; Li, Bin
It is difficult to accurately estimate the rates of rape and
domestic violence due to the sensitive nature of these crimes.
There is evidence that bias in estimating the crime rates from
survey data may arise because some women respondents are
“gagged” in reporting some types of crimes by the use of a
telephone rather than a personal interview, and by the presence
of a spouse during the interview. On the other hand, as data on
these crimes are collected every year, it would be more
efficient in data analysis if we could identify and make use of
information from previous data. In this paper we propose a model
to...
Höfling, Holger; Tibshirani, Robert
Given a predictor of outcome derived from a high-dimensional
dataset, pre-validation is a useful technique for comparing it
to competing predictors on the same dataset. For microarray
data, it allows one to compare a newly derived predictor for
disease outcome to standard clinical predictors on the same
dataset. We study pre-validation analytically to determine if
the inferences drawn from it are valid. We show that while
pre-validation generally works well, the straightforward “one
degree of freedom” analytical test from pre-validation can be
biased and we propose a permutation test to remedy this problem.
In simulation studies, we show that the permutation test has the
nominal level and achieves roughly the same...
Buishand, T. A.; de Haan, L.; Zhou, C.
We consider daily rainfall observations at 32 stations in the
province of North Holland (the Netherlands) during 30 years. Let
T be the total rainfall in this area on one
day. An important question is: what is the amount of rainfall
T that is exceeded once in 100 years? This is
clearly a problem belonging to extreme value theory. Also, it is
a genuinely spatial problem.
¶ Recently, a theory of extremes of continuous stochastic processes
has been developed. Using the ideas of that theory and much
computer power (simulations), we have been able to come up with
a reasonable answer to the question above.
Shen, Haipeng; Huang, Jianhua Z.
We consider forecasting the latent rate profiles of a time series
of inhomogeneous Poisson processes. The work is motivated by
operations management of queueing systems, in particular,
telephone call centers, where accurate forecasting of call
arrival rates is a crucial primitive for efficient staffing of
such centers. Our forecasting approach utilizes dimension
reduction through a factor analysis of Poisson variables,
followed by time series modeling of factor score series. Time
series forecasts of factor scores are combined with factor
loadings to yield forecasts of future Poisson rate profiles.
Penalized Poisson regressions on factor loadings guided by time
series forecasts of factor scores are used to generate dynamic
within-process rate updating. Methods are...
Zheng, Lu; Zelen, Marvin
The purpose of this paper is to investigate and develop methods
for analysis of multi-center randomized clinical trials which
only rely on the randomization process as a basis of inference.
Our motivation is prompted by the fact that most current
statistical procedures used in the analysis of randomized
multi-center studies are model based. The randomization feature
of the trials is usually ignored. An important characteristic of
model based analysis is that it is straightforward to model
covariates. Nevertheless, in nearly all model based analyses,
the effects due to different centers and, in general, the design
of the clinical trials are ignored. An alternative to a model
based analysis is to have...
Stark, Philip B.
There are many sources of error in counting votes: the apparent
winner might not be the rightful winner. Hand tallies of the
votes in a random sample of precincts can be used to test the
hypothesis that a full manual recount would find a different
outcome. This paper develops a conservative sequential test
based on the vote-counting errors found in a hand tally of a
simple or stratified random sample of precincts. The procedure
includes a natural escalation: If the hypothesis that the
apparent outcome is incorrect is not rejected at stage s,
more precincts are audited. Eventually, either the hypothesis is
rejected—and the apparent outcome is confirmed—or all precincts
have...
Gelman, Andrew; Cai, Cexun Jeffrey
Could John Kerry have gained votes in the 2004 Presidential
election by more clearly distinguishing himself from George Bush
on economic policy? At first thought, the logic of political
preferences would suggest not: the Republicans are to the right
of most Americans on economic policy, and so in a
one-dimensional space with party positions measured with no
error, the optimal strategy for the Democrats would be to stand
infinitesimally to the left of the Republicans. The median voter
theorem suggests that each party should keep its policy
positions just barely distinguishable from the opposition.
¶ In a multidimensional setting, however, or when voters vary in
their perceptions of the parties’ positions,...
Kou, S. C.
Advances in nanotechnology have allowed scientists to study
biological processes on an unprecedented nanoscale
molecule-by-molecule basis, opening the door to addressing many
important biological problems. A phenomenon observed in recent
nanoscale single-molecule biophysics experiments is
subdiffusion, which largely departs from the classical Brownian
diffusion theory. In this paper, by incorporating fractional
Gaussian noise into the generalized Langevin equation, we
formulate a model to describe subdiffusion. We conduct a
detailed analysis of the model, including (i) a spectral
analysis of the stochastic integro-differential equations
introduced in the model and (ii) a microscopic derivation of the
model from a system of interacting particles. In addition to its
analytical tractability and clear physical underpinning, the
model is...
Lee, Ann B.; Nadler, Boaz; Wasserman, Larry
Tuglus, Catherine; van der Laan, Mark J.
We would like to congratulate Lee, Nadler and Wasserman on their
contribution to clustering and data reduction methods for high
p and low n situations. A composite of
clustering and traditional principal components analysis,
treelets is an innovative method for multi-resolution analysis
of unordered data. It is an improvement over traditional PCA and
an important contribution to clustering methodology. Their paper
presents theory and supporting applications addressing the two
main goals of the treelet method: (1) Uncover the underlying
structure of the data and (2) Data reduction prior to
statistical learning methods. We will organize our discussion
into two main parts to address their methodology in terms of
each of these two...
Qiu, Xing
This is a discussion of paper “Treelets—An adaptive multi-scale
basis for sparse unordered data” by Ann B. Lee, Boaz Nadler and
Larry Wasserman. In this paper the authors defined a new type of
dimension reduction algorithm, namely, the treelet algorithm.
The treelet method has the merit of being completely data
driven, and its decomposition is easier to interpret as compared
to PCR. It is suitable in some certain situations, but it also
has its own limitations. I will discuss both the strength and
the weakness of this method when applied to microarray data
analysis.
Tibshirani, Robert
Meinshausen, Nicolai; Bühlmann, Peter
We congratulate Lee, Nadler and Wasserman (henceforth LNW) on a
very interesting paper on new methodology and supporting theory.
Treelets seem to tackle two important problems of modern data
analysis at once. For datasets with many variables, treelets
give powerful predictions even if variables are highly
correlated and redundant. Maybe more importantly, interpretation
of the results is intuitive. Useful insights about relevant
groups of variables can be gained.
¶ Our comments and questions include: (i) Could the success of
treelets be replicated by a combination of hierarchical
clustering and PCA? (ii) When choosing a suitable basis,
treelets seem to be largely an unsupervised method. Could the
results be even more interpretable and...
Bickel, Peter J.; Ritov, Ya’acov
Murtagh, Fionn
Lee, Ann B.; Nadler, Boaz; Wasserman, Larry
In many modern applications, including analysis of gene expression
and text documents, the data are noisy, high-dimensional, and
unordered—with no particular meaning to the given order of the
variables. Yet, successful learning is often possible due to
sparsity: the fact that the data are typically redundant with
underlying structures that can be represented by only a few
features. In this paper we present treelets—a novel
construction of multi-scale bases that extends wavelets to
nonsmooth signals. The method is fully adaptive, as it returns a
hierarchical tree and an orthonormal basis which both
reflect the internal structure of the data. Treelets are
especially well-suited as a dimensionality reduction and feature
selection tool prior...