Department of Statistical Scienceshttps://hdl.handle.net/2104/47612019-10-19T04:30:35Z2019-10-19T04:30:35ZTopic on the statistical analysis of high-dimensional data.https://hdl.handle.net/2104/106852019-07-30T08:06:40Z2019-04-15T00:00:00ZTopic on the statistical analysis of high-dimensional data.
High-dimensional genomic data can provide deep insight into biological processes. However, conventional statistical methods typically cannot be applied directly to genomic data sets because the high dimensionality of markers commonly exceeds sample size, rendering the sample covariance matrix to be singular. Here, we examine three scenarios involving high-dimensional genomic data: reordering of principle components of multi-class data based on alternative criteria, comparing tests for two population means on high-dimensional data, and correcting for systematic batch effects in microarray data. All three investigations overcome issues of dimensionality and use principal components for dimension-reduction, visualization, or statistical analysis. First we use alternatively ordered principal components to produce low-dimensional models for visualization; second, we compare five high-dimensional tests of two-means and describe a principal-component alternative to Hotelling's T2 test; and finally, we utilize principal component reduction of microarray data to visualize existing batch effects between cohorts. Overall, we explore solutions to the analysis of high-dimensional, genomic data through the use of principal components analysis or other adaptations to reach the desired analytic objectives.
2019-04-15T00:00:00ZA power contrast of tests for homogeneity of covariance matrices in a high-dimensional setting.https://hdl.handle.net/2104/105162019-01-26T09:07:31Z2018-10-31T00:00:00ZA power contrast of tests for homogeneity of covariance matrices in a high-dimensional setting.
Multivariate statistical analyses, such as linear discriminant analysis, MANOVA, and profile analysis, have a covariance-matrix homogeneity assumption. Until recently, homogeneity testing of covariance matrices was limited to the well-posed problem, where the number of observations is much larger than the data dimension. Linear dimension reduction has many applications in classification and regression but has been used very little in hypothesis testing for equal covariance matrices. In this manuscript, we first contrast the powers of five current tests for homogeneity of covariance matrices under a high-dimensional setting for two population covariance matrices using Monte Carlo simulations. We then derive a linear dimension reduction method specifically constructed for testing homogeneity of high-dimensional covariance matrices. We also explore the effect of our proposed linear dimension reduction for two or more covariance matrices on the power of four tests for homogeneity of covariance matrices
under a high-dimensional setting for two- and three-population covariance matrices. We determine that our proposed linear dimension reduction method, when applied to the original data before using an appropriate test, can yield a substantial increase in power.
2018-10-31T00:00:00ZBayesian approach to partially validated binary regression with response and exposure misclassification.https://hdl.handle.net/2104/104542018-09-10T08:02:41Z2018-06-09T00:00:00ZBayesian approach to partially validated binary regression with response and exposure misclassification.
Misclassification of epidemiological and observational data is a problem that commonly arises and can have adverse ramifications on the validity of results if not properly handled. Considerable research has been conducted when only the response or only the exposure are misclassified, while less work has been done on the simultaneous case. We extend previous frequentist work by investigating a Bayesian approach to dependent, differential misclassification models. Using a logit model with misclassified binary response and exposure variables and assuming a validation sub-sample is available, we compare the resulting confidence and credible intervals under the two paradigms. We compare the results under varying percentages of validation subsamples, 100% (ideal scenario), 25%, 15%, 10%, 5%, 2.5%, and 0% (naive scenario) of the overall sample size. We extend this work further be examining scenarios for which the assumptions may falter; we assume independent, differential misclassification, increase the overall sample size, and vary the influence of our priors from diffuse to concentrated.
2018-06-09T00:00:00ZWeibull mixture model for grouped data and pattern identification in spatial and spatial-temporal data.https://hdl.handle.net/2104/104532018-09-10T08:02:34Z2018-06-22T00:00:00ZWeibull mixture model for grouped data and pattern identification in spatial and spatial-temporal data.
Motived by problems in geology, civil engineering, and material science, we develop statistical models and apply statistical tools for characterizing patterns that exist in different types of data. In the first project, we propose the Weibull mixture model to fit the distribution of grain size in continental sediments in geological studies. We use an EM algorithm to fit the model and a bootstrap likelihood ratio test (LRT) to compare different models. The performance of the bootstrap LRT is studied via simulation and is compared with model selection criteria AIC and BIC. Situations where bootstrap LRT is preferred in model selection are discussed. In the second project, we develop robust models for detecting outliers in a large spatial-temporal dataset that contains daily ground deformation values in a New York City tunnel excavation project. Systematic outliers and random outliers are defined and identified using robust spatial kriging models and robust time series models. The residuals from these models are pooled to construct outlier bounds using a moving window technique. The observations whose residuals fall outside of the outlier bounds are flagged as outliers. Artificial outliers are generated and added to the ground deformation data to study the accuracy and stability of the proposed techniques in outlier detection. In the third project, we apply a set of spatial statistical tools for charactering materials at the atomic level with a multivariate three-dimensional atom probe tomography dataset as an example. We use Moranâ€™s I statistic to study the global spatial structure of each atomic feature and apply the local indicator of spatial association (LISA) to study the local spatial regions where high or low values of a given atomic feature exist. LISA is also used for detecting spatial outliers. We then use the local indicator of spatial cross-correlation (LISC) to find where in space high or low levels of two atomic features occur simultaneously. For LISA and LISC, we conduct a conditional permutation test for each location and then compare methods to handle the multiple testing issues. We also discuss the effect of different weight functions and neighborhood selection on the significance of the statistical tests.
2018-06-22T00:00:00Z