# Theses/Dissertations - Statistical Sciences

Permanent URI for this collectionhttps://hdl.handle.net/2104/4798

## Browse

### Recent Submissions

Item Computational tools for data visualization, Bayesian Deming regression, and software documentation.(2023-08) Otto, James M., 1997-; Kahle, David J.This dissertation consists of three chapters that, while unified under the broad context of the development of “computational tools”, are generally distinct. In the first chapter, we propose the ggdensity package for visualizing “highest density regions” in R by extending the ggplot2 package for data visualization. In the second chapter, we develop Bayesian models for hierarchical Deming regression data, providing an implementation with the BayesDeming R package. We also apply the proposed models to example data, with an application to a chemistry, manufacturing, and controls project at a major US pharmaceutical company. Finally, we propose the tldr package, providing a system of “example-forward” documentation in the R console. Included in the discussion is the accompanying tldrDocs package, providing documentation for popular base R functions.Item Bayesian dynamic borrowing strategies with power priors and quantifying prior information for circular priors.(2023-08) Xie, Chang, 1994-; Seaman, John Weldon, 1956-This dissertation concerns two problems in Bayesian prior construction. One is the development of a dynamic historical borrowing strategy tailored for settings with small current or historical data samples. The second concerns strategies for the assessment of circular priors. Chapter two introduces a novel dynamic borrowing method that can be applied in both clinical and non-clinical settings. Recent approaches such as that in Thompson et al. (2021) do not accommodate situations with limited current sample sizes. Our approach integrates data amplification techniques specifically for small data sets. We assess the effectiveness of this method through simulation. In Chapter three, we explore the bioassay validation process, which necessitates borrowing from previous studies. We apply the method introduced in Chapter two to a case study for bioassay validation. The outcomes are then contrasted with results derived from another dynamic borrowing method we present in this chapter. Chapter four shifts focus to the quantification of information contained in a circular prior. Circular data are measurements on a unit circle, with the von Mises model being the most widely used model for such data. We gauge the information contained in priors for the von Mises data model using prior equivalent sample size.Item Statistical methods for outcome misclassification adjustment in causal inference and spatial classification.(December 2022) Ma, Yuhan, 1992-; Song, Joon Jin.Outcome misclassification occurs when a categorical variable is incorrectly assigned to a group due to an imperfect diagnostic device. The failure to account for outcome misclassification results in biased estimation and inaccurate predictions. This dissertation focuses on correcting outcome misclassification in causal inference and spatial classification. In the first work, we describe a Bayesian propensity score analysis to estimate the causal effect in observational studies with misclassified multinomial outcomes. To adjust the effect of misclassification, the informative Dirichlet priors are specified based on previous studies. Taking misclassification into account significantly reduces bias and yields coverage probabilities closer to a given nominal value. In spatial classification, we propose validation data-based adjustment methods using interval validation data. Regression calibration (RC) and multiple imputation (MI) are utilized to correct misclassified outcomes where a gold-standard device is not available. Spatial generalized linear mixed model (SGLMM) and indicator kriging (IK) are applied to spatial classification at unsampled locations. In a Bayesian perspective, we propose two-stage methods incorporating validation data into prior elicitation. The proposed Frequentist and Bayesian methods significantly improve the spatial classification accuracy.Item Bayesian method of predicting in-game win probability across sports.(December 2022) Maddox, Jason, 1995-; Harvill, Jane L.In this dissertation, we create different in-game win probability models for several sports using a Bayesian methodology. In the first chapter, we create a college basketball model using score differential and time and compare the model to other models found in literature. In the second chapter, we extend the model from the first chapter into the NBA. In doing so, we also make adjustments to aid in the performance of the model. In the third chapter, we create a college football win probability model, accounting for many more factors than the score differential and time.Item Comparing predictive accuracy of multiple forecasts.(August 2022) Best, William John, 1997-; Harvill, Jane L.In this dissertation, we consider different statistical tests of equal predictive accuracy (EPA). In the first chapter, a background of key time series models and results are presented that are foundational to the existing tests for EPA, as well as the new methods proposed within. In Chapter two, we extended the approach of Hering and Genton (2012) from comparing two forecasts of univariate to comparing more than two competing forecasts for vector time series. In Chapter three, we provide a bias modification to a nonparametric estimator of the asymptotic variance of the loss differential series, and use that in a modified test statistics. The bias correction resulted in a different null distribution, the Hotelling T2 distribution for finding critical values and p-values. We also consider a bootstrap approach for estimating p-values. Finally, in Chapter four we consider a parametric approach. In particular, the loss differential series is modeled using a stationary linear process. The estimated coefficients are used for computing a parametric estimate of the variance of the loss differential series, and that estimator is then applied to the test statistics. Because of the relatively good performance of the methods in Chapter four to the methods of the previous chapter, the tests in Chapter four were applied to a series of residuals from predicting the paths of satellites orbiting the earth. The data were provided by NASA. For each of the newly proposed methods in Chapters two through four, extensive Monte Carlo simulations are conducted for investigating the probabilistic properties of the tests, and each simulation study is accompanied by a discussion. Of all methods considered in the dissertation, the methods of Chapter four were better overall in terms of empirical size being close to nominal and having higher power in most of the cases.Item Applications of functional data analysis to environmental problems.(August 2022) Durell, Luke, 1995-; Hering, Amanda S.Functional Data Analysis (FDA) is a relatively recent framework within the statistical sciences, and while it offers compelling benefits to many applications, it has not yet gained widespread applied use. Two important environmental applications, water quality profile forecasting and larval fish photolocomotor response studies, measure functional data and stand to profit from employing FDA. In this work, we present the first application of FDA to these two applications of environmental and biological sciences. Specifically, this dissertation analyzes the most temporally and vertically dense dissolved oxygen lake profiles in the water quality forecasting literature. This is the first work to introduce full function forecasting with exogenous variables, various machine learning approaches, and empirical prediction band construction in the context of functional principal component machine learning hybrid models. Additionally, this research introduces both a new permutation test for two-way functional ANOVA and the first simulation study comparing four global F-based statistics in a two-way functional ANOVA setting.Item Bayesian approaches for survival data in pharmaceutical research.(2020-09-15) Prajapati, Purvi Kishor, 1992-; Stamey, James D.; Seaman, John Weldon, 1984-In this research, we consider Bayesian methodologies to address problems in biopharmaceutical research, most of which are motivated by real-world problems in network meta-analysis, prior elicitation, and adaptive designs. Network meta-analysis is a hierarchical model used to combine the results of multiple studies, and allows for us to make direct and indirect comparisons between treatments. We investigate Bayesian network meta-analysis models for survival data based on modeling the log-hazard rates, as opposed to hazards ratios. Expert opinion is often needed to construct priors for time-to-event data, especially in pediatric and oncology studies. For this, we propose a prior elicitation method for the Weibull time-to-event distribution that is based on potentially observable time-to-event summaries which can be transformed to obtain a joint prior distribution for the Weibull parameters. Bayesian adaptive designs take advantage of accumulating information, by allowing key trial parameters to change in response to accruing information and predefined rules. We introduce a novel model-based Bayesian assessment of reading speed that uses an adaptive algorithm to target key reading metrics. These metrics are used in the assessment of reading speed in individuals with poor vision.Item A beta regression approach to nonparametric longitudinal data classification in clinical trials.(2022-04-07) Hernandez, Roberto Sergio, 1995-; Tubbs, Jack Dale.Classification is an important topic in statistical analysis. For example, in applications involving clinical trials, an often seen objective is to determine whether or not novel medicines and treatments differ from existing standards of care. There are numerous methods and approaches in the literature for this problem when the endpoint of interest is normally distributed or can be approximated by an asymptotic Normal distribution, yet, the approaches when using a non-normally distributed endpoint are limited. This is especially true when these endpoints are correlated across time. In this dissertation, we investigated several techniques for use with longitudinal, repeated measures data where there is a special interest in adapting some recent results found in the literature on Beta regression. The proposed methods provided a nonparametric, with regard to the design endpoint, model that can be used in the repeated measures problem.Item Covariate-adjusted ROC regressions and the extensions in trend tests.(2022-04-21) Meng, Xing, 1988-; Tubbs, Jack Dale.In 2008 the faculty and doctoral graduate students within the Department of Statistical Sciences at Baylor University met with statisticians from Eli Lilly and Company to discuss ongoing long-term problems with the possibility that the department would begin collaborative work with the Lilly statisticians. One such problem led to several doctoral dissertations over the subsequent 8-10 years. The initial approach made use of the area under (AUC) the receiver operating characteristic (ROC) curve's diagnostic ability as a binary classifier for continuous outcomes. A generalized linear model (GLM) framework enabled one to investigate covariate effects to model the ROC using parametric and semi-parametric regression methods. The latest addition to this ongoing investigation into problems of this type made use of the beta regression for placement values and the property that their CDF is the ROC curve. The underlying motivation for the dissertation is to revisit the previous work using various models with the AUC and ROC regression with beta regression using covariate-adjusted placement values. The parametric and beta regression ROC models are introduced in Chapter Two. Both methods were applied to clinical data with different combinations of covariate effects. An extension of the ROC regression methods to the Jonckheere trend test is presented in Chapter Three. A further extension of the ROC regression models to determining the minimum effective dose is presented in Chapter Four, where the parametric and beta regression approaches were compared through a simulation study. Chapter Five applied ROC regression in three real clinical studies: incontinence data, pancreatic data, and breast cancer data. The beta ROC regression and parametric methodology were compared using the real data examples.Item Contributions to the practical application of Bayesian methods to survival analysis in clinical trials.(2022-03-11) Miyakawa, Evan, 1995-; Stamey, James D.; Kahle, David J.This dissertation is composed of three chapters that deal with fairly distinct concepts. In the first chapter, we compare and contrast the major Bayesian computational platforms accessible in the R statistical computing environment using large-scale simulations across a diverse collection of modeling scenarios. In the second chapter, we assess the performance of several model selection criteria for a complex family of network meta-analysis models for survival data. We also propose a technique for study outlier detection and present simulation results that demonstrate its effectiveness. Finally, the third chapter covers methods for constructing prediction intervals for forecasts for various machine learning algorithms. After describing existing strategies, we propose a new technique for measuring forecast uncertainty that can be used on a wide set of machine learning models. This final chapter was the result of a collaboration with researchers at the Institute for Defense Analyses (IDA).Item Statistical methods for complex spatial data.(2021-07-12) Kim, Minho, 1988-; Song, Joon Jin.Spatial analysis is an active research area as it allows us to solve problems containing geographic information in various applications. In this dissertation, we consider some challenging issues we often face in practice. The work of this dissertation mainly focuses on spatial binary data. Binary data contains much less information than that of continuous type, which hinders our ability to obtain accurate predictions. To tackle this issue, we present a Bayesian downscaling model using spatially varying coefficients, which allows us to make inferences at high resolution from low resolution observed data. We also consider a situation where the binary data is measured with some errors, causing presence of misclassification in the data. In practice, misclassification is a well known problem, but often is ignored and analysis is performed as if data is measured perfectly. We address this issue by presenting a spatial misclassification model. While high resolution data may be superior in spatial coverage, it often suffers from a considerable number of censored observations due to a limit of detection of a device. To properly handle this issue, a statistical method with a predictor subject to censoring is presented. In addition, we relax a linearity assumption between a response and a predictor variable to increase the flexibility of modeling. We examine each model by performing extensive simulation studies and illustrate with real world applications using precipitation data in South Korea.Item Contributions to algebraic pattern recognition and integrated likelihood ratio confidence intervals.(2021-04-12) Ma, Qida, 1990-; Kahle, David J.; Young, Dean M.Nonlinear polynomial equations describing positive-dimensional solution sets are called real varieties that in general may be quite complex but locally look like smooth manifolds. In the first chapter, we describe ways to numerically construct useful parameterizations of real varieties using a combination of Monte Carlo strategies to stochastically explore real varieties and tools from deep learning. In the second chapter, in the opposite direction, we describe ways to recognize algebraic patterns when only points with noise are provided by using our proposed RSS model with model selection strategies. In the third and fourth chapters, we propose integrated-likelihood-ratio confidence interval estimations for a Poisson rate parameter with one or two misclassification parameters using double sampling. We also compare the performance of our proposed integrated-likelihood-ratio CI estimations with other CI estimation strategies using both Monte Carlo simulations and real-world examples.Item Bayesian spatial misclassification model for areal count data with applications to COVID-19.(2021-04-21) Chen, Jinjie, 1986-; Stamey, James D.As of December 14, 2020, there have been more than 72.1 million confirmed cases, of which more than 1.61 million have died of COVID-19 globally. In the United States, there are more than 16,200,000 confirmed cases and 299,000 COVID-19-related deaths, the most cases, and deaths of any country. However, even with the huge number of confirmed diagnoses, the public burden of the pandemic is still masked by under-reporting and misclassification. Based on the Bayesian spatial model and Poisson regression, we study two topics, aiming to provide a flexible quantitative approach for simulating and correcting the under-reporting and misclassification of COVID-19 at the US state level. Topic 1 quantifies under-reporting rates with Poisson-logistic regression, combined with the prior information derived from the results of the SARS-CoV-2 antibody sampling study, and then estimates the true case of COVID-19 in each state of the US. Topic 1 also combines the Besag-York-Mollié 2 (BYM2) model to correct the bias of parameter estimation caused by ignoring the spatial autocorrelation. Topic 2 proposes a bivariate Bayesian spatial misclassification model, which can simultaneously calibrate the misclassification of two counts of the same area (for example, state or county). Deaths related to COVID-19 are considered to be misclassified to other causes and vice versa (although the latter case is relatively fewer). In addition, because the number of deaths at the state level shows obvious spatial similarity, BYM2 random effects are included to explain the variability beyond the covariates. Our model was applied to state-level COVID-19 deaths and other deaths, achieving satisfactory results that can be a reference for estimating the true COVID-19 deaths. Topic 3 proposes and discusses the determination of sample size based on skew-normal distribution. This method adopts Bayesian intensive simulation to overcome limitations of closed-form approximation and normality assumption while ensuring sufficient statistical power and nominal coverage of confidence interval (or credible set). Our approach demonstrates good performance and application prospects.Item Integrated-likelihood-ratio confidence intervals obtained from data via a double-sampling scenario.(2020-08-26) Wiley, Briceon Curtis, 1994-; Stamey, James D.Hypothesis testing has been a primary focus of statistical inference. Recently, confidence intervals (CIs) have been suggested as a superior inference form because of the additional information they provide to a scientist to aid decision making. For public health data, business data, and other types of data, misclassification is often present and can cause estimators to be biased, thus leading to incorrect conclusions. Tenenbein (1970) has provided a double-sampling scheme to correct for misclassification through the use of an infallible data set that is combined with a larger fallible data set subject to misclassification. Many authors have utilized the double-sampling procedure to correct for misclassification in their data. When constructing confidence intervals, for instance, Rahardja and Yang (2015) derived Wald intervals for one-sample binomial problems, and Lyles (2002) proposed a Wald interval for two-sample binomial problems. Also, Riggs et al. (2009) provided confidence intervals for one-sample Poisson rate parameters. In addition, Li (2009) built similar intervals for the difference of two Poisson rate parameters. We derive integrated-likelihood-ratio (ILR) confidence intervals, first proposed by Severini (2010), for each of these situations to demonstrate their effectiveness in estimating parameters from data subject to misclassification. In chapter one, we derive an ILR CI for a one-sample binomial data set and demonstrate that it has at least nominal coverage while providing narrow average interval widths when the binomial parameter is small. In chapter two, we apply a transformation related to one from Fisher and Robbins (2019) to make the ILR CI less conservative when estimating a one-sample binomial parameter, thus providing closer-to-nominal coverage while decreasing the average interval width. In chapter three, we extend the ILR CI to estimate the log odds-ratio of two binomial parameters when the binary data are subject to misclassification. Finally, in chapter four we demonstrate the ILR CI’s efficacy versus the Wald and score CIs for estimating the ratio of two Poisson rate parameters using data sampled via a double-sampling scenario.Item Multivariate fault detection and isolation.(2020-06-08) Klanderman, Molly C., 1994-; Hering, Amanda S.In a variety of industrial settings, many complexly related variables are monitored to ensure that a process remains in control (IC) over time. Faults that remain undetected can cause extensive damage that require costly repairs; even if the fault is detected quickly, it can be difficult to isolate the fault without the aid of data-driven diagnostic tools. Multivariate statistical process monitoring (MSPM) methods are designed to detect an abnormal process and sometimes to also isolate the shifted variables in a system. However, these methods often require assumptions such as normality, stationarity, and autocorrelation, which are not often met in practice. Additionally, the metrics used to evaluate MSPM schemes, terminology, and notation are inconsistent, making it difficult to understand and compare methods in the literature. In our first project, we propose a distribution-free, retrospective fault detection and isolation (FD&I) method for autocorrelated, nonstationary processes. First, we detrend the data using observations from an IC period to account for known causes of fluctuations in the mean. Then, we perform fused lasso to drive any small changes in the mean to zero, and we use the estimated effective sample size in the Extended Bayesian Information Criterion to account for autocorrelation in the choice of the regularization parameter. In our second project, we develop a fully integrated online FD&I method that can also handle non-normality, nonstationarity, and autocorrelation. The method is based on principal component analysis, where the shifted variables are recovered using adaptive lasso. In addition, we design an enhanced visualization technique to assist operators in fault diagnosis. In both projects, we illustrate our method’s performance in case studies with known faults from a wastewater treatment facility. In our third project, we summarize the most common metrics used to evaluate FD&I methods from a survey of MSPM literature. We offer a way to standardize the notation and language, and we synthesize their strengths and weaknesses. Then, we propose a suite of new metrics to jointly assess a method’s detection and isolation ability that allows the user to tailor the metric to a particular process based on the consequences of a missed detection.Item Lehmann ROC regression and spatial classification.(2019-12-17) Innerst, Melissa, 1993-; Song, Joon Jin.; Tubbs, Jack Dale.Receiver Operating Curves (ROC) are a widely used measure of accuracy in diagnostic tests. Recently, there has been an increased interest in the effect that covariates have on the accuracy of the tests. As a result, several regression models for the ROC have been proposed. Two existing methods, a binormal model and a more recently proposed method based upon the beta distribution are discussed in the first chapter. A third model based upon the Lehmann assumption and the commonly used proportional hazard ratio model is considered. An objective of this dissertation is to introduce this method and to compare it with the existing methods in order to determine if and when it performs well. We do this by constructing simulated data from three distributions, the normal, extreme-value, and Weibull. The methods are further illustrated using real leukocyte elastase data. In the second chapter, we expand our investigation of the ROC models beyond diagnostic testing to the problem of identifying cases and controls using repeated measurement data. Little is found concerning this problem in the literature. So we begin with a very simple case of having a simple dose response model. The results are based upon simulations when using the beta model with the Lehmann model with normal, extreme-value, and Weibull data. The dependency structure of the repeated measures makes use of a Copula model. In the third chapter the emphasis changes as we address the question, “Will there be precipitation at a given location?". There are multiple statistical and machine learning methods available to address this question. We consider two statistical and three machine learning methods for estimating precipitation area. Since this problem involves spatial areas, we expanded the above models by incorporating spatial information into the estimation problem. The data were obtained from a network consisting of VIS rain gauges, automated weather system (AWS) tipping-bucket rain gauges, and a S-band dual-polarimetric weather radar for ten different rain events in South Korea. The mean squared prediction error (MSPE) and leave-one-out cross validation (LOOCV) are used to measure the performance of the methods.Item Detecting episodes of star formation using Bayesian model selection.(2019-10-30) Lawler, Andrew Joseph, 1985-; Tubbs, Jack Dale.Bayesian model comparison is a data-driven method to establish model complexity. In this dissertation we investigate its use in detecting multiple episodes of star formation from the analysis of the Spectral Energy Distribution (SED) of galaxies. This method is validated by simulating galaxy catalogs modeled after 3D-HST galaxies at redshift z ∼ 1. The SED of galaxies are derived using multivariate kernel density estimates of the input parameter distributions before fitting results and Bayes factors for multiple scenarios of nested models. In addition, we investigate the role that prior specification has in the derivation of physical parameters. These results are then compared to Bayes factors calculated using the Savage-Dickey Density Ratio (SDDR). The results of this investigation indicate that the use of Bayes factors are a promising tool when the model has a high level of complexity. We also demonstrate that the choice of priors plays an important role in the accuracy of results and that the SDDR is a good proxy for the Bayes factor. This last finding has significant computational advantages when compared with the computationally intensive nested sampling algorithms.Item On testing for a difference in two high-dimensional mean vectors.(2019-11-18) Worley, Whitney V., 1990-; Young, Dean M.A common problem in multivariate statistical analysis involves testing for differences in the mean vectors from two populations with equal covariance matrices.This problem is considered well-posed when the sum of the two sample sizes is greater than the data dimension and, therefore, the traditional Hotelling’s T2 test can be applied. In cases where the data dimension exceeds the sample-sizes sum minus two, the pooled sample covariance matrix is singular and, thus, nontraditional tests must be formulated. Using Monte Carlo simulations, we first contrast the powers of five hypothesis tests for two high-dimensional means that have been proposed in the statistical literature. We then examine the efficacy of linear dimension reduction derived from the singular value decomposition of the total data matrix and explore its effect on the powers of five tests when the tests are conducted with the dimension-reduced data. We then propose a new test for the difference in two high-dimensional mean vectors that combines aspects of the random subspaces and cluster subspaces tests to improve test power.Item Bayesian adjustment for misclassification bias and prior elicitation for dependent parameters.(2019-11-25) Lakshminarayanan, Divya Ranjani, 1993-; Seaman, John Weldon, 1956-This research is motivated by problems in biopharmaceutical research. Prior elicitation is defined as formulating an expert's beliefs about one or more uncertain quantities into a joint probability distribution, and is often used in Bayesian statistics for specifying prior distributions for parameters in the data model. However, there is limited research on eliciting information about dependent random variables, which is often necessary in practice. We develop methods for constructing a prior distribution for the correlation coefficient using expert elicitation. Electronic health records are often used to assess potential adverse drug reaction risk, which may be misclassified for many reasons. Unbiased estimation with the presence of outcome misclassification requires additional information. Using internally validated data, we develop Bayesian models for analyzing misclassified data with a validation substudy and compare its performance to the existing frequentist approaches.Item Bayesian inference for vaccine efficacy and prediction of survival probability in prime-boost vaccination regimes.(2019-11-08) Lu, Yuelin, 1992-; Seaman, John Weldon, 1956-This dissertation consists of two major topics on applying Bayesian statistical methods in vaccine development. Chapter two concerns the estimation of vaccine efficacy from validation samples with selection bias. Since there exists a selection bias in the validated group, traditional assumptions about the non-validated group being missing at random do not hold. A selection bias parameter is introduced to handle this problem. Extending the methods of et al. scharfstein (2006), we construct and validate a data generating mechanism that simulates real-world data and allows evaluation of their model. We implement the Bayesian model in JAGS and assess its performance via simulation. Chapter three introduces a two-level Bayesian model which can be used in predicting survival probability from administrated dose concentrations. This research is motivated by the need to use limited information to infer the probability of survival for the next Ebola outbreak under a heterologous prime-boost vaccine regimen. The first level models the relationship between dose and induced antibody count. We use a two-stage response surface to model this relationship. The second level models the association between the antibody count and the probability of survival using a logistic regression. We combine these models to predict survival probability from administrated dosage. We illustrate application of the model with three examples in this chapter and evaluate its performance in Chapter four.