Statistical methods for high-throughput data integration : methodologies in disease research and drug discovery.


Access rights

No access - Contact

Journal Title

Journal ISSN

Volume Title



The wide application of high-throughput technologies in biomedical research calls for integrative approaches for data mining and knowledge discovery. Consequently, methodologies that deliver robust and systems integrations are in unprecedented demand. Two important sub-disciplines in biomedical research, namely disease research and drug discovery, have become ever-evolving frontiers for integration of “big data”. In disease research, p-value combination has been broadly employed to integrate statistical evidences from multiple studies. Common assumptions of conventional p-value combination methods include independence and homogeneity of the combined tests, which are constantly challenged by the complex nature of high-throughput biomedical datasets. In this dissertation, we propose a novel and robust p-value combination algorithm based on the Pareto Dominance principle from multi-objective optimization, which accounts for dependency and heterogeneity in data. Compared to existing methods, the Pareto method attains adaptive rejection regions from “learning” the multivariate null distribution estimated by permutations, therefore achieves superior performance when combining heterogeneous effects from multiple datasets, meanwhile remains appropriate error control for correlated tests. The Pareto meta-gene-set-analysis tool, PEACH, was developed and tested on a 16-cancer pan-cancer dataset from The Cancer Genome Atlas (TCGA). Significantly improved statistical power of the PEACH algorithm and its ability to detect important pathways related to sub-groups of cancers were demonstrated. On the other hand, computational drug repurposing based on gene expression data has gained increasing popularity in the field of drug discovery. The Connectivity Map (CMap) is a major database to repurpose new drugs from gene expression data. However, key limitations of the current signature-based drug-repurposing paradigm have prohibited accurate and unbiased repurposing. In the second part of this dissertation, we developed a frame-breaking statistical approach, namely Dr. Insight, to remove the requirement of subjective selection of a gene signature to query CMap database. We performed comprehensive studies using simulation data and disease datasets and validated the superior performance of Dr. Insight compared to previous methods. A TCGA breast cancer case study was also performed to showcase the application of Dr. Insight to breast cancer drug repurposing, from drug redirection to systematic construction of disease-specific drug-target networks.



Statistical methods. Data integration. Disease. Drug.