Selected topics in high-dimensional statistical learning.

Date

2012-08

Authors

Ramey, John A.

Access rights

Worldwide access

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Advances in microarray technology have equipped researchers to measure gene expression levels simultaneously from thousands of genes, yielding increasingly large and complex data sets. However, due to the cost and time required to obtain individual observations, the sample sizes of the resulting data sets are often much smaller than the number of gene expressions measured. Hence, due to the curse of dimensionality [Bellman, 1961], the analysis of these data sets with classic multivariate statistical methods is challenging and, at times, impossible. Consequently, numerous supervised and unsupervised learning methods have been proposed to improve upon classic methods. In Chapter 2 we formulate a clustering stability evaluation method based on decision-theoretic principles to assess the quality of clusters proposed by a clustering algorithm used to identify subtypes of cancer for diagnosis. We demonstrate that our proposed clustering-evaluation method is better suited to comparing clustering algorithms and to providing superior interpretability compared to the figure of merit (FOM) method from Yeung, Haynor, and Ruzzo [2001] and the cluster stability evaluation method from Hennig [2007] using three artificial data sets and a well- known microarray data set from Khan et al. [2001]. In Chapter 3 we investigate model selection of the regularized discriminant analysis (RDA) classifier proposed by Friedman [1989]. Using four small-sample, high-dimensional data sets, we compare the classification performance of RDA models selected with five conditional error-rate estimators to models selected with the leave-one-out (LOO) error-rate estimator, which has been recommended for RDA model selection by Friedman [1989]. We recommend the 10-fold cross-validation (CV ) estimator and the bootstrap CV estimator from Fu, Carroll, and Wang [2005] for model selection with the RDA classifier. In Chapters 4 and 5 we consider the diagonal linear discriminant analysis (DLDA) classifier, the shrinkage-based DLDA (SDLDA) classifier from Pang, Tong, and Zhao [2009], and the shrinkage-mean-based DLDA (SmDLDA) classifier from Tong, Chen, and Zhao [2012]. We propose four alternative classifiers and demonstrate that they are often superior to the diagonal classifiers using six well-known microarray data sets because they preserve off-diagonal classificatory information by nearly simultaneously diagonalizing the sample covariance matrix of each class.

Description

Keywords

Supervised learning., Unsupervised learning., Clustering., Clustering stability., Clustering evaluation., Classification., Naive Bayes classifier., Regularized discriminant analysis., Diagonal discriminant analysis., Error-rate estimation., Gene expression data., Microarray data.

Citation