Doctor of Philosophy (PHD)
Electrical and Computer Engineering (Engineering)
Date of Defense
First Committee Member
Second Committee Member
Third Committee Member
Manohar N. Murthi
Fourth Committee Member
Fifth Committee Member
Genotype and phenotype associations are of paramount importance in understanding the genetic basis of living organisms, improving traits of interests in animal and plant breeding, as well as gaining insights into complex biological systems and the etiology of human diseases. With the advancements in molecular biology such as microarrays, high throughput next generation sequencing, RNAseq, et al, the number of available genotype markers is far exceeding the number of available samples in association studies. The objective of this dissertation is to develop sparse models for such high dimensional data, develop accurate sparse variable selection and estimation algorithms for the models, and design statistical methods for robust hypothesis tests for the genotype and phenotype associations. We develop a novel empirical Bayesian least absolute shrinkage and selection operator (EBlasso) algorithm with Normal, Exponential and Gamma (NEG), and Normal, Exponential (NE) hierarchical prior distributions, and an empirical Bayesian elastic net (EBEN) algorithm with an innovative Normal and generalized Gamma (NG) hierarchical prior distribution, for both general linear and generalized logistic regression models. Both of the two empirical Bayes methods estimate variance components of the regression coefficients with closed-form solutions and perform automatic variable selection such that a variable with zero variance is excluded from the model. With the closed-form solutions for variance components in the model and without estimating the posterior modes for excluded variables, the two empirical Bayes methods infer sparse models efficiently. Having both covariance and posterior modes estimated, they also provide a statistical testing method that considers as much information as possible without increasing the degrees of freedom (DF). Extensive simulation studies are carried out to evaluate the performance of the proposed methods, and real datasets are analyzed for validation. Both simulation and real data analyses suggest that the two methods are fast and accurate genotype-phenotype association methods that can easily handle high dimensional data including possible main and interaction effects. Comparing the two methods, EBlasso typically selects one variable out of a group of highly correlated effects, and the EBEN algorithm encourages a grouping effect that selects a group of effects if they are correlated. Not only verificatory simulation and real dataset analyses are performed, we further demonstrate the advantage of the developed algorithms through two exploratory applications, namely the whole-genome QTL mapping for an elite rice hybrid and pathway-based genome wide association study (GWAS) for human Parkinson disease (PD). In the first application, we exploit whole-genome markers of an immortalized F2 population derived from an elite rice hybrid to perform QTL mapping for the rice-yield phenotype. Our QTL model includes additive and dominance main effects of 1,619 markers and all pair-wise interactions, with a total of more than 5 million possible effects. This study not only reveals the major role of epistasis influencing rice yield, but also provides a set of candidate genetic loci for further experimental investigations. In the second application, we employ the EBlasso logistic regression model for pathway-based GWAS to include all possible main effects and a large number of pair-wise interactions of single nucleotide polymorphisms (SNPs) in a pathway, with a total number of more than 32 million effects included in the model. With effects inferred by EBlasso, the statistical significance of a pathway is tested with the Wald statistics and reliable effects in a significant pathway are selected using the stability selection technique. Another important area of genotype and phenotype association is to infer the structure of gene regulatory networks (GRNs). We developed a GRN inference algorithm by exploring sparse model selection and estimation methods in structural equation models (SEMs). We extend a previously developed sparse-aware maximum likelihood (SML) algorithm to incorporate the adaptive elastic net penalty for the SEM likelihood function (SEM-EN) and infer the model using a parallelized block coordinate ascent algorithm. With the versatile penalty function and powerful parallel computation, the SEM-EN algorithm is able to infer a network with thousands of nodes. The performance of the developed algorithm are demonstrated through simulation studies, in which power of detection and false discovery rate both suggest that SEM-EN significantly improves GRN inference over the previously developed SEM-SML algorithm. When applied to infer the GRN of a real budding yeast dataset with more than 3,000 nodes, SEM-EN infers a sparse network corroborated by previous independent studies in terms of roles of hub nodes and functions of key clusters. Given the fundamental importance of genotype and phenotype associations in understanding the genetic basis of complex biological system, the EBlasso-NE, EBlasso-NEG, EBEN, as well as SEM-EN algorithms and software packages developed in this dissertation achieve the effectiveness, robustness and efficiency that are needed for successful application to biology. With the advancement of high-throughput molecular technologies in generating information at genetic, epigenetic, transcriptional and post-transcriptional levels, the methods developed in this dissertation can have broad applications to infer different types of genotype and phenotypes associations.
Sparse; Empirical Bayes; Lasso; Elastic Net; Generalized linear regression; Structural Equation Models
Huang, Anhui, "Sparse Model Learning for Inferring Genotype and Phenotype Associations" (2014). Open Access Dissertations. 1186.