Publication Date



Open access

Embargo Period


Degree Type


Degree Name

Doctor of Philosophy (PHD)


Electrical and Computer Engineering (Engineering)

Date of Defense


First Committee Member

Xiaodong Cai

Second Committee Member

Miroslav Kubat

Third Committee Member

Jie Xu

Fourth Committee Member

Sawsan Khuri

Fifth Committee Member

Stefan Wuchty


Variation in gene expression is an important mechanism underlying phenotypic variation in morphological, physiological and behavioral traits as well as disease susceptibility. A connection between DNA variants and gene expression levels not only provides more understanding of the biological network, but also enhances the mapping of these quantitative traits. Thus, an understanding of the mechanism of gene expression and the genotype/phenotype relationship is of paramount importance to both scientific research and social economics. The primary functionality of the gene expression process is to convert information stored in genes into gene products such as RNAs or proteins. The fundamental of this complex process is controlled by a class of proteins known as transcription factors (TFs) that bind to special locations of the DNA double helix. These special binding sites, known as transcription factor binding sites (TFBSs), are generally short motifs of 6-20 base pairs. Furthermore, the discovery of new TFBSs will contribute to the establishment of gene regulation networks, diagnosis of genetic diseases and new drug design. On the other hand, the genotype/phenotype relationship is mainly explained by multiple quantitative trait loci (QTLs), epistatic effects and environmental factors. A QTL is a section of DNA that correlates with variation in a phenotype. The QTL typically is linked to, or contains, the genes that control that phenotype interactions among QTLs or between genes, and environmental factors contribute substantially to variation in complex traits. During the last two decades the use of QTLs has proven to be effective for increasing food production, resistance to diseases and pests, tolerance to heat, cold and draught, and to improve nutrient content in animal and plant breeding. Therefore, the objective of this dissertation is to develop sparse models for such high dimensional data, develop accurate sparse variable selection and estimation algorithms for the models and design statistical methods for robust hypothesis tests for the TFBSs identification and QTL mapping problems. Although the sparse model learning works presented in this thesis are used in the context of TFBSs identification or QTL mapping problems, the algorithms are equally applicable to a broad range of problems, such as whole-genome QTL mapping and pathway-based genome-wide association study (GWAS), etc. The widely used computational methods for identifying TFBSs based on the position weight matrix (PWM) assume that the nucleotides at different positions of the TFBSs are independent. However, several experimental results demonstrate the dependencies among different positions. Recently, Bayesian networks (BN) and variable order Bayesian networks (VOBN) were proposed to model such dependencies and thereby improve the accuracy of predicting TFBSs. However, BN and VOBN model the dependencies in a directional manner, which may hinder their capability of completely capturing complex dependencies. To this end, we develop a Markov random field (MRF) based model for TFBSs capable of capturing complex unidirectional relationships among motifs. To capture the large extent of dependencies in a sparse model without causing overfitting, we develop a feature selection method that carefully chooses only the most relevant features of the model. An exhaustive simulation study affirmed that our MRF-based method outperforms other state-of-the-art methods based on VOBN. To further reduce the computational complexity of our algorithm, we introduce a novel pairwise MRF model to the TFBSs, and develop a fast algorithm to learn the model parameters. Specifically, we adopt an optimization method that employs the log determinant relaxation approach to evaluate the partition function in the MRF, which dramatically reduces the computational complexity of the algorithm. For the genotype/phenotype association problem, we develop a novel empirical Bayesian least absolute shrinkage and selection operator (EBlasso) algorithm with normal and exponential (NE) and normal, exponential and gamma (NEG) hierarchical prior distributions. Both of these algorithms employ a novel proximal gradient approach to simultaneously estimate model parameters that leads to extremely fast convergence. Furthermore, we develop a novel proximal gradient hybrid model capable of detecting more QTLs than its vanilla flavor, but still maintaining a lower false positive rate. Having both covariance and posterior modes estimated, they also provide a statistical testing method that considers as much information as possible without increasing the degrees of freedom (DF). Extensive simulation studies are carried out to evaluate the performance of the proposed methods, and real datasets are analyzed for validation. Both simulation and real data analyses suggest that the new methods are fast and accurate genotype-phenotype association methods that can easily handle high dimensional data, including possible main and interaction effects with orders of magnitude faster than existing state-of-the-art methods. Specifically, with the EBlasso-NEG, our new algorithm could easily handle more than 〖10〗^5 possible effects within few seconds running on an average personal computer. Given the fundamental importance of gene expression and genotype/phenotype associations in understanding the genetic basis of complex biological system, the MRF, pairwise-MRF, EBlasso-NE, EBlasso-NEG and EBlasso-NEG hybrid algorithms and software packages developed in this dissertation achieve the effectiveness, robustness and efficiency needed for successful application to biology. With the advancement of high-throughput molecular technologies in generating information at genetic, epigenetic, transcriptional and posttranscriptional levels, the methods developed here have broad applications to infer TFBSs and different types of genotype and phenotypes associations.


Sparse model learning; TFBS; QTL; motif; Genotype and Phenotype Associations