Publication Date



Open access

Embargo Period


Degree Type


Degree Name

Doctor of Philosophy (PHD)


Biostatistics (Medicine)

Date of Defense


First Committee Member

J. Sunil Rao

Second Committee Member

Hermant Ishwaran

Third Committee Member

Lily Wang

Fourth Committee Member

Nagi Ayad


In this thesis, I develop some new variable selection and statistical modeling techniques in the framework of L1 shrinkage estimation with applications to high dimensional genomic and pharmacogenomic datasets. In the first part of the thesis, I revisit the problem of variable selection in linear regression models. While numerous variable selection procedures have been developed, their finite sample performance can often be less than satisfactory. I develop a new strategy for variable selection in the adaptive least absolute shrinkage and selection operator (Lasso) and adaptive elastic-net estimations with $p_n$ diverging. The basic idea first involves using the trace paths of their LARS solutions to bootstrap estimates of maximum frequency (MF) models conditioned on model dimension. Conditioning on dimension effectively mitigates overfitting. But to deal with underfitting these MFs are then prediction-wighted. I show that the new method is not only model selection consistent, but also has attractive convergence rate, which lead to outstanding finite sample performance. In the second part, I propose a new statistical model to re-explore the Genomics of Drug Sensitivity (GDSC) study \citep{garnett2012systematic}. To link drug sensitivity with genomic profiles, the study screened 639 human tumor cell lines with 130 cancer drugs ranging from known chemotherapeutic agents to experimental compounds. However, the statistical challenges still exist in analyses of this dataset: i)biomarkers cluster among the cell lines; ii) clusters can overlap (e.g. a cell line may belong to multiple clusters); iii) drugs should be modeled jointly. I introduce a new multivariate regression model with a latent overlapping cluster indicator variable to address these issues. I then propose the generalized mixture of multivariate regression (GMMR) models and build a connection with it to the new model. I develop a new EM algorithm for numerical computations in the GMMR model. The proposed new model can answer specific questions in the GDSC data: i) can cancer-specific therapeutic biomarkers be detected, ii) can drug resistance patterns be identified along with predictive strategies to circumvent resistance using alternate drugs? In the third part of the thesis, I set out to tackle another challenging problem related to GDSC data -- that of validating models built on one dataset but tested on similar datasets generated in other laboratories. The Genomics of Drug Sensitivity (GDSC) and Cancer Cell Line Encyclopedia (CCLE) are two major resources that can be used to mine for therapeutic biomarkers for cancers of a large variety. Recent studies found that while the genomic profiling seems consistent, the drug response data is not. As a result, both predictions and signatures do not validate well for models built on one dataset and tested on the other. I present a partitioning strategy based on a data sharing concept, which directly estimates the amount of discordance between datasets and in doing so, also allows for extraction of reproducible signals. I show both significantly improved test set prediction accuracy and signature validation as compared to other approaches that have been tried.


L1 shrinkage estimation; variable selection; overlapping clustering; model validation