Publication Date



Open access

Embargo Period


Degree Type


Degree Name

Doctor of Philosophy (PHD)


Human Genetics and Genomics (Medicine)

Date of Defense


First Committee Member

Eden R. Martin

Second Committee Member

William K. Scott

Third Committee Member

Mitsunori Ogihara

Fourth Committee Member

Miroslav Kubat


Case-control resequencing studies are growing in popularity as investigators apply novel massively parallel sequencing technologies to existing case-control data sets. However, the sequence data generated by these studies present several daunting analytic challenges. The present study focuses on addressing the challenges posed by rare variants and missing genotypes when performing a test for association between a disease and a locus using data from a case-control resequencing study. Association tests that pool minor alleles into a measure of burden at a locus have been proposed to address allelic heterogeneity in the presence of rare variants. However, such pooling tests are not robust to the inclusion of neutral and protective variants, which can mask the association signal from risk variants, and may not be robust to randomly missing genotypes. In contrast, methods for locus-wide inference using nonnegative single-variant test statistics are robust to both the inclusion of neutral and protective variants and randomly missing genotypes. Therefore, three existing methods for locus-wide inference using nonnegative single-variant test statistics were compared to two widely cited pooling tests under realistic conditions. Analytic results for a simple model with one rare risk and one rare neutral variant demonstrated that pooling tests are less powerful than even Bonferroni-corrected single-variant tests in most situations. These results were extended by Monte Carlo simulations using variants with realistic minor allele frequency and linkage disequilibrium spectra, disease models with multiple rare risk variants and extensive neutral variation, and varying rates of randomly missing genotypes. In all scenarios considered, at least one existing method using nonnegative single-variant test statistics had power comparable to or greater than the two pooling tests considered. These results suggest that efficient locus-wide inference using single-variant test statistics should be reconsidered as a useful framework for addressing the challenge posed by rare variants in case-control resequencing studies. Methods that perform efficient locus-wide inference using nonnegative single-variant test statistics also partially address the challenge posed by missing genotypes because they can use all available genotype data. When these methods are based on permutation tests, inferences will be valid if genotypes are randomly missing—that is, if the probability of a missing genotype at a variant does not depend on other observed or unobserved variables in the study. However, it was unclear whether methods based on permutation tests would yield valid inferences for nonrandomly missing genotypes. Therefore, a rigorous theoretical framework for constructing valid permutation tests was developed for genetic case-control studies with unrelated subjects and missing genotypes arising from a variety of missing data processes. The development began with the specification of a nonparametric probability model for the observed data in such a study. Group-theoretic arguments were then used to establish two conditions that together guarantee an exact level-α Monte Carlo permutation test for data generated under this nonparametric probability model. One of these conditions is not satisfied for the most frequently used Monte Carlo permutation test, and this test is guaranteed to be level α only for missing data processes with certain characteristics. An alternative Monte Carlo permutation test, which is exact level α as long as all covariates influencing the missing data process are identified and recorded, was therefore proposed. The theoretical development was supplemented with Monte Carlo simulations for a variety of test statistics and missing data processes. These results demonstrate that Monte Carlo permutation tests must be constructed with careful consideration of the missing data process to adequately address the challenge posed by missing genotypes and avoid inferential errors.


case-control genetic association studies; resequencing; rare variants; missing genotypes; permutation tests