abstract.tex

%!TEX root = ./main.tex


Advances in high-throughput genomic technologies have facilitated the collection of DNA information for thousands of individuals, providing unprecedented opportunities to explore the genetic architecture of complex disease.
One important finding has been that the majority of variants in the human genome are low in frequency or rare.
It has been hypothesised that recent explosive growth of the human population afforded unexpectedly large amounts of rare variants with potentially deleterious effects, suggesting that rare variants may play a role in disease predisposition.
But, importantly, rare variants embody a source of information through which we may learn more about our recent evolutionary history.
In this thesis, I developed several statistical and computational methods to address problems associated with the analysis of rare variants and, foremost, to leverage the genealogical information they encode.

First, one constraint in genome-wide association studies is that lower-frequency variants are not well captured by genotyping methods, but instead are predicted through imputation from a reference dataset.
I developed the \emph{meta-imputation} method to improve imputation accuracy by integrating genotype data from multiple, independent reference panels, which outperformed imputations from separate references in almost all comparisons (mean correlation with masked genotypes ${r^2 > 0.9}$).
I further demonstrated in simulated case-control studies that meta-imputation increased the statistical power to identify low-frequency variants of intermediate or high penetrance by 2.2--3.6\%.

Second, rare variants are likely to have originated recently through mutation and thereby sit on relatively long haplotype regions identical by descent (IBD).
I developed a method that exploits rare variants as identifiers for shared haplotype segments around which the breakpoints of recombination are detected using non-probabilistic approaches.
In coalescent simulations, I show that such breakpoints can be inferred with high accuracy (${r^2 > 0.99}$) around rare variants at frequencies ${\leq 0.05\%}$, using either haplotype or genotype data.

Third, I show that technical error poses a major problem for the analysis of whole-genome sequencing or genotyping data, particularly for alleles below 0.05\% frequency (false positive rate, ${\text{FPR}=0.1}$).
I therefore propose a novel approach to infer IBD segments using a Hidden Markov Model (HMM) which operates on genotype data alone.
I incorporated an empirical error model constructed from error rates I estimated in publicly available sequencing and genotyping datasets.
The HMM was robust in presence of error in simulated data (${r^2>0.98}$) while non-probabilistic methods failed (${r^2<0.02}$).

Lastly, the age of an allele (the time since its creation through mutation) may provide clues about demographic processes that resulted in its observed frequency.
I present a novel method to estimate (rare) allele age based on the inferred shared haplotype structure of the sample.
The method operates in a Bayesian framework to infer pairwise coalescent times from which the age is estimated using a composite posterior approach.
I show in simulated data that coalescent time can be inferred with high accuracy (rank correlation ${>0.91}$) which resulted in a likewise high accuracy for estimated age (${>0.94}$).
When applied to data from the 1000 Genomes Project, I show that estimated age distributions were overall conform with frequency-dependent expectations under neutrality, but where patterns of low frequency and old age may hint at signatures of selection at certain sites.
Thus, this method may prove useful in the analysis of large cohorts when linked to biomedical phenotype data.


% I further show that it would be possible to use IBD information to locally estimate haplotypes from genotype data (phasing).


% Recent advances in high-throughput genomic technologies have enabled the large-scale collection of massive amounts of whole-genome data for thousands of individuals, which provide unprecedented opportunities to learn more about the genetic architecture of complex diseases.
% One important finding was that the majority of genetic variants in the human genome is low in frequency or rare, each variant being shared by only a small number of individuals.
% It has been hypothesised that these endow low, but deleterious effects, possibly emanating from the recent explosive growth of the human population.
% Existing methodologies are not designed to detect such minor effects, but which nonetheless may play a significant role in the aetiology of complex diseases.
% For example, rare variants are generally too low in frequency to expect statistical significance in GWAS, and traditional linkage methods are underpowered to locate variants with low or modest penetrance.
%
% In this thesis, I developed several statistical methods [...]
%
% A caveat of GWAS is that genotyping methods are designed to capture common variants, while variants at lower frequencies have to be predicted through imputation from a reference panel.
% I developed a method to improve imputation accuracy by integrating genotype data from multiple reference datasets.
% In a series of simulated case-control experiments, I demonstrate that this approach, called meta-imputation, is able improve power to detect low-frequency variants of intermediate or high penetrance.
%
% Despite the problems to interrogate rare variants using existing approaches, they provide a useful source of information about recent demographic history, as they are likely to have originated recently through mutation, making them highly population-specific.
% I developed a non-probabilistic method to detect shared haplotype segments that are identical by descent (IBD) from patterns of rare allele sharing, using either haplotype or genotype data.
% I further show that it would be possible to use IBD information to locally estimate haplotypes from genotype data (phasing).
%
% -- Genotype error is a major problem
% -- I propose a novel approach to infer IBD using a \gls{hmm} under an empirical error model, which I constructed by identifying misclassified genotypes in different genotyping and sequencing datasets.
%
% -- The age of a rare allele (the time since it was created through mutation) may provide clues about the selective forces that [resulted/allowed/have granted/have led] it to be observed at specific frequency and its impact on fitness.
% -- I developed a novel method to estimate rare allele age, based on the inferred IBD structure of a sample.
% -- I demonstrate that the age of particular alleles can be estimated with high accuracy using the HMM-based approach for IBD detection, which is robust towards phasing or genotype errors.
% -- I apply this method to data from the 1000 Genomes Project and show that there are significant age differences between rare alleles predicted to have high or low consequences on the phenotype.


% Despite these problems, rare variants provide a useful source of information about recent demographic history, as they are likely to have originated recently through mutation, making them highly population-specific.
% Hence, the patterns of rare allele sharing
%
% I demonstrate that this method (referred to as the tidy algorithm) is able to detect recombination breakpoints of \gls{ibd} segments
%
% Classical approaches such as the four-gamete test
%
% I describe \n{2} implementations of the algorithm that can be
% applied to datasets consisting of thousands of samples.
%
% in presence of genotype error.
%
% quantified genotype error in
% sequencing and genotyping datasets
%
% IBD sharing across purportedly unrelated individuals
%
% \gls{hmm}
% empirical error model
%
% composite likelihood approach to estimate the age of an allele
%
% the development of novel strategies that enable future discoveries.