Whole genome analysis pipeline

Sections

Repository Description
WGS considerations
QC pipeline steps

Dataset QC prior to manipulating VCF
Dataset pre-filtering
High quality hard call subset of the data
Outlier sample QC part 1
Sex check sample QC
Identity-by-descent filtering
Principal components analysis
Principal components filtering
Outlier sample QC part 2
Variant QC
Assessing variant QC

Sample QC Parameters
Variant QC Parameters

Repository description

This repo provides a template for processing WGS data in Hail

README: Considerations of WGS data and QC pipeline steps
Resources: Links and other things relevant to WGS pipelines
Modules: Hail script templates for each pipeline step

Relevant gnomAD resources:

gnomAD resources: gs://gcp-public-data--gnomad/
gnomad readthedocs
gnomAD v2.1
gnomAD v3.0

AllofUS QC report (Spring 2023)

WGS considerations

File size:

WGS data is orders of magnitdues larger than GWAS array (fixed variant size) and exome capture sequencing (~1% of genome)
File size will scale with sample size
Strategies to deal with large files:
- Use Hail
- Run everything on the cloud
- Re-partitioning the Hail matrix table
- Use a Sparse matrix table
- Use a hail VDS (variant dataset), which retains only non-ref calls
GVS file format: Google Big Query solution to save space due to the size of the VCF.

File formats:

Per-sample gVCFs: VCF format data for each sample, where all variant sites and reference blocks are listed for all contigs
Ref-blocks: Summary depth/genotype quality information in the gVCF across a genomic interval where the sample has no variant sites

Common and rare variation:

WGS data contains SNP/indel calls across the allele frequency spectrum, whereas array (MAF > 0.1%)
Low coverage WGS (< 10x) aided by imputation
High coverage WGS (> 20x) robust across all SNP/indels
Multiple levels of analysis
- Pulling out common variants to include in GWAS meta-analysis (GWAMA)
- Pulling out rare coding variants to include in Exome meta-analysis
- Looking at rare non-coding variation (good luck!)

Repetitive regions:

WGS contains challenging genomic regions often ignored in array and exome capture sequencing
Regions with highly repetitive sequence that is difficult for short read sequening to align properly to the reference genome library
- Telomeres/centromeres of each chromosome
- Segmental duplications - large repetitive chumks
- Low complexity regions (LCRs)
These regions are often excluded early on in an analysis

QC pipeline steps

Index of steps:

Dataset QC prior to manipulating VCF
Dataset pre-filtering
High quality hard call subset of the data
Outlier sample QC part 1
Sex check sample QC
Identity-by-descent filtering
Principal components analysis
Principal components filtering
Outlier sample QC part 2
Variant QC
Assessing variant QC

Dataset QC prior to manipulating VCF

GOAL: Understanding the project/phenotype data you are working with
- List and understand all sample phenotypes provided
- Match up phenotype file IDs with genetic data IDs
- Resolve any inconsistencies before moving forward
- Start spreadsheet/table of datasets, sequence platforms
- Know what genome reference your sequence data is mapped (most WGS data is hg38 / GRCh38)
- Determine whether you are using a dense or sparse matrix table
- Look over the VCF meta-data at the top of the VCF file
- Read VCF and phenotype/sample file into Hail
- Generate sample QC metrics from raw VCF
- Write up paragraph of sample collection and descriptives

Categorical variables often analyzed in datasets

Cohort
Sequencing wave
C-Project (Broad specific)
Sequencing Plate
Sex
Affection status
Continental ancestry groupings / reported ancestry

Quantitative parameters often analyzed in datasets

Age
Top genetic principal components
Number of variant sites (n_snps)
Singleton rate (n_singleton)
Het / hom variant ratio (r_het_hom_var)

Dataset pre-filtering

GOAL: Remove variants that are highly unlikely to be analyzed
- Remove variants that fail VQSR (or AS-VQSR)
- Remove variants in telomeres / centromeres
- Remove variants in low-complexity regions
- Remove variants in segmental duplication regions
- Generate sample QC metrics from pre-filtered VCF
- VEP annotate remaining sites
  - Usually just need gene name and consequence

High quality hard call subset of the data

GOAL: Use a smaller subset of variants (i.e. 50-500k variants) to analyze relatedness (IBD) and population ancestry (PCA)
- Bi-allelic SNVs only
- Common variants (MAF > 0.1%)
- Call rate > 99%
- LD-pruned with a cutoff of r2 = 0.1
PROTIP: Use gnomAD / CCDG hg38 PCA variant list
- HQ calls selected from large multi-ancestry WGS cohort
- 224,591 variants ld-pruned with gnomAD v3: gs://gnomad/sample_qc/ht/genomes_v3.1/ld_pruned_combined_variants.ht
- 259,482 variants pre-ld-pruning: gs://gnomad/sample_qc/ht/genomes_v3.1/pre_ld_pruning_combined_variants_without_washu.ht
If running from raw VCF calls
- Run split.multi to maximize bi-allelic sites
- Filter to PASS sites and SNV
- Run variant.qc to get MAF

Outlier sample QC part 1

GOAL: Remove samples that are contaminated or have poor sequencing levels
- Use pre-filtered dataset
- Plot values below before using DEFAULT filters to ensure you are not throwing away large amounts of samples
  - freemix contamination filtering (DEFAULT > 0.05)
  - Chimeric read filtering (DEFAULT > 0.05 )
  - Call rate filtering (DEFAULT < 0.85)
  - Mean Depth coverage filtering (DEFAULT < 15)
  - Small insert size: Median insert size < 250bp

Sex check sample QC

GOAL: remove samples where genotype sex does not equal reported sex
- Filter out variants within PAR coordinates (https://en.wikipedia.org/wiki/Pseudoautosomal_region)
  - Reported males shoud have X chromosome F-statistic from 0.8 to 1
  - Reported females shoud have X chromosome F-statistic from -0.2 to 0.4
  - Remove samples with ambiguous imputed sex (X chromosome F-statistic between 0.4 and 0.8)
- Large-scale sex check errors are indicative of ID mismatching or upstream data mishandling

Identity-by-descent filtering

GOAL: remove 1st and 2nd degree relatives from population based sample
- Use HQ hardcall dataset
- Consider which IBD algorithm to use (KING, PC-AiR, PC-relate)
- Which algorithm you use will depend on sample size, ancestry composition, and knowledge of family pedigrees in data
- Unsure? Start by trying genetic relatedness using Hail pc-relate
- Plot proportion of 0 shared vs 1 shared alleles
- IBD filtering on PI-HAT value (DEFAULT > 0.2)

Principal components analysis

GOAL: Determine general ancestry of cohort
- Use HQ hardcall dataset
- Consider which PCA strategy to use
  - PCA comparing against self-identified ancestry information (PROS: easy if you have this info CONS: dependent on quality of self-id info)
  - PCA including reference datasets with known ancestry (PROS: good truth dataset CONS: can be hard to harmonize variants)
  - PCA using gnomAD PCs (PROS: no need to merge variants CONS: shrinkage in PCs due to variant mismatch
- Unsure? Start with Hail hwe_normalized_pca
- Run with and without reference panel data (1KG or 1KG+HGDP genome reference panel)
- Create plots of PCs
  - Case / control coloring
  - Cohort coloring
- Assigning samples to a particular ancestry
  - Random forest model

Principal components filtering

GOAL: match case and controls within a common genetic ancestry
- If retaining multiple ancestries, make sure to define ancestry groups in phenotype file
- PCA filtering (no DEFAULT filtering parameters)
  - 2-dimensional centroid approach (using distance from 1KG ancestry mean PC; used by Chia-Yen Chen)
  - pair-matching cases to controls (R package: optmatch; R function: pairmatch(); used by Mitja Kurki)
  - Within each ancestry group, re-run PCA
    - Re-evaluate PC dispersion
    - Add these PCs as covariates to phenotype file

Outlier sample QC part 2

GOAL: remove samples within continental ancestry groupings that have unusual variant properties
- Examine variation in:
- TiTv ratio
- Het/Hom ratio
- Insertion/Deletion ratio
- Plots with colors defined by assigned ancestry
  - Different ancestries have significant mean differences in these ratios
- Filter out within cohort outliers (DEFAULT > 4 Std. deviations within a particular ancestry)

Variant QC

GOAL: Remove low quality/somatic variants
- Use pre-filtered VCF with outlier samples removed
- Filter variants with low call rate (DEFAULT < 95%)
  - Split variant call rate by capture, case/control status, or other category
- Remove monoallelic variants: no alt or no ref calls
- Genotype filtering (set to filtered variants to missing)
  - Filter by Depth (DEFAULT < 10)
  - Filter by GQ (DEFAULT < 25)
  - HET calls: Filter by allele depth/balance (AB/AD; DEFAULT < 0.25)
    - Additional pAB filter on higher depth (binomial test - DEFAULT p < 1e-9)
  - HOMREF calls: (AB/AD; DEFAULT > 0.1)
  - HOMALT calls: (AB/AD; DEFAULT < 0.9)
- Remove variants not in HWE (DEFAULT p < 1e-6)
- Generate sample QC metrics

Assessing variant QC

GOAL: Determine if more stringent variant QC is needed
- Examine QC parameters across 3 filtering steps:
  - Pre-filtered VCF
  - Sample QC'ed VCF
  - Variant QC'ed VCF
- QC parameters:
  - Number of SNPs / Indels
  - TiTv ratio
  - Het/Hom ratio
  - Ins/Del ratio
  - Singleton synonymous rate
- Primary categories:
  - Case/control (should be equal between groups)
  - Cohort (should vary predictably)
- Determine whether additional variant filtering needs to be done

Sample QC Parameters

From GATK/Picard metadata

Freemix Contamination
Chimeric read percentage

From Hail Sample QC Annotation Table Primary QC parameters

callRate
rTiTv
rHetHomVar
rInsertionDeletion
nSingleton
dpMean

Secondary QC parameters

nCalled
nNotCalled
nHomRef
nHet
nHomVar
nSNP
nInsertion
nDeletion
nTransition
nTransversion
dpStDev
gqMean
gqStDev
nNonRef

Variant QC Parameters

Variant Quality

VQSLOD - Variant Quality Score in LOD space (or new RF posterior probabilities)
pHWE - Hard-Weinberg Equilibrium
AC - allele count
Median depth
QD - quality by depth

Genotype Quality

Depth
PHRED likelihood (PL)
Genotype Quality (GQ - same as PL in joint-called VCF)
Allele Depth (AD)

Case/control Whole Genome Sequencing - Burden and Association pipeline

Basic Requirements:

A QC-ready annotated matrix table
Samples annotated with PCs

Whole Genome Burden

Primary Hypotheses:

Do cases have a higher overall burden of rare deleterious variants than controls?
- Does this burden decrease as allele frequency increase?
Is the burden concentrated in genes with evidence of selective constraint?
- Does the burden effect attenuate outside of these constrained genes?
Sanity check: Is the rate of rare variants similar in cases and controls?
- Highly significant differences suggest QC issues still persist

General Method:

Aggregate per-individual counts on selected annotation/allele frequency
Testing deleterious coding variant count using logistic regression
Include first 10 PCs and sex as covariates

Comparisons:

Stratify by cohort
- Require cases and controls within any cohort designation
Stratify by commonly assessed variant annotations
- Genic / non-genic
- Conservation/constraint
- CADD severity prediction
- Damaging missense (PolyPhen damaging, SIFT deleterious, MPC > 2)
- Protein truncating variants (frameshift_variant, splice_acceptor_variant, splice_donor_variant, stop_gained)
Stratify by Allele frequency
- Ultra-rare variation: non-gnomAD singletons
- Dataset + gnomAD AC < 5
- Doubleton distributions

Infomative Graphics:

Forest plots of Odds ratio, 95% CI, and p-value

Gene-based association

Primary Hypotheses:

Do any genes associate with our phenotype after correcting for multiple testing?
Is there inflation of the median test statistic (lambda)
- Inflation suggests that there is underlying population stratification/QC not accounted for
- Can also mean a polygenic signal in well-powered cohorts

General Method:

Aggregate per-individual counts on selected annotation/allele frequency
Testing per-gene variant count using logistic regression
Include first 10 PCs and sex as covariates

Infomative Graphics:

QQ plots of expected/actual p-values example .ipynb
Gene-level Manhattan plot example .R function
Forest plots of Odds ratio, 95% CI, and p-value

Portal information

Example Case/Control WES Portal: SCHEMA

Basic association tests:

Fisher's Exact test
- Separate individuals into discrete binary categories (carrier / non-carrier)
- fisher.test() in R
- in Hail: http://discuss.hail.is/t/simple-rare-variant-burden-testing-with-fisher-exact-test/210
Poisson rate test
- Compares variant counts when the mean and variance are equal (generally for de-novo / ultra-rare variants only)
- poisson.test() in R
Logistic (or Firth regression) in Hail
- Predicts case/control status by allele frequency and allows for covariates
- Requires some asymptotic assumptions, which can be unmet at low allele counts (< 20)

Mixed model association tests

SAIGE
- Mixed-model assocation with saddle-point approximation (SPA) to control for imbalanced case/control ratio
- Both individal variant and gene-based tests (using SKAT) available
Kernel association - SKAT test
- R implementation: SKAT
- Random effects model with covariates
- Handles alleles at various frequencies
Example of logistic and fishers-exact test using Hail (courtesy of Mitja Kurki) Hail Topic
R packages for rare variant testing
- GENESIS
- SeqMeta

burden_testing.Rmd

A walkthrough of various gene-based tests, strategies, and Hail commands for testing rare variant burden

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Whole genome analysis pipeline

Sections

Repository description

WGS considerations

QC pipeline steps

Index of steps:

Dataset QC prior to manipulating VCF

Dataset pre-filtering

High quality hard call subset of the data

Outlier sample QC part 1

Sex check sample QC

Identity-by-descent filtering

Principal components analysis

Principal components filtering

Outlier sample QC part 2

Variant QC

Assessing variant QC

Sample QC Parameters

Variant QC Parameters

Case/control Whole Genome Sequencing - Burden and Association pipeline

Whole Genome Burden

Gene-based association

Portal information

Basic association tests:

Mixed model association tests

burden_testing.Rmd

Files

README.md

Latest commit

History

README.md

File metadata and controls

Whole genome analysis pipeline

Sections

Repository description

WGS considerations

QC pipeline steps

Index of steps:

Dataset QC prior to manipulating VCF

Dataset pre-filtering

High quality hard call subset of the data

Outlier sample QC part 1

Sex check sample QC

Identity-by-descent filtering

Principal components analysis

Principal components filtering

Outlier sample QC part 2

Variant QC

Assessing variant QC

Sample QC Parameters

Variant QC Parameters

Case/control Whole Genome Sequencing - Burden and Association pipeline

Whole Genome Burden

Gene-based association

Portal information

Basic association tests:

Mixed model association tests

burden_testing.Rmd