- Dataset QC prior to manipulating VCF
- Dataset pre-filtering
- High quality hard call subset of the data
- Outlier sample QC part 1
- Sex check sample QC
- Identity-by-descent filtering
- Principal components analysis
- Principal components filtering
- Outlier sample QC part 2
- Variant QC
- Assessing variant QC
This repo provides a template for processing WGS data in Hail
- README: Considerations of WGS data and QC pipeline steps
- Resources: Links and other things relevant to WGS pipelines
- Modules: Hail script templates for each pipeline step
Relevant gnomAD resources:
- gnomAD resources: gs://gcp-public-data--gnomad/
- gnomad readthedocs
- gnomAD v2.1
- gnomAD v3.0
AllofUS QC report (Spring 2023)
File size:
- WGS data is orders of magnitdues larger than GWAS array (fixed variant size) and exome capture sequencing (~1% of genome)
- File size will scale with sample size
- Strategies to deal with large files:
- Use Hail
- Run everything on the cloud
- Re-partitioning the Hail matrix table
- Use a Sparse matrix table
- Use a hail VDS (variant dataset), which retains only non-ref calls
- GVS file format: Google Big Query solution to save space due to the size of the VCF.
File formats:
- Per-sample gVCFs: VCF format data for each sample, where all variant sites and reference blocks are listed for all contigs
- Ref-blocks: Summary depth/genotype quality information in the gVCF across a genomic interval where the sample has no variant sites
Common and rare variation:
- WGS data contains SNP/indel calls across the allele frequency spectrum, whereas array (MAF > 0.1%)
- Low coverage WGS (< 10x) aided by imputation
- High coverage WGS (> 20x) robust across all SNP/indels
- Multiple levels of analysis
- Pulling out common variants to include in GWAS meta-analysis (GWAMA)
- Pulling out rare coding variants to include in Exome meta-analysis
- Looking at rare non-coding variation (good luck!)
Repetitive regions:
- WGS contains challenging genomic regions often ignored in array and exome capture sequencing
- Regions with highly repetitive sequence that is difficult for short read sequening to align properly to the reference genome library
- Telomeres/centromeres of each chromosome
- Segmental duplications - large repetitive chumks
- Low complexity regions (LCRs)
- These regions are often excluded early on in an analysis
- Dataset QC prior to manipulating VCF
- Dataset pre-filtering
- High quality hard call subset of the data
- Outlier sample QC part 1
- Sex check sample QC
- Identity-by-descent filtering
- Principal components analysis
- Principal components filtering
- Outlier sample QC part 2
- Variant QC
- Assessing variant QC
- GOAL: Understanding the project/phenotype data you are working with
- List and understand all sample phenotypes provided
- Match up phenotype file IDs with genetic data IDs
- Resolve any inconsistencies before moving forward
- Start spreadsheet/table of datasets, sequence platforms
- Know what genome reference your sequence data is mapped (most WGS data is hg38 / GRCh38)
- Determine whether you are using a dense or sparse matrix table
- Look over the VCF meta-data at the top of the VCF file
- Read VCF and phenotype/sample file into Hail
- Generate sample QC metrics from raw VCF
- Write up paragraph of sample collection and descriptives
Categorical variables often analyzed in datasets
- Cohort
- Sequencing wave
- C-Project (Broad specific)
- Sequencing Plate
- Sex
- Affection status
- Continental ancestry groupings / reported ancestry
Quantitative parameters often analyzed in datasets
- Age
- Top genetic principal components
- Number of variant sites (n_snps)
- Singleton rate (n_singleton)
- Het / hom variant ratio (r_het_hom_var)
- GOAL: Remove variants that are highly unlikely to be analyzed
- Remove variants that fail VQSR (or AS-VQSR)
- Remove variants in telomeres / centromeres
- Remove variants in low-complexity regions
- Remove variants in segmental duplication regions
- Generate sample QC metrics from pre-filtered VCF
- VEP annotate remaining sites
- Usually just need gene name and consequence
- GOAL: Use a smaller subset of variants (i.e. 50-500k variants) to analyze relatedness (IBD) and population ancestry (PCA)
- Bi-allelic SNVs only
- Common variants (MAF > 0.1%)
- Call rate > 99%
- LD-pruned with a cutoff of r2 = 0.1
- PROTIP: Use gnomAD / CCDG hg38 PCA variant list
- HQ calls selected from large multi-ancestry WGS cohort
- 224,591 variants ld-pruned with gnomAD v3:
gs://gnomad/sample_qc/ht/genomes_v3.1/ld_pruned_combined_variants.ht
- 259,482 variants pre-ld-pruning:
gs://gnomad/sample_qc/ht/genomes_v3.1/pre_ld_pruning_combined_variants_without_washu.ht
- If running from raw VCF calls
- Run split.multi to maximize bi-allelic sites
- Filter to PASS sites and SNV
- Run variant.qc to get MAF
- GOAL: Remove samples that are contaminated or have poor sequencing levels
- Use pre-filtered dataset
- Plot values below before using DEFAULT filters to ensure you are not throwing away large amounts of samples
- freemix contamination filtering (DEFAULT > 0.05)
- Chimeric read filtering (DEFAULT > 0.05 )
- Call rate filtering (DEFAULT < 0.85)
- Mean Depth coverage filtering (DEFAULT < 15)
- Small insert size: Median insert size < 250bp
- GOAL: remove samples where genotype sex does not equal reported sex
- Filter out variants within PAR coordinates (https://en.wikipedia.org/wiki/Pseudoautosomal_region)
- Reported males shoud have X chromosome F-statistic from 0.8 to 1
- Reported females shoud have X chromosome F-statistic from -0.2 to 0.4
- Remove samples with ambiguous imputed sex (X chromosome F-statistic between 0.4 and 0.8)
- Large-scale sex check errors are indicative of ID mismatching or upstream data mishandling
- Filter out variants within PAR coordinates (https://en.wikipedia.org/wiki/Pseudoautosomal_region)
- GOAL: remove 1st and 2nd degree relatives from population based sample
- Use HQ hardcall dataset
- Consider which IBD algorithm to use (KING, PC-AiR, PC-relate)
- Which algorithm you use will depend on sample size, ancestry composition, and knowledge of family pedigrees in data
- Unsure? Start by trying genetic relatedness using Hail pc-relate
- Plot proportion of 0 shared vs 1 shared alleles
- IBD filtering on PI-HAT value (DEFAULT > 0.2)
- GOAL: Determine general ancestry of cohort
- Use HQ hardcall dataset
- Consider which PCA strategy to use
- PCA comparing against self-identified ancestry information (PROS: easy if you have this info CONS: dependent on quality of self-id info)
- PCA including reference datasets with known ancestry (PROS: good truth dataset CONS: can be hard to harmonize variants)
- PCA using gnomAD PCs (PROS: no need to merge variants CONS: shrinkage in PCs due to variant mismatch
- Unsure? Start with Hail hwe_normalized_pca
- Run with and without reference panel data (1KG or 1KG+HGDP genome reference panel)
- Create plots of PCs
- Case / control coloring
- Cohort coloring
- Assigning samples to a particular ancestry
- GOAL: match case and controls within a common genetic ancestry
- If retaining multiple ancestries, make sure to define ancestry groups in phenotype file
- PCA filtering (no DEFAULT filtering parameters)
- 2-dimensional centroid approach (using distance from 1KG ancestry mean PC; used by Chia-Yen Chen)
- pair-matching cases to controls (R package: optmatch; R function: pairmatch(); used by Mitja Kurki)
- Within each ancestry group, re-run PCA
- Re-evaluate PC dispersion
- Add these PCs as covariates to phenotype file
- GOAL: remove samples within continental ancestry groupings that have unusual variant properties
- Examine variation in:
- TiTv ratio
- Het/Hom ratio
- Insertion/Deletion ratio
- Plots with colors defined by assigned ancestry
- Different ancestries have significant mean differences in these ratios
- Filter out within cohort outliers (DEFAULT > 4 Std. deviations within a particular ancestry)
- GOAL: Remove low quality/somatic variants
- Use pre-filtered VCF with outlier samples removed
- Filter variants with low call rate (DEFAULT < 95%)
- Split variant call rate by capture, case/control status, or other category
- Remove monoallelic variants: no alt or no ref calls
- Genotype filtering (set to filtered variants to missing)
- Filter by Depth (DEFAULT < 10)
- Filter by GQ (DEFAULT < 25)
- HET calls: Filter by allele depth/balance (AB/AD; DEFAULT < 0.25)
- Additional pAB filter on higher depth (binomial test - DEFAULT p < 1e-9)
- HOMREF calls: (AB/AD; DEFAULT > 0.1)
- HOMALT calls: (AB/AD; DEFAULT < 0.9)
- Remove variants not in HWE (DEFAULT p < 1e-6)
- Generate sample QC metrics
- GOAL: Determine if more stringent variant QC is needed
- Examine QC parameters across 3 filtering steps:
- Pre-filtered VCF
- Sample QC'ed VCF
- Variant QC'ed VCF
- QC parameters:
- Number of SNPs / Indels
- TiTv ratio
- Het/Hom ratio
- Ins/Del ratio
- Singleton synonymous rate
- Primary categories:
- Case/control (should be equal between groups)
- Cohort (should vary predictably)
- Determine whether additional variant filtering needs to be done
- Examine QC parameters across 3 filtering steps:
From GATK/Picard metadata
- Freemix Contamination
- Chimeric read percentage
From Hail Sample QC Annotation Table Primary QC parameters
- callRate
- rTiTv
- rHetHomVar
- rInsertionDeletion
- nSingleton
- dpMean
Secondary QC parameters
- nCalled
- nNotCalled
- nHomRef
- nHet
- nHomVar
- nSNP
- nInsertion
- nDeletion
- nTransition
- nTransversion
- dpStDev
- gqMean
- gqStDev
- nNonRef
Variant Quality
- VQSLOD - Variant Quality Score in LOD space (or new RF posterior probabilities)
- pHWE - Hard-Weinberg Equilibrium
- AC - allele count
- Median depth
- QD - quality by depth
Genotype Quality
- Depth
- PHRED likelihood (PL)
- Genotype Quality (GQ - same as PL in joint-called VCF)
- Allele Depth (AD)
Basic Requirements:
- A QC-ready annotated matrix table
- Samples annotated with PCs
Primary Hypotheses:
- Do cases have a higher overall burden of rare deleterious variants than controls?
- Does this burden decrease as allele frequency increase?
- Is the burden concentrated in genes with evidence of selective constraint?
- Does the burden effect attenuate outside of these constrained genes?
- Sanity check: Is the rate of rare variants similar in cases and controls?
- Highly significant differences suggest QC issues still persist
General Method:
- Aggregate per-individual counts on selected annotation/allele frequency
- Testing deleterious coding variant count using logistic regression
- Include first 10 PCs and sex as covariates
Comparisons:
- Stratify by cohort
- Require cases and controls within any cohort designation
- Stratify by commonly assessed variant annotations
- Genic / non-genic
- Conservation/constraint
- CADD severity prediction
- Damaging missense (PolyPhen damaging, SIFT deleterious, MPC > 2)
- Protein truncating variants (frameshift_variant, splice_acceptor_variant, splice_donor_variant, stop_gained)
- Stratify by Allele frequency
- Ultra-rare variation: non-gnomAD singletons
- Dataset + gnomAD AC < 5
- Doubleton distributions
Infomative Graphics:
- Forest plots of Odds ratio, 95% CI, and p-value
Primary Hypotheses:
- Do any genes associate with our phenotype after correcting for multiple testing?
- Is there inflation of the median test statistic (lambda)
- Inflation suggests that there is underlying population stratification/QC not accounted for
- Can also mean a polygenic signal in well-powered cohorts
General Method:
- Aggregate per-individual counts on selected annotation/allele frequency
- Testing per-gene variant count using logistic regression
- Include first 10 PCs and sex as covariates
Infomative Graphics:
- QQ plots of expected/actual p-values example .ipynb
- Gene-level Manhattan plot example .R function
- Forest plots of Odds ratio, 95% CI, and p-value
- Example Case/Control WES Portal: SCHEMA
-
Fisher's Exact test
- Separate individuals into discrete binary categories (carrier / non-carrier)
- fisher.test() in R
- in Hail: http://discuss.hail.is/t/simple-rare-variant-burden-testing-with-fisher-exact-test/210
-
Poisson rate test
- Compares variant counts when the mean and variance are equal (generally for de-novo / ultra-rare variants only)
- poisson.test() in R
-
Logistic (or Firth regression) in Hail
- Predicts case/control status by allele frequency and allows for covariates
- Requires some asymptotic assumptions, which can be unmet at low allele counts (< 20)
-
- Mixed-model assocation with saddle-point approximation (SPA) to control for imbalanced case/control ratio
- Both individal variant and gene-based tests (using SKAT) available
-
Kernel association - SKAT test
- R implementation: SKAT
- Random effects model with covariates
- Handles alleles at various frequencies
-
Example of logistic and fishers-exact test using Hail (courtesy of Mitja Kurki) Hail Topic
-
R packages for rare variant testing
- A walkthrough of various gene-based tests, strategies, and Hail commands for testing rare variant burden