BAF from BAM #18

mwalker174 · 2020-07-01T17:01:42Z

BAF generation could be improved by using GATK's CollectAllelicCounts on sample BAMs rather than gVCFs/VCFs, which is not only more costly but also operationally more complicated (eg see #8). I propose collecting only at gnomAD SNP sites with >0.1% AF.

The text was updated successfully, but these errors were encountered:

mwalker174 · 2022-03-02T18:01:42Z

Following up on this with more detail.

We currently generate BAF in GenerateBatchMetrics (formerly Module00c) from GATK SNP calls using one of two workflows. If the user provides gVCFs then BAFFromGVCFs is used, and if they provide a joint-called VCF (or set of sharded VCFs) then BAFFromShardedVCF is used. This has been a pain point for a number of reasons:

Obviously, the SNP calls must be generated first. This introduces delays for projects where SNV and SV generation are planned in parallel, and often creates more obtaining SNV calls in a format digestible by the pipeline.
gVCFs are extremely large and storage is expensive.
Joint VCFs can also be extremely large and again it creates work if it needs to be sharded.
For larger cohorts especially, SNV calls may only be available for a subset of samples. The pipeline does accommodate this scenario, but it is somewhat confusing and error-prone to set up BAF generation, particularly in Terra (mistakes have happened here in the past).
BAF is generated per-batch, which makes it difficult to share BAF records for individual samples across overlapping cohorts.
GenerateBatchMetrics is already a large workflow, and it would be good to move work to other modules where possible.

For these reasons, it would be a huge improvement to generate BAF directly from the CRAMs. Since BAF is only used for random forest training in FilterBatch, we don't expect slight changes in methodology to affect results substantially. The new strategy will be:

Move BAF generation to PESRCollection and rename the workflow CollectSVEvidence. This workflow should use a sites list (in VCF format) of common SNPs from gnomAD to collect allelic counts (LocusDepth features) with GATK's CollectSVEvidence tool. BAF generation should be optional, however, as it is not needed for the single-sample mode. Delete the old BAF workflows from GatherBatchEvidence.
Create a new GATK tool to digest LocusDepth feature files from multiple samples and join them into a batch BafEvidence feature file, matching the logic/math from src/sv-pipeline/02_evidence_assessment/02d_baftest/scripts/Filegenerate/generate_baf.py (output in compressed text format for now). Create a new workflow in GATK-SV, GenerateBAF, to run this tool as a part of GatherBatchEvidence (again optional since it's not required for single-sample).
Test changes by running the pipeline through FilterBatch and evaluating any effects on site filtering.

Some of this is dependent on this GATK PR, which adds sample metadata to LocusDepth features.

tedsharpe · 2022-03-11T20:25:47Z

Current BAF calculation relies on genotype specified in VCF (_is_het function). We'll need some way of classifying samples as hets from DP and AD alone.

Update sv_pipeline_docker.yml

mwalker174 · 2022-10-06T16:48:22Z

Addressed in #351

mwalker174 assigned tedsharpe Jan 20, 2021

mwalker174 added high effort labels Nov 16, 2021

mwalker174 mentioned this issue Nov 16, 2021

Handling large VCFs for BAF from SNP VCF #147

Closed

mwalker174 mentioned this issue Mar 1, 2022

Migrate to 1KGP reference panel for testing #293

Merged

VJalili added a commit to talkowski-lab/gnomad-sv-v3-qc that referenced this issue Mar 31, 2022

Merge pull request broadinstitute#18 from Genometric/VJalili-patch-8

0c578d8

Update sv_pipeline_docker.yml

mwalker174 mentioned this issue Apr 20, 2022

AWS execution related code changes #309

Closed

vladsavelyev mentioned this issue Aug 25, 2022

Merge Upstream populationgenomics/gatk-sv#7

Merged

mwalker174 closed this as completed Oct 6, 2022

mwalker174 mentioned this issue Jun 12, 2024

Job Long Running -- Module : GatherBatchEvidence Job : BAFFromGVCFs_ImportGVCFs #285

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BAF from BAM #18

BAF from BAM #18

mwalker174 commented Jul 1, 2020

mwalker174 commented Mar 2, 2022

tedsharpe commented Mar 11, 2022

mwalker174 commented Oct 6, 2022

BAF from BAM #18

BAF from BAM #18

Comments

mwalker174 commented Jul 1, 2020

mwalker174 commented Mar 2, 2022

tedsharpe commented Mar 11, 2022

mwalker174 commented Oct 6, 2022