Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BAF from BAM #18

Closed
mwalker174 opened this issue Jul 1, 2020 · 3 comments
Closed

BAF from BAM #18

mwalker174 opened this issue Jul 1, 2020 · 3 comments
Assignees

Comments

@mwalker174
Copy link
Collaborator

BAF generation could be improved by using GATK's CollectAllelicCounts on sample BAMs rather than gVCFs/VCFs, which is not only more costly but also operationally more complicated (eg see #8). I propose collecting only at gnomAD SNP sites with >0.1% AF.

@mwalker174
Copy link
Collaborator Author

Following up on this with more detail.

We currently generate BAF in GenerateBatchMetrics (formerly Module00c) from GATK SNP calls using one of two workflows. If the user provides gVCFs then BAFFromGVCFs is used, and if they provide a joint-called VCF (or set of sharded VCFs) then BAFFromShardedVCF is used. This has been a pain point for a number of reasons:

  1. Obviously, the SNP calls must be generated first. This introduces delays for projects where SNV and SV generation are planned in parallel, and often creates more obtaining SNV calls in a format digestible by the pipeline.
  2. gVCFs are extremely large and storage is expensive.
  3. Joint VCFs can also be extremely large and again it creates work if it needs to be sharded.
  4. For larger cohorts especially, SNV calls may only be available for a subset of samples. The pipeline does accommodate this scenario, but it is somewhat confusing and error-prone to set up BAF generation, particularly in Terra (mistakes have happened here in the past).
  5. BAF is generated per-batch, which makes it difficult to share BAF records for individual samples across overlapping cohorts.
  6. GenerateBatchMetrics is already a large workflow, and it would be good to move work to other modules where possible.

For these reasons, it would be a huge improvement to generate BAF directly from the CRAMs. Since BAF is only used for random forest training in FilterBatch, we don't expect slight changes in methodology to affect results substantially. The new strategy will be:

  1. Move BAF generation to PESRCollection and rename the workflow CollectSVEvidence. This workflow should use a sites list (in VCF format) of common SNPs from gnomAD to collect allelic counts (LocusDepth features) with GATK's CollectSVEvidence tool. BAF generation should be optional, however, as it is not needed for the single-sample mode. Delete the old BAF workflows from GatherBatchEvidence.
  2. Create a new GATK tool to digest LocusDepth feature files from multiple samples and join them into a batch BafEvidence feature file, matching the logic/math from src/sv-pipeline/02_evidence_assessment/02d_baftest/scripts/Filegenerate/generate_baf.py (output in compressed text format for now). Create a new workflow in GATK-SV, GenerateBAF, to run this tool as a part of GatherBatchEvidence (again optional since it's not required for single-sample).
  3. Test changes by running the pipeline through FilterBatch and evaluating any effects on site filtering.

Some of this is dependent on this GATK PR, which adds sample metadata to LocusDepth features.

@tedsharpe
Copy link
Contributor

Current BAF calculation relies on genotype specified in VCF (_is_het function). We'll need some way of classifying samples as hets from DP and AD alone.

VJalili added a commit to talkowski-lab/gnomad-sv-v3-qc that referenced this issue Mar 31, 2022
@mwalker174
Copy link
Collaborator Author

Addressed in #351

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants