Improve Scramble accuracy for BWA and Dragen 3.7.8 #722

mwalker174 · 2024-09-09T16:45:26Z

Updates methods for Scramble calling. This includes several critical components:

Filtering of calls within reference MEIs overlapping indels and small deletion SVs. This was a major source of false positives for Scramble with WGS. The filter identifies such cases from the BAM/CRAM and Manta calls.
Improved SVLEN estimation. The previous method was overly simplistic, but length estimation now accounts for the orientation of soft clips and alignment strand. Note that this method's accuracy is greatly limited in cases where split reads are found on only one side of the breakpoint.
Deduplication. Scramble seems to emit duplicated calls in many cases. Now calls within a short window are collapsed appropriately.
Dragen 3.7.8 realignment. An apparent regression was introduced somewhere after Dragen 3.4.12 and after or as of 3.7.8 where small indels cause erroneous soft-clipping. This caused a massive number of false positive calls with Scramble. GatherSampleEvidence now automatically detects if Dragen 3.7.8 is used using the CRAM header (or if the is_dragen_3_7_8 flag is set explicitly). If so, it performs realignment of all soft-clipped reads near Scramble calls using BWA and re-runs Scramble off the realignment. The cost of this is very small and eliminates this artifact.
Updates the overlap breakpoint filter in ResolveComplexVariants to prioritize Manta over Scramble. A major source of Scramble false positives was at deletion sites. This filter catches such cases missed in the raw call filter from part (1).
SVA precision was initially low even after implementation of the above filters. A hard filter has been added to ApplyManualVariantFilter to remove Scramble-only SVAs flagged as HIGH_SR_BACKGROUND. This raised SVA precision from ~0.65 to ~0.85 in our tests, with a tolerably small loss of sensitivity. In the future, improvements and/or retraining of the GQ Recalibrator model may obviate the need for this hard filter.
Filter secondary/supplementary reads from consideration in the Scramble tool. This was bringing in further false positives. The Dockerfile has been updated.

This PR also includes changes to the README to recommend Scramble for MEI calling and deprecate MELT support.

These changes were extensively tested and validated on CRAMS aligned with both BWA and Dragen-3.7.8, including a full run of 1000 Genomes Phase 3 on 3.7.8.

cwhelan

Looks good and clearly represents a lot of work, nice job. I had a few questions, especially about the way we search for nearby indels, mostly just to understand what went into all the choices. I was wondering if it might be worthwhile to package up some of the documentation / figures / slides you made while doing this, put them on github in an issue or wiki page, and link to it from the code in case people have questions about why things are implemented this way (or for future maintainers).

cwhelan · 2024-10-15T21:09:03Z

README.md

@@ -260,18 +261,21 @@ The following sections briefly describe each module and highlights inter-depende
 ## <a name="gather-sample-evidence">GatherSampleEvidence</a>
 *Formerly Module00a*

-Runs raw evidence collection on each sample with the following SV callers: [Manta](https://github.com/Illumina/manta), [Wham](https://github.com/zeeev/wham), and/or [MELT](https://melt.igs.umaryland.edu/). For guidance on pre-filtering prior to `GatherSampleEvidence`, refer to the [Sample Exclusion](#sample-exclusion) section.
+Runs raw evidence collection on each sample with the following SV callers: [Manta](https://github.com/Illumina/manta), [Wham](https://github.com/zeeev/wham), [Scramble](https://github.com/GeneDx/scramble), and/or [MELT](https://melt.igs.umaryland.edu/). For guidance on pre-filtering prior to `GatherSampleEvidence`, refer to the [Sample Exclusion](#sample-exclusion) section.


I know you say not to above but this list makes it sound like Scramble + MELT would be OK.

Maybe rephrase to,

Runs raw evidence collection on each sample with the following SV callers: [Manta](https://github.com/Illumina/manta), [Wham](https://github.com/zeeev/wham), and an MEI caller ([Scramble](https://github.com/GeneDx/scramble) or [MELT](https://melt.igs.umaryland.edu/)). For guidance on pre-filtering prior to `GatherSampleEvidence`, refer to the [Sample Exclusion](#sample-exclusion) section.```

I noticed this too and am fixing it in a separate PR transferring this over to the website

cwhelan · 2024-10-15T21:12:29Z

README.md


 Note: a list of sample IDs must be provided. Refer to the [sample ID requirements](#sampleids) for specifications of allowable sample IDs. IDs that do not meet these requirements may cause errors.

 #### Inputs:
 * Per-sample BAM or CRAM files aligned to hg38. Index files (`.bai`) must be provided if using BAMs.

 #### Outputs:
-* Caller VCFs (Manta, MELT, and/or Wham)
+* Caller VCFs (Manta, Scramble, MELT, and/or Wham)


Again the and/or is a little confusing I think. Maybe just provide the list of callers supported?

cwhelan · 2024-10-16T13:42:22Z

README.md

 * Binned read counts file
 * Split reads (SR) file
 * Discordant read pairs (PE) file
+* Scramble intermediate clusters file and table (not needed downstream)


"not needed downstream but useful for examining candidate sites when high sensitivity is required"?

Or something similar describing the main use of having the file as an output?

cwhelan · 2024-10-16T13:45:19Z

dockerfiles/sv-base-mini/Dockerfile

@@ -51,6 +52,19 @@ RUN wget -q https://github.com/arq5x/bedtools2/releases/download/v$BEDTOOLS_VERS
    mv bedtools.static /opt/bedtools/bin/bedtools && \
    chmod a+x /opt/bedtools/bin/bedtools

+# install bwa
+# must do from source because of compiler error in latest release (see https://github.com/lh3/bwa/issues/387)


I looked up this issue and it looks like it might have been fixed in a new release 0.7.18 in the last few weeks (https://github.com/lh3/bwa/releases/tag/v0.7.18). Might be worth switching this back to installing a release rather than building from source for simplicity and build time sake?

It doesn't look like a build got included with that release unfortunately

cwhelan · 2024-10-16T13:47:37Z

inputs/templates/test/ApplyManualVariantFilter/ApplyManualVariantFilter.json.tmpl

-  "ApplyManualVariantFilter.bcftools_filter": "SVTYPE==\"DEL\" && COUNT(ALGORITHMS)==1 && ALGORITHMS==\"wham\"",
-  "ApplyManualVariantFilter.filter_name": "filter_wham_only_del"
+  "ApplyManualVariantFilter.bcftools_filter": "(SVTYPE==\"DEL\" && COUNT(ALGORITHMS)==1 && ALGORITHMS==\"wham\") || (ALT==\"<INS:ME:SVA>\" && COUNT(ALGORITHMS)==1 && ALGORITHMS==\"scramble\" && HIGH_SR_BACKGROUND==1)",
+  "ApplyManualVariantFilter.filter_name": "hard_filter"


I'd prefer something a little more descriptive but I'm not sure what the best name would be. What about something like "high_algorithm_fp_rate"?

cwhelan · 2024-10-17T18:38:07Z

src/sv-pipeline/scripts/make_scramble_vcf.py

+
+    header = make_header(reference_path=arguments.reference, sample_id=arguments.sample)
+    logging.info("Loading Scramble table...")
+    sorted_bed_lines = preprocess_scramble_table(path=arguments.table)


The method doesn't actually sort the records, do we need to?

Right I think maybe they used to be but the sorting is not necessary. Changed to just bed_lines.

cwhelan · 2024-10-17T18:46:30Z

src/sv-pipeline/scripts/make_scramble_vcf.py

+        ref_sequences_index_dict = {seq: i for i, seq in enumerate(f.references)}
+
+    # Buffer of active items within cluster_distance
+    buffer = list()


MIght not make any practical difference in runtim but buffer, new_buffer and data feel like they might be more efficient as deque objects rather than lists since they have insertions to the front and end of the lists.

Good suggestion, I've changed these to deque

cwhelan · 2024-10-17T18:54:59Z

src/sv-pipeline/scripts/make_scramble_vcf.py

+        return False
+    start = max(record.pos - indel_window, 1)
+    end = record.pos + indel_window
+    num_indels = _get_num_indels(reads=samfile.fetch(record.chrom, start, end), min_reads=min_reads,


This will fetch any reads that overlap the window, so you'll be counting indels that could technically be more than indel_window bases away from record.pos if eg a read only overlaps the window by a few bases.

Is this what you want actually? Seems like you might want to only look at reads that actually overlap the position itself, right?

Either way, it would be good to clarify in the comments or argument descriptions what is really happening.

Yeah this is another case where repetitive contexts dominated some choices, but they were are a substantial source of false positives. I found that looking at just the position was too stringent.

It's a good point that the search extends past the window since we pull in reads that overlap the window and count all indels, so I will clarify in the comment.

cwhelan · 2024-10-17T19:03:12Z

wdl/GatherSampleEvidence.wdl

+  }
+}
+
+task RealignSoftClippedReads {


This task is pretty tied to scramble, maybe move it to Scramble.wdl?

I tried moving it in but it adds a lot of wdl code. One thing that's kind of awkward is having to pass in the scramble_table from the previous step. I think there's less code this way and avoids adding a lot of mutually exclusive inputs to the Scramble wdl itself (all the bwa files etc).

cwhelan · 2024-10-17T19:14:14Z

wdl/Scramble.wdl

+    #   DRAGEN-3.7.8: 60
+    Int alignment_score_cutoff
+
+    Float min_clipped_reads_fraction = 0.22


What went into choosing this parameter? It seems pretty important. Is it worth considering an alternate approach of setting it lower and then applying a filter to calls that don't meet this threshold, so that they'll be there in the VCF for cohorts where high sensitivity is desired?

I agree this would be a useful feature. The only issue with increasing the threshold is cost. A lower threshold can bring in a lot more clusters which will drive up run time proportionally for all users. I think it would be a reasonable compromise to lower the threshold fed to Scramble by a tad (say to 0.15) so that the variants at least appear in the Scramble table, and then filter at 0.22 for the vcf. I don't think this is critical but I'm going to open a ticket for this.

parent 2b9af68 author Mark Walker <[email protected]> 1701883462 -0500 committer Mark Walker <[email protected]> 1718121961 -0400 parent ceb41fc author Mark Walker <[email protected]> 1701883462 -0500 committer Mark Walker <[email protected]> 1715005950 -0400 parent 9ad640d author Mark Walker <[email protected]> 1701883462 -0500 committer Mark Walker <[email protected]> 1704486807 -0500 Add scramble filter Realignment Add bwa to mini docker Build bwa from git Use /opt/bin Start in /opt Runtime parameters Fix alignment command bump memory Make scramble vcf python script Scramble make vcf task added Fix realignment bai path Missing sv_pipeline_docker input Fix scramble table output Buffered clustering Fix reverses Use samtools view -M; realign from Scramble table instead of vcf Fix interval generation and only realign soft-clipped reads Slop 1 base Slop 150 bases And merge Change scramble priority in overlap bp filter; update dockers Reduce realignment cores to 4 Sort scramble table Fix wdl Update json templates Bump realignment memory to 12gb Update script to filter on indels Indel filtering switch Update Add MEI intervals and manta/wham filtering Delete comments Optimize mei tree loading; consume table directly Add scramble_min_clipped_reads Fix reads_index_ Update scramble vcf script args Add missing vcf script input in GatherSampleEvidence LINE1 to L1 alt Give preference to MEI over Manta insertions in breakpoint overlap filter Update dockers with last build Update dockers Reduce memory usage in RealignSoftClippedReads Bump realignment memory Add manta_vcf_input in case manta isn't run but scramble is Increase min_clipped_reads to 5 Bump memory Update scramble commit Bump MakeScrambleVcf memory to 4 Reduce min reads to 3 Add mei tree padding and alignment_score_cutoff Select alignment score cutoff based on aligner Update dockers Remove debugging lines Add scramble_mei_bed Add scramble_mei_bed to test template Bump ScramblePart1 to 3 GB Adjust scramble clipped reads threshold based on coverage Top level scramble_min_clipped_reads_fraction optional Update dockers Add mw_scramble_filter to ResolveComplexVariants in dockstore.yml Update dockers Update dockstore yml Final touches Update templates; add SVA filter Revert unnecessary changes Lint Update readme Update single sample pipeline and gathersampleevidence batch Update readme Readme grammar Scramble intermedates description in README Add comments to scramble vcf script Minor fix Fix GatherSampleEvidence optional comparison Fix double declaration of mei_bed in GATKSVPipelineSingleSample Update Scramble commit to latest master Update ref panel vcfs Reviewer comments

cwhelan

Looks good, thanks for addressing my comments.

mwalker174 mentioned this pull request Sep 30, 2024

Add genotype filtering Terra workflow configs and documentation #695

Merged

cwhelan requested changes Oct 17, 2024

View reviewed changes

mwalker174 mentioned this pull request Oct 21, 2024

Scramble sensitivity tuning #735

Open

mwalker174 force-pushed the mw_scramble_filter branch from 0024d24 to dca0b44 Compare October 21, 2024 15:16

cwhelan approved these changes Oct 25, 2024

View reviewed changes

mwalker174 merged commit 5ea22a5 into main Oct 25, 2024
9 checks passed

mwalker174 deleted the mw_scramble_filter branch October 25, 2024 15:41

mwalker174 mentioned this pull request Oct 25, 2024

Throw error for untested Dragen versions #739

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Scramble accuracy for BWA and Dragen 3.7.8 #722

Improve Scramble accuracy for BWA and Dragen 3.7.8 #722

mwalker174 commented Sep 9, 2024

cwhelan left a comment

cwhelan Oct 15, 2024

mwalker174 Oct 21, 2024

cwhelan Oct 15, 2024

mwalker174 Oct 21, 2024

cwhelan Oct 16, 2024

mwalker174 Oct 21, 2024

cwhelan Oct 16, 2024

mwalker174 Oct 21, 2024

cwhelan Oct 16, 2024

cwhelan Oct 17, 2024

mwalker174 Oct 21, 2024

cwhelan Oct 17, 2024

mwalker174 Oct 21, 2024

cwhelan Oct 17, 2024

mwalker174 Oct 21, 2024

cwhelan Oct 17, 2024

mwalker174 Oct 21, 2024

cwhelan Oct 17, 2024

mwalker174 Oct 21, 2024

cwhelan left a comment

Improve Scramble accuracy for BWA and Dragen 3.7.8 #722

Improve Scramble accuracy for BWA and Dragen 3.7.8 #722

Conversation

mwalker174 commented Sep 9, 2024

cwhelan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cwhelan left a comment

Choose a reason for hiding this comment