Optionally filter off-target calls for exomes [VS-1375] #8915

mcovarr · 2024-07-16T17:13:35Z

Currently two commits here: Megan's commit that is already on master and was approved in #8831, and the other commit with all of my changes for this ticket.

Successful integration run here.

mcovarr · 2024-07-22T16:25:50Z

scripts/variantstore/docs/RUNNING_EXOMES_ON_GVS.md

+          - `input_vcfs`: `"gs://broad-gotc-test-storage/ExomeGermlineSingleSample/truth/scientific/master/D5327.NA12878/NA12878.rb.g.vcf.gz"`
+          - `input_vcf_indexes`: `"gs://broad-gotc-test-storage/ExomeGermlineSingleSample/truth/scientific/master/D5327.NA12878/NA12878.rb.g.vcf.gz.tbi"`


these moved!

mcovarr · 2024-07-22T16:26:55Z

scripts/variantstore/docs/RUNNING_EXOMES_ON_GVS.md

 1. `GvsExtractCallset` workflow
+   - This workflow is used to extract non-control samples only. See the `GvsCalculatePrecisionAndSensitivity` instructions below if running Precision and Sensitivity on control samples.
   - This workflow takes the tables created in the `Prepare` step to output joint VCF shards.
-   - This workflow will need to be run once to extract the callset into VCF shards and an additional time to calculate Precision and Sensitivity (with `control_samples` set to `true`).
+   - This workflow is run only to extract the callset into VCF shards (`control_samples` set to `false` explicitly or left at its default `false` value).
   - This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
-   - Set the parameter `use_interval_weights` to `false`.  This avoids a bug seen in WeightedSplitIntervals when using exomes, forcing it to use the standard version of SplitIntervals.
-   - Set the `output_gcs_dir` to a GCS location to collect all the VCF shards into one place.  If you are running it twice (to calculate Precision and Sensitivity), be sure to provide distinct locations for each.
+   - Set the `output_gcs_dir` to a GCS location to collect all the VCF shards into one place.


the current version of the GvsCalculatePrecisionAndSensitivity workflow now includes the cohort extraction, so no separate extraction step for control samples is required anymore.

mcovarr · 2024-07-22T16:28:53Z

scripts/variantstore/wdl/GvsExtractCohortFromSampleNames.wdl

+  Int effective_sample_count = length(read_lines(cohort_sample_names_file))
+
+  Int effective_scatter_count = if defined(extract_scatter_count_override) then select_first([extract_scatter_count_override])
+                                else if is_wgs then
+                                     if effective_sample_count < 5000 then 1 # This results in 1 VCF per chromosome.
+                                     else if effective_sample_count < 20000 then 2000 # Stroke Anderson
+                                          else if effective_sample_count < 50000 then 10000
+                                               else 20000
+                                     else
+                                     if effective_sample_count < 5000 then 1 # This results in 1 VCF per chromosome.
+                                     else if effective_sample_count < 20000 then 1000
+                                          else if effective_sample_count < 50000 then 2500
+                                               else 7500


Ridiculously wide scatters during cohort extract have been super annoying for a long time, and in this work they were actually breaking P&S runs when trying to merge thousands of VCFs that were >99% header lines. Simple fixes here largely copy/pasted from the regular extract workflow.

Smart -- we know VCF creation is very much NOT linear, so building our step function this way makes more sense

mcovarr · 2024-07-22T16:29:42Z

scripts/variantstore/wdl/GvsExtractCohortFromSampleNames.wdl


  # allow an interval list to be passed in, but default it to our standard one if no args are here
-  File working_interval_list = select_first([interval_list, "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list"])
+  File effective_interval_list = select_first([interval_list, "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list"])


name change for variable to match pretty much all other GVS code

mcovarr · 2024-07-22T16:30:36Z

scripts/variantstore/wdl/GvsQuickstartIntegration.wdl

@@ -34,7 +35,7 @@ workflow GvsQuickstartIntegration {
    File full_wgs_interval_list = "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list"
    File full_exome_interval_list = "gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list"
    String expected_subdir = if (!chr20_X_Y_only) then "all_chrs/"  else ""
-    File expected_output_prefix = "gs://gvs-internal-quickstart/integration/2024-05-23/" + expected_subdir
+    File expected_output_prefix = "gs://gvs-internal-quickstart/integration/2024-07-03/" + expected_subdir


updated truth is just an FT header line change for most branches, but extensive OUTSIDE_OF_TARGETS filtering changes for the exome branch.

mcovarr · 2024-07-22T16:31:25Z

scripts/variantstore/wdl/GvsQuickstartVcfIntegration.wdl

@@ -231,7 +233,7 @@ task AssertIdenticalOutputs {
          actual="actual/$vcf"
          expected="expected/$vcf"
          set +o errexit
-          cmp <(grep '^#' $actual) <(grep '^#' $expected)
+          cmp <(grep '^#' $actual | grep -E -v '^##GATKCommandLine=') <(grep '^#' $expected | grep -E -v '^##GATKCommandLine=')


Running VariantFiltration puts a ##GATKCommandLine header in the headers which has different datetimes and filepaths for every run, so exclude this from consideration.

mcovarr · 2024-07-22T16:32:08Z

src/main/java/org/broadinstitute/hellbender/tools/gvs/extract/ExtractCohort.java

@@ -393,7 +393,7 @@ else if (vqScoreFilteringType.equals(VQScoreFilteringType.GENOTYPE)) {
        }

        if (vqScoreFilteringType.equals(VQScoreFilteringType.GENOTYPE)) {
-            extraHeaderLines.add(new VCFFormatHeaderLine("FT", 1, VCFHeaderLineType.String, "Sample Genotype Filter Field"));
+            extraHeaderLines.add(VCFStandardHeaderLines.getFormatLine(VCFConstants.GENOTYPE_FILTER_KEY));


Switch from Variants-proprietary FT header line to the standard one

mcovarr · 2024-07-22T16:32:49Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/filters/VariantFiltration.java

@@ -114,6 +114,7 @@ public final class VariantFiltration extends VariantWalker {
    public static final String CLUSTER_WINDOW_SIZE_LONG_NAME = "cluster-window-size";
    public static final String MASK_EXTENSION_LONG_NAME = "mask-extension";
    public static final String MASK_NAME_LONG_NAME = "mask-name";
+    public static final String MASK_DESCRIPTION_LONG_NAME = "mask-description";


As of now everything from here on down is from Megan's PR

rsasch · 2024-07-23T13:46:51Z

Is there a successful integration run?

RoriCremer · 2024-07-23T14:01:49Z

scripts/variantstore/wdl/GvsExtractCallset.wdl

+          -O ~{output_file} \
+          -V ${pre_off_target_vcf}
+    fi
+


can we not just give an interval list to the GATK extract itself?

I don't think this is the same interval list we want to use there, plus the VariantFiltration tool writes the specific OUTSIDE_OF_TARGETS genotype filter we want with the VCF header and everything.

RoriCremer · 2024-07-23T14:02:41Z

scripts/variantstore/wdl/GvsExtractCohortFromSampleNames.wdl

+  Int effective_scatter_count = if defined(extract_scatter_count_override) then select_first([extract_scatter_count_override])
+                                else if is_wgs then
+                                     if effective_sample_count < 5000 then 1 # This results in 1 VCF per chromosome.
+                                     else if effective_sample_count < 20000 then 2000 # Stroke Anderson


I think we can retire the Stroke Anderson comment since the data has long been taken

RoriCremer · 2024-07-23T14:03:56Z

scripts/variantstore/wdl/GvsExtractCohortFromSampleNames.wdl

+  Int effective_sample_count = length(read_lines(cohort_sample_names_file))
+
+  Int effective_scatter_count = if defined(extract_scatter_count_override) then select_first([extract_scatter_count_override])
+                                else if is_wgs then
+                                     if effective_sample_count < 5000 then 1 # This results in 1 VCF per chromosome.
+                                     else if effective_sample_count < 20000 then 2000 # Stroke Anderson
+                                          else if effective_sample_count < 50000 then 10000
+                                               else 20000
+                                     else
+                                     if effective_sample_count < 5000 then 1 # This results in 1 VCF per chromosome.
+                                     else if effective_sample_count < 20000 then 1000
+                                          else if effective_sample_count < 50000 then 2500
+                                               else 7500


Smart -- we know VCF creation is very much NOT linear, so building our step function this way makes more sense

scripts/variantstore/beta_docs/gvs-exome-workspace-description.md

kbergin

I'm going to want to put a few updates into the documentation regarding BGE but that can be a separate PR.

Note I only reviewed documentation changes and had one suggestion.

kbergin · 2024-07-23T15:02:43Z

scripts/variantstore/beta_docs/gvs-exome-workspace-description.md

-Note that, by default, the Exomes Beta workflow uses the Blended Genomes Interval List: `gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list` for its calling regions. For information on how that interval list was generated, see `gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list.README.md`.
+Note that, by default, the Exomes Beta workflow uses two interval lists by default:
+1. The calling region interval list defaults to the Blended Genomes Interval List: `gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list` . For information on how that interval list was generated, see `gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list.README.md`.
+1. The target region interval list defaults to `gs://broad-gotc-test-storage/JointGenotyping/inputs/scientific/bge/TwistAllianceClinicalResearchExome_Covered_Targets_hg38.interval_list`. Any variants called outside these interval will be *genotype* filtered with `OUTSIDE_OF_TARGETS`.  


This would need to be updated later when it's public right?

mcovarr · 2024-07-23T15:59:05Z

Is there a successful integration run?

yes here

…in VCF header (#8831) Added a --mask-description argument to VariantFiltration to write a custom description for the mask filter in the VCF header

mcovarr changed the title ~~Vs 1375 off target~~ Optionally filter off-target calls for exomes [VS-1375] Jul 17, 2024

mcovarr mentioned this pull request Jul 17, 2024

Update GVS Docs to cover additional use cases and troubleshooting suggestions #8917

Merged

mcovarr force-pushed the vs_1375_off_target branch 3 times, most recently from 23dd8c0 to 1bb5f67 Compare July 19, 2024 21:52

mcovarr marked this pull request as ready for review July 19, 2024 22:57

mcovarr commented Jul 22, 2024

View reviewed changes

mcovarr requested review from gbggrant, rsasch and RoriCremer July 22, 2024 17:19

mcovarr force-pushed the vs_1375_off_target branch 2 times, most recently from 335bb6f to f3c6ad2 Compare July 22, 2024 18:06

RoriCremer approved these changes Jul 23, 2024

View reviewed changes

kbergin reviewed Jul 23, 2024

View reviewed changes

scripts/variantstore/beta_docs/gvs-exome-workspace-description.md Outdated Show resolved Hide resolved

kbergin approved these changes Jul 23, 2024

View reviewed changes

kbergin reviewed Jul 23, 2024

View reviewed changes

gbggrant approved these changes Jul 23, 2024

View reviewed changes

meganshand and others added 2 commits July 23, 2024 17:40

VariantFiltration: added arg to write custom mask filter description …

d25cc3e

…in VCF header (#8831) Added a --mask-description argument to VariantFiltration to write a custom description for the mask filter in the VCF header

Run VariantFiltering for Exomes [VS-1305]

00b7c69

mcovarr force-pushed the vs_1375_off_target branch from 2e5b438 to 00b7c69 Compare July 23, 2024 21:41

mcovarr merged commit 00b7c69 into ah_var_store Jul 23, 2024
5 of 16 checks passed

mcovarr deleted the vs_1375_off_target branch July 23, 2024 21:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optionally filter off-target calls for exomes [VS-1375] #8915

Optionally filter off-target calls for exomes [VS-1375] #8915

mcovarr commented Jul 16, 2024 •

edited

Loading

mcovarr Jul 22, 2024

mcovarr Jul 22, 2024

mcovarr Jul 22, 2024 •

edited

Loading

RoriCremer Jul 23, 2024

mcovarr Jul 22, 2024

mcovarr Jul 22, 2024

mcovarr Jul 22, 2024

mcovarr Jul 22, 2024

mcovarr Jul 22, 2024

rsasch commented Jul 23, 2024

RoriCremer Jul 23, 2024

mcovarr Jul 23, 2024

RoriCremer Jul 23, 2024

mcovarr Jul 23, 2024

RoriCremer Jul 23, 2024

kbergin left a comment

kbergin Jul 23, 2024

mcovarr Jul 23, 2024

mcovarr commented Jul 23, 2024

		- `input_vcfs`: `"gs://broad-gotc-test-storage/ExomeGermlineSingleSample/truth/scientific/master/D5327.NA12878/NA12878.rb.g.vcf.gz"`
		- `input_vcf_indexes`: `"gs://broad-gotc-test-storage/ExomeGermlineSingleSample/truth/scientific/master/D5327.NA12878/NA12878.rb.g.vcf.gz.tbi"`

Optionally filter off-target calls for exomes [VS-1375] #8915

Optionally filter off-target calls for exomes [VS-1375] #8915

Conversation

mcovarr commented Jul 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcovarr Jul 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rsasch commented Jul 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kbergin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcovarr commented Jul 23, 2024

mcovarr commented Jul 16, 2024 •

edited

Loading

mcovarr Jul 22, 2024 •

edited

Loading