Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally filter off-target calls for exomes [VS-1375] #8915

Merged
merged 2 commits into from
Jul 23, 2024

Conversation

mcovarr
Copy link
Collaborator

@mcovarr mcovarr commented Jul 16, 2024

Currently two commits here: Megan's commit that is already on master and was approved in #8831, and the other commit with all of my changes for this ticket.

Successful integration run here.

@mcovarr mcovarr changed the title Vs 1375 off target Optionally filter off-target calls for exomes [VS-1375] Jul 17, 2024
@mcovarr mcovarr force-pushed the vs_1375_off_target branch 3 times, most recently from 23dd8c0 to 1bb5f67 Compare July 19, 2024 21:52
@mcovarr mcovarr marked this pull request as ready for review July 19, 2024 22:57
Comment on lines +34 to +35
- `input_vcfs`: `"gs://broad-gotc-test-storage/ExomeGermlineSingleSample/truth/scientific/master/D5327.NA12878/NA12878.rb.g.vcf.gz"`
- `input_vcf_indexes`: `"gs://broad-gotc-test-storage/ExomeGermlineSingleSample/truth/scientific/master/D5327.NA12878/NA12878.rb.g.vcf.gz.tbi"`
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these moved!

Comment on lines 48 to +53
1. `GvsExtractCallset` workflow
- This workflow is used to extract non-control samples only. See the `GvsCalculatePrecisionAndSensitivity` instructions below if running Precision and Sensitivity on control samples.
- This workflow takes the tables created in the `Prepare` step to output joint VCF shards.
- This workflow will need to be run once to extract the callset into VCF shards and an additional time to calculate Precision and Sensitivity (with `control_samples` set to `true`).
- This workflow is run only to extract the callset into VCF shards (`control_samples` set to `false` explicitly or left at its default `false` value).
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
- Set the parameter `use_interval_weights` to `false`. This avoids a bug seen in WeightedSplitIntervals when using exomes, forcing it to use the standard version of SplitIntervals.
- Set the `output_gcs_dir` to a GCS location to collect all the VCF shards into one place. If you are running it twice (to calculate Precision and Sensitivity), be sure to provide distinct locations for each.
- Set the `output_gcs_dir` to a GCS location to collect all the VCF shards into one place.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the current version of the GvsCalculatePrecisionAndSensitivity workflow now includes the cohort extraction, so no separate extraction step for control samples is required anymore.

Comment on lines +96 to +108
Int effective_sample_count = length(read_lines(cohort_sample_names_file))

Int effective_scatter_count = if defined(extract_scatter_count_override) then select_first([extract_scatter_count_override])
else if is_wgs then
if effective_sample_count < 5000 then 1 # This results in 1 VCF per chromosome.
else if effective_sample_count < 20000 then 2000 # Stroke Anderson
else if effective_sample_count < 50000 then 10000
else 20000
else
if effective_sample_count < 5000 then 1 # This results in 1 VCF per chromosome.
else if effective_sample_count < 20000 then 1000
else if effective_sample_count < 50000 then 2500
else 7500
Copy link
Collaborator Author

@mcovarr mcovarr Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ridiculously wide scatters during cohort extract have been super annoying for a long time, and in this work they were actually breaking P&S runs when trying to merge thousands of VCFs that were >99% header lines. Simple fixes here largely copy/pasted from the regular extract workflow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smart -- we know VCF creation is very much NOT linear, so building our step function this way makes more sense


# allow an interval list to be passed in, but default it to our standard one if no args are here
File working_interval_list = select_first([interval_list, "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list"])
File effective_interval_list = select_first([interval_list, "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list"])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name change for variable to match pretty much all other GVS code

@@ -34,7 +35,7 @@ workflow GvsQuickstartIntegration {
File full_wgs_interval_list = "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list"
File full_exome_interval_list = "gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list"
String expected_subdir = if (!chr20_X_Y_only) then "all_chrs/" else ""
File expected_output_prefix = "gs://gvs-internal-quickstart/integration/2024-05-23/" + expected_subdir
File expected_output_prefix = "gs://gvs-internal-quickstart/integration/2024-07-03/" + expected_subdir
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated truth is just an FT header line change for most branches, but extensive OUTSIDE_OF_TARGETS filtering changes for the exome branch.

@@ -231,7 +233,7 @@ task AssertIdenticalOutputs {
actual="actual/$vcf"
expected="expected/$vcf"
set +o errexit
cmp <(grep '^#' $actual) <(grep '^#' $expected)
cmp <(grep '^#' $actual | grep -E -v '^##GATKCommandLine=') <(grep '^#' $expected | grep -E -v '^##GATKCommandLine=')
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running VariantFiltration puts a ##GATKCommandLine header in the headers which has different datetimes and filepaths for every run, so exclude this from consideration.

@@ -393,7 +393,7 @@ else if (vqScoreFilteringType.equals(VQScoreFilteringType.GENOTYPE)) {
}

if (vqScoreFilteringType.equals(VQScoreFilteringType.GENOTYPE)) {
extraHeaderLines.add(new VCFFormatHeaderLine("FT", 1, VCFHeaderLineType.String, "Sample Genotype Filter Field"));
extraHeaderLines.add(VCFStandardHeaderLines.getFormatLine(VCFConstants.GENOTYPE_FILTER_KEY));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switch from Variants-proprietary FT header line to the standard one

@@ -114,6 +114,7 @@ public final class VariantFiltration extends VariantWalker {
public static final String CLUSTER_WINDOW_SIZE_LONG_NAME = "cluster-window-size";
public static final String MASK_EXTENSION_LONG_NAME = "mask-extension";
public static final String MASK_NAME_LONG_NAME = "mask-name";
public static final String MASK_DESCRIPTION_LONG_NAME = "mask-description";
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of now everything from here on down is from Megan's PR

@mcovarr mcovarr force-pushed the vs_1375_off_target branch 2 times, most recently from 335bb6f to f3c6ad2 Compare July 22, 2024 18:06
@rsasch
Copy link

rsasch commented Jul 23, 2024

Is there a successful integration run?

-O ~{output_file} \
-V ${pre_off_target_vcf}
fi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not just give an interval list to the GATK extract itself?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the same interval list we want to use there, plus the VariantFiltration tool writes the specific OUTSIDE_OF_TARGETS genotype filter we want with the VCF header and everything.

Int effective_scatter_count = if defined(extract_scatter_count_override) then select_first([extract_scatter_count_override])
else if is_wgs then
if effective_sample_count < 5000 then 1 # This results in 1 VCF per chromosome.
else if effective_sample_count < 20000 then 2000 # Stroke Anderson
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can retire the Stroke Anderson comment since the data has long been taken

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😭

Comment on lines +96 to +108
Int effective_sample_count = length(read_lines(cohort_sample_names_file))

Int effective_scatter_count = if defined(extract_scatter_count_override) then select_first([extract_scatter_count_override])
else if is_wgs then
if effective_sample_count < 5000 then 1 # This results in 1 VCF per chromosome.
else if effective_sample_count < 20000 then 2000 # Stroke Anderson
else if effective_sample_count < 50000 then 10000
else 20000
else
if effective_sample_count < 5000 then 1 # This results in 1 VCF per chromosome.
else if effective_sample_count < 20000 then 1000
else if effective_sample_count < 50000 then 2500
else 7500
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smart -- we know VCF creation is very much NOT linear, so building our step function this way makes more sense

Copy link
Collaborator

@kbergin kbergin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to want to put a few updates into the documentation regarding BGE but that can be a separate PR.

Note I only reviewed documentation changes and had one suggestion.

Note that, by default, the Exomes Beta workflow uses the Blended Genomes Interval List: `gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list` for its calling regions. For information on how that interval list was generated, see `gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list.README.md`.
Note that, by default, the Exomes Beta workflow uses two interval lists by default:
1. The calling region interval list defaults to the Blended Genomes Interval List: `gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list` . For information on how that interval list was generated, see `gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list.README.md`.
1. The target region interval list defaults to `gs://broad-gotc-test-storage/JointGenotyping/inputs/scientific/bge/TwistAllianceClinicalResearchExome_Covered_Targets_hg38.interval_list`. Any variants called outside these interval will be *genotype* filtered with `OUTSIDE_OF_TARGETS`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would need to be updated later when it's public right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@mcovarr
Copy link
Collaborator Author

mcovarr commented Jul 23, 2024

Is there a successful integration run?

yes here

meganshand and others added 2 commits July 23, 2024 17:40
…in VCF header (#8831)

Added a --mask-description argument to VariantFiltration to write a custom description for the mask filter in the VCF header
@mcovarr mcovarr force-pushed the vs_1375_off_target branch from 2e5b438 to 00b7c69 Compare July 23, 2024 21:41
@mcovarr mcovarr merged commit 00b7c69 into ah_var_store Jul 23, 2024
5 of 16 checks passed
@mcovarr mcovarr deleted the vs_1375_off_target branch July 23, 2024 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants