-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optionally filter off-target calls for exomes [VS-1375] #8915
Conversation
23dd8c0
to
1bb5f67
Compare
- `input_vcfs`: `"gs://broad-gotc-test-storage/ExomeGermlineSingleSample/truth/scientific/master/D5327.NA12878/NA12878.rb.g.vcf.gz"` | ||
- `input_vcf_indexes`: `"gs://broad-gotc-test-storage/ExomeGermlineSingleSample/truth/scientific/master/D5327.NA12878/NA12878.rb.g.vcf.gz.tbi"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these moved!
1. `GvsExtractCallset` workflow | ||
- This workflow is used to extract non-control samples only. See the `GvsCalculatePrecisionAndSensitivity` instructions below if running Precision and Sensitivity on control samples. | ||
- This workflow takes the tables created in the `Prepare` step to output joint VCF shards. | ||
- This workflow will need to be run once to extract the callset into VCF shards and an additional time to calculate Precision and Sensitivity (with `control_samples` set to `true`). | ||
- This workflow is run only to extract the callset into VCF shards (`control_samples` set to `false` explicitly or left at its default `false` value). | ||
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option. | ||
- Set the parameter `use_interval_weights` to `false`. This avoids a bug seen in WeightedSplitIntervals when using exomes, forcing it to use the standard version of SplitIntervals. | ||
- Set the `output_gcs_dir` to a GCS location to collect all the VCF shards into one place. If you are running it twice (to calculate Precision and Sensitivity), be sure to provide distinct locations for each. | ||
- Set the `output_gcs_dir` to a GCS location to collect all the VCF shards into one place. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the current version of the GvsCalculatePrecisionAndSensitivity
workflow now includes the cohort extraction, so no separate extraction step for control samples is required anymore.
Int effective_sample_count = length(read_lines(cohort_sample_names_file)) | ||
|
||
Int effective_scatter_count = if defined(extract_scatter_count_override) then select_first([extract_scatter_count_override]) | ||
else if is_wgs then | ||
if effective_sample_count < 5000 then 1 # This results in 1 VCF per chromosome. | ||
else if effective_sample_count < 20000 then 2000 # Stroke Anderson | ||
else if effective_sample_count < 50000 then 10000 | ||
else 20000 | ||
else | ||
if effective_sample_count < 5000 then 1 # This results in 1 VCF per chromosome. | ||
else if effective_sample_count < 20000 then 1000 | ||
else if effective_sample_count < 50000 then 2500 | ||
else 7500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ridiculously wide scatters during cohort extract have been super annoying for a long time, and in this work they were actually breaking P&S runs when trying to merge thousands of VCFs that were >99% header lines. Simple fixes here largely copy/pasted from the regular extract workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smart -- we know VCF creation is very much NOT linear, so building our step function this way makes more sense
|
||
# allow an interval list to be passed in, but default it to our standard one if no args are here | ||
File working_interval_list = select_first([interval_list, "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list"]) | ||
File effective_interval_list = select_first([interval_list, "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name change for variable to match pretty much all other GVS code
@@ -34,7 +35,7 @@ workflow GvsQuickstartIntegration { | |||
File full_wgs_interval_list = "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list" | |||
File full_exome_interval_list = "gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list" | |||
String expected_subdir = if (!chr20_X_Y_only) then "all_chrs/" else "" | |||
File expected_output_prefix = "gs://gvs-internal-quickstart/integration/2024-05-23/" + expected_subdir | |||
File expected_output_prefix = "gs://gvs-internal-quickstart/integration/2024-07-03/" + expected_subdir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated truth is just an FT
header line change for most branches, but extensive OUTSIDE_OF_TARGETS
filtering changes for the exome branch.
@@ -231,7 +233,7 @@ task AssertIdenticalOutputs { | |||
actual="actual/$vcf" | |||
expected="expected/$vcf" | |||
set +o errexit | |||
cmp <(grep '^#' $actual) <(grep '^#' $expected) | |||
cmp <(grep '^#' $actual | grep -E -v '^##GATKCommandLine=') <(grep '^#' $expected | grep -E -v '^##GATKCommandLine=') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running VariantFiltration
puts a ##GATKCommandLine
header in the headers which has different datetimes and filepaths for every run, so exclude this from consideration.
@@ -393,7 +393,7 @@ else if (vqScoreFilteringType.equals(VQScoreFilteringType.GENOTYPE)) { | |||
} | |||
|
|||
if (vqScoreFilteringType.equals(VQScoreFilteringType.GENOTYPE)) { | |||
extraHeaderLines.add(new VCFFormatHeaderLine("FT", 1, VCFHeaderLineType.String, "Sample Genotype Filter Field")); | |||
extraHeaderLines.add(VCFStandardHeaderLines.getFormatLine(VCFConstants.GENOTYPE_FILTER_KEY)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switch from Variants-proprietary FT
header line to the standard one
@@ -114,6 +114,7 @@ public final class VariantFiltration extends VariantWalker { | |||
public static final String CLUSTER_WINDOW_SIZE_LONG_NAME = "cluster-window-size"; | |||
public static final String MASK_EXTENSION_LONG_NAME = "mask-extension"; | |||
public static final String MASK_NAME_LONG_NAME = "mask-name"; | |||
public static final String MASK_DESCRIPTION_LONG_NAME = "mask-description"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of now everything from here on down is from Megan's PR
335bb6f
to
f3c6ad2
Compare
Is there a successful integration run? |
-O ~{output_file} \ | ||
-V ${pre_off_target_vcf} | ||
fi | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we not just give an interval list to the GATK extract itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the same interval list we want to use there, plus the VariantFiltration
tool writes the specific OUTSIDE_OF_TARGETS
genotype filter we want with the VCF header and everything.
Int effective_scatter_count = if defined(extract_scatter_count_override) then select_first([extract_scatter_count_override]) | ||
else if is_wgs then | ||
if effective_sample_count < 5000 then 1 # This results in 1 VCF per chromosome. | ||
else if effective_sample_count < 20000 then 2000 # Stroke Anderson |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can retire the Stroke Anderson comment since the data has long been taken
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😭
Int effective_sample_count = length(read_lines(cohort_sample_names_file)) | ||
|
||
Int effective_scatter_count = if defined(extract_scatter_count_override) then select_first([extract_scatter_count_override]) | ||
else if is_wgs then | ||
if effective_sample_count < 5000 then 1 # This results in 1 VCF per chromosome. | ||
else if effective_sample_count < 20000 then 2000 # Stroke Anderson | ||
else if effective_sample_count < 50000 then 10000 | ||
else 20000 | ||
else | ||
if effective_sample_count < 5000 then 1 # This results in 1 VCF per chromosome. | ||
else if effective_sample_count < 20000 then 1000 | ||
else if effective_sample_count < 50000 then 2500 | ||
else 7500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smart -- we know VCF creation is very much NOT linear, so building our step function this way makes more sense
scripts/variantstore/beta_docs/gvs-exome-workspace-description.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to want to put a few updates into the documentation regarding BGE but that can be a separate PR.
Note I only reviewed documentation changes and had one suggestion.
Note that, by default, the Exomes Beta workflow uses the Blended Genomes Interval List: `gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list` for its calling regions. For information on how that interval list was generated, see `gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list.README.md`. | ||
Note that, by default, the Exomes Beta workflow uses two interval lists by default: | ||
1. The calling region interval list defaults to the Blended Genomes Interval List: `gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list` . For information on how that interval list was generated, see `gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list.README.md`. | ||
1. The target region interval list defaults to `gs://broad-gotc-test-storage/JointGenotyping/inputs/scientific/bge/TwistAllianceClinicalResearchExome_Covered_Targets_hg38.interval_list`. Any variants called outside these interval will be *genotype* filtered with `OUTSIDE_OF_TARGETS`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would need to be updated later when it's public right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
yes here |
…in VCF header (#8831) Added a --mask-description argument to VariantFiltration to write a custom description for the mask filter in the VCF header
2e5b438
to
00b7c69
Compare
Currently two commits here: Megan's commit that is already on
master
and was approved in #8831, and the other commit with all of my changes for this ticket.Successful integration run here.