-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update GermlineCNVCaller to gatk 4.1.8.1 #58
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @TedBrookings this is looking really good. I have some questions/comments below.
@@ -140,11 +142,13 @@ workflow Module00c { | |||
|
|||
# Runtime parameters | |||
String sv_base_mini_docker | |||
String sv_base_docker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This input is breaking the single-sample wdl. Can you run scripts/test/validate.sh
? It'll tell you which dependent WDLs you need to update. I think you should also run the single-sample pipeline json test/single-sample/GATKSVPipelineSingleSampleTest.test_na19240.json
to make sure everything works okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just remembered to remind you to update the gCNV model for the single-sample run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the single-sample pipeline. There were some problems with the bincov matrix which I corrected. It ran through until Module05_06 where it crashed in CleanVcf1a due to my not having rebuilt the sv-pipeline docker after Chris and I made incompatible changes to it (the WDL is looking for the output "includelist.txt" but my old docker is providing whitelist.txt). I'm rebuilding it now because it sounds like there's immediate interest in running this.
@@ -90,10 +91,14 @@ task BAFTest { | |||
} | |||
} | |||
|
|||
Int disk_gb = disk_gb_baseline + ceil( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many / what size batches did you use for the scaling regressions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't do any scaling regressions. I had (just a few) shards crash from out-of-disk errors, so I just added up the size of the input files. I think a few shards must wind up taking down a significant fraction of the baf_metrics file, because when I forgot to double it (for the 2x tabix) the shards still crashed. Incidentally, this is why I was removing those second tabix files in so many tasks. You didn't ask me to undo that for this task: should I assume it needs to be removed here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I spoke too soon. PETest and SRTest were having rare out-of-disk crashes, and I did have crashes in BAFTest as well. So I just implemented the same disk scaling in every test (PE, SR, RD, BAF). I just downloaded my monitoring logs and at least in my panel, BAFTest never used much of its disk. Given that it's 27 GiB on my 500 sample panel, I think it's reasonable leaving this in for safety sake. That said, it could probably could be removed for BAFTest (and RDTest as well).
It turns out BAFTest was crashing out of memory on one shard. Since I didn't have much to go on for scaling the memory usage of BAFTest, I just passed a runtime_attr override to my input json.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can revisit this once #51 goes in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also yes delete the rm
statement below
@@ -98,7 +98,7 @@ workflow CleanVcf { | |||
input: | |||
whole_file=CleanVcf1a.include_list[0], | |||
lines_per_shard=samples_per_step2_shard, | |||
shard_prefix="whiteblack.", | |||
shard_prefix="includeexclude.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching these
| awk '{print "--input "$0}' \ | ||
> read_count_files.args | ||
|
||
function get_seeded_random() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you open a quick ticket to remind us this needs to be replaced with the changes in Andrey's branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added Issue #59 . I tried to assign Andrey but for whatever reason Github wouldn't select him. Maybe he's not set up to edit our repo? For now I assigned myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added Andrey to the dev list
I'm about to get started on the single-sample branch. I can't believe I forgot to update that! Thanks also for pointing out the validation script which I was unaware of. Quick question: should I remove my added line deleting of the second tabix file on RDTestChromosome.wdl::RDTest and BAFTestChromosome.wdl::BAFTest? |
Yes, I missed those |
7c1ce0e
to
77bfceb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything is looking good aside from the whitelist naming issue. I think it's safe to merge when the batch and single-sample tests complete.
9a769a6
to
9e723bd
Compare
* Updated gcnv to latest version for cohort and case mode * Addressed small risk of NaN errors by try-catch in task (bash) and reproducibly scrambling sample order to get different random samples * Decreased wall-clock time of MakeBincovMatrix and call only once * Fixed crashes and improved output of RawVcfQc * Updated VM scaling to prevent crashes with larger cohorts * Updated dockers and input jsons to work with new gcnv inputs and script changes * Added allowNestedInputs=true to single sample pipeline
9e723bd
to
2f522d7
Compare
reproducibly scrambling sample order to get different random samples