-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-organization of gCNV WDL output for more efficient post-processing #4397
Comments
@asmirnov239 can I ask you to take care of this at some point? Might make sense to address it before you get too far with #4738. |
@mwalker174 also mentioned to me that he's adjusted the scattering strategy so that he can run multiple blocks in one shard. If we can update the WDL with both fixes, that would be great. |
Yes for WGS with 250bp bins gives 10.8M intervals, which at 10k intervals per shard requires 1,000 tasks and cloud quotas start becoming a problem (I would prefer to run at 1k intervals per block). I'm currently working on running multiple blocks in one task basically by scattering the ScatterIntervals task by shard and then looping over an array of interval lists in GermlineCNVCallerCohortMode to run multiple blocks. |
OK. However, don't forget that the denoising model is fit independently in each block. So introducing too many blocks could cause overfitting, in a sense. Also, you want to make sure that you have enough bins in each block to learn the model. 10k seems safe, but I'd spot check results first if you want to go down to 1k. |
I will try that as well. I just finished building a PoN at 250bp bin size with 1k intervals per block. This produces ~10k models and the PostprocessGermlineCNVCalls WDL task gets us the following error from Cromwell:
Who knew? So, we are also going to have to modify PostprocessGermlineCNVCalls and the case mode calling task to accept a tar archive containing all the models. @samuelklee @mbabadi Let me know if you have any opinions on this. |
Oof, that's good to know as well. Actually, can I ask why you are running at 1k intervals per block? That seems a little on the thin side to me. I think @mbabadi's example of ~200 samples x ~10000 bins is the typical use case we expected, either for WGS or WES. |
I was hoping it would train faster but I'm not sure that turns out to be the case. |
Yes, I'm not sure I would expect that to be true. My guess is that we continue to run long after suitable convergence with the current inference parameters (which we have not had a chance to optimize over for runtime). I think we can probably afford to be less conservative. @mbabadi do you have a better handle on this? |
@asmirnov239 @mwalker174 @vruano I'm not sure if we have completely addressed this with what's been merged so far. If any of you still has fixes that haven't been merged, please feel free to add a note here. |
@asmirnov239 I think that some of the optimizations that @vruano made to the postprocessing step concern the config JSONs, gCNV version, and interval list output added in #5176. These take a lot of time to localize when the number of shards is large but, aside from the interval list, aren't really used for anything, correct? Were these just added for debugging purposes, or for reproducibility/provenance? Let's revisit whether it's necessary to pass these files on when we merge @vruano's changes into the canonical WDL. |
I think issues with gCNV postprocessing initially stemmed from things like 1) taking large sample x shard transposes (which Cromwell may have had trouble with at some point), 2) localizing more files than necessary for each sample, 3) introduction of lexicographical globbing bugs, etc. I think various people have tried to address this via things like bundling, which we ended up reverting in favor of transposing. Unfortunately, I'm afraid I've lost the plot on this after so long; and not sure I ever had it---is there any documentation (e.g., of things like how bundling improved performance) elsewhere that might help us decide what's left to be done? Seems like I had some more coherent thoughts in #6607 (comment). As before, if any remaining issues like the call caching mentioned above would best be addressed at the Cromwell level, I think it's at least worth an investigation of what might be involved. |
Most of the scaling issues in Cromwell/Terra have been resolved. Terra still has limitations on workflow metadata size, and passing long file arrays to ever task in large scatters (i.e. the full list of counts files is passed into every gCNV shard) can limit our batch sizes for workflows that embed gCNV (e.g. GatherBatchEvidence in gatk-sv). gCNV workflows also don't call cache reliably (presumably due to timeouts) probably again due to the large file arrays, including 2D arrays. |
Since
PostprocessGermlineCNVCalls
processes individual samples, it makes sense to reorganize gCNV calls into separate tarballs for each sample in the COHORT WDL (v.s. one tarball containing calls for all samples in the shard), i.e. let's do it like:It is crucial to preserve the directory structure for
gcnvkernel
Viterbi segmenter, e.g. it looks forCALLS_SHARD_ROOT/SAMPLE_<sample_index>/log_q_c_tc.tsv
, as well as the common files in the root, e.g.CALLS_SHARD_ROOT/interval_list.tsv
,CALLS_SHARD_ROOT/gcnvkernel_version.json
, etc.For the record, ~ 200 WES samples x 10'000 targets per shard produces ~ 150mb calls tarball x 3.5mb model tarball (per shard). Localizing all calls requires ~ 2.9gb data transfer for single-sample post-processing (vs ~ 15mb if one only moves
shard_x_calls_samply_y.tar.gz
, x=0, 1, ...).The text was updated successfully, but these errors were encountered: