Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-organization of gCNV WDL output for more efficient post-processing #4397

Open
mbabadi opened this issue Feb 13, 2018 · 12 comments
Open

Re-organization of gCNV WDL output for more efficient post-processing #4397

mbabadi opened this issue Feb 13, 2018 · 12 comments

Comments

@mbabadi
Copy link
Contributor

mbabadi commented Feb 13, 2018

Since PostprocessGermlineCNVCalls processes individual samples, it makes sense to reorganize gCNV calls into separate tarballs for each sample in the COHORT WDL (v.s. one tarball containing calls for all samples in the shard), i.e. let's do it like:

   shard_0_model.tar.gz
   shard_0_calls_sample_0.tar.gz
   shard_0_calls_sample_1.tar.gz
   shard_0_calls_sample_2.tar.gz
   ...
   shard_1_model.tar.gz
   shard_1_calls_sample_0.tar.gz
   shard_1_calls_sample_1.tar.gz
   shard_1_calls_sample_2.tar.gz
   ...

It is crucial to preserve the directory structure for gcnvkernel Viterbi segmenter, e.g. it looks for CALLS_SHARD_ROOT/SAMPLE_<sample_index>/log_q_c_tc.tsv, as well as the common files in the root, e.g. CALLS_SHARD_ROOT/interval_list.tsv, CALLS_SHARD_ROOT/gcnvkernel_version.json, etc.

For the record, ~ 200 WES samples x 10'000 targets per shard produces ~ 150mb calls tarball x 3.5mb model tarball (per shard). Localizing all calls requires ~ 2.9gb data transfer for single-sample post-processing (vs ~ 15mb if one only moves shard_x_calls_samply_y.tar.gz, x=0, 1, ...).

@samuelklee
Copy link
Contributor

@asmirnov239 can I ask you to take care of this at some point? Might make sense to address it before you get too far with #4738.

@samuelklee
Copy link
Contributor

@mwalker174 also mentioned to me that he's adjusted the scattering strategy so that he can run multiple blocks in one shard. If we can update the WDL with both fixes, that would be great.

@mwalker174
Copy link
Contributor

Yes for WGS with 250bp bins gives 10.8M intervals, which at 10k intervals per shard requires 1,000 tasks and cloud quotas start becoming a problem (I would prefer to run at 1k intervals per block). I'm currently working on running multiple blocks in one task basically by scattering the ScatterIntervals task by shard and then looping over an array of interval lists in GermlineCNVCallerCohortMode to run multiple blocks.

@samuelklee
Copy link
Contributor

OK. However, don't forget that the denoising model is fit independently in each block. So introducing too many blocks could cause overfitting, in a sense. Also, you want to make sure that you have enough bins in each block to learn the model. 10k seems safe, but I'd spot check results first if you want to go down to 1k.

@mwalker174
Copy link
Contributor

I will try that as well. I just finished building a PoN at 250bp bin size with 1k intervals per block. This produces ~10k models and the PostprocessGermlineCNVCalls WDL task gets us the following error from Cromwell:

The task run request has exceeded the maximum PAPI request size.If you have a task with a very large number of inputs and / or outputs in your workflow you should try
to reduce it. Depending on your case you could: 1) Zip your input files together and unzip them in the command. 2) Use a file of file names and localize the files yourself.

Who knew? So, we are also going to have to modify PostprocessGermlineCNVCalls and the case mode calling task to accept a tar archive containing all the models. @samuelklee @mbabadi Let me know if you have any opinions on this.

@samuelklee
Copy link
Contributor

Oof, that's good to know as well. Actually, can I ask why you are running at 1k intervals per block? That seems a little on the thin side to me. I think @mbabadi's example of ~200 samples x ~10000 bins is the typical use case we expected, either for WGS or WES.

@mwalker174
Copy link
Contributor

I was hoping it would train faster but I'm not sure that turns out to be the case.

@samuelklee
Copy link
Contributor

Yes, I'm not sure I would expect that to be true.

My guess is that we continue to run long after suitable convergence with the current inference parameters (which we have not had a chance to optimize over for runtime). I think we can probably afford to be less conservative. @mbabadi do you have a better handle on this?

@samuelklee
Copy link
Contributor

@asmirnov239 @mwalker174 @vruano I'm not sure if we have completely addressed this with what's been merged so far. If any of you still has fixes that haven't been merged, please feel free to add a note here.

@samuelklee
Copy link
Contributor

@asmirnov239 I think that some of the optimizations that @vruano made to the postprocessing step concern the config JSONs, gCNV version, and interval list output added in #5176. These take a lot of time to localize when the number of shards is large but, aside from the interval list, aren't really used for anything, correct?

Were these just added for debugging purposes, or for reproducibility/provenance? Let's revisit whether it's necessary to pass these files on when we merge @vruano's changes into the canonical WDL.

@samuelklee
Copy link
Contributor

samuelklee commented Aug 17, 2021

I think issues with gCNV postprocessing initially stemmed from things like 1) taking large sample x shard transposes (which Cromwell may have had trouble with at some point), 2) localizing more files than necessary for each sample, 3) introduction of lexicographical globbing bugs, etc. I think various people have tried to address this via things like bundling, which we ended up reverting in favor of transposing. Unfortunately, I'm afraid I've lost the plot on this after so long; and not sure I ever had it---is there any documentation (e.g., of things like how bundling improved performance) elsewhere that might help us decide what's left to be done? Seems like I had some more coherent thoughts in #6607 (comment).

As before, if any remaining issues like the call caching mentioned above would best be addressed at the Cromwell level, I think it's at least worth an investigation of what might be involved.

@mwalker174
Copy link
Contributor

Most of the scaling issues in Cromwell/Terra have been resolved. Terra still has limitations on workflow metadata size, and passing long file arrays to ever task in large scatters (i.e. the full list of counts files is passed into every gCNV shard) can limit our batch sizes for workflows that embed gCNV (e.g. GatherBatchEvidence in gatk-sv). gCNV workflows also don't call cache reliably (presumably due to timeouts) probably again due to the large file arrays, including 2D arrays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants