Re-organization of gCNV WDL output for more efficient post-processing #4397

mbabadi · 2018-02-13T01:46:49Z

Since PostprocessGermlineCNVCalls processes individual samples, it makes sense to reorganize gCNV calls into separate tarballs for each sample in the COHORT WDL (v.s. one tarball containing calls for all samples in the shard), i.e. let's do it like:

   shard_0_model.tar.gz
   shard_0_calls_sample_0.tar.gz
   shard_0_calls_sample_1.tar.gz
   shard_0_calls_sample_2.tar.gz
   ...
   shard_1_model.tar.gz
   shard_1_calls_sample_0.tar.gz
   shard_1_calls_sample_1.tar.gz
   shard_1_calls_sample_2.tar.gz
   ...

It is crucial to preserve the directory structure for gcnvkernel Viterbi segmenter, e.g. it looks for CALLS_SHARD_ROOT/SAMPLE_<sample_index>/log_q_c_tc.tsv, as well as the common files in the root, e.g. CALLS_SHARD_ROOT/interval_list.tsv, CALLS_SHARD_ROOT/gcnvkernel_version.json, etc.

For the record, ~ 200 WES samples x 10'000 targets per shard produces ~ 150mb calls tarball x 3.5mb model tarball (per shard). Localizing all calls requires ~ 2.9gb data transfer for single-sample post-processing (vs ~ 15mb if one only moves shard_x_calls_samply_y.tar.gz, x=0, 1, ...).

The text was updated successfully, but these errors were encountered:

samuelklee · 2018-05-21T18:17:16Z

@asmirnov239 can I ask you to take care of this at some point? Might make sense to address it before you get too far with #4738.

samuelklee · 2018-05-22T15:30:55Z

@mwalker174 also mentioned to me that he's adjusted the scattering strategy so that he can run multiple blocks in one shard. If we can update the WDL with both fixes, that would be great.

mwalker174 · 2018-05-22T17:08:30Z

Yes for WGS with 250bp bins gives 10.8M intervals, which at 10k intervals per shard requires 1,000 tasks and cloud quotas start becoming a problem (I would prefer to run at 1k intervals per block). I'm currently working on running multiple blocks in one task basically by scattering the ScatterIntervals task by shard and then looping over an array of interval lists in GermlineCNVCallerCohortMode to run multiple blocks.

samuelklee · 2018-05-22T17:19:12Z

OK. However, don't forget that the denoising model is fit independently in each block. So introducing too many blocks could cause overfitting, in a sense. Also, you want to make sure that you have enough bins in each block to learn the model. 10k seems safe, but I'd spot check results first if you want to go down to 1k.

mwalker174 · 2018-05-25T16:58:56Z

I will try that as well. I just finished building a PoN at 250bp bin size with 1k intervals per block. This produces ~10k models and the PostprocessGermlineCNVCalls WDL task gets us the following error from Cromwell:

The task run request has exceeded the maximum PAPI request size.If you have a task with a very large number of inputs and / or outputs in your workflow you should try
to reduce it. Depending on your case you could: 1) Zip your input files together and unzip them in the command. 2) Use a file of file names and localize the files yourself.

Who knew? So, we are also going to have to modify PostprocessGermlineCNVCalls and the case mode calling task to accept a tar archive containing all the models. @samuelklee @mbabadi Let me know if you have any opinions on this.

samuelklee · 2018-05-25T17:54:01Z

Oof, that's good to know as well. Actually, can I ask why you are running at 1k intervals per block? That seems a little on the thin side to me. I think @mbabadi's example of ~200 samples x ~10000 bins is the typical use case we expected, either for WGS or WES.

mwalker174 · 2018-05-25T19:04:59Z

I was hoping it would train faster but I'm not sure that turns out to be the case.

samuelklee · 2018-05-25T19:13:17Z

Yes, I'm not sure I would expect that to be true.

My guess is that we continue to run long after suitable convergence with the current inference parameters (which we have not had a chance to optimize over for runtime). I think we can probably afford to be less conservative. @mbabadi do you have a better handle on this?

samuelklee · 2019-01-31T13:58:14Z

@asmirnov239 @mwalker174 @vruano I'm not sure if we have completely addressed this with what's been merged so far. If any of you still has fixes that haven't been merged, please feel free to add a note here.

samuelklee · 2019-03-14T13:49:13Z

@asmirnov239 I think that some of the optimizations that @vruano made to the postprocessing step concern the config JSONs, gCNV version, and interval list output added in #5176. These take a lot of time to localize when the number of shards is large but, aside from the interval list, aren't really used for anything, correct?

Were these just added for debugging purposes, or for reproducibility/provenance? Let's revisit whether it's necessary to pass these files on when we merge @vruano's changes into the canonical WDL.

samuelklee · 2021-08-17T20:57:03Z

I think issues with gCNV postprocessing initially stemmed from things like 1) taking large sample x shard transposes (which Cromwell may have had trouble with at some point), 2) localizing more files than necessary for each sample, 3) introduction of lexicographical globbing bugs, etc. I think various people have tried to address this via things like bundling, which we ended up reverting in favor of transposing. Unfortunately, I'm afraid I've lost the plot on this after so long; and not sure I ever had it---is there any documentation (e.g., of things like how bundling improved performance) elsewhere that might help us decide what's left to be done? Seems like I had some more coherent thoughts in #6607 (comment).

As before, if any remaining issues like the call caching mentioned above would best be addressed at the Cromwell level, I think it's at least worth an investigation of what might be involved.

mwalker174 · 2021-09-27T18:38:04Z

Most of the scaling issues in Cromwell/Terra have been resolved. Terra still has limitations on workflow metadata size, and passing long file arrays to ever task in large scatters (i.e. the full list of counts files is passed into every gCNV shard) can limit our batch sizes for workflows that embed gCNV (e.g. GatherBatchEvidence in gatk-sv). gCNV workflows also don't call cache reliably (presumably due to timeouts) probably again due to the large file arrays, including 2D arrays.

mbabadi added Copy Number tools wdl Germline CNV labels Feb 13, 2018

samuelklee assigned asmirnov239 May 21, 2018

asmirnov239 mentioned this issue Jul 29, 2018

Add improvements to gCNV WDL, including scattering case samples in chunks. #5054

Closed

asmirnov239 mentioned this issue Sep 11, 2018

Refactoring gCNV WDL #5176

Merged

This was referenced Sep 25, 2018

gCNV WGS cohort germline WDL run outputs vcf files with names that do not correspond to the actual sample inside the file. #5217

Closed

Revert gCNV WDLs to tar calls from all samples. #5225

Merged

samuelklee mentioned this issue Nov 1, 2018

gCNV WDLs should clean up intermediate files and directories. #5382

Closed

samuelklee mentioned this issue May 21, 2020

Germline CNV WDLs for WGS #6607

Merged

cwhelan mentioned this issue May 6, 2021

PostprocessGermlineCNVCalls in gCNVCaseWorkflow doesn't efficiently call cache broadinstitute/gatk-sv#161

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-organization of gCNV WDL output for more efficient post-processing #4397

Re-organization of gCNV WDL output for more efficient post-processing #4397

mbabadi commented Feb 13, 2018 •

edited

Loading

samuelklee commented May 21, 2018

samuelklee commented May 22, 2018

mwalker174 commented May 22, 2018

samuelklee commented May 22, 2018

mwalker174 commented May 25, 2018

samuelklee commented May 25, 2018

mwalker174 commented May 25, 2018

samuelklee commented May 25, 2018

samuelklee commented Jan 31, 2019

samuelklee commented Mar 14, 2019

samuelklee commented Aug 17, 2021 •

edited

Loading

mwalker174 commented Sep 27, 2021

Re-organization of gCNV WDL output for more efficient post-processing #4397

Re-organization of gCNV WDL output for more efficient post-processing #4397

Comments

mbabadi commented Feb 13, 2018 • edited Loading

samuelklee commented May 21, 2018

samuelklee commented May 22, 2018

mwalker174 commented May 22, 2018

samuelklee commented May 22, 2018

mwalker174 commented May 25, 2018

samuelklee commented May 25, 2018

mwalker174 commented May 25, 2018

samuelklee commented May 25, 2018

samuelklee commented Jan 31, 2019

samuelklee commented Mar 14, 2019

samuelklee commented Aug 17, 2021 • edited Loading

mwalker174 commented Sep 27, 2021

mbabadi commented Feb 13, 2018 •

edited

Loading

samuelklee commented Aug 17, 2021 •

edited

Loading