Germline CNV WDLs for WGS #6607

mwalker174 · 2020-05-18T15:26:34Z

Modifies gCNV WDLs to improve Cromwell performance when running on a large number of intervals, as in WGS
Adds optional disabled_read_filters input to CollectCounts
Enables GCS streaming for CollectCounts and CollectAllelicCounts

asmirnov239

Looks good @mwalker174! Could you also make necessary changes to the README.md file in the gatk/scripts/cnv_wdl/germline/ directory?

asmirnov239 · 2020-05-19T18:46:50Z

scripts/cnv_wdl/cnv_common_tasks.wdl

@@ -413,99 +441,82 @@ task ScatterIntervals {
    }
 }

-task PostprocessGermlineCNVCalls {
+task BundledPostprocessGermlineCNVCalls {


Can we keep it named PostprocessGermlineCNVCalls, since it's basically doing same work as before

Yes, WDL tasks are pretty much 1:1 with tools and should just reflect the tool name, if possible.

asmirnov239 · 2020-05-19T18:50:41Z

scripts/cnv_wdl/cnv_common_tasks.wdl

-      Boolean use_ssd = false
-      Int? cpu
-      Int? preemptible_attempts
+        File invariants_tar


Can we change invariants in the name here and elsewhere to something more explicit -- like bundled_gcnv_posteriors or bundled_gcnv_outputs

I agree. I still need to read on to figure out what exactly is being bundled, but the name is not super descriptive...

Changed to bundled_gcnv_outputs

asmirnov239 · 2020-05-19T18:51:56Z

scripts/cnv_wdl/cnv_common_tasks.wdl

-        gatk --java-options "-Xmx~{command_mem_mb}m" PostprocessGermlineCNVCalls \
-            $calls_args \
-            $model_args \
+        time gatk --java-options "-Xmx~{command_mem_mb}m" PostprocessGermlineCNVCalls \


What's the reason for calling time here?

Looks like a relic from when we were optimizing for wgs. Removed time

asmirnov239 · 2020-05-19T20:26:26Z

scripts/cnv_wdl/cnv_common_tasks.wdl

+        cat $calling_configs_list | sort -V > calling_configs_list.sorted
+        cat $denoising_configs_list | sort -V > denoising_configs_list.sorted
+        cat $gcnvkernel_version_list | sort -V > gcnvkernel_version_list.sorted
+        cat $sharded_interval_lists_list | sort -V > sharded_interval_lists_list.sorted


We seem to be hitting an error in our Travis gCNV WDL tests -
The number of entries in the copy-number posterior file for shard 0 does not match the number of entries in the shard interval list (posterior list size: 21, interval list size: 20)

Is it possible that some mismatch between shards is introduced here?

Hm not sure what was happening there, especially since my own tests completed successfully. But my latest changes don't seem to be triggering this error.

asmirnov239 · 2020-05-19T20:27:06Z

scripts/cnv_wdl/cnv_common_tasks.wdl

@@ -605,3 +616,122 @@ task CollectModelQualityMetrics {
        String qc_status_string = read_string("qcStatus.txt")
    }
 }
+
+task BundlePostprocessingInvariants {


Same here for invariant name as above

Maybe some comments to indicate what this task is doing.

Renamed to BundleCallerOutputs

Actually in my latest commit this is now a TransposeCallerOutputs

samuelklee

Hmm, I'm only halfway through my review, but I'm not sure I understand the need for a lot of these changes. Which changes are critical for improving performance---e.g., removing the samples x shards transpose---and which are just rearranging output?

For the transpose, did we check whether recent versions of Cromwell/Terra are OK?

It also seems like we are bundling up some things (global configs/interval lists, all gCNV calls by sample) and then splitting up others (contig ploidy calls per sample). Unless I'm misunderstanding the code, it seems that we are no longer consistent in how various quantities are split up (i.e., by shard or by sample). There are also some flattening operations that I don't quite understand. As we've discussed, we should see which of these global quantities can be scrapped---probably only need to keep the sharded intervals.

Perhaps we can discuss at today's gCNV meeting?

samuelklee · 2020-05-21T16:59:55Z

scripts/cnv_wdl/cnv_common_tasks.wdl

    Int machine_mem_mb = select_first([mem_gb, 7]) * 1000
    Int command_mem_mb = machine_mem_mb - 1000

    Boolean enable_indexing_ = select_first([enable_indexing, false])
+    Array[String] disabled_read_filters_arr = if(defined(disabled_read_filters))


I might just put this ternary all on one line (or make the style of optional arrays like allosomal-contigs below consistent).

samuelklee · 2020-05-21T17:54:14Z

scripts/cnv_wdl/cnv_common_tasks.wdl

@@ -413,99 +441,82 @@ task ScatterIntervals {
    }
 }

-task PostprocessGermlineCNVCalls {
+task BundledPostprocessGermlineCNVCalls {


Yes, WDL tasks are pretty much 1:1 with tools and should just reflect the tool name, if possible.

samuelklee · 2020-05-21T17:54:57Z

scripts/cnv_wdl/cnv_common_tasks.wdl

-      Boolean use_ssd = false
-      Int? cpu
-      Int? preemptible_attempts
+        File invariants_tar


I agree. I still need to read on to figure out what exactly is being bundled, but the name is not super descriptive...

samuelklee · 2020-05-21T17:55:40Z

scripts/cnv_wdl/cnv_common_tasks.wdl

    String genotyped_intervals_vcf_filename = "genotyped-intervals-~{entity_id}.vcf.gz"
    String genotyped_segments_vcf_filename = "genotyped-segments-~{entity_id}.vcf.gz"
    String denoised_copy_ratios_filename = "denoised_copy_ratios-~{entity_id}.tsv"

    Array[String] allosomal_contigs_args = if defined(allosomal_contigs) then prefix("--allosomal-contig ", select_first([allosomal_contigs])) else []

    command <<<
-        set -eu
-        export GATK_LOCAL_JAR=~{default="/root/gatk.jar" gatk4_jar_override}
+        set -euo pipefail


Is this something that should be changed throughout all CNV WDLs?

It is good practice, although -o pipefail will have no effect in this task since piping isn't used anywhere

samuelklee · 2020-05-21T17:58:07Z

scripts/cnv_wdl/germline/cnv_germline_case_workflow.wdl

      ###################################################
      #### arguments for PostprocessGermlineCNVCalls ####
      ###################################################
      Int ref_copy_number_autosomal_contigs
      Array[String]? allosomal_contigs
+      Int? disk_space_gb_for_postprocess_germline_cnv_calls


The task name in the comment section header should match the name of the WDL task.

samuelklee · 2020-05-21T18:00:39Z

scripts/cnv_wdl/cnv_common_tasks.wdl

@@ -605,3 +616,122 @@ task CollectModelQualityMetrics {
        String qc_status_string = read_string("qcStatus.txt")
    }
 }
+
+task BundlePostprocessingInvariants {


Maybe some comments to indicate what this task is doing.

samuelklee · 2020-05-21T18:18:55Z

scripts/cnv_wdl/cnv_common_tasks.wdl

+      for (( i=0; i<~{num_samples}; i++ ))
+      do
+        sample_id=${sample_ids[$i]}
+        sample_no=`printf %04d $i`


I think the way this is originally done in the gCNV tasks is more robust to the maximum number of samples.

Also, why is the renaming added to the ploidy calls here, but removed from the gCNV calls? Are those guaranteed to line up?

Added dynamic scaling of the sample index padding. I don't think we need sample ids in the call tars since these are only consumed internally. The sample order will be the same between the input bam array and SAMPLE_ subdirectories that gCNV creates, so they will line up.

samuelklee · 2020-05-21T18:28:00Z

scripts/cnv_wdl/germline/cnv_germline_case_scattered_workflow.wdl

+        Array[File] read_counts_entity_id = flatten(CNVGermlineCaseWorkflow.read_counts_entity_id)
+        Array[File] read_counts = flatten(CNVGermlineCaseWorkflow.read_counts)
+        Array[File] sample_contig_ploidy_calls_tars = flatten(CNVGermlineCaseWorkflow.sample_contig_ploidy_calls_tars)
+        Array[File] gcnv_calls_tars = flatten(CNVGermlineCaseWorkflow.gcnv_calls_tars)


Does it make sense to flatten some of these quantities?

You're right I shouldn't flatten sample_contig_ploidy_calls_tars and gcnv_calls_tars, but the rest are sample-wise and will correspond to the size/order of the input bams

samuelklee · 2020-05-21T19:38:31Z

OK, finally tracked down that original issue from Mehrtash concerning the bundling: #4397 As we discussed, there was a lot of back and forth to try to resolve this issue, and it was confounded by a lexicographical bug (which may have been reintroduced here). The last chapter in this saga was #5490

If the matrix transpose is still troublesome and we can avoid it by being more clever with WDL indexing, then maybe we can explore that. Or we can just see if there are analogous existing WDLs and borrow their solution.

However, note that @mwalker174 indicated that the creation of the matrix itself is troublesome for call caching. If bundling is the only answer and we are willing to pay the cost of localizing all gCNV results to all shards, it might make things easier to first bundle everything up at the end of each gCNV task.

Also, would Cromwell be able to handle things if we change the bundling from a) all gCNV results (i.e., across all samples and shards) to b) a single bundled global quantity (model + interval lists) + calls bundled (across shards) per sample? Each postprocessing task would then take the global bundle + the bundle containing calls for that sample. That seems like it would resolve Mehrtash's original complaint, while still minimizing the number of files whizzing around.

We also discussed batching by sample at the postprocessing task level, but I think we want to keep this task at the per-sample level for parallelism.

samuelklee · 2020-05-21T19:52:20Z

Also, if we can't figure this out, then I really think it's worth kicking it up to Cromwell to see if call caching can be made more scalable.

TedBrookings

I mainly looked over the transpose operation. It looks pretty good!

TedBrookings · 2020-06-26T18:13:08Z

scripts/cnv_wdl/cnv_common_tasks.wdl

+    runtime {
+        docker: docker
+        memory: select_first([mem_gb, 2]) + " GiB"
+        disks: "local-disk " + select_first([disk_space_gb, 150]) + if use_ssd then " SSD" else " HDD"


Suggested change

runtime {

docker: docker

memory: select_first([mem_gb, 2]) + " GiB"

disks: "local-disk " + select_first([disk_space_gb, 150]) + if use_ssd then " SSD" else " HDD"

Int disk_baseline_gb_ = 10

Float compression_factor = 10.0

Int disk_needed_gb = disk_baseline_gb + ceil(compression_factor * size(gcnv_calls_tars, "GiB"))

runtime {

docker: docker

memory: select_first([mem_gb, 2]) + " GiB"

disks: "local-disk " + select_first([disk_space_gb, disk_needed_gb]) + if use_ssd then " SSD" else " HDD"

I don't see memory being a problem here, but these kinds of tasks have been running of out disk on larger panels. This should be conservative enough without being super wasteful.

Thanks. This would have been good but it turns out we aren't using this task. It might be useful to have automatic scaling for the rest of the gCNV tasks but I'll leave this for later work.

mwalker174 · 2020-07-29T16:09:23Z

The transpose operation appears to work with Cromwell v51, so I've reverted a good chunk of the code.

asmirnov239

@mwalker174 This looks good! just a few minor comments.
What happen to the bundling performance improvement changes by the way?

asmirnov239 · 2020-08-11T04:02:00Z

scripts/cnv_wdl/cnv_common_tasks.wdl

+        padded_sample_index=$(printf "%0${num_digits}d" $i)
+        tar -czf sample_${padded_sample_index}.${sample_id}.contig_ploidy_calls.tar.gz -C calls/SAMPLE_${i} .
+      done
+    >>>


Space here for consistency

asmirnov239 · 2020-08-11T04:13:35Z

scripts/cnv_wdl/germline/cnv_germline_case_workflow.wdl

@@ -314,7 +326,7 @@ task DetermineGermlineContigPloidyCaseMode {
            --mapping-error-rate ~{default="0.01" mapping_error_rate} \
            --sample-psi-scale ~{default="0.0001" sample_psi_scale}

-        tar czf case-contig-ploidy-calls.tar.gz -C ~{output_dir_}/case-calls .
+        tar c -C ~{output_dir_}/case-calls . | gzip -1 > case-contig-ploidy-calls.tar.gz


Can you explain this change? Also, why is it not in cohort mode as well?

Faster compression with gzip -1 I believe. This is okay in case mode since the calls tars aren't usually kept in storage except as intermediates, so the tradeoff with larger file size doesn't outweigh the cost of compressing/decompressing on VMs.

mwalker174 · 2020-08-11T16:24:49Z

What happen to the bundling performance improvement changes by the way?

The large 2D file array can be handled by the latest Cromwell versions, so we do not need to bundle. It is much more elegant and readable this way and should actually improve performance.

mwalker174 requested review from samuelklee, ldgauthier and asmirnov239 May 18, 2020 15:26

asmirnov239 requested changes May 19, 2020

View reviewed changes

samuelklee requested changes May 21, 2020

View reviewed changes

mwalker174 force-pushed the mw_gcnv_wdl_upgrade branch from f146f66 to 3931c8d Compare June 1, 2020 20:43

mwalker174 force-pushed the mw_gcnv_wdl_upgrade branch from daeda8e to 13317db Compare June 11, 2020 19:42

TedBrookings reviewed Jun 26, 2020

View reviewed changes

mwalker174 added 7 commits July 15, 2020 14:22

gCNV WDLs for WGS

eb99158

Start addressing reviewer comments

a7689b4

Transpose instead of bundling

af0d56f

Address comments; bundle auxiliary calls files

5f29492

Pad sample index in ScatterPloidyCallsBySample

d29a005

Fix unflattened output error in CNVGermlineCaseScatteredWorkflow

e00692b

Revert back to 2d file array

7084775

mwalker174 force-pushed the mw_gcnv_wdl_upgrade branch from 13317db to 7084775 Compare July 29, 2020 00:11

Delete transpose task inputs

57680ec

broadinstitute deleted a comment from gatk-bot Jul 29, 2020

Fix minor error

d24b40b

asmirnov239 approved these changes Aug 11, 2020

View reviewed changes

Add space

aa66473

samuelklee approved these changes Aug 11, 2020

View reviewed changes

mwalker174 merged commit b1688d9 into master Aug 11, 2020

mwalker174 deleted the mw_gcnv_wdl_upgrade branch August 11, 2020 16:52

mwalker174 added a commit that referenced this pull request Nov 3, 2020

Germline CNV WDLs for WGS (#6607)

204e1c1

samuelklee mentioned this pull request Aug 17, 2021

Re-organization of gCNV WDL output for more efficient post-processing #4397

Open

Germline CNV WDLs for WGS #6607

Germline CNV WDLs for WGS #6607

Conversation

mwalker174 commented May 18, 2020

asmirnov239 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samuelklee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samuelklee commented May 21, 2020 • edited Loading

samuelklee commented May 21, 2020

TedBrookings left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mwalker174 commented Jul 29, 2020

asmirnov239 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mwalker174 commented Aug 11, 2020

samuelklee commented May 21, 2020 •

edited

Loading