Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Germline CNV WDLs for WGS #6607

Merged
merged 10 commits into from
Aug 11, 2020
Merged

Germline CNV WDLs for WGS #6607

merged 10 commits into from
Aug 11, 2020

Conversation

mwalker174
Copy link
Contributor

  • Modifies gCNV WDLs to improve Cromwell performance when running on a large number of intervals, as in WGS
  • Adds optional disabled_read_filters input to CollectCounts
  • Enables GCS streaming for CollectCounts and CollectAllelicCounts

Copy link
Collaborator

@asmirnov239 asmirnov239 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @mwalker174! Could you also make necessary changes to the README.md file in the gatk/scripts/cnv_wdl/germline/ directory?

@@ -413,99 +441,82 @@ task ScatterIntervals {
}
}

task PostprocessGermlineCNVCalls {
task BundledPostprocessGermlineCNVCalls {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep it named PostprocessGermlineCNVCalls, since it's basically doing same work as before

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, WDL tasks are pretty much 1:1 with tools and should just reflect the tool name, if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Boolean use_ssd = false
Int? cpu
Int? preemptible_attempts
File invariants_tar
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change invariants in the name here and elsewhere to something more explicit -- like bundled_gcnv_posteriors or bundled_gcnv_outputs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I still need to read on to figure out what exactly is being bundled, but the name is not super descriptive...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to bundled_gcnv_outputs

gatk --java-options "-Xmx~{command_mem_mb}m" PostprocessGermlineCNVCalls \
$calls_args \
$model_args \
time gatk --java-options "-Xmx~{command_mem_mb}m" PostprocessGermlineCNVCalls \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for calling time here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a relic from when we were optimizing for wgs. Removed time

cat $calling_configs_list | sort -V > calling_configs_list.sorted
cat $denoising_configs_list | sort -V > denoising_configs_list.sorted
cat $gcnvkernel_version_list | sort -V > gcnvkernel_version_list.sorted
cat $sharded_interval_lists_list | sort -V > sharded_interval_lists_list.sorted
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seem to be hitting an error in our Travis gCNV WDL tests -
The number of entries in the copy-number posterior file for shard 0 does not match the number of entries in the shard interval list (posterior list size: 21, interval list size: 20)

Is it possible that some mismatch between shards is introduced here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm not sure what was happening there, especially since my own tests completed successfully. But my latest changes don't seem to be triggering this error.

@@ -605,3 +616,122 @@ task CollectModelQualityMetrics {
String qc_status_string = read_string("qcStatus.txt")
}
}

task BundlePostprocessingInvariants {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here for invariant name as above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some comments to indicate what this task is doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to BundleCallerOutputs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually in my latest commit this is now a TransposeCallerOutputs

Copy link
Contributor

@samuelklee samuelklee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm only halfway through my review, but I'm not sure I understand the need for a lot of these changes. Which changes are critical for improving performance---e.g., removing the samples x shards transpose---and which are just rearranging output?

For the transpose, did we check whether recent versions of Cromwell/Terra are OK?

It also seems like we are bundling up some things (global configs/interval lists, all gCNV calls by sample) and then splitting up others (contig ploidy calls per sample). Unless I'm misunderstanding the code, it seems that we are no longer consistent in how various quantities are split up (i.e., by shard or by sample). There are also some flattening operations that I don't quite understand. As we've discussed, we should see which of these global quantities can be scrapped---probably only need to keep the sharded intervals.

Perhaps we can discuss at today's gCNV meeting?

Int machine_mem_mb = select_first([mem_gb, 7]) * 1000
Int command_mem_mb = machine_mem_mb - 1000

Boolean enable_indexing_ = select_first([enable_indexing, false])
Array[String] disabled_read_filters_arr = if(defined(disabled_read_filters))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might just put this ternary all on one line (or make the style of optional arrays like allosomal-contigs below consistent).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -413,99 +441,82 @@ task ScatterIntervals {
}
}

task PostprocessGermlineCNVCalls {
task BundledPostprocessGermlineCNVCalls {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, WDL tasks are pretty much 1:1 with tools and should just reflect the tool name, if possible.

Boolean use_ssd = false
Int? cpu
Int? preemptible_attempts
File invariants_tar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I still need to read on to figure out what exactly is being bundled, but the name is not super descriptive...

String genotyped_intervals_vcf_filename = "genotyped-intervals-~{entity_id}.vcf.gz"
String genotyped_segments_vcf_filename = "genotyped-segments-~{entity_id}.vcf.gz"
String denoised_copy_ratios_filename = "denoised_copy_ratios-~{entity_id}.tsv"

Array[String] allosomal_contigs_args = if defined(allosomal_contigs) then prefix("--allosomal-contig ", select_first([allosomal_contigs])) else []

command <<<
set -eu
export GATK_LOCAL_JAR=~{default="/root/gatk.jar" gatk4_jar_override}
set -euo pipefail
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that should be changed throughout all CNV WDLs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is good practice, although -o pipefail will have no effect in this task since piping isn't used anywhere

###################################################
#### arguments for PostprocessGermlineCNVCalls ####
###################################################
Int ref_copy_number_autosomal_contigs
Array[String]? allosomal_contigs
Int? disk_space_gb_for_postprocess_germline_cnv_calls
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The task name in the comment section header should match the name of the WDL task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -605,3 +616,122 @@ task CollectModelQualityMetrics {
String qc_status_string = read_string("qcStatus.txt")
}
}

task BundlePostprocessingInvariants {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some comments to indicate what this task is doing.

for (( i=0; i<~{num_samples}; i++ ))
do
sample_id=${sample_ids[$i]}
sample_no=`printf %04d $i`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way this is originally done in the gCNV tasks is more robust to the maximum number of samples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why is the renaming added to the ploidy calls here, but removed from the gCNV calls? Are those guaranteed to line up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added dynamic scaling of the sample index padding. I don't think we need sample ids in the call tars since these are only consumed internally. The sample order will be the same between the input bam array and SAMPLE_ subdirectories that gCNV creates, so they will line up.

Array[File] read_counts_entity_id = flatten(CNVGermlineCaseWorkflow.read_counts_entity_id)
Array[File] read_counts = flatten(CNVGermlineCaseWorkflow.read_counts)
Array[File] sample_contig_ploidy_calls_tars = flatten(CNVGermlineCaseWorkflow.sample_contig_ploidy_calls_tars)
Array[File] gcnv_calls_tars = flatten(CNVGermlineCaseWorkflow.gcnv_calls_tars)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to flatten some of these quantities?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right I shouldn't flatten sample_contig_ploidy_calls_tars and gcnv_calls_tars, but the rest are sample-wise and will correspond to the size/order of the input bams

@samuelklee
Copy link
Contributor

samuelklee commented May 21, 2020

OK, finally tracked down that original issue from Mehrtash concerning the bundling: #4397 As we discussed, there was a lot of back and forth to try to resolve this issue, and it was confounded by a lexicographical bug (which may have been reintroduced here). The last chapter in this saga was #5490

If the matrix transpose is still troublesome and we can avoid it by being more clever with WDL indexing, then maybe we can explore that. Or we can just see if there are analogous existing WDLs and borrow their solution.

However, note that @mwalker174 indicated that the creation of the matrix itself is troublesome for call caching. If bundling is the only answer and we are willing to pay the cost of localizing all gCNV results to all shards, it might make things easier to first bundle everything up at the end of each gCNV task.

Also, would Cromwell be able to handle things if we change the bundling from a) all gCNV results (i.e., across all samples and shards) to b) a single bundled global quantity (model + interval lists) + calls bundled (across shards) per sample? Each postprocessing task would then take the global bundle + the bundle containing calls for that sample. That seems like it would resolve Mehrtash's original complaint, while still minimizing the number of files whizzing around.

We also discussed batching by sample at the postprocessing task level, but I think we want to keep this task at the per-sample level for parallelism.

@samuelklee
Copy link
Contributor

Also, if we can't figure this out, then I really think it's worth kicking it up to Cromwell to see if call caching can be made more scalable.

@mwalker174 mwalker174 force-pushed the mw_gcnv_wdl_upgrade branch from f146f66 to 3931c8d Compare June 1, 2020 20:43
@mwalker174 mwalker174 force-pushed the mw_gcnv_wdl_upgrade branch from daeda8e to 13317db Compare June 11, 2020 19:42
Copy link
Contributor

@TedBrookings TedBrookings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly looked over the transpose operation. It looks pretty good!

Comment on lines 651 to 654
runtime {
docker: docker
memory: select_first([mem_gb, 2]) + " GiB"
disks: "local-disk " + select_first([disk_space_gb, 150]) + if use_ssd then " SSD" else " HDD"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
runtime {
docker: docker
memory: select_first([mem_gb, 2]) + " GiB"
disks: "local-disk " + select_first([disk_space_gb, 150]) + if use_ssd then " SSD" else " HDD"
Int disk_baseline_gb_ = 10
Float compression_factor = 10.0
Int disk_needed_gb = disk_baseline_gb + ceil(compression_factor * size(gcnv_calls_tars, "GiB"))
runtime {
docker: docker
memory: select_first([mem_gb, 2]) + " GiB"
disks: "local-disk " + select_first([disk_space_gb, disk_needed_gb]) + if use_ssd then " SSD" else " HDD"

I don't see memory being a problem here, but these kinds of tasks have been running of out disk on larger panels. This should be conservative enough without being super wasteful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This would have been good but it turns out we aren't using this task. It might be useful to have automatic scaling for the rest of the gCNV tasks but I'll leave this for later work.

@mwalker174 mwalker174 force-pushed the mw_gcnv_wdl_upgrade branch from 13317db to 7084775 Compare July 29, 2020 00:11
@broadinstitute broadinstitute deleted a comment from gatk-bot Jul 29, 2020
@broadinstitute broadinstitute deleted a comment from gatk-bot Jul 29, 2020
@broadinstitute broadinstitute deleted a comment from gatk-bot Jul 29, 2020
@broadinstitute broadinstitute deleted a comment from gatk-bot Jul 29, 2020
@broadinstitute broadinstitute deleted a comment from gatk-bot Jul 29, 2020
@mwalker174
Copy link
Contributor Author

The transpose operation appears to work with Cromwell v51, so I've reverted a good chunk of the code.

Copy link
Collaborator

@asmirnov239 asmirnov239 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mwalker174 This looks good! just a few minor comments.
What happen to the bundling performance improvement changes by the way?

padded_sample_index=$(printf "%0${num_digits}d" $i)
tar -czf sample_${padded_sample_index}.${sample_id}.contig_ploidy_calls.tar.gz -C calls/SAMPLE_${i} .
done
>>>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space here for consistency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -314,7 +326,7 @@ task DetermineGermlineContigPloidyCaseMode {
--mapping-error-rate ~{default="0.01" mapping_error_rate} \
--sample-psi-scale ~{default="0.0001" sample_psi_scale}

tar czf case-contig-ploidy-calls.tar.gz -C ~{output_dir_}/case-calls .
tar c -C ~{output_dir_}/case-calls . | gzip -1 > case-contig-ploidy-calls.tar.gz
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this change? Also, why is it not in cohort mode as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Faster compression with gzip -1 I believe. This is okay in case mode since the calls tars aren't usually kept in storage except as intermediates, so the tradeoff with larger file size doesn't outweigh the cost of compressing/decompressing on VMs.

@mwalker174
Copy link
Contributor Author

What happen to the bundling performance improvement changes by the way?

The large 2D file array can be handled by the latest Cromwell versions, so we do not need to bundle. It is much more elegant and readable this way and should actually improve performance.

@mwalker174 mwalker174 merged commit b1688d9 into master Aug 11, 2020
@mwalker174 mwalker174 deleted the mw_gcnv_wdl_upgrade branch August 11, 2020 16:52
mwalker174 added a commit that referenced this pull request Nov 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants