Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to 1KGP testing #10

Closed
mwalker174 opened this issue Jun 30, 2020 · 1 comment
Closed

Migrate to 1KGP testing #10

mwalker174 opened this issue Jun 30, 2020 · 1 comment
Assignees

Comments

@mwalker174
Copy link
Collaborator

mwalker174 commented Jun 30, 2020

Testing data currently contain sensitive data that cannot be released publicly. We should migrate over to a 2-batch test dataset built using the 1000 Genome Project (1KGP) high-coverage data that we are using for the cohort mode Terra workspace (not to be confused with the single-sample mode 1KGP reference panel which is a single batch composed of a different set of samples from 1KGP).

Ideally, all our testing should be self-contained, meaning that prerequisite cohort-dependent inputs for all modules (e.g. vcfs, metrics files, etc.) can be generated from the tests of earlier modules. Therefore, we will need separate tests for batch1 and batch2 starting at GenerateSampleMetricsBatch through FilterBatch and GenotypeBatch. Other downstream modules are run on the whole cohort (batch1 and batch2 together).

We will replace small/large test set designations. In the future, we can think about options to run on a subset of chromosomes to speed up testing. The one exception would be GenerateSampleMetrics - currently we test the batch version of this (GenerateSampleMetricsBatch). We should add another template for GenerateSampleMetrics itself to run on one sample, since this workflow is quite expensive.

A few technical notes:

  • New input values need to be defined for batch1 and batch2 in /input_values. For cohort-level steps (mentioned above), let's define a third inputs file for a 1kgp_test cohort (i.e. 1kgp_test.json).
  • Input data and configurations can be found in inputs/terra_workspaces/cohort_mode (after running scripts/inputs/build_default_inputs.sh). This includes CRAM and gVCF paths, batch membership assignments, and cohort-specific resource files (e.g. ped file).
  • Copy and organize workflow inputs/outputs in gs://gatk-sv-resources-public/test, including metrics generated by enabling run_module_metrics.
@epiercehoffman
Copy link
Collaborator

As part of the migration to 1KGP testing, we should test with multiple batches for increased robustness & specifically to test MergeCohortVcfs.wdl. This will also be beneficial for testing a future batch-combine workflow needed for Terra.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants