Skip to content

Commit 290ad0f

Browse files
authored
Documentation Updates For Sample Subset Deliverables [VS-1448] (#8922)
1 parent e986050 commit 290ad0f

File tree

1 file changed

+49
-21
lines changed

1 file changed

+49
-21
lines changed

scripts/variantstore/docs/aou/AOU_DELIVERABLES.md

+49-21
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Running the Genome Variant Store (GVS) Pipeline for AoU
1+
# Running the Genome Variant Store (GVS) Pipelines for AoU
22

33
## Setup
44
- Create a Terra workspace
@@ -35,7 +35,7 @@
3535
- Make a note of the Google project ID ("aou-genomics-curation-prod"), dataset name (e.g. "aou_wgs" — if it does not exist be sure to create one before running any workflows) and callset identifier (e.g. "Bravo") as these will be inputs (`project_id`, `dataset_name` and `call_set_identifier`) to all or most of the GVS workflows. The [naming conventions for other aspects of GVS datasets are outlined here](https://docs.google.com/document/d/1pNtuv7uDoiOFPbwe4zx5sAGH7MyxwKqXkyrpNmBxeow).
3636
- Once the **non-control** samples have been fully ingested into BQ using the `GvsBulkIngestGenomes` workflow, the **control** samples can be manually added to the workspace and loaded in separately
3737

38-
## The Pipeline
38+
## The Main Pipeline
3939
1. `GvsBulkIngestGenomes` workflow
4040
- For use with **non-control** samples only! To ingest control samples (required for running `GvsCalculatePrecisionAndSensitivity`), use the`GvsAssignIds` and `GvsImportGenomes` workflows described below.
4141
- Set `sample_id_column_name` to "research_id" to use the shorter unique ID from AoU for the `sample_name` column.
@@ -83,42 +83,70 @@
8383
- This workflow needs to be run with the `extract_table_prefix` input from `GvsPrepareRangesCallset` step.
8484
- This workflow needs to be run with the `filter_set_name` input from `GvsCreateFilterSet` step.
8585
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
86-
1. `GvsExtractCallset` / `GvsExtractCallsetPgenMerged` workflows ("small callset" Exome, Clinvar, and ACAF threshold extracts in VCF and PGEN formats respectively)
87-
- You will need to run the `GvsPrepareRangesCallset` workflow for each "[Region](https://support.researchallofus.org/hc/en-us/articles/14929793660948-Smaller-Callsets-for-Analyzing-Short-Read-WGS-SNP-Indel-Data-with-Hail-MT-VCF-and-PLINK)" (interval list) for which a PGEN or VCF deliverable is required for the callset.
88-
- This workflow transforms the data in the vet, ref_ranges, and samples tables into a schema optimized for extract.
89-
- The `enable_extract_table_ttl` input should be set to `true` (the default value is `false`), which will add a TTL of two weeks to the tables it creates.
90-
- `extract_table_prefix` should be set to a name that is unique to the given Region / interval list. See the [naming conventions doc](https://docs.google.com/document/d/1pNtuv7uDoiOFPbwe4zx5sAGH7MyxwKqXkyrpNmBxeow) for guidance on what to use.
91-
- Specify the `interval_list` appropriate for the PGEN / VCF extraction run you are performing.
92-
- `GvsExtractCallsetPgen` currently defaults to 100 alt alleles maximum, which means that any sites having more than that number of alt alleles will be dropped.
93-
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
94-
- Specify the same `call_set_identifier`, `dataset_name`, `project_id`, `extract_table_prefix`, and `interval_list` that were used in the `GvsPrepareRangesCallset` run documented above.
95-
- Specify the `interval_weights_bed` appropriate for the PGEN / VCF extraction run you are performing. `gs://gvs_quickstart_storage/weights/gvs_full_vet_weights_1kb_padded_orig.bed` is the interval weights BED used for Quickstart.
96-
- For both `GvsExtractCallset` and `GvsExtractCallsetPgenMerged`, select the workflow option "Retry with more memory" and choose a "Memory retry factor" of 1.5
97-
- For `GvsExtractCallset`, make sure to specify the appropriate `maximum_alternate_alleles` value (currently 100).
98-
- For `GvsExtractCallset`, if you want to output VCFs that are compressed using bgzip, set the `bgzip_output_vcfs` input to `true` to generate VCFs that are compressed using bgzip.
99-
- For `GvsExtractCallsetPgen` (which is called by `GvsExtractCallsetPgenMerged`), if one (or several) of the `PgenExtractTask` shards fail because of angry cloud, you can re-run the workflow with the exact same inputs; the successful shards will call cache and only the failed ones will run.
100-
- If you want to collect the monitoring logs from a large number of `Extract` shards, the `summarize_task_monitor_logs.py` script will not work if the task is scattered too wide. Use the `summarize_task_monitor_logs_from_file.py` script, instead, which takes a FOFN of GCS paths instead of a space-separated series of localized files.
101-
- These workflows do not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
10286
1. `GvsCalculatePrecisionAndSensitivity` workflow
10387
- Please see the detailed instructions for running the Precision and Sensitivity workflow [here](../../tieout/AoU_PRECISION_SENSITIVITY.md).
10488
1. `GvsCallsetCost` workflow
10589
- This workflow calculates the total BigQuery cost of generating this callset (which is not represented in the Terra UI total workflow cost) using the above GVS workflows; it's used to calculate the cost as a whole and by sample.
10690

107-
## Deliverables (via email to stakeholders once the above steps are complete)
91+
## Main Deliverables (via email to stakeholders once the above steps are complete)
10892

10993
The Callset Stats and S&P files can be simply `gsutil cp`ed to the AoU delivery bucket since they are so much smaller.
11094
1. GCS location of the VDS in the AoU delivery bucket
11195
2. Fully qualified name of the BigQuery dataset (composed of the `project_id` and `dataset_name` inputs from the workflows)
11296
3. GCS location of the CSV output from `GvsCallsetStatistics` workflow in the AoU delivery bucket
11397
4. GCS location of the TSV output from `GvsCalculatePrecisionAndSensitivity` in the AoU delivery bucket
11498

115-
116-
11799
## Running the VAT pipeline
118100
To create a BigQuery table of variant annotations, you may follow the instructions here:
119101
[process to create variant annotations table](../../variant_annotations_table/README.md)
120102

121103
The pipeline takes in the VDS and outputs a variant annotations table in BigQuery.
122104

105+
## Additional Deliverables
106+
107+
### Smaller Interval Lists
108+
109+
1. You will need to run the `GvsPrepareRangesCallset` workflow for each "[Region](https://support.researchallofus.org/hc/en-us/articles/14929793660948-Smaller-Callsets-for-Analyzing-Short-Read-WGS-SNP-Indel-Data-with-Hail-MT-VCF-and-PLINK)" (interval list) for which a PGEN or VCF deliverable is required for the callset.
110+
- This workflow transforms the data in the vet, ref_ranges, and samples tables into a schema optimized for extract.
111+
- The `enable_extract_table_ttl` input should be set to `true` (the default value is `false`), which will add a TTL of two weeks to the tables it creates.
112+
- `extract_table_prefix` should be set to a name that is unique to the given Region / interval list. See the [naming conventions doc](https://docs.google.com/document/d/1pNtuv7uDoiOFPbwe4zx5sAGH7MyxwKqXkyrpNmBxeow) for guidance on what to use.
113+
- Specify the `interval_list` appropriate for the PGEN / VCF extraction run you are performing.
114+
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
115+
1. `GvsExtractCallset` / `GvsExtractCallsetPgenMerged` workflows ("small callset" Exome, Clinvar, and ACAF threshold extracts in VCF and PGEN formats respectively)
116+
- Specify the same `call_set_identifier`, `dataset_name`, `project_id`, `extract_table_prefix`, and `interval_list` that were used in the `GvsPrepareRangesCallset` run documented above.
117+
- Specify the `interval_weights_bed` appropriate for the PGEN / VCF extraction run you are performing. `gs://gvs_quickstart_storage/weights/gvs_full_vet_weights_1kb_padded_orig.bed` is the interval weights BED used for Quickstart.
118+
- For both `GvsExtractCallset` and `GvsExtractCallsetPgenMerged`, select the workflow option "Retry with more memory" and choose a "Memory retry factor" of 1.5
119+
- `GvsExtractCallsetPgen` currently defaults to 100 alt alleles maximum, which means that any sites having more than that number of alt alleles will be dropped.
120+
- For both `GvsExtractCallset` and `GvsExtractCallsetPgenMerged`, be sure to set the `output_gcs_dir` to the proper path in the AoU delivery bucket so you don't need to copy them there yourself once the workflows have finished.
121+
- For `GvsExtractCallset`, make sure to specify the appropriate `maximum_alternate_alleles` value (currently 100).
122+
- For `GvsExtractCallset`, you will probably (check the requirements to confirm) want to set the input `bgzip_output_vcfs` to `true`.
123+
- For `GvsExtractCallsetPgen` (which is called by `GvsExtractCallsetPgenMerged`), if one (or several) of the `PgenExtractTask` shards fail because of angry cloud, you can re-run the workflow with the exact same inputs with call caching turned on; the successful shards will cache and only the failed ones will re-run.
124+
- If you want to collect the monitoring logs from a large number of `Extract` shards, the `summarize_task_monitor_logs.py` script will not work if the task is scattered too wide. Use the `summarize_task_monitor_logs_from_file.py` script, instead, which takes a FOFN of GCS paths instead of a space-separated series of localized files.
125+
- These workflows do not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
126+
127+
### Smaller Sample Lists
123128

129+
#### VCF Outputs
130+
You can take advantage of our existing sub-cohort WDL, `GvsExtractCohortFromSampleNames.wdl`, to create VCFs for a subset of callset samples.
131+
- Specify the same `call_set_identifier`, `gvs_dataset` (same as `dataset_name` in other runs), `gvs_project` (same as `project_id` in other runs), and `filter_set_name` that were used in the creation of the main callset.
132+
- Specify a unique `cohort_table_prefix` to this subset of samples so as to not overwrite the prepare tables for the full callset.
133+
- You will need to either fill out `cohort_sample_names` with a GCS path to a newline-delimited list of the sample names or `cohort_sample_names_array` with an Array of sample name Strings. The `cohort_sample_names_array` input will take precedence over `cohort_sample_names` if both are set.
134+
- Be sure to set the `output_gcs_dir` to the proper path in the AoU delivery bucket so you don't need to copy the output files there yourself once the workflow has finished.
135+
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
124136

137+
#### PGEN Outputs
138+
1. You will need to run the `GvsPrepareRangesCallset` workflow for each subset of samples for which a PGEN or VCF deliverable is required for the callset.
139+
- This workflow transforms the data in the vet, ref_ranges, and samples tables into a schema optimized for extract.
140+
- The `enable_extract_table_ttl` input should be set to `true` (the default value is `false`), which will add a TTL of two weeks to the tables it creates.
141+
- `extract_table_prefix` should be set to a name that is unique to the given sample list. See the [naming conventions doc](https://docs.google.com/document/d/1pNtuv7uDoiOFPbwe4zx5sAGH7MyxwKqXkyrpNmBxeow) for guidance on what to use.
142+
- Specify the `interval_list` appropriate for the PGEN / VCF extraction run you are performing.
143+
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
144+
1. `GvsExtractCallsetPgenMerged` workflow
145+
- Specify the same `call_set_identifier`, `dataset_name`, `project_id`, `extract_table_prefix`, and `interval_list` that were used in the `GvsPrepareRangesCallset` run documented above.
146+
- Specify the `interval_weights_bed` appropriate for the PGEN extraction run you are performing. `gs://gvs_quickstart_storage/weights/gvs_full_vet_weights_1kb_padded_orig.bed` is the interval weights BED used for Quickstart.
147+
- Select the workflow option "Retry with more memory" and choose a "Memory retry factor" of 1.5
148+
- `GvsExtractCallsetPgen` currently defaults to 100 alt alleles maximum, which means that any sites having more than that number of alt alleles will be dropped.
149+
- Be sure to set the `output_gcs_dir` to the proper path in the AoU delivery bucket so you don't need to copy the output files there yourself once the workflow has finished.
150+
- For `GvsExtractCallsetPgen` (which is called by `GvsExtractCallsetPgenMerged`), if one (or several) of the `PgenExtractTask` shards fail because of angry cloud, you can re-run the workflow with the exact same inputs with call caching turned on; the successful shards will cache and only the failed ones will re-run.
151+
- If you want to collect the monitoring logs from a large number of `Extract` shards, the `summarize_task_monitor_logs.py` script will not work if the task is scattered too wide. Use the `summarize_task_monitor_logs_from_file.py` script, instead, which takes a FOFN of GCS paths instead of a space-separated series of localized files.
152+
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.

0 commit comments

Comments
 (0)