Basic joint genotyping with GATK4. NOT Best Practices, only for teaching/demo purposes.
- One or more per-sample GVCF files (.g.vcf), provided as an array
- Genomic resources: reference genome in FASTA format (.fasta) and its accessory files (.fasta.fai and .dict)
- List of intervals to process in GATK intervals list format (.list)
- Resourcing and environment parameters including memory, disk space and container are all customaizable
- A multi-sample VCF of variants joint-called across the cohort, block-gzipped (.gz) with tabix index (.gz.tbi)
This workflow consists of four steps:
Ensures that the input GVCF files have the appropriate file extensions (.g.vcf.gz) and creates an index file (.tbi).
- Per file, scattered by input file
- Expects an input GVCF
- Outputs a copy of the GVCF (renamed if it did not have the right extension) and its index file.
Imports data from GVCF into a GenomicsDB datastore
- Across all inputs, scattered by genome interval
- Expects an array of input GVCFs
- Outputs a tarred GenomicsDB datastore
Applies joint genotyping to all samples present in the datastore
- Across all inputs, scattered by genome interval
- Expects a tarred GenomicsDB datastore
- Outputs a VCF file with variant calls made across the cohort
Merges VCF files across intervals generated by the scatter above
- Across genomic intervals
- Expects an array of per-interval VCFs
- Outputs the final cohort VCF