Skip to content

4.5.0.0

Compare
Choose a tag to compare
@droazen droazen released this 13 Dec 22:53
· 89 commits to master since this release
8317d8b

Download release: gatk-4.5.0.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.5.0.0 release:

  • HaplotypeCaller now supports custom ploidy regions that can be specified via a new --ploidy-regions argument, overriding the global -ploidy setting

  • The default SmithWaterman implementation for HaplotypeCaller and Mutect2 is now the hardware-accelerated version, resulting in a significant speedup

  • Funcotator has a new datasource release that brings in the latest version of Gencode and several other key data sources

  • We've updated our dependencies and our docker environment to greatly cut down on known security vulnerabilities

  • We've greatly improved support for http/https inputs in GATK-native tools (though most Picard tools bundled with GATK do not yet support it)

  • We've ported some additional DRAGEN features to HaplotypeCaller that bring us closer to functional equivalence with DRAGEN v3.7.8

  • GenomicsDBImport now has support for Azure storage az:// URIs

  • GnarlyGenotyper now has haploid support

  • Lots of important bug fixes, including a fix for a bug in the Intel GKL that could cause output files to intermittently fail to be compressed properly

Full list of changes:

  • HaplotypeCaller

    • HaplotypeCaller now supports custom ploidy regions (#8609)
      • Added a new argument to HaplotypeCaller called --ploidy-regions which allows the user to input a .bed or .interval_list with the "name" column equal to a positive integer for the ploidy to use when calling variants in that region
      • The main use case is for calling haploid variants outside the PAR for XY individuals as required by the VCF spec, but this provides a much more flexible interface for other similar niche applications, like genotyping individuals with other known aneuploidies
      • The global -ploidy flag will still provide the background default (or the built-in ploidy of 2 for humans), but the user-supplied values will supersede these in overlapping regions
    • Changed the SmithWaterman implementation to default to FASTEST_AVAILABLE (#8485)
    • Fixed a bug in pileup calling mode relating to the number of haplotypes (#8489)
    • Huge simplication of genotyping likelihoods calculations -- no change in output (#6351)
    • Be explicit about when variants are biallelic (#8332)
    • Fixed debug log severity for read threading assembler messages (#8419)
    • Fixed issue with visibility of the --dont-use-softclipped-bases argument (#8271)
  • Mutect2

    • Added a --base-qual-correction-factor to allow a scale factor to be provided to modify the base qualities reported by the sequencer and used in the Mutect2 substitution error model (#8447)
      • Set to zero to turn off the error model changes introduced in GATK 4.1.9.0
    • Fixed a bug in FilterMutectCalls for GVCFs (#8458)
      • When using GVCFs with Mutect2 (for example with the Mitochondria mode), in the filtering step ADs for symbolic alleles are set to 0 so it doesn't contribute to overall AD. There was an off-by-one error that removed the alt allele AD rather than the <NON_REF> allele AD. This led to NaNs and errors when a site had no ref reads (for example a GT of [ref,alt,<NON_REF>] and AD of [0,300,0] would accidentally be changed to an AD of [0,0,0] if the alt index was removed instead of the <NON_REF> index).
  • DRAGEN-GATK

    • Added implementations of the "columnwise detection" and "PDHMM" (partially-determined HMM) features from DRAGEN to bring us much closer to functional equivalence with DRAGEN v3.7.8 (#8083)
    • Development work to prepare the way for the final missing DRAGEN 3.7.8 feature, "joint detection":
      • Graph method for PDHMM event groups that unifies finding/merging and overlap/mutual exclusion (#8366)
      • Rewrote haplotype construction methods in PartiallyDeterminedHaplotypeComputationEngine (#8367)
      • More refactoring in PartiallyDeterminedHaplotypeComputationEngine and preparing for joint detection (#8492)
      • Innocuous housekeeping changes in the partially-determined haplotypes code (#8361)
      • Clarify cryptic bitwise operations in the partially-determined haplotype EventGroup subclass (#8400)
  • Joint Calling

    • Added haploid support to GnarlyGenotyper (#7750)
    • Fix to allow GenotypeGVCFs to properly handle events not in minimal representation (#8567)
    • ReblockGVCF: added a --keep-site-filters argument to keep site-level filters (#8304) (#8308)
    • ReblockGVCF: added a --add-site-filters-to-genotype argument to move site-level filters to genotype-level filters (#8484)
    • ReblockGVCF: added a --format-annotations-to-remove argument to specify format-level annotations to remove from all genotypes in final GVCF (#8411)
    • ReblockGVCF: added a check to make sure the input VCF is a GVCF rather than a single sample VCF (#8411)
    • Improved an error message in GnarlyGenotyper (#8270)
    • Added a mergeWithRemapping() method in ReferenceConfidenceVariantContextMerger to perform allele remapping prior to genotyping (#8318)
    • GVS (Genomic Variant Store) development:
      • Incorporated changes from the GVS branch to existing files (#8256)
      • Incorporated build changes from the GVS branch (#8249)
      • Merged non-GVS bits required by the GVS branch [VS-971] (#8362)
  • GenomicsDB

    • Allow GenomicsDBImport to accept Azure az:// URIs as input (#8438)
    • Updated to a newer GenomicsDB release with Java 17 support, improved error messages/logging, and generally improved performance (#8358)
  • Funcotator

    • New data source release V1.8 (#8512)
      • Updated Gencode to version 43, and also updated COSMIC, Clinvar, and several other datasources to their latest versions
      • The data sources are now split by reference into separate hg19 and hg38 bundles to cut down on size
    • Fixed support for newer Gencode GTF versions by making the GencodeGTFField parsing more permissive (#8351)
    • Fixed Funcotator VCF output renderer to correctly preserve B37 contig names on output for B37 aligned files (#8539)
    • Fix bug in VCF comparison code that causes Funcotator to crash with certain datasources (#8445)
    • Connected the splice site window size to CLI parameters (#8463)
    • Allow LocatableXsvFuncotationFactory to read gzipped files (#8363)
  • CNV Calling

    • Matched gCNV pipeline arguments to those that were shown to have good performance in running large exome cohorts (#8234)
    • Added resource usage section to the GermlineCNVCaller java doc (#8064)
  • SV Calling

    • Added support for breakend replacement alleles in SVCluster (#8408)
      • Implements allele collapsing for "breakend replacement" BND alleles, as described in section 5.4 of the VCFv4.2 spec
    • Size similarity linkage and bug fixes for SV matching tools (#8257)
      • Added size similarity criterion to the SVConcordance and SVCluster tools. This is particularly useful for accurately matching smaller SVs that have a high degree of breakpoint uncertainty, in which case reciprocal overlap does not work well. PESR/mixed variant types must have size similarity, reciprocal overlap, and breakend window criteria met. Depth-only variants may have either size similarity + reciprocal overlap OR breakend window criteria met (or both).
    • Updated SV split-read strand validation and clustering (#8378)
      • Adds some flexibility to the allowed split-read strand annotations on SV records:
        • Allow INS -+ strands
        • Allow INV null strands
        • When clustering, only require that strands match for INV/BND records
    • Sample set and annotation improvements for SVConcordance (#8211)
  • Mitochondrial pipeline

    • Added a variable for the user to specify the java heap size in Picard in the MT pipeline (#8406)
    • Exposed runtime attributes as arguments in the MT pipeline (#8413) (#8417)
  • Flow-based Calling

    • New/updated flow-based read tools (#8579)
      • Added a new GroundTruthScorer tool to score reads against a reference/ground truth
      • Updated FlowFeatureMapper
    • Created an AddFlowBaseQuality tool that writes reads from flow-based SAM/BAM/CRAM files that pass criteria to a new file while adding a base-quality attribute (BQ) (#8235)
    • Added an experimental tool FlowPairHMMAlignReadsToHaplotypes that aligns flow-based reads to set of haplotypes / templates (#8305)
    • Fixed an issue with reads that contain the tp tag sometimes being incorrectly identified as flow-based (#8337)
    • Minor changes and fixes to flow-based annotations (#8442)
    • Removed a line in FlowBasedAnnotation that contained a bug and thus was meaningless (#8421)
    • Additional annotation in FeatureMap (#8347)
    • Removed unnecessary flow-based argument and option (#8342)
    • GroundTruthScorer doc update (#8597)
    • Removed unnecessary and buggy validation check (#8580)
  • Notable Enhancements

    • Major security fixes in our dependencies and docker environment
      • Updated the GATK base docker image to Ubuntu 22.04 for security fixes and newer versions of genomics packages like samtools and bcftools (#8610)
      • Updated GATK dependencies to address known security vulnerabilities, and added a vulnerability scanner to build.gradle (#8607)
    • Greatly improved HTTP support (#8611)
      • Updated the http-nio library and made tweaks to HTSJDK to make it available in more places. The new version of http-nio should provide much more reliable access to http(s) file paths. This is supported by all methods accessing Paths, and includes SAM/BAM/CRAM and VCF/Feature files. It includes a new retry mechanism which retries after transient errors. It also includes bug fixes and various other minor improvements, such as making encoded Path handling more consistent.
    • Added a new PrintFileDiagnostics tool that can output the internal metadata of CRAM, CRAI and BAI files for diagnostic purposes (#8577)
    • Added a new TransmittedSingleton annotation and added quality threshold arguments to the PossibleDenovo annotation (#8329)
    • Support multiple read name inputs in ReadNameReadFilter (#8405)
    • Added a native GATK implementation for 2bit references, and removed the dependency on the ADAM library (#8606)
  • Bug Fixes

    • Fixed a major bug in the Intel GKL that could cause output files to intermittently fail to be compressed properly (#8409)
  • Miscellaneous Changes

    • CNNVariantTrain: exposed more CNN training parameters as arguments (#8483)
    • Support underscores in bucket names on Google Cloud (#8439)
    • Performed some refactoring on the new annotation-based filtering tools (#8131)
    • Added tags to dockstore.yaml (#8323)
    • Added the ability to specify the RELEASE arg to the cloud-based docker build, and added a new docker release script (#8247)
    • Added an option to AnalyzeSaturationMutagenesis to keep disjoint mates (#8557)
    • Exit with code 137 when we get an OutOfMemoryError (#8277)
    • Updates to reduce size of docker image (#8259)
    • Free up space on Github Actions runners for all jobs (#8386) (#8371) (#8373)
    • Fixed warnings in Github Actions (#8241)
    • Disabled line-by-line codecov comments (#8613)
    • Fixed a bug in the GATK download metrics script (#8418)
    • Updated the Spark version in the GATK jar manifest, and hooked up the Spark version constant in build.gradle (#8625)
    • Fixed a warning in Gradle (#8431)
    • Pinned joblib to v1.1.1 in the python environment (#8391)
    • Updated the Ubuntu version for the Carrot github action because github dropped support for 18.04 (#8299)
  • Documentation

    • Major update to documentation generation for Metrics classes (#7749)
    • Updated some dead links to the GATK forums in the docs (#8273)
  • Dependencies

    • Updated Picard to 3.1.1 (#8585)
    • Updated HTSJDK 4.1.0 (#8620)
    • Updated the Intel GKL to 0.8.11 (#8409)
    • Updated Apache Spark to 3.5.0 (#8607)
    • Updated Hadoop to 3.3.6 (#8607)
    • Updated google-cloud-nio to 0.127.8
    • Updated http-nio to 1.1.0 (#8626)