-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Nextclade GFFs to augur translate #1017
Conversation
Adds Cram tests for augur translate with GenBank- and GFF-based gene maps. Much of the data are copied from the Zika and TB builds. These tests capture the current behavior of augur translate in preparation for future improvements to this behavior. This commit adds support to `diff_jsons.py` for excluding paths in JSON files by regular expression, so we can ignore differences in temporary file names. Version DeepDiff 5.2.0 introduced a command line interface [1] that provides this same functionality and more, eliminating the need for custom scripts like `diff_jsons.py`. In the future, we may wish to update Augur's deepdiff dependency [2] and switch to this CLI. [1] https://zepworks.com/deepdiff/current/commandline.html [2] https://github.com/nextstrain/augur/blob/67027a6ae6a7a324c8eb646593cc5c0a89650ccf/setup.py#L72
Adds check for "gene_name" qualifier when loading features from a GFF file. This change allows us to read gene maps from Nextclade datasets. We check for "gene_name" immediately after looking for the "gene" qualifier, since these qualifiers are most commonly used with viral pathogen GFFs files we encounter. Note that there is still an inconsistency in treatment of GFF and GenBank files where we check for "locus_tag" in GenBank files first instead of "gene".
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1017 +/- ##
==========================================
+ Coverage 59.36% 61.75% +2.38%
==========================================
Files 52 52
Lines 6271 6275 +4
Branches 1579 1581 +2
==========================================
+ Hits 3723 3875 +152
+ Misses 2285 2129 -156
- Partials 263 271 +8 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with augur translate
but the behavioral change seems simple and straightforward. Also good to have functional tests here 👍
pushd "$TESTDIR/../../" > /dev/null | ||
export AUGUR="${AUGUR:-../../bin/augur}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(non-blocking)
It would be nice to have just one of these files (the same as tests/functional/filter/cram/_setup.sh), but this is fine for now.
Thanks for the quick check, @victorlin! Agreed that the functional tests could use some consolidation. Maybe we could brainstorm at the retreat a bit, if that's a topic you're down to discuss... :D |
Description of proposed changes
Nextclade's datasets provide gene maps for SARS-CoV-2 and influenza subtypes in GFF format. These files are easy to fetch with
nextclade dataset get
making them especially useful for bootstrapping training environments and supporting any custom analysis a user might want to perform with these pathogens. Unfortunately, Nextclade's gene maps store gene names in a GFF qualifier namedgene_name
, whileaugur translate
only looks for gene names in qualifiers namedgene
andlocus_tag
. This PR adds support for Nextclade's GFFs toaugur translate
through two steps:load_features
function inutils.py
that gets called byaugur translate
to support thegene_name
qualifier (8cdca95).Related issues
Testing
augur translate
behavioraugur translate
behaviorChecklist