Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Nextclade GFFs to augur translate #1017

Merged
merged 3 commits into from
Aug 9, 2022

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Aug 8, 2022

Description of proposed changes

Nextclade's datasets provide gene maps for SARS-CoV-2 and influenza subtypes in GFF format. These files are easy to fetch with nextclade dataset get making them especially useful for bootstrapping training environments and supporting any custom analysis a user might want to perform with these pathogens. Unfortunately, Nextclade's gene maps store gene names in a GFF qualifier named gene_name, while augur translate only looks for gene names in qualifiers named gene and locus_tag. This PR adds support for Nextclade's GFFs to augur translate through two steps:

  1. Add functional tests for current augur translate behavior (c23f927).
  2. Add functional test for expected behavior with Nextclade GFFs and add logic to the underlying load_features function in utils.py that gets called by augur translate to support the gene_name qualifier (8cdca95).

Related issues

Testing

  • Adds functional tests for original augur translate behavior
  • Adds functional test for new augur translate behavior
  • Tested by CI

Checklist

  • Add a message in CHANGES.md summarizing the changes in this PR. Keep headers and formatting consistent with the rest of the file.

huddlej added 3 commits August 8, 2022 15:26
Adds Cram tests for augur translate with GenBank- and GFF-based gene
maps. Much of the data are copied from the Zika and TB builds. These
tests capture the current behavior of augur translate in preparation for
future improvements to this behavior.

This commit adds support to `diff_jsons.py` for excluding paths in JSON
files by regular expression, so we can ignore differences in temporary
file names. Version DeepDiff 5.2.0 introduced a command line interface
[1] that provides this same functionality and more, eliminating the need
for custom scripts like `diff_jsons.py`. In the future, we may wish to
update Augur's deepdiff dependency [2] and switch to this CLI.

[1] https://zepworks.com/deepdiff/current/commandline.html
[2] https://github.com/nextstrain/augur/blob/67027a6ae6a7a324c8eb646593cc5c0a89650ccf/setup.py#L72
Adds check for "gene_name" qualifier when loading features from a GFF
file. This change allows us to read gene maps from Nextclade datasets.
We check for "gene_name" immediately after looking for the "gene"
qualifier, since these qualifiers are most commonly used with viral
pathogen GFFs files we encounter. Note that there is still an
inconsistency in treatment of GFF and GenBank files where we check for
"locus_tag" in GenBank files first instead of "gene".
@codecov
Copy link

codecov bot commented Aug 8, 2022

Codecov Report

Attention: Patch coverage is 40.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 61.75%. Comparing base (67027a6) to head (ba142cd).
Report is 1272 commits behind head on master.

Files Patch % Lines
augur/utils.py 40.00% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1017      +/-   ##
==========================================
+ Coverage   59.36%   61.75%   +2.38%     
==========================================
  Files          52       52              
  Lines        6271     6275       +4     
  Branches     1579     1581       +2     
==========================================
+ Hits         3723     3875     +152     
+ Misses       2285     2129     -156     
- Partials      263      271       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@huddlej huddlej requested a review from victorlin August 8, 2022 22:58
Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with augur translate but the behavioral change seems simple and straightforward. Also good to have functional tests here 👍

Comment on lines +1 to +2
pushd "$TESTDIR/../../" > /dev/null
export AUGUR="${AUGUR:-../../bin/augur}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(non-blocking)

It would be nice to have just one of these files (the same as tests/functional/filter/cram/_setup.sh), but this is fine for now.

@huddlej
Copy link
Contributor Author

huddlej commented Aug 9, 2022

Thanks for the quick check, @victorlin! Agreed that the functional tests could use some consolidation. Maybe we could brainstorm at the retreat a bit, if that's a topic you're down to discuss... :D

@huddlej huddlej merged commit 3cbd593 into master Aug 9, 2022
@huddlej huddlej deleted the support-nextclade-gffs branch August 9, 2022 04:37
@corneliusroemer corneliusroemer mentioned this pull request Aug 10, 2022
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants