Add support for Nextclade GFFs to augur translate #1017

huddlej · 2022-08-08T22:46:07Z

Description of proposed changes

Nextclade's datasets provide gene maps for SARS-CoV-2 and influenza subtypes in GFF format. These files are easy to fetch with nextclade dataset get making them especially useful for bootstrapping training environments and supporting any custom analysis a user might want to perform with these pathogens. Unfortunately, Nextclade's gene maps store gene names in a GFF qualifier named gene_name, while augur translate only looks for gene names in qualifiers named gene and locus_tag. This PR adds support for Nextclade's GFFs to augur translate through two steps:

Add functional tests for current augur translate behavior (c23f927).
Add functional test for expected behavior with Nextclade GFFs and add logic to the underlying load_features function in utils.py that gets called by augur translate to support the gene_name qualifier (8cdca95).

Related issues

Replaces part of Add functional tests for expected behavior of translate command #1009

Testing

Adds functional tests for original augur translate behavior
Adds functional test for new augur translate behavior
Tested by CI

Checklist

Add a message in CHANGES.md summarizing the changes in this PR. Keep headers and formatting consistent with the rest of the file.

Adds Cram tests for augur translate with GenBank- and GFF-based gene maps. Much of the data are copied from the Zika and TB builds. These tests capture the current behavior of augur translate in preparation for future improvements to this behavior. This commit adds support to `diff_jsons.py` for excluding paths in JSON files by regular expression, so we can ignore differences in temporary file names. Version DeepDiff 5.2.0 introduced a command line interface [1] that provides this same functionality and more, eliminating the need for custom scripts like `diff_jsons.py`. In the future, we may wish to update Augur's deepdiff dependency [2] and switch to this CLI. [1] https://zepworks.com/deepdiff/current/commandline.html [2] https://github.com/nextstrain/augur/blob/67027a6ae6a7a324c8eb646593cc5c0a89650ccf/setup.py#L72

Adds check for "gene_name" qualifier when loading features from a GFF file. This change allows us to read gene maps from Nextclade datasets. We check for "gene_name" immediately after looking for the "gene" qualifier, since these qualifiers are most commonly used with viral pathogen GFFs files we encounter. Note that there is still an inconsistency in treatment of GFF and GenBank files where we check for "locus_tag" in GenBank files first instead of "gene".

codecov · 2022-08-08T22:57:41Z

Codecov Report

Attention: Patch coverage is 40.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 61.75%. Comparing base (67027a6) to head (ba142cd).
Report is 1272 commits behind head on master.

Files	Patch %	Lines
augur/utils.py	40.00%	1 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1017      +/-   ##
==========================================
+ Coverage   59.36%   61.75%   +2.38%     
==========================================
  Files          52       52              
  Lines        6271     6275       +4     
  Branches     1579     1581       +2     
==========================================
+ Hits         3723     3875     +152     
+ Misses       2285     2129     -156     
- Partials      263      271       +8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

victorlin

I'm not familiar with augur translate but the behavioral change seems simple and straightforward. Also good to have functional tests here 👍

victorlin · 2022-08-08T23:21:46Z

tests/functional/translate/cram/_setup.sh

+pushd "$TESTDIR/../../" > /dev/null
+export AUGUR="${AUGUR:-../../bin/augur}"


(non-blocking)

It would be nice to have just one of these files (the same as tests/functional/filter/cram/_setup.sh), but this is fine for now.

huddlej · 2022-08-09T04:35:00Z

Thanks for the quick check, @victorlin! Agreed that the functional tests could use some consolidation. Maybe we could brainstorm at the retreat a bit, if that's a topic you're down to discuss... :D

huddlej added 3 commits August 8, 2022 15:26

Update change log

ba142cd

huddlej mentioned this pull request Aug 8, 2022

Add functional tests for expected behavior of translate command #1009

Draft

3 tasks

huddlej requested a review from victorlin August 8, 2022 22:58

victorlin approved these changes Aug 8, 2022

View reviewed changes

huddlej merged commit 3cbd593 into master Aug 9, 2022

huddlej deleted the support-nextclade-gffs branch August 9, 2022 04:37

corneliusroemer mentioned this pull request Aug 10, 2022

Upgrade treetime 0.9 #1018

Merged

1 task

huddlej mentioned this pull request May 17, 2023

utils.load_features cannot parse Nextclade gene map GFFs #1007

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Nextclade GFFs to augur translate #1017

Add support for Nextclade GFFs to augur translate #1017

huddlej commented Aug 8, 2022 •

edited

Loading

codecov bot commented Aug 8, 2022 •

edited

Loading

victorlin left a comment

victorlin Aug 8, 2022

huddlej commented Aug 9, 2022

		pushd "$TESTDIR/../../" > /dev/null
		export AUGUR="${AUGUR:-../../bin/augur}"

Add support for Nextclade GFFs to augur translate #1017

Add support for Nextclade GFFs to augur translate #1017

Conversation

huddlej commented Aug 8, 2022 • edited Loading

Description of proposed changes

Related issues

Testing

Checklist

codecov bot commented Aug 8, 2022 • edited Loading

Codecov Report

victorlin left a comment

Choose a reason for hiding this comment

victorlin Aug 8, 2022

Choose a reason for hiding this comment

huddlej commented Aug 9, 2022

huddlej commented Aug 8, 2022 •

edited

Loading

codecov bot commented Aug 8, 2022 •

edited

Loading