Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functional tests for expected behavior of translate command #1009

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Jul 25, 2022

Description of proposed changes

Adds functional tests for the expected behavior of the augur translate command and then updates related translate logic to make tests pass. The main changes to behavior include:

  • Added support for "gene_name" qualifier in GFF inputs (in addition to "gene" and "locus_tag")
  • Changed sequence id in JSON output to match the sequence id defined in the GenBank or GFF inputs instead of the file name of those inputs

Testing

  • Adds four new functional tests for GFF and GenBank inputs to augur translate
  • Tested by CI

Checklist

  • Add a message in CHANGES.md summarizing the changes in this PR. Keep headers and formatting consistent with the rest of the file.

huddlej added 4 commits July 25, 2022 15:34
Adds Cram tests for augur translate with GenBank- and GFF-based gene
maps. Much of the data are copied from the Zika and TB builds. These
tests capture the expected behavior of augur translate which is not yet
the actual behavior.
Adds check for "gene_name" qualifier when loading features from a GFF
file. This change allows us to read gene maps from Nextclade datasets.
Uses the actual sequence id for the reference that corresponds to the
coordinates written by augur translate, instead of writing out the
"seqid" as the filename of the gene map. The file name does not always
represent the id of the sequence (e.g., when the name is "genemap.gff"),
whereas the sequence id in the given GenBank or GFF record should be
more accurate.
Updates sequence ids for Zika translate output to actual sequence id
instead of file name of the gene map.
Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests capture the expected behavior of augur translate which is not yet the actual behavior.

Love this phrasing!

tests/builds/zika/results/aa_muts.json Show resolved Hide resolved
for feat in SeqIO.read(reference, 'genbank').features:
reference_record = SeqIO.read(reference, 'genbank')
for feat in reference_record.features:
feat.seqid = reference_record.name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to check that .name was the right attribute here. It seems like it should be equal to .id given we don't use a custom title2ids parser, but I think it'd still be clearer to use .id since we're assigning it to .seqid.

For those playing the home game, Bio.SeqRecord (what's returned by Bio.SeqIO.read()) has id, name, and description attributes, which can all be different. For FASTA, by default, id and name are the same (the first word of the FASTA defline, sans >) and description is the whole defline, sans >. This can be changed by providing a title2ids function to the parser which splits the defline into the three attributes.

augur/utils.py Show resolved Hide resolved
@huddlej
Copy link
Contributor Author

huddlej commented Jul 28, 2022

Thank you for the review, @tsibley! I'm rethinking my approach to this PR, actually, since I think it does too much at once. A simpler/clearer solution would be to:

  • add tests for how the code actually works (instead of how I want it to work, which could be more controversial)
  • add the small fix for parsing gene_name that I need for the upcoming workshop
  • update the tests to reflect the behavior I'd prefer and change the code to implement that behavior

The first two could be in one smaller PR and then the last item could be in a separate maybe larger PR. I'll need to follow up on your comments above for these PRs anyway, though, so thank you again!

@tsibley
Copy link
Member

tsibley commented Jul 28, 2022

Ah, ok, sounds good!

@huddlej huddlej marked this pull request as draft August 8, 2022 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

2 participants