-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add functional tests for expected behavior of translate command #1009
base: master
Are you sure you want to change the base?
Conversation
Adds Cram tests for augur translate with GenBank- and GFF-based gene maps. Much of the data are copied from the Zika and TB builds. These tests capture the expected behavior of augur translate which is not yet the actual behavior.
Adds check for "gene_name" qualifier when loading features from a GFF file. This change allows us to read gene maps from Nextclade datasets.
Uses the actual sequence id for the reference that corresponds to the coordinates written by augur translate, instead of writing out the "seqid" as the filename of the gene map. The file name does not always represent the id of the sequence (e.g., when the name is "genemap.gff"), whereas the sequence id in the given GenBank or GFF record should be more accurate.
Updates sequence ids for Zika translate output to actual sequence id instead of file name of the gene map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests capture the expected behavior of augur translate which is not yet the actual behavior.
Love this phrasing!
for feat in SeqIO.read(reference, 'genbank').features: | ||
reference_record = SeqIO.read(reference, 'genbank') | ||
for feat in reference_record.features: | ||
feat.seqid = reference_record.name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to check that .name
was the right attribute here. It seems like it should be equal to .id
given we don't use a custom title2ids
parser, but I think it'd still be clearer to use .id
since we're assigning it to .seqid
.
For those playing the home game, Bio.SeqRecord
(what's returned by Bio.SeqIO.read()
) has id
, name
, and description
attributes, which can all be different. For FASTA, by default, id
and name
are the same (the first word of the FASTA defline, sans >
) and description
is the whole defline, sans >
. This can be changed by providing a title2ids
function to the parser which splits the defline into the three attributes.
Thank you for the review, @tsibley! I'm rethinking my approach to this PR, actually, since I think it does too much at once. A simpler/clearer solution would be to:
The first two could be in one smaller PR and then the last item could be in a separate maybe larger PR. I'll need to follow up on your comments above for these PRs anyway, though, so thank you again! |
Ah, ok, sounds good! |
Description of proposed changes
Adds functional tests for the expected behavior of the
augur translate
command and then updates related translate logic to make tests pass. The main changes to behavior include:Testing
Checklist