Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare sequence objects for equality. #468

Merged
merged 1 commit into from
Mar 30, 2020

Conversation

groutr
Copy link
Contributor

@groutr groutr commented Mar 24, 2020

I became aware of this project from the COVID-19 Open Source Help Desk. Looking at #465 I feel I discovered a subtle bug in how align.py detects duplicate sequences when reading from files. I look forward to any feedback on this PR.

str(record) is the string representation of the entire record and
comparing the string representations compares more than the sequences.

Description of proposed changes

This PR addresses a subtle bug in align.py when checking duplicate sequences when reading from files.

The changes in this PR fix sequence comparison in align.py. Before these changes, the wrong objects were being compared.

Example:

from Bio import SeqIO
a = SeqIO.parse('test.fasta', 'fasta')
p = next(a)

p
# SeqRecord(seq=Seq('GAATTCTCATACACAGATGACAAGAGATGGCATATAAAGTATGGTGCGGAATAC...TTC', SingleLetterAlphabet()), id='EM_INV:AC008368', name='EM_INV:AC008368', description='EM_INV:AC008368 AC008368.21 STD:Trypanosoma brucei chromosome 2 clone RPCI93-1F7, complete sequence.', dbxrefs=[])
str(p)
# "ID: EM_INV:AC008368\nName: EM_INV:AC008368\nDescription: EM_INV:AC008368 AC008368.21 STD:Trypanosoma brucei chromosome 2 clone RPCI93-1F7, complete sequence.\nNumber of features: 0\nSeq('GAATTCTCATACACAGATGACAAGAGATGGCATATAAAGTATGGTGCGGAATAC...TTC', SingleLetterAlphabet())"
# With str(p) == str(q) it's possible for the sequences to be the same, but the metadata cause the comparison to return False.

p.seq    # These are the objects that I think we are actually wanting to compare
# Seq('GAATTCTCATACACAGATGACAAGAGATGGCATATAAAGTATGGTGCGGAATAC...TTC', SingleLetterAlphabet())

# We could compare str(p.seq) == str(q.seq) however, the seq objects also check the alphabet.
# The correct comparison should be p.seq == q.seq

Testing

I would love to get some help on how to properly test these changes. The instructions in DEV_DOCS didn't seem to work on my setup.

Thank you for contributing to Nextstrain!

@tolot27
Copy link
Contributor

tolot27 commented Mar 24, 2020

@groutr You are right! Only the Seq objects should be compared at that time/line. Comparing the alphabet is not necessary, IMHO. It is more interessing, what the outcome of the comparison is, if one of the sequences is represented as RNA and the other one as cDNA. str(p.seq) == str(q.seq) would return false in that case. The alphabet would be different as well (RNAAlphabet and DNAAlphabet). However, should these two sequences be considered identical?

@groutr
Copy link
Contributor Author

groutr commented Mar 25, 2020

@tolot27 the Seq.__eq__ method checks to see if alphabets are compatible and issues a warning if they are not compatible. Ultimately, the comparison that is returned is str(self) == str(other)

As far as alphabet compatibility, I'm afraid that is beyond my expertise.

@groutr
Copy link
Contributor Author

groutr commented Mar 27, 2020

@tolot27 Unless we are setting the alphabet, biopython defaults to SingleLetterAlphabet when reading fasta files. In this case, if the strings are equal, then the sequences will be considered identical.

Is there anything else that I need to do?

@tolot27
Copy link
Contributor

tolot27 commented Mar 27, 2020

Is there anything else that I need to do?

@groutr No, thanks. The PR looks good, IMHO.

@tolot27
Copy link
Contributor

tolot27 commented Mar 27, 2020

@groutr Maybe, you can write a test case for it.

Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@groutr Thank you for catching this subtle bug! The new code and test work for me and look good.

Would you mind rebasing and merging these six commits into a single commit with a commit message that summarizes all changes? Let me know if you aren't comfortable doing this and I can do it instead.

str(record) is the string representation of the entire record and
comparing the string representations compares more than the sequences.
@groutr groutr requested a review from huddlej March 30, 2020 19:36
@huddlej huddlej merged commit 7fff5b1 into nextstrain:master Mar 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants