Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception for Duplicate Strain Names #356

Merged
merged 2 commits into from
Sep 3, 2019
Merged

Conversation

emmahodcroft
Copy link
Member

@emmahodcroft emmahodcroft commented Aug 16, 2019

A collaborator noticed that a bunch of sequences from Germany were missing from a big run - I'd never have noticed they were gone on my own given my unfamiliarity with the data, and the size of the run (about 70 missing from ~1400). (I'm also downsampling, so the final number isn't ~1400, either!)

Turns out due to a mess-up in metadata on their end, a bunch of sequences were being given the same strain name. When these are read in (in filter and align, they just overwrite each other in the dict, with no error. We also don't error if there are duplicate keys when reading in metadata (from read_metadata in utils.py). I think this would be a sensible thing to do.

For sequences this is pretty easy - I switched from using our own dict to using Bio.SeqIO's function to_dict which raises a ValueError for duplicate keys.

For metadata, I modified read_metadata so that it raises ValueErrors if there are duplicate keys. I added a catch to this in filter since I was working on it, to show a prettier error message, but this isn't necessary.

Should avoid having sneaky things like this happen in the background in future!

@emmahodcroft
Copy link
Member Author

Oh - one more thought: I put this in filter and align as those are rather obvious starting points. I assume most Treebuilder programs handle this kind of thing somehow (though how shouty they are about it, and whether we hide this somewhere, I don't know), and we don't always read in alignments/etc before we pass it to Treebuilders.

However, perhaps something should also go into parse? I never use this myself so I'm not familiar, but it seems like it's not unimaginable that after stripping away some stuff that might make the sequence unique (date, location, etc), you might be left with a string that's not unique. @huddlej I thought you might be more familiar with this function and able to weigh in.

On the other hand, perhaps it's unlikely to use parse and not then use filter or align - or in worse case, a Treebuilder which will hopefully shout somehow?

@emmahodcroft emmahodcroft requested a review from huddlej August 16, 2019 13:05
@emmahodcroft
Copy link
Member Author

All three Treebuilders error noisily, but for none the reason comes up on the screen. IQTree example:
image

All three put it in the log files.

IQTree

It is in the log file, but if some IQTree-incompatible-symbols (here /) have been replaced so that they can be restored, it may not be super clear to the user what in the world has happened....
image
(Actual sequence name is GB/England_2018_127/CSF/2018)

FastTree

It is in the log file, and this time looks more sensible:
image

RAxML:

In the log file and also looks reasonable.
image

I added a little bit to the error messages for when TreeBuilders fail, so that users are directed to look at the log file (if it exists).

@jameshadfield
Copy link
Member

👍 Seems great. These things are the bane of bioinformatics.

@emmahodcroft
Copy link
Member Author

@huddlej If you don't want to add anything at the moment RE parse, I'll go ahead and merge?

@emmahodcroft emmahodcroft merged commit 5b3929a into master Sep 3, 2019
@rneher rneher deleted the detect_seq_dups branch September 5, 2019 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants