Exception for Duplicate Strain Names #356

emmahodcroft · 2019-08-16T12:54:31Z

A collaborator noticed that a bunch of sequences from Germany were missing from a big run - I'd never have noticed they were gone on my own given my unfamiliarity with the data, and the size of the run (about 70 missing from ~1400). (I'm also downsampling, so the final number isn't ~1400, either!)

Turns out due to a mess-up in metadata on their end, a bunch of sequences were being given the same strain name. When these are read in (in filter and align, they just overwrite each other in the dict, with no error. We also don't error if there are duplicate keys when reading in metadata (from read_metadata in utils.py). I think this would be a sensible thing to do.

For sequences this is pretty easy - I switched from using our own dict to using Bio.SeqIO's function to_dict which raises a ValueError for duplicate keys.

For metadata, I modified read_metadata so that it raises ValueErrors if there are duplicate keys. I added a catch to this in filter since I was working on it, to show a prettier error message, but this isn't necessary.

Should avoid having sneaky things like this happen in the background in future!

emmahodcroft · 2019-08-16T13:05:03Z

Oh - one more thought: I put this in filter and align as those are rather obvious starting points. I assume most Treebuilder programs handle this kind of thing somehow (though how shouty they are about it, and whether we hide this somewhere, I don't know), and we don't always read in alignments/etc before we pass it to Treebuilders.

However, perhaps something should also go into parse? I never use this myself so I'm not familiar, but it seems like it's not unimaginable that after stripping away some stuff that might make the sequence unique (date, location, etc), you might be left with a string that's not unique. @huddlej I thought you might be more familiar with this function and able to weigh in.

On the other hand, perhaps it's unlikely to use parse and not then use filter or align - or in worse case, a Treebuilder which will hopefully shout somehow?

emmahodcroft · 2019-08-16T13:39:27Z

All three Treebuilders error noisily, but for none the reason comes up on the screen. IQTree example:

All three put it in the log files.

IQTree

It is in the log file, but if some IQTree-incompatible-symbols (here /) have been replaced so that they can be restored, it may not be super clear to the user what in the world has happened....

(Actual sequence name is GB/England_2018_127/CSF/2018)

FastTree

It is in the log file, and this time looks more sensible:

RAxML:

In the log file and also looks reasonable.

I added a little bit to the error messages for when TreeBuilders fail, so that users are directed to look at the log file (if it exists).

jameshadfield · 2019-08-27T07:29:07Z

👍 Seems great. These things are the bane of bioinformatics.

emmahodcroft · 2019-08-27T09:28:39Z

@huddlej If you don't want to add anything at the moment RE parse, I'll go ahead and merge?

throw error for duplicate strains in fasta or metadata

e6e24c0

emmahodcroft requested a review from huddlej August 16, 2019 13:05

add to treebuilding error message

d9a7e5a

emmahodcroft merged commit 5b3929a into master Sep 3, 2019

rneher deleted the detect_seq_dups branch September 5, 2019 12:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exception for Duplicate Strain Names #356

Exception for Duplicate Strain Names #356

emmahodcroft commented Aug 16, 2019 •

edited

Loading

emmahodcroft commented Aug 16, 2019

emmahodcroft commented Aug 16, 2019

jameshadfield commented Aug 27, 2019

emmahodcroft commented Aug 27, 2019

Exception for Duplicate Strain Names #356

Exception for Duplicate Strain Names #356

Conversation

emmahodcroft commented Aug 16, 2019 • edited Loading

emmahodcroft commented Aug 16, 2019

emmahodcroft commented Aug 16, 2019

IQTree

FastTree

RAxML:

jameshadfield commented Aug 27, 2019

emmahodcroft commented Aug 27, 2019

emmahodcroft commented Aug 16, 2019 •

edited

Loading