-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Errors and warnings for trees without internal node names #283
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emmahodcroft this is great -- thanks for doing this 🎉
My comments (below) are generally aimed at forcing users to add node names via augur refine
and nowhere else. I think this makes things less error-prone (e.g. it avoids the subtle bug where the wrong tree was provided to augur traits
and your warning message ignored). Naming things in refine
also means that it's only ever done once -- whether you infer temporal phylogeny or not.
This doesn't detect situations where the order/topology of the tree has changed since running ancestral but all the node names are the same. (But I'm not sure we could detect this?)
This would be avoided if we only ever added internal nodes via refine
and all downstream commands checked for these.
augur refine --tree results/tree_raw.nwk --output-tree results/tree.nwk
This is much much much nicer 👍
Perhaps we could detect this case (i.e. no alignment provided) and print some output indicating that refine is only adding internal node labels.
For Fasta-input, if the root node doesn't have a match in the alignment, it won't have the full translated sequence(s) in the resulting JSON. This doesn't seem to affect how it views in auspice - could this cause errors elsewhere? (If so, easy to add an extra check here.)
Hmm not sure I understand this. How can we infer mutations on the root node?
print("*** If you want to use 'augur export' or `augur translate` later, re-run this command with the output of 'augur refine'.") | ||
print("*** If you haven't run 'augur refine', you can add node names to your tree by running:") | ||
print("*** augur refine --tree %s --output-tree <filename>.nwk"%(args.tree) ) | ||
print("*** And use <filename>.nwk as the tree when running 'ancestral', 'translate', and 'traits'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great 👍
I would be strongly in favor of changing this to an ERROR
and exiting at this point, thus forcing the user to generate these internal node names through augur refine
, then rerunning ancestral / traits using the correct tree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might raise a bigger question of how we want these individual parts to operate - more independently or always together? Should it be ok just to run ancestral
or translate
to get some information, without the intention of putting it together or through export
(in which case node names don't matter), or do we want to make people always keep the pieces so that they could be put together, because they may make that decision later without remembering the warnings? I can think for both sides, myself - it would be good to hear thoughts of others. (Should I ask in Slack?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm all for being able to use these features as independent parts, with no intention to run export
. But if one runs augur traits
(e.g.) using a tree without internal node names, then you get an output (json) which refers to things like NODE00001 which do not exist in the tree. I think in these cases we want to exit and say "We can only infer things onto a tree with internal node names labelled, else you cannot match the results to your tree! One way to do this is to run augur refine --tree A --output-tree B
"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to change to this. Feel free to nudge me if I haven't done it in a few days...!
print("\n*** ERROR: Mismatch between node names in %s and in %s"%(args.tree, args.ancestral_sequences)) | ||
print("*** Ensure you are using the same tree you used to run 'ancestral' as input here.") | ||
print("*** Or, re-run 'ancestral' using %s, then use the new %s as input here."%(args.tree, args.ancestral_sequences)) | ||
return 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
if args.output_node_data: | ||
node_data_fname = args.output_node_data | ||
else: | ||
elif args.alignment: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Node-data is necessary (it seems) even if it only contains branch lengths, because export looks for branch lengths in metadata read in from such files. However, I played around and export can quite easily be modified to look for branch lengths in the Newick if it doesn't find it in the metadata.
I think it's sensible for export
to work without any node_data JSONs, and look in the newick file for divergence. (If divergence is provided in the node_data files then it should preferentially use that.) Certainly if I just used refine
to add node names it seems unnecessarily complicated to have a node data JSON produced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll make a PR with this change in export
, and we can take a look and ensure it seems like it's going to work as intended, then we can modify refine
here so that it doesn't output this if only node-names are added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I played around with this in export
. It's simple to get branch lengths from the tree rather than branch_lengths node-data (but preferentially from branch_lengths node-data, if available). However, to test this, I'm inputting a 'raw' tree, then providing node-data (some is required) for only aa_muts (for example).
So the deeper question is, if you export
with a raw tree and some ancestral
, or trait
info also generated on a raw tree (so no internal node names match the tree), how should export
handle this?
I played around and for V1, changed recursively_decorate_tree_json_nextflu_schema
so that if node['strain']
is None
(this is an internal node, so has no name on a raw tree), it doesn't error, but displays a warning saying it can't find the node, and so metadata for this node won't be written. Essentially, the only node-data traits attached to the tree are those on the tips.
Thanks to the wonderful raw tree I am using, this looks like a hot mess, but it works...:
By 'works' I mean that you can see different AA mutations at the tips, for example. Branches always seem to be just coloured by the first AA in the legend, I guess that's a default behaviour somewhere for if there's no other information for the branch? But assume this could be changed to default to grey?
For traits there ends up being no difference between just adding the trait from the metadata.
Is this the kind of behaviour we want? Or do we not want to support this?
(I tried to test this for V2 as well, but didn't realise it required reinstalling Auspice... so I've modified export
to give me V2 JSONS without error, but I haven't yet tested them. Same theory applies as above.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both the V1 exports that work in 'current' auspice and the V2 export do NOT work in V2 auspice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So no, it doesn't work for mutations, in V1 or V2. --node-data
in export
that can't be matched with nodes in the tree should cause an error (a reasonable one, unlike currently).
We could in theory still support export
reading branch length from the tree instead of from a node-data file, in case they ran refine
just to get node-names but didn't include the branch-length node-data JSON in export
, but this is a question of how much we want to coddle users I suppose. If not, we should at least try to give a reasonable error probably.
I'm less familiar with Fasta-input, and I've already forgotten a lot over the weekend - I'll look again. But from what I remember, normally the whole translated sequence of the root node is put in aa_muts, whereas on other nodes it's just the mutations. If the root node in the tree doesn't have a corresponding sequence in Alternatively we could make the Fasta-input as strict as the VCF-input, where even one missing internal node will cause the whole thing to error, which would intrinsically solve the problem. |
I changed to using Exceptions as suggested in Remaining issues:
|
These are changes designed to detect cases where
refine
hasn't been run in order to give internal node names, and other functions that could or do rely on node-names are called. Some give warnings, others give errors and fail.The warnings/errors I implemented are quite wordy. If there's a better way to clearly convey the problem/solution to users, please modify :)
1. Make adding node labels easier in
refine
Currently just to add node labels you must provide
--tree
,--output-tree
,--alignment
, and, if VCF,--vcf-reference
(because this is checked if you've provided a VCF alignment). The alignment isn't actually used - the code is there for a fake alignment to be made if only renaming nodes.... but if you haven't provided--output-node-data
(why would you when just naming nodes) then it tries to make the filename fromargs.alignment
...Node-data is necessary (it seems) even if it only contains branch lengths, because
export
looks for branch lengths in metadata read in from such files. However, I played around andexport
can quite easily be modified to look for branch lengths in the Newick if it doesn't find it in the metadata. @jameshadfield , is there any other reason we couldn't move to this?For the moment, I've modified this so that if
args.alignment
isn't present, it makes the node-data filename fromargs.tree
. So, you can run just name nodes with:augur refine --tree results/tree_raw.nwk --output-tree results/tree.nwk
Similarly,
--output-tree
is also not a required argument but looks toargs.alignment
to create the output file name, so I also modified this so that ifargs.alignment
is missing, it'll useargs.tree
(required). (so you can actually now just run:augur refine --tree results/tree_raw.nwk
)2. Added errors with likely problem and solution to
translate
a) If some/all node names are missing, advises user to check they are using the tree output by
refine
. Ifrefine
hasn't been run, gives the run command above.b) If a node in the tree can't be found in the alignment, advises user to ensure they've used the same tree as input for
ancestral
andtranslate
, or to re-runancestral
with the tree they provided here, then runtranslate
with the new alignment output.This doesn't detect situations where the order/topology of the tree has changed since running
ancestral
but all the node names are the same. (But I'm not sure we could detect this?)For Fasta-input, there is a message if nodes can't be found in the translation: "no sequence pair for nodes
node1
-node2
", but even if this message displays for every or almost every pair, it doesn't give an error and creates the output files anyway. I don't think that's necessarily obvious that something's gone wrong!I added an extra catch that if it can't find both nodes in the alignment, it breaks and gives the same error as b) above. However, this allows some nodes to be missing, unlike with VCF (where any missing cause an error). We could be more strict here with Fasta-input as well - unsure if there is a reason for it allowing lots of missing nodes?
For Fasta-input, if the root node doesn't have a match in the alignment, it won't have the full translated sequence(s) in the resulting JSON. This doesn't seem to affect how it views in
auspice
- could this cause errors elsewhere? (If so, easy to add an extra check here.)I used some return strings here to separate out the errors. I don't know if that's good Python practice - feel free to edit if there's a better way! (I wanted to be able to include the filename args in the error messages.)
3. Added warnings to
traits
andancestral
If it's detected that the tree has no internal node names, a warning is displayed that tells the user if they'd like to use
export
ortranslate
later, they should either use the tree fromrefine
as input, or runrefine
to add node names (and gives the simple command)4. Fixed and added error messages in
export
(minor)Minor fixes
I've tested all of the above with Fasta and VCF, but probably not exhaustively, so other eyes would be appreciated.