Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors and warnings for trees without internal node names #283

Merged
merged 2 commits into from
Jul 10, 2019

Conversation

emmahodcroft
Copy link
Member

@emmahodcroft emmahodcroft commented May 9, 2019

These are changes designed to detect cases where refine hasn't been run in order to give internal node names, and other functions that could or do rely on node-names are called. Some give warnings, others give errors and fail.

The warnings/errors I implemented are quite wordy. If there's a better way to clearly convey the problem/solution to users, please modify :)

1. Make adding node labels easier in refine

Currently just to add node labels you must provide --tree, --output-tree, --alignment, and, if VCF, --vcf-reference (because this is checked if you've provided a VCF alignment). The alignment isn't actually used - the code is there for a fake alignment to be made if only renaming nodes.... but if you haven't provided --output-node-data (why would you when just naming nodes) then it tries to make the filename from args.alignment...

Node-data is necessary (it seems) even if it only contains branch lengths, because export looks for branch lengths in metadata read in from such files. However, I played around and export can quite easily be modified to look for branch lengths in the Newick if it doesn't find it in the metadata. @jameshadfield , is there any other reason we couldn't move to this?

For the moment, I've modified this so that if args.alignment isn't present, it makes the node-data filename from args.tree. So, you can run just name nodes with:
augur refine --tree results/tree_raw.nwk --output-tree results/tree.nwk

Similarly, --output-tree is also not a required argument but looks to args.alignment to create the output file name, so I also modified this so that if args.alignment is missing, it'll use args.tree (required). (so you can actually now just run:
augur refine --tree results/tree_raw.nwk )

2. Added errors with likely problem and solution to translate

a) If some/all node names are missing, advises user to check they are using the tree output by refine. If refine hasn't been run, gives the run command above.

b) If a node in the tree can't be found in the alignment, advises user to ensure they've used the same tree as input for ancestral and translate, or to re-run ancestral with the tree they provided here, then run translate with the new alignment output.

This doesn't detect situations where the order/topology of the tree has changed since running ancestral but all the node names are the same. (But I'm not sure we could detect this?)

For Fasta-input, there is a message if nodes can't be found in the translation: "no sequence pair for nodes node1-node2", but even if this message displays for every or almost every pair, it doesn't give an error and creates the output files anyway. I don't think that's necessarily obvious that something's gone wrong!
I added an extra catch that if it can't find both nodes in the alignment, it breaks and gives the same error as b) above. However, this allows some nodes to be missing, unlike with VCF (where any missing cause an error). We could be more strict here with Fasta-input as well - unsure if there is a reason for it allowing lots of missing nodes?

For Fasta-input, if the root node doesn't have a match in the alignment, it won't have the full translated sequence(s) in the resulting JSON. This doesn't seem to affect how it views in auspice - could this cause errors elsewhere? (If so, easy to add an extra check here.)

I used some return strings here to separate out the errors. I don't know if that's good Python practice - feel free to edit if there's a better way! (I wanted to be able to include the filename args in the error messages.)

3. Added warnings to traits and ancestral

If it's detected that the tree has no internal node names, a warning is displayed that tells the user if they'd like to use export or translate later, they should either use the tree from refine as input, or run refine to add node names (and gives the simple command)

4. Fixed and added error messages in export (minor)
Minor fixes

I've tested all of the above with Fasta and VCF, but probably not exhaustively, so other eyes would be appreciated.

@emmahodcroft emmahodcroft requested a review from jameshadfield May 9, 2019 15:50
Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emmahodcroft this is great -- thanks for doing this 🎉

My comments (below) are generally aimed at forcing users to add node names via augur refine and nowhere else. I think this makes things less error-prone (e.g. it avoids the subtle bug where the wrong tree was provided to augur traits and your warning message ignored). Naming things in refine also means that it's only ever done once -- whether you infer temporal phylogeny or not.

This doesn't detect situations where the order/topology of the tree has changed since running ancestral but all the node names are the same. (But I'm not sure we could detect this?)

This would be avoided if we only ever added internal nodes via refine and all downstream commands checked for these.

augur refine --tree results/tree_raw.nwk --output-tree results/tree.nwk

This is much much much nicer 👍
Perhaps we could detect this case (i.e. no alignment provided) and print some output indicating that refine is only adding internal node labels.

For Fasta-input, if the root node doesn't have a match in the alignment, it won't have the full translated sequence(s) in the resulting JSON. This doesn't seem to affect how it views in auspice - could this cause errors elsewhere? (If so, easy to add an extra check here.)

Hmm not sure I understand this. How can we infer mutations on the root node?

print("*** If you want to use 'augur export' or `augur translate` later, re-run this command with the output of 'augur refine'.")
print("*** If you haven't run 'augur refine', you can add node names to your tree by running:")
print("*** augur refine --tree %s --output-tree <filename>.nwk"%(args.tree) )
print("*** And use <filename>.nwk as the tree when running 'ancestral', 'translate', and 'traits'")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great 👍

I would be strongly in favor of changing this to an ERROR and exiting at this point, thus forcing the user to generate these internal node names through augur refine, then rerunning ancestral / traits using the correct tree.

Copy link
Member Author

@emmahodcroft emmahodcroft May 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might raise a bigger question of how we want these individual parts to operate - more independently or always together? Should it be ok just to run ancestral or translate to get some information, without the intention of putting it together or through export (in which case node names don't matter), or do we want to make people always keep the pieces so that they could be put together, because they may make that decision later without remembering the warnings? I can think for both sides, myself - it would be good to hear thoughts of others. (Should I ask in Slack?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all for being able to use these features as independent parts, with no intention to run export. But if one runs augur traits (e.g.) using a tree without internal node names, then you get an output (json) which refers to things like NODE00001 which do not exist in the tree. I think in these cases we want to exit and say "We can only infer things onto a tree with internal node names labelled, else you cannot match the results to your tree! One way to do this is to run augur refine --tree A --output-tree B"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to change to this. Feel free to nudge me if I haven't done it in a few days...!

print("\n*** ERROR: Mismatch between node names in %s and in %s"%(args.tree, args.ancestral_sequences))
print("*** Ensure you are using the same tree you used to run 'ancestral' as input here.")
print("*** Or, re-run 'ancestral' using %s, then use the new %s as input here."%(args.tree, args.ancestral_sequences))
return 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

if args.output_node_data:
node_data_fname = args.output_node_data
else:
elif args.alignment:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node-data is necessary (it seems) even if it only contains branch lengths, because export looks for branch lengths in metadata read in from such files. However, I played around and export can quite easily be modified to look for branch lengths in the Newick if it doesn't find it in the metadata.

I think it's sensible for export to work without any node_data JSONs, and look in the newick file for divergence. (If divergence is provided in the node_data files then it should preferentially use that.) Certainly if I just used refine to add node names it seems unnecessarily complicated to have a node data JSON produced.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll make a PR with this change in export, and we can take a look and ensure it seems like it's going to work as intended, then we can modify refine here so that it doesn't output this if only node-names are added.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played around with this in export. It's simple to get branch lengths from the tree rather than branch_lengths node-data (but preferentially from branch_lengths node-data, if available). However, to test this, I'm inputting a 'raw' tree, then providing node-data (some is required) for only aa_muts (for example).

So the deeper question is, if you export with a raw tree and some ancestral, or trait info also generated on a raw tree (so no internal node names match the tree), how should export handle this?

I played around and for V1, changed recursively_decorate_tree_json_nextflu_schema so that if node['strain'] is None (this is an internal node, so has no name on a raw tree), it doesn't error, but displays a warning saying it can't find the node, and so metadata for this node won't be written. Essentially, the only node-data traits attached to the tree are those on the tips.

Thanks to the wonderful raw tree I am using, this looks like a hot mess, but it works...:
image

By 'works' I mean that you can see different AA mutations at the tips, for example. Branches always seem to be just coloured by the first AA in the legend, I guess that's a default behaviour somewhere for if there's no other information for the branch? But assume this could be changed to default to grey?

For traits there ends up being no difference between just adding the trait from the metadata.

Is this the kind of behaviour we want? Or do we not want to support this?

(I tried to test this for V2 as well, but didn't realise it required reinstalling Auspice... so I've modified export to give me V2 JSONS without error, but I haven't yet tested them. Same theory applies as above.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the V1 exports that work in 'current' auspice and the V2 export do NOT work in V2 auspice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So no, it doesn't work for mutations, in V1 or V2. --node-data in export that can't be matched with nodes in the tree should cause an error (a reasonable one, unlike currently).

We could in theory still support export reading branch length from the tree instead of from a node-data file, in case they ran refine just to get node-names but didn't include the branch-length node-data JSON in export, but this is a question of how much we want to coddle users I suppose. If not, we should at least try to give a reasonable error probably.

augur/translate.py Outdated Show resolved Hide resolved
@emmahodcroft
Copy link
Member Author

For Fasta-input, if the root node doesn't have a match in the alignment, it won't have the full translated sequence(s) in the resulting JSON. This doesn't seem to affect how it views in auspice - could this cause errors elsewhere? (If so, easy to add an extra check here.)

I'm less familiar with Fasta-input, and I've already forgotten a lot over the weekend - I'll look again. But from what I remember, normally the whole translated sequence of the root node is put in aa_muts, whereas on other nodes it's just the mutations. If the root node in the tree doesn't have a corresponding sequence in translate, then it won't have a sequence written out. From my trials on Friday, it seems this didn't actually impact how things ended up looking in auspice - I don't think we actually export that translated sequence at the moment, an as mutations are already inferred, it's not needed? (But I could be wrong.) However, from other discussions we've had it seems we may assume this is present in future (to allow auspice to show the base at a constant site, for example), so perhaps better to check for this now...

Alternatively we could make the Fasta-input as strict as the VCF-input, where even one missing internal node will cause the whole thing to error, which would intrinsically solve the problem.

@emmahodcroft
Copy link
Member Author

I changed to using Exceptions as suggested in translate.

Remaining issues:

  • Should translate always output a full translated sequence on the root node for Fasta-input? Should Fasta-input be as strict about missing nodes as VCF-input? (Which would solve the first issue.)
  • How independent do we want augur commands to be? Should users be able to run translate without internal node names if they wish (just give a warning)? Or should they be forced to generate node names to ensure future stuff works predictably (give an error)?
  • What do we want to require in export? If it can't find internal node info, is it ok to leave it out? (see comment above)

@rneher rneher merged commit fddb5b3 into master Jul 10, 2019
@rneher rneher deleted the better_node_naming branch July 10, 2019 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants