Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors and warnings for trees without internal node names #283

Merged
merged 2 commits into from
Jul 10, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions augur/ancestral.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,16 @@ def run(args):
print("ERROR: reading tree from %s failed."%args.tree)
return 1

import numpy as np
missing_internal_node_names = [n.name is None for n in T.get_nonterminals()]
if np.all(missing_internal_node_names):
print("\n*** WARNING: Tree has no internal node names!")
print("*** Without internal node names, ancestral sequences can't be linked up to the correct node later.")
print("*** If you want to use 'augur export' or `augur translate` later, re-run this command with the output of 'augur refine'.")
print("*** If you haven't run 'augur refine', you can add node names to your tree by running:")
print("*** augur refine --tree %s --output-tree <filename>.nwk"%(args.tree) )
print("*** And use <filename>.nwk as the tree when running 'ancestral', 'translate', and 'traits'")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great 👍

I would be strongly in favor of changing this to an ERROR and exiting at this point, thus forcing the user to generate these internal node names through augur refine, then rerunning ancestral / traits using the correct tree.

Copy link
Member Author

@emmahodcroft emmahodcroft May 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might raise a bigger question of how we want these individual parts to operate - more independently or always together? Should it be ok just to run ancestral or translate to get some information, without the intention of putting it together or through export (in which case node names don't matter), or do we want to make people always keep the pieces so that they could be put together, because they may make that decision later without remembering the warnings? I can think for both sides, myself - it would be good to hear thoughts of others. (Should I ask in Slack?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all for being able to use these features as independent parts, with no intention to run export. But if one runs augur traits (e.g.) using a tree without internal node names, then you get an output (json) which refers to things like NODE00001 which do not exist in the tree. I think in these cases we want to exit and say "We can only infer things onto a tree with internal node names labelled, else you cannot match the results to your tree! One way to do this is to run augur refine --tree A --output-tree B"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to change to this. Feel free to nudge me if I haven't done it in a few days...!


if any([args.alignment.lower().endswith(x) for x in ['.vcf', '.vcf.gz']]):
if not args.vcf_reference:
print("ERROR: a reference Fasta is required with VCF-format alignments")
Expand Down
4 changes: 3 additions & 1 deletion augur/export.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,8 @@ def convert_tree_to_json_structure(node, metadata, div=0, nextflu_schema=False,
cdiv = div + metadata[child.name]['mutation_length']
elif 'branch_length' in metadata[child.name]:
cdiv = div + metadata[child.name]['branch_length']
else:
print("ERROR: Cannot find branch length information for %s"%(child.name))
node_struct["children"].append(convert_tree_to_json_structure(child, metadata, div=cdiv, nextflu_schema=nextflu_schema, strains=strains)[0])

return (node_struct, strains)
Expand All @@ -77,7 +79,7 @@ def recursively_decorate_tree_json_nextflu_schema(node, node_metadata, decoratio
metadata = node_metadata[node["strain"]]
metadata["strain"] = node["strain"]
except KeyError:
raise Exception("ERROR: node %s is not found in the node metadata."%n.name)
raise Exception("ERROR: node %s is not found in the node metadata."%node.name)

for data in decorations:
val = None
Expand Down
10 changes: 8 additions & 2 deletions augur/refine.py
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,7 @@ def run(args):
attributes = ['branch_length']

# check if tree is provided an can be read
T = None #otherwise get 'referenced before assignment' error if reading fails
for fmt in ["newick", "nexus"]:
try:
T = Phylo.read(args.tree, fmt)
Expand Down Expand Up @@ -165,8 +166,10 @@ def run(args):
# if not specified, construct default output file name with suffix _tt.nwk
if args.output_tree:
tree_fname = args.output_tree
else:
elif args.alignment:
tree_fname = '.'.join(args.alignment.split('.')[:-1]) + '_tt.nwk'
else:
tree_fname = '.'.join(args.tree.split('.')[:-1]) + '_tt.nwk'

if args.timetree:
# load meta data and covert dates to numeric
Expand Down Expand Up @@ -215,10 +218,13 @@ def run(args):
import json
tree_success = Phylo.write(T, tree_fname, 'newick', format_branch_length='%1.8f')
print("updated tree written to",tree_fname, file=sys.stdout)

if args.output_node_data:
node_data_fname = args.output_node_data
else:
elif args.alignment:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node-data is necessary (it seems) even if it only contains branch lengths, because export looks for branch lengths in metadata read in from such files. However, I played around and export can quite easily be modified to look for branch lengths in the Newick if it doesn't find it in the metadata.

I think it's sensible for export to work without any node_data JSONs, and look in the newick file for divergence. (If divergence is provided in the node_data files then it should preferentially use that.) Certainly if I just used refine to add node names it seems unnecessarily complicated to have a node data JSON produced.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll make a PR with this change in export, and we can take a look and ensure it seems like it's going to work as intended, then we can modify refine here so that it doesn't output this if only node-names are added.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played around with this in export. It's simple to get branch lengths from the tree rather than branch_lengths node-data (but preferentially from branch_lengths node-data, if available). However, to test this, I'm inputting a 'raw' tree, then providing node-data (some is required) for only aa_muts (for example).

So the deeper question is, if you export with a raw tree and some ancestral, or trait info also generated on a raw tree (so no internal node names match the tree), how should export handle this?

I played around and for V1, changed recursively_decorate_tree_json_nextflu_schema so that if node['strain'] is None (this is an internal node, so has no name on a raw tree), it doesn't error, but displays a warning saying it can't find the node, and so metadata for this node won't be written. Essentially, the only node-data traits attached to the tree are those on the tips.

Thanks to the wonderful raw tree I am using, this looks like a hot mess, but it works...:
image

By 'works' I mean that you can see different AA mutations at the tips, for example. Branches always seem to be just coloured by the first AA in the legend, I guess that's a default behaviour somewhere for if there's no other information for the branch? But assume this could be changed to default to grey?

For traits there ends up being no difference between just adding the trait from the metadata.

Is this the kind of behaviour we want? Or do we not want to support this?

(I tried to test this for V2 as well, but didn't realise it required reinstalling Auspice... so I've modified export to give me V2 JSONS without error, but I haven't yet tested them. Same theory applies as above.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the V1 exports that work in 'current' auspice and the V2 export do NOT work in V2 auspice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So no, it doesn't work for mutations, in V1 or V2. --node-data in export that can't be matched with nodes in the tree should cause an error (a reasonable one, unlike currently).

We could in theory still support export reading branch length from the tree instead of from a node-data file, in case they ran refine just to get node-names but didn't include the branch-length node-data JSON in export, but this is a question of how much we want to coddle users I suppose. If not, we should at least try to give a reasonable error probably.

node_data_fname = '.'.join(args.alignment.split('.')[:-1]) + '.node_data.json'
else:
node_data_fname = '.'.join(args.tree.split('.')[:-1]) + '.node_data.json'

write_json(node_data, node_data_fname)
print("node attributes written to",node_data_fname, file=sys.stdout)
Expand Down
11 changes: 11 additions & 0 deletions augur/traits.py
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,17 @@ def run(args):
tree_fname = args.tree
traits, columns = read_metadata(args.metadata)

from Bio import Phylo
T = Phylo.read(tree_fname, 'newick')
missing_internal_node_names = [n.name is None for n in T.get_nonterminals()]
if np.all(missing_internal_node_names):
print("\n*** WARNING: Tree has no internal node names!")
print("*** Without internal node names, ancestral traits can't be linked up to the correct node later.")
print("*** If you want to use 'augur export' later, re-run this command with the output of 'augur refine'.")
print("*** If you haven't run 'augur refine', you can add node names to your tree by running:")
print("*** augur refine --tree %s --output-tree <filename>.nwk"%(tree_fname) )
print("*** And use <filename>.nwk as the tree when running 'ancestral', 'translate', and 'traits'")

mugration_states = defaultdict(dict)
for column in args.columns:
T, gtr, alphabet = mugration_inference(tree=tree_fname, seq_meta=traits,
Expand Down
96 changes: 70 additions & 26 deletions augur/translate.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,12 @@
from .utils import read_node_data, load_features, write_json, write_VCF_translation
from treetime.vcf_utils import read_vcf

class MissingNodeError(Exception):
pass

class MismatchNodeError(Exception):
pass

def safe_translate(sequence, report_exceptions=False):
"""Returns an amino acid translation of the given nucleotide sequence accounting
for gaps in the given sequence.
Expand Down Expand Up @@ -181,7 +187,15 @@ def assign_aa_vcf(tree, translations):
#get mutations on the root
root = tree.root
aa_muts[root.name]={"aa_muts":{}}
#If has no root node name, exit with error
if root.name is None:
print("\n*** Can't find node name for the tree root!")
raise MissingNodeError()

for fname, prot in translations.items():
if root.name not in prot['sequences']:
print("\n*** Can't find %s in the alignment provided!"%(root.name))
raise MismatchNodeError()
root_muts = prot['sequences'][root.name]
tmp = []
for pos in prot['positions']:
Expand All @@ -193,9 +207,15 @@ def assign_aa_vcf(tree, translations):
for c in n:
aa_muts[c.name]={"aa_muts":{}}
for fname, prot in translations.items():
if n.name not in prot['sequences']:
print("\n*** Can't find %s in the alignment provided!"%(root.name))
raise MismatchNodeError()
n_muts = prot['sequences'][n.name]
for c in n:
tmp = []
if c.name is None:
print("\n*** Internal node missing a node name!")
raise MissingNodeError()
c_muts = prot['sequences'][c.name]
for pos in prot['positions']:
#if pos in both, check if same
Expand All @@ -211,6 +231,39 @@ def assign_aa_vcf(tree, translations):

return aa_muts

def assign_aa_fasta(tree, translations):
aa_muts = {}

#fasta input shouldn't have mutations on root, so give empty entry
root = tree.get_nonterminals()[0]
aa_muts[root.name]={"aa_muts":{}}

for n in tree.get_nonterminals():
if n.name is None:
print("\n*** Tree is missing node names!")
raise MissingNodeError()
for c in n:
aa_muts[c.name]={"aa_muts":{}}
for fname, aln in translations.items():
for c in n:
if c.name in aln and n.name in aln:
tmp = [construct_mut(a, int(pos+1), d) for pos, (a,d) in
enumerate(zip(aln[n.name], aln[c.name])) if a!=d]
aa_muts[c.name]["aa_muts"][fname] = tmp
elif c.name not in aln and n.name not in aln:
print("\n*** Can't find %s OR %s in the alignment provided!"%(c.name, n.name))
raise MismatchNodeError()
else:
print("no sequence pair for nodes %s-%s"%(c.name, n.name))

if n==tree.root:
aa_muts[n.name]={"aa_muts":{}, "aa_sequences":{}}
for fname, aln in translations.items():
if n.name in aln:
aa_muts[n.name]["aa_sequences"][fname] = "".join(aln[n.name])

return aa_muts

def get_genes_from_file(fname):
genes = []
if os.path.isfile(fname):
Expand Down Expand Up @@ -314,32 +367,23 @@ def run(args):
'strand': 1}

## determine amino acid mutations for each node
if is_vcf:
aa_muts = assign_aa_vcf(tree, translations)
else:
aa_muts = {}

#fasta input shouldn't have mutations on root, so give empty entry
root = tree.get_nonterminals()[0]
aa_muts[root.name]={"aa_muts":{}}

for n in tree.get_nonterminals():
for c in n:
aa_muts[c.name]={"aa_muts":{}}
for fname, aln in translations.items():
for c in n:
if c.name in aln and n.name in aln:
tmp = [construct_mut(a, int(pos+1), d) for pos, (a,d) in
enumerate(zip(aln[n.name], aln[c.name])) if a!=d]
aa_muts[c.name]["aa_muts"][fname] = tmp
else:
print("no sequence pair for nodes %s-%s"%(c.name, n.name))
if n==tree.root:
aa_muts[n.name]={"aa_muts":{}, "aa_sequences":{}}
for fname, aln in translations.items():
if n.name in aln:
aa_muts[n.name]["aa_sequences"][fname] = "".join(aln[n.name])

try:
if is_vcf:
aa_muts = assign_aa_vcf(tree, translations)
else:
aa_muts = assign_aa_fasta(tree, translations)
except MissingNodeError as err:
print("\n*** ERROR: Some/all nodes have no node names!")
print("*** Please check you are providing the tree output by 'augur refine'.")
print("*** If you haven't run 'augur refine', please add node names to your tree by running:")
print("*** augur refine --tree %s --output-tree <filename>.nwk"%(args.tree) )
print("*** And use <filename>.nwk as the tree when running 'ancestral', 'translate', and 'traits'")
return 1
except MismatchNodeError as err:
print("\n*** ERROR: Mismatch between node names in %s and in %s"%(args.tree, args.ancestral_sequences))
print("*** Ensure you are using the same tree you used to run 'ancestral' as input here.")
print("*** Or, re-run 'ancestral' using %s, then use the new %s as input here."%(args.tree, args.ancestral_sequences))
return 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


write_json({'annotations':annotations, 'nodes':aa_muts}, args.output)
print("amino acid mutations written to",args.output, file=sys.stdout)
Expand Down