[augur ancestral] create annotation block

For a detailed write-up of the bug which motivated this commit, see nextstrain#881. By storing the (nucleotide) genome annotation in the node-data produced from augur-ancestral we make this information available for export. Previously this information was only exported by `augur translate` which was problematic for workflows which didn't perform translation. No changes are needed to `augur export v2` (which may now process multiple "annotations" blocks) due to the behavior of `NodeData.deep_add_or_update` which will recurse into dicts in annotation blocks and when confronted with non-dict values which already exist overwrite them. This poses a potential problem where two node-data JSONs which (e.g.) define different `annotations['nuc']` coordinates will not raise any error and the output coodinates are dependent on the order the node-data JSONs were provided to `augur export v2`. Closes nextstrain#881.
victorlin · Jun 30, 2022 · af35e77 · af35e77
1 parent 93dc8d1
commit af35e77
Show file tree

Hide file tree

Showing 3 changed files with 19 additions and 0 deletions.
diff --git a/augur/ancestral.py b/augur/ancestral.py
@@ -207,6 +207,8 @@ def run(args):
     if anc_seqs.get("mask") is not None:
         anc_seqs["mask"] = "".join(['1' if x else '0' for x in anc_seqs["mask"]])
 
+    anc_seqs['annotations'] = {'nuc': {'start': 1, 'end': len(anc_seqs['reference']['nuc']), 'strand': '+'}}
+
     out_name = get_json_name(args, '.'.join(args.alignment.split('.')[:-1]) + '_mutations.json')
     write_json(anc_seqs, out_name)
     print("ancestral mutations written to", out_name, file=sys.stdout)

diff --git a/tests/builds/zika/results/nt_muts.json b/tests/builds/zika/results/nt_muts.json
@@ -1,4 +1,11 @@
 {
+  "annotations": {
+    "nuc": {
+      "end": 10769,
+      "start": 1,
+      "strand": "+"
+    }
+  },
   "generated_by": {
     "program": "augur",
     "version": "7.0.2"

diff --git a/tests/functional/ancestral.t b/tests/functional/ancestral.t
@@ -15,6 +15,16 @@ The default is to infer ambiguous bases, so there should not be N bases in the i
   $ grep N "$TMP/ancestral_sequences.fasta"
   >NODE_0000000
 
+Check that the reference length was correctly exported as the nuc annotation
+  $ grep -A 6 'annotations' "$TMP/ancestral_mutations.json"
+    "annotations": {
+      "nuc": {
+        "end": 10769,
+        "start": 1,
+        "strand": "+"
+      }
+    },
+
 Infer ancestral sequences for the given tree and alignment, explicitly requesting that ambiguous bases are inferred.
 There should not be N bases in the inferred output sequences.