-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: use Auspice JSON as a dataset #1455
Conversation
I had to derive a bunch of Eq and PartialEq traits to satisfy parent type requirements
This allows to pass a path to Auspice JSON v2 to `--input-dataset` CLI argument. In this case we attempt to read not only tree, but also ref sequence, genome annotation and pathogen properties from that file, rather than from a conventional dataset.
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
This allows to input Auspice JSON as Nextclade dataset to the web app.
Let's add an explicit `Accept` HTTP header when fetching Auspice JSON. This is required for nextstrain.org links to work - the server sends different content depending on `Accept` header.
We decided to merge it to master. We found a tree which crashes Nextclade Web. But it seems unrelated to this feature. Rather, the tree is too deep and we overflow the call stack in a recursive call when converting internal representation into tree json for output. We will try to remove recursion and see how it goes. The dataset trees are rather small, but this is not generally true for the auspice trees out there. In the meantime, the excessively big trees should be avoided. We would also need to followup with CLI and Web docs - for args and URL parameters, as well as some basic explanations. Though the feature will only be used internally for some time and we might need to figure out how exactly to document it first, from experience. |
Awesome! I'll work on adding a link-out within auspice-on-nextstrain.org to kick the dataset over to nextclade (if there's a root-sequence and if the root-sequence is actually the root sequence and not a reference, although that's hard to tell)
Any back-of-the envelope numbers here which I could use to disable the link / add a warning? |
Nope, and I just wanted to link the known problematic tree here, but the problem is no longer reproducible. Here is the link: I think the tree might have changed at that address. But I have a copy of the old tree that crashes Nextclade: Too big to upload to GitHub plaintext (71 MB out of 25 MB), so sadly no Nextclade link possible to try it. P.S. Interesting that the new tree is twice as small: 34 MB. P.P.S. I formatted both trees with prettier for readability and comparison. |
P.P.P.S you don't have to use the charon API - because you (Nextclade) use the appropriate And you can make use of the snapshot/version identifier to access past trees, which should expose the broken tree. Something like this: |
I'm just catching up on activity here (cool!) so if any of this is not applicable any more, my apologies!
This seems like a huge hazard, yeah? Especially once this feature is documented or there's greater awareness/usage of it outside ourselves. (As soon as we start linking to Nextclade from Auspice, I'm sure folks will notice and try it themselves.) Instead of Nextclade accepting any Auspice JSON and requiring this implicit assertion to hold true without having any way to verify it, could we take a different approach? For example, we could have a flag in the JSON (either in
Is there a plan to support fetching the root sequence from the sidecar file, e.g. with If this won't be supported, it seems to me that the decision to inline the root sequence in the main JSON or relegate it to a sidecar will no longer be determined by Auspice performance and genome size but instead by "do folks want this Nextstrain dataset to work with Nextclade?" And the answer to that will always hew towards "yes, of course", so every root sequence will end up inlined. If we're ok with that (???), can we get ahead of it and ditch the sidecar entirely then? (or rather, recommend always inlining going forward and change Augur's default) |
I have confused myself in the past with the differences here, and our language in augur is itself confusing / inconsistent, largely because this subtlety was added to accommodate nextclade dataset construction in the first place. So I was thinking of starting by writing up good docs here (along the lines of these ones) - there's already commentary on this, but it's scattered across slack / issues. Totally support the aim here which is to not jump into nextclade if the dataset's not valid (but still "works") |
Followup of #1455 If `.root_sequence` is not available on Auspice JSON, let's attempt to fetch ref sequence from sidecar Auspice JSON. For that let's GET from the same URL, but with `Accept: application/vnd.nextstrain.dataset.root-sequence+json` header.
@tsibley @jameshadfield my attempt to add sidecar JSON is here: #1460 |
Regarding the "yes, I understand what I am doing" flag - it could be useful. This rule is universal for all trees, not just trees which happen to also be full datasets. For official datasets this flag can be added easily. I'd ask our scientists what they think. |
Nextclade does check for consistency of the provided root sequence with the mutations in the tree (we didn't initially, but we do in v3). If there is a mutation on the tree like I don't think there is a reason not to inline the root sequence for most of the viruses. The root sequence is 100x smaller than the rest of the tree. For bacteria this is a different consideration. |
Nod. Size-wise, sure, but I think the other reasons boiled down to an Auspice load-time optimization. @jameshadfield probably has the most historical context here on hand without further digging. |
A couple of UI things I noted. External datasets like this are tagged "community", e.g. and while I get why this is, it feels potentially pretty confusing to users. In particular because a) this is an official nextstrain.org core dataset and b) nextstrain.org has its own meaning for "community dataset". Should we rethink the UI here a little? Mark https://nextstrain.org core datasets as "official"? Or with some other label than "community"? (These are unification/integration pains, which feel hard, but also worth it.) For me, the "Suggest automatically" feature of Nextclade was enabled by default (though it seems like a sticky preference?). With it enabled:
This seems confusing? |
Following up on the conversation around root-sequences vs reference sequence. All usages of For nextclade usage from a nextstrain dataset JSON (i.e. this PR) this means any dataset produced by an augur pipeline will be internally consistent. For nextclade usage from a nextclade dataset, there's two paths to inconsistency I can see:
The nextclade docs do provide instructions to avoid these mistakes. ¹ See this issue for behaviour of |
Nextclade does not read When pasting pathogen info into Auspice JSON (which is also optional), I suggest to not put the |
Yup, that quote was from the section on "nextclade usage from a nextclade dataset" (i.e. not an auspice JSON dataset) -- in that case I think you should be comparing the files reference sequence against the auspice JSON root sequence, as this would guard against a misconfigured augur pipeline which produced the dataset.
Agreed |
Following conversation in #1455 (comment) Let's add a warning if the reference sequence provided any of the possible ways (fasta in dataset files, through CLI argument, Web URL param, or Web "Customization" interface) does not exactly match (as in string comparison) the `.root_sequence.nuc` in Auspice JSON. The warning message is the following. Please suggest improvements (paste a full quote into reply message or feel free to modify in the code). <details> <summary>Click to expand</summary> > Nextclade detected that reference sequence provided does not exactly match reference (root) sequence in Auspice JSON. > > This could be due to one of the reasons: > > - Nextclade dataset author provided reference sequence and reference tree that are incompatible > - The reference tree has been constructed incorrectly > - The reference sequence provided using `--input-ref` CLI argument is not compatible with the reference tree in the dataset > - The reference tree provided using `--input-tree` CLI argument is not compatible with the reference sequence in the dataset > - The reference sequence provided using `&input-ref` parameter in Nextclade Web URL is not compatible with the reference tree in the dataset > - The reference tree provided using `&input-tree` parameter in Nextclade Web URL is not compatible with the reference sequence in the dataset > > This warning signals that there is a potential for failures if the mismatch is not intended. </details>
Most discussion about this functionality has been happening within the nextclade repo, see <nextstrain/nextclade#1455> and <nextstrain/nextclade#1460> for a good summary.
Most discussion about this functionality has been happening within the nextclade repo, see <nextstrain/nextclade#1455> and <nextstrain/nextclade#1460> for a good summary.
Slack threads:
This allows to use Auspice JSON v2 as input dataset. In this case we attempt to read not only tree, but also ref sequence, genome annotation and pathogen properties from that file, rather than from a conventional dataset.
Work items
--input-dataset
?dataset-json-url
URL param. Note that the URL passed as an argument to the URL param might need to be percent-encoded (urlencoded).read-annotation
command. This allows to visualize genome annotation from Auspice JSON the way Nextclade sees it. Might be useful for debugging.Requirements
.root_sequence.nuc
fieldand(clade_membership
node attributes. Theclade_membership
requirement will be lifted inthe near future#1457.clade_membership
is no longer required)Data sources within Auspice JSON
.root_sequence.nuc
.meta.genome_annotations
.meta.extensions.nextclade.pathogen
Notes:
If
.root_sequence.nuc
is missing and reference sequence is not provided otherwise (e.g. using individual args/params), then an error is thrown.Auspice JSON does not contain name of the reference in
.root_sequence.nuc
. When writing reference sequence to outputs (with--include-reference
in CLI and always in Web), the name is taken from pathogen info at.meta.extensions.nextclade.pathogen.attributes["reference name"]
if present. Otherwise a hardcoded value "reference" is used.Genome annotation in the Auspice format (Gff annotations augur#354, Entropy panel mk2 auspice#1684)
If
.meta.genome_annotations
is missing, similarly to whengenome_annotation.gff3
is missing, an empty annotation will be used, meaning translation, aa mutation calling and anything else related to amino acids will not run.Object of the same format as
pathogen.json
. Just paste contents ofpathogen.json
into a new field:.meta.extensions.nextclade.pathogen
. If pathogen info is missing, then QC and other features configurable inpathogen.json
will be disabled. There will be no pretty dataset name and ref sequence name.Examples
From nextstrain.org: https://nextclade-git-feat-ref-and-ann-from-tree-json-nextstrain.vercel.app?dataset-json-url=https://nextstrain.org/charon/getDataset?prefix=ncov/gisaid/global/all-time(does not contain required.root_sequence.nuc
. Please suggest a json that has it)✔️ SC2 from GitHub (the tree originally contains
.root_sequence.nuc
, as well as.meta.genome_annotations
and I added.meta.extensions.nextclade.pathogen
for pretty dataset name display): https://nextclade-git-feat-ref-and-ann-from-tree-json-nextstrain.vercel.app?dataset-json-url=https://github.com/nextstrain/nextclade_data/blob/feat/ref-and-ann-from-tree-json/data/nextstrain/sars-cov-2/wuhan-hu-1/orfs/tree.json💥 crashing H5 from GitHub: https://nextclade-git-feat-ref-and-ann-from-tree-json-nextstrain.vercel.app/?dataset-json-url=https://github.com/rneher/h5_cattle_genome/blob/master/tree.json
Future work
.meta.extensions.nextclade.pathogen.files
or another similar field could contain URLs to the (1) data which cannot be in JSON (e.g. example sequences, readme, changelog) or (2) you could have a separate ref sequence, annotation and pathogen.json files if for some reason you decide that Auspice fields do not suite you.