Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use Auspice JSON as a dataset #1455

Merged
merged 20 commits into from
May 29, 2024

Conversation

ivan-aksamentov
Copy link
Member

@ivan-aksamentov ivan-aksamentov commented May 16, 2024

Slack threads:

This allows to use Auspice JSON v2 as input dataset. In this case we attempt to read not only tree, but also ref sequence, genome annotation and pathogen properties from that file, rather than from a conventional dataset.

Work items

  • parse genome annotation, reference and pathogen info from Auspice JSON in CLI, when passing Auspice JSON filepath to --input-dataset
  • parse genome annotation, reference and pathogen info from Auspice JSON in Web, when passing URL to Auspice JSON file into ?dataset-json-url URL param. Note that the URL passed as an argument to the URL param might need to be percent-encoded (urlencoded).
  • accept Auspice JSON in read-annotation command. This allows to visualize genome annotation from Auspice JSON the way Nextclade sees it. Might be useful for debugging.

Requirements

  • JSONs must contain ref nuc sequence .root_sequence.nuc field and clade_membership node attributes. The clade_membership requirement will be lifted in the near future #1457. (clade_membership is no longer required)
  • Scientifically, root node of the tree should either correspond to reference sequence or to contain all mutations between reference sequence and sequence corresponding to the root node. This is ensured by authors of Nextclade datasets, but this is not generally true for Auspice JSONs. Nextclade has no possibility to verify this correspondence. If this assumption is violated, it will produce incorrect results.

Data sources within Auspice JSON

required? what path in Auspice JSON Notes
yes reference sequence .root_sequence.nuc 1, 2
no genome annotation .meta.genome_annotations 3, 4
no pathogen info .meta.extensions.nextclade.pathogen 5
no examples ???
no readme ???
no changelog ???

Notes:

  1. If .root_sequence.nuc is missing and reference sequence is not provided otherwise (e.g. using individual args/params), then an error is thrown.

  2. Auspice JSON does not contain name of the reference in .root_sequence.nuc. When writing reference sequence to outputs (with --include-reference in CLI and always in Web), the name is taken from pathogen info at .meta.extensions.nextclade.pathogen.attributes["reference name"] if present. Otherwise a hardcoded value "reference" is used.

  3. Genome annotation in the Auspice format (Gff annotations augur#354, Entropy panel mk2 auspice#1684)

  4. If .meta.genome_annotations is missing, similarly to when genome_annotation.gff3 is missing, an empty annotation will be used, meaning translation, aa mutation calling and anything else related to amino acids will not run.

  5. Object of the same format as pathogen.json. Just paste contents of pathogen.json into a new field: .meta.extensions.nextclade.pathogen. If pathogen info is missing, then QC and other features configurable in pathogen.json will be disabled. There will be no pretty dataset name and ref sequence name.

Examples

Future work

  • .meta.extensions.nextclade.pathogen.files or another similar field could contain URLs to the (1) data which cannot be in JSON (e.g. example sequences, readme, changelog) or (2) you could have a separate ref sequence, annotation and pathogen.json files if for some reason you decide that Auspice fields do not suite you.

I had to derive a bunch of Eq and PartialEq traits to satisfy parent type requirements
This allows to pass a path to Auspice JSON v2 to `--input-dataset` CLI argument. In this case we attempt to read not only tree, but also ref sequence, genome annotation and pathogen properties from that file, rather than from a conventional dataset.
Copy link

vercel bot commented May 16, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
nextclade ✅ Ready (Inspect) Visit Preview May 24, 2024 2:50pm

Let's add an explicit `Accept` HTTP header when fetching Auspice JSON. This is required for nextstrain.org links to work - the server sends different content depending on `Accept` header.
@ivan-aksamentov
Copy link
Member Author

ivan-aksamentov commented May 29, 2024

We decided to merge it to master.

We found a tree which crashes Nextclade Web. But it seems unrelated to this feature. Rather, the tree is too deep and we overflow the call stack in a recursive call when converting internal representation into tree json for output. We will try to remove recursion and see how it goes.

The dataset trees are rather small, but this is not generally true for the auspice trees out there. In the meantime, the excessively big trees should be avoided.

We would also need to followup with CLI and Web docs - for args and URL parameters, as well as some basic explanations. Though the feature will only be used internally for some time and we might need to figure out how exactly to document it first, from experience.

@jameshadfield
Copy link
Member

We decided to merge it to master.

Awesome! I'll work on adding a link-out within auspice-on-nextstrain.org to kick the dataset over to nextclade (if there's a root-sequence and if the root-sequence is actually the root sequence and not a reference, although that's hard to tell)

The dataset trees are rather small, but this is not generally true for the auspice trees out there. In the meantime, the excessively big trees should be avoided.

Any back-of-the envelope numbers here which I could use to disable the link / add a warning?

@ivan-aksamentov
Copy link
Member Author

ivan-aksamentov commented May 29, 2024

@jameshadfield

Any back-of-the envelope numbers here which I could use to disable the link / add a warning?

Nope, and I just wanted to link the known problematic tree here, but the problem is no longer reproducible. Here is the link:

https://nextclade-git-feat-ref-and-ann-from-tree-json-nextstrain.vercel.app/?dataset-json-url=https://nextstrain.org/charon/getDataset?prefix=rsv/a/F

I think the tree might have changed at that address. But I have a copy of the old tree that crashes Nextclade:

tree.json.gz

Too big to upload to GitHub plaintext (71 MB out of 25 MB), so sadly no Nextclade link possible to try it.

P.S. Interesting that the new tree is twice as small: 34 MB.

tree_new.json.gz

P.P.S. I formatted both trees with prettier for readability and comparison.

@jameshadfield
Copy link
Member

P.P.P.S you don't have to use the charon API - because you (Nextclade) use the appropriate Accept headers you can hit the dataset address directly:

https://nextclade-git-feat-ref-and-ann-from-tree-json-nextstrain.vercel.app/?dataset-json-url=https://nextstrain.org/rsv/a/F

And you can make use of the snapshot/version identifier to access past trees, which should expose the broken tree. Something like this:

https://nextclade-git-feat-ref-and-ann-from-tree-json-nextstrain.vercel.app/?dataset-json-url=https://nextstrain.org/rsv/a/F@2024-05-15

@tsibley
Copy link
Member

tsibley commented May 29, 2024

I'm just catching up on activity here (cool!) so if any of this is not applicable any more, my apologies!

Scientifically, root node of the tree should either correspond to reference sequence or to contain all mutations between reference sequence and sequence corresponding to the root node. This is ensured by authors of Nextclade datasets, but this is not generally true for Auspice JSONs. Nextclade has no possibility to verify this correspondence. If this assumption is violated, it will produce incorrect results.

This seems like a huge hazard, yeah? Especially once this feature is documented or there's greater awareness/usage of it outside ourselves. (As soon as we start linking to Nextclade from Auspice, I'm sure folks will notice and try it themselves.)

Instead of Nextclade accepting any Auspice JSON and requiring this implicit assertion to hold true without having any way to verify it, could we take a different approach? For example, we could have a flag in the JSON (either in .meta somewhere or dangling off .root_sequence somewhere) that allows a Nextstrain dataset to opt-into Nextclade compatibility, e.g. declares "yes, my .root_sequence.nuc was correctly generated for use with Nextclade". This would require co-operation from Augur, but that's very doable. I can see still having a way to force/override this flag (e.g. via a query param/CLI option) for older Nextstrain datasets without it, but it should be obvious that you must know what you're doing.

JSONs must contain ref nuc sequence .root_sequence.nuc field

Is there a plan to support fetching the root sequence from the sidecar file, e.g. with Accept: application/vnd.nextstrain.dataset.root-sequence+json? If this support existed, for example, then pointing Nextclade at https://nextstrain.org/ncov/gisaid/global/all-time would work.

If this won't be supported, it seems to me that the decision to inline the root sequence in the main JSON or relegate it to a sidecar will no longer be determined by Auspice performance and genome size but instead by "do folks want this Nextstrain dataset to work with Nextclade?" And the answer to that will always hew towards "yes, of course", so every root sequence will end up inlined. If we're ok with that (???), can we get ahead of it and ditch the sidecar entirely then? (or rather, recommend always inlining going forward and change Augur's default)

@jameshadfield
Copy link
Member

This would require co-operation from Augur, but that's very doable.

I have confused myself in the past with the differences here, and our language in augur is itself confusing / inconsistent, largely because this subtlety was added to accommodate nextclade dataset construction in the first place. So I was thinking of starting by writing up good docs here (along the lines of these ones) - there's already commentary on this, but it's scattered across slack / issues.

Totally support the aim here which is to not jump into nextclade if the dataset's not valid (but still "works")

ivan-aksamentov added a commit that referenced this pull request May 30, 2024
Followup of #1455

If `.root_sequence` is not available on Auspice JSON, let's attempt to fetch ref sequence from sidecar Auspice JSON. For that let's GET from the same URL, but with `Accept: application/vnd.nextstrain.dataset.root-sequence+json` header.
@ivan-aksamentov
Copy link
Member Author

@tsibley @jameshadfield my attempt to add sidecar JSON is here: #1460

@ivan-aksamentov
Copy link
Member Author

ivan-aksamentov commented May 30, 2024

Regarding the "yes, I understand what I am doing" flag - it could be useful. This rule is universal for all trees, not just trees which happen to also be full datasets. For official datasets this flag can be added easily. I'd ask our scientists what they think.

@rneher
Copy link
Member

rneher commented May 30, 2024

Nextclade does check for consistency of the provided root sequence with the mutations in the tree (we didn't initially, but we do in v3). If there is a mutation on the tree like A321G and the root sequence is not A at this position, it will error. But it can't do this consistency check for positions when there is no mutation anywhere on the tree.

I don't think there is a reason not to inline the root sequence for most of the viruses. The root sequence is 100x smaller than the rest of the tree. For bacteria this is a different consideration.

@tsibley
Copy link
Member

tsibley commented May 30, 2024

I don't think there is a reason not to inline the root sequence for most of the viruses. The root sequence is 100x smaller than the rest of the tree. For bacteria this is a different consideration.

Nod. Size-wise, sure, but I think the other reasons boiled down to an Auspice load-time optimization. @jameshadfield probably has the most historical context here on hand without further digging.

@tsibley
Copy link
Member

tsibley commented May 30, 2024

A couple of UI things I noted.

External datasets like this are tagged "community", e.g.

image

and while I get why this is, it feels potentially pretty confusing to users. In particular because a) this is an official nextstrain.org core dataset and b) nextstrain.org has its own meaning for "community dataset". Should we rethink the UI here a little? Mark https://nextstrain.org core datasets as "official"? Or with some other label than "community"? (These are unification/integration pains, which feel hard, but also worth it.)

For me, the "Suggest automatically" feature of Nextclade was enabled by default (though it seems like a sticky preference?). With it enabled:

  1. Load a Nextstrain dataset via the URL, e.g. https://nextclade-git-feat-fetch-auspice-sidecar-json-nextstrain.vercel.app/?dataset-json-url=https://nextstrain.org/ncov/gisaid/global/all-time
  2. Load a file of sequences
  3. Notice suggestions are made for an official dataset, but the one manually loaded via the URL remains selected.
  4. Run the analysis.
  5. Flip back to "Start".
  6. Notice that the dataset has been switched out for one of the suggested ones.

This seems confusing?

@ivan-aksamentov
Copy link
Member Author

ivan-aksamentov commented May 30, 2024

@tsibley The second one sounds like a serious bug. I made a new issue: #1462

And for first one as well: #1463

Thanks for feedback!

@jameshadfield
Copy link
Member

Following up on the conversation around root-sequences vs reference sequence.

All usages of augur ancestral, augur translate (and exporting those node-data JSONs via augur export) will result in root-sequence data that correctly corresponds to (the parent of) the tree's root node. The language used within augur itself is a little inconsistent, and this issue and, separately, this PR aim to address this. For nextstrain usage I think all augur-produced datasets will be internally consistent. (I fixed a bunch of bugs here a few months back with my work on VCF inputs.)

For nextclade usage from a nextstrain dataset JSON (i.e. this PR) this means any dataset produced by an augur pipeline will be internally consistent.

For nextclade usage from a nextclade dataset, there's two paths to inconsistency I can see:

  1. the nextclade dataset's files.reference doesn't correspond to the reference sequence handed to augur ancestral --root-sequence¹. Nextclade should (and perhaps does?) assert that <AuspiceJSON>.reference.nuc is the same as the sequence in files.reference
  2. augur ancestral is run without --root-sequence¹. Here the same assertion as (1) should be enough. The additional consistency checking of the ancestral state of mutations vs the reference is technically not needed but can't hurt.

The nextclade docs do provide instructions to avoid these mistakes.

¹ See this issue for behaviour of augur ancestral under different argument combinations.

@ivan-aksamentov
Copy link
Member Author

ivan-aksamentov commented Jun 4, 2024

Nextclade should (and perhaps does?) assert that .reference.nuc is the same as the sequence in files.reference

Nextclade does not read .files from Auspice JSON. It's unclear what to do with them anyways. The single files are only used with directory-based datasets and zips.

When pasting pathogen info into Auspice JSON (which is also optional), I suggest to not put the .files object there to avoid confusion.

@jameshadfield
Copy link
Member

Nextclade does not read .files from Auspice JSON. It's unclear what to do with them anyways.

Yup, that quote was from the section on "nextclade usage from a nextclade dataset" (i.e. not an auspice JSON dataset) -- in that case I think you should be comparing the files reference sequence against the auspice JSON root sequence, as this would guard against a misconfigured augur pipeline which produced the dataset.

When pasting pathogen info into Auspice JSON (which is also optional), I suggest to not put the .files object there to avoid confusion.

Agreed

ivan-aksamentov added a commit that referenced this pull request Jun 4, 2024
Following conversation in #1455 (comment)

Let's add a warning if the reference sequence provided any of the possible ways (fasta in dataset files, through CLI argument, Web URL param, or Web "Customization" interface) does not exactly match (as in string comparison) the `.root_sequence.nuc` in Auspice JSON.

The warning message is the following. Please suggest improvements (paste a full quote into reply message or feel free to modify in the code).

<details>
<summary>Click to expand</summary>

> Nextclade detected that reference sequence provided does not exactly match reference (root) sequence in Auspice JSON.
>
> This could be due to one of the reasons:
>
> - Nextclade dataset author provided reference sequence and reference tree that are incompatible
> - The reference tree has been constructed incorrectly
> - The reference sequence provided using `--input-ref` CLI argument is not compatible with the reference tree in the dataset
> - The reference tree provided using `--input-tree` CLI argument is not compatible with the reference sequence in the dataset
> - The reference sequence provided using `&input-ref` parameter in Nextclade Web URL is not compatible with the reference tree in the dataset
> - The reference tree provided using `&input-tree` parameter in Nextclade Web URL is not compatible with the reference sequence in the dataset
>
> This warning signals that there is a potential for failures if the mismatch is not intended.

</details>
jameshadfield added a commit to nextstrain/auspice that referenced this pull request Jun 5, 2024
Most discussion about this functionality has been happening within the
nextclade repo, see <nextstrain/nextclade#1455>
and <nextstrain/nextclade#1460> for a good
summary.
jameshadfield added a commit to nextstrain/auspice that referenced this pull request Jun 5, 2024
Most discussion about this functionality has been happening within the
nextclade repo, see <nextstrain/nextclade#1455>
and <nextstrain/nextclade#1460> for a good
summary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants