Skip to content

Annotations schema updates #1281

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 20, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 30 additions & 20 deletions augur/data/schema-annotations.json
Original file line number Diff line number Diff line change
@@ -1,34 +1,44 @@
{
"type" : "object",
"$schema": "http://json-schema.org/draft-06/schema#",
"title": "JSON object for the `annotations` key, typically produced by `augur translate`",
"description": "Coordinates etc of genes / genome",
"title": "Schema for the 'annotations' property (node-data JSON) or the 'genome_annotations' property (auspice JSON)",
"properties": {
"nuc": {
"type": "object",
"allOf": [{ "$ref": "#/$defs/startend" }],
"additionalProperties": true,
"$comment": "All other properties are unused by Auspice. Strand is always considered to be positive."
}
},
"required": ["nuc"],
"patternProperties": {
"^[a-zA-Z0-9*_-]+$": {
"type": "object",
"allOf": [{ "$ref": "#/$defs/startend" }],
"additionalProperties": true,
"properties": {
"seqid":{
"description": "Sequence on which the coordinates below are valid. Could be viral segment, bacterial contig, etc",
"$comment": "Unused by Auspice 2.0",
"type": "string"
},
"type": {
"description": "Type of the feature. could be mRNA, CDS, or similar",
"$comment": "Unused by Auspice 2.0",
"strand": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Late to the party, but shouldn't we in principle allow each cds fragment to have its own strandedness? Here the strand is fixed for all fragments, which will be fine in most cases but who knows, maybe sometimes fragments might come from opposite strands?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChatGPT tells me that strandedness never changes within a CDS so then what we have here should work in all cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With trans-splicing it might work, but doesn't seem to happen in viruses so we should be good for now: https://en.wikipedia.org/wiki/Trans-splicing

"description": "Is the gene on the positive ('+') or negative ('-') strand.",
"$comment": "Auspice assumes positive strand unless strand is '-'",
"type": "string"
},
}
}
}
},
"$defs": {
"startend": {
"type": "object",
"required": ["start", "end"],
"properties": {
"start": {
"description": "Gene start position (one-based, following GFF format)",
"type": "number"
"type": "integer",
"minimum": 1,
"description": "Start position (one-based, following GFF format)"
},
"end": {
"description": "Gene end position (one-based closed, last position of feature, following GFF format)",
"type": "number"
},
"strand": {
"description": "Positive or negative strand",
"type": "string",
"enum": ["-","+"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Add enum back with additional values.

Even though Auspice only cares if the value is '-' or not, Augur can also export values '?' and None as suggested and implemented in 31f0b26. Defining the possible values here will improve the effectiveness of validation.

I started this in #1279 but that PR can be closed and the change included in here.

Copy link
Member Author

@jameshadfield jameshadfield Aug 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review -- really appreciated!

We're going to have to think through this. All annotations are interpreted by Auspice as CDSs, so a strand of "?" or "." (which we represent as None) doesn't make sense as a CDS - neither for Auspice to display nor for Augur to translate.

I can allow them in the schema and then have Auspice filter to ["+", "-"], which is probably the most technically correct, but I would think that happily translating "?" / "." features is questionable. For augur translate + GenBank annotations, only CDS features are read and (I believe) it's not possible to encode a CDS in GenBank that's not +/-. I'm not sure what we do for augur translate + GFF annotation.

Update: Auspice PR now ignores any non-nuc annotation which is not explicitly +/- strand

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have shifted this conversation to #1279

"type": "integer",
"minimum": 2,
"description": "End position (one-based, following GFF format). This value _must_ be greater than the start."
}
}
}
Expand Down
2 changes: 1 addition & 1 deletion tests/util_support/test_node_data_file.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ def test_validate_valid(self, build_node_data_file):
build_node_data_file(
f"""
{{
"annotations": {{ "a": {{ "start": 5 }} }},
"annotations": {{ "nuc": {{ "start": 1, "end": 100 }} }},
"generated_by": {{ "program": "augur", "version": "{__version__}" }},
"nodes": {{ "a": 5 }}
}}
Expand Down