Skip to content

SILVA #1306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jplfaria opened this issue Dec 12, 2024 · 17 comments · Fixed by #1307
Closed

SILVA #1306

jplfaria opened this issue Dec 12, 2024 · 17 comments · Fixed by #1307
Labels
New Used in combination with prefix, metaprefix, or collection for new entries Prefix

Comments

@jplfaria
Copy link
Contributor

jplfaria commented Dec 12, 2024

Prefix

silva

Name

SILVA rRNA database project

Homepage

https://www.arb-silva.de/

Source Code Repository

No response

Description

SILVA is a comprehensive, quality-checked, and regularly updated resource of aligned ribosomal RNA (rRNA) gene sequences for Bacteria, Archaea, and Eukaryotes. It provides a consistent taxonomic framework, commonly used in microbial ecology and diversity studies.

License

CC-BY-SA-4.0

Publications

doi:10.1093/nar/gks1219 | doi:10.1093/nar/gkt1209

Example Local Unique Identifier

11084

Regular Expression Pattern for Local Unique Identifier

^\d+$

URI Format String

https://www.arb-silva.de/search?q=$1

Wikidata Property

No response

Contributor Name

Jose P. Faria

Contributor GitHub

jplfaria

Contributor ORCiD

0000-0001-9302-7250

Contributor Email

[email protected]

Contact Name

Frank Oliver Glöckner

Contact ORCiD

0000-0001-8528-9023

Contact GitHub

frankolivergloeckner

Contact Email

[email protected]

Additional Comments

No response

@jplfaria jplfaria added New Used in combination with prefix, metaprefix, or collection for new entries Prefix labels Dec 12, 2024
@cthoyt
Copy link
Member

cthoyt commented Dec 12, 2024

Thanks @jplfaria. I made a few updates, including putting @frankolivergloeckner as the contact person - it's Bioregistry policy to have a single point of contact and not a group email.

@cthoyt
Copy link
Member

cthoyt commented Dec 12, 2024

@jplfaria however, it's not clear what's the actual semantic space here. The URI format doesn't seem to work when using your example. Can you link me to a page that shows something for the example acccession number 11084?

@cthoyt
Copy link
Member

cthoyt commented Dec 12, 2024

Is it possible that https://www.arb-silva.de/browser/ssu-138.2/AJ001010 is actually the accession numbers that SILVA creates, and 11084 was an NCBI Taxonomy id?

@jplfaria
Copy link
Contributor Author

jplfaria commented Dec 12, 2024

I was going off the folder I believe it has taxonomy data:
https://www.arb-silva.de/no_cache/download/archive/current/Exports/taxonomy/
In that folder in specific my plan was to work ith the file https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/taxonomy/tax_slv_ssu_138.2.txt.gz take a look at the contents there and let me know if that makes any sense. There are additional files with mappings, etc, but I was going to start just with that. Also, I decided only to make this ontology file for the SSU data since that is the most commonly used. We could, in theory, have a separate ontology file for the LSU data or merge both, which I am not sure is a good idea.

@cthoyt
Copy link
Member

cthoyt commented Dec 12, 2024

Archaea;Crenarchaeota;Nitrososphaeria;Nitrosotaleales;	42950	order		138
Archaea;Crenarchaeota;Nitrososphaeria;Nitrosotaleales;Nitrosotaleaceae;	42951	family		138
Archaea;Crenarchaeota;Nitrososphaeria;Nitrosotaleales;Nitrosotaleaceae;Candidatus Nitrosotalea;	42952	genus		138

here are a few lines in that file, I think it corresponds to https://www.arb-silva.de/browser/ssu-121/AB050229/ but I can't figure out where any of the numbers (42950, 42951, 42952) are on the web page. Or, what does AB050229 mean?

@jplfaria
Copy link
Contributor Author

jplfaria commented Dec 12, 2024

We are bpth having issues understanding what is going on here. Maybe starting this entry could have been more timely. I will spend more time trying to understand the resource to be able to properly represent this entry.

@jplfaria
Copy link
Contributor Author

The IDs in the file 42950, 42951, etc seem to be some taxonomy unique ID that only exists in that file, I cant find it anywhere else. When I search fo those IDs as ncbi taxonomy ids, the organisms are different.
Another issues I don't understand is that in the taxonomy file, there are no taxa for the "species" level, I understand that SILVA doesn't always has all taxonomy levels since the taxonomy may not be fully resolved with just the ssu unit sequence but in the GTDB metada file we use to build the GTDB ontology I see mapping to silva at the species level.
i may need to email these authors to understand what is happening.

@jplfaria
Copy link
Contributor Author

jplfaria commented Jan 21, 2025

I think I have now more clarity on how to represent silva as a taxonomy ontology.

The prefix should be adjusted for silva_ssu (small subunit) since according to SILVA, the large subunit (silva_lsu) taxonomy is different from the small subunit one. Fortunately, SILVA has every file in "duplicate" just with _ssu and _lsu difference in files names, so with the same code we will be able to create taxonomy ontologies file for both ssu and lsu taxonomies.

Regarding the taxonomy ids, the main file is still the one we discussed above:
https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/taxonomy/tax_slv_lsu_138.2.txt.gz

SILVA taxonomy is only assigned up to the genus level since resolution by only using RNA subunits

We can then use this file to get the link to the accession as the level below genus.

While we don't have URL linking to the IDs present in the tax_slv_lsu file, there is a permanent URL for the accession.

For example: https://www.arb-silva.de/browser/ssu/silva/CP019636 and the fill taxonomy is shown there:
Bacteria->Cyanobacteriota->Cyanobacteriia-.Cyanobacteriales->NostocaceaeScytonema VB-61278->CP019636

So for the aforementioned example and using the two files mentioned above, we could get a representation like this:

<!-- Domain: Bacteria (ID: 3) -->
<owl:Class rdf:about="http://example.org/silva_ssu/3">
    <rdfs:label>Bacteria</rdfs:label>
    <obo:TAXRANK_1000000 rdf:resource="http://purl.obolibrary.org/obo/TAXRANK_0000090"/>
</owl:Class>

intermediary taxon ranks

<!-- Genus: Scytonema VB-61278 (ID: 60450) -->
<owl:Class rdf:about="http://example.org/silva_ssu/60450">
    <rdfs:subClassOf rdf:resource="http://example.org/silva_ssu/60382"/>
    <rdfs:label>Scytonema VB-61278</rdfs:label>
    <obo:TAXRANK_1000000 rdf:resource="http://purl.obolibrary.org/obo/TAXRANK_0000006"/>
</owl:Class>

<!-- Accession Information -->
<owl:Class rdf:about="https://www.arb-silva.de/browser/ssu/silva/CP019636">
    <rdfs:subClassOf rdf:resource="http://example.org/silva_ssu/60450"/>
    <rdfs:label>Nostocales cyanobacterium HT-58-2</rdfs:label>
</owl:Class>

Please advise

@cthoyt
Copy link
Member

cthoyt commented Jan 21, 2025

Okay, so let me try to summarize what I understand. SILVA has two different semantic spaces:

Accession Numbers

SILVA has accession numbers like CP019636. This can be resolved with two different URL schemes corresponding to the "small subunit" (https://www.arb-silva.de/browser/ssu/silva/CP019636) and "large subunit" (https://www.arb-silva.de/browser/lsu/silva/CP019636) taxonomies.

However, this appears to link to ENA: https://www.ebi.ac.uk/ena/browser/view/CP019636. ENA has already been registered at https://bioregistry.io/ena.embl, and CP019636 can get resolved with https://bioregistry.io/ena.embl:CP019636

@jplfaria can you confirm that all SILVA accession numbers correspond to ENA records?

SILVA Internal IDs

IDs that appear in the second column of https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/taxonomy/tax_slv_lsu_138.2.txt.gz, like 2 for Archea. These can't be resolved anywhere as far as I know. However, this isn't a strict requirement for making a prefix.

How they can be ontologized is a discussion to pick back up on the PyOBO repo, something like

from pyobo import Reference, Term, TypeDef

import pandas as pd

url = "https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/taxonomy/tax_slv_lsu_138.2.txt.gz"

PREFIX = "silva"
TAXRANK_PREFIX = "TAXRANK"
#: String appearing in SILVA to TAXRANK identifier
STRING_TO_TAXRANK: dict[str, str] = {
    "phylum": "0000001",
}

TD = TypeDef.from_triple(TAXRANK_PREFIX, "0000000")


def iter_terms():
    # TODO how to get current version of SILVA?
    df = pd.read_csv(url, dtype=str)
    for hierarchy, identifier, taxrank, _, _unknown in df.values:
        name = hierarchy.split(";")[-1]
        term = Term(
            reference=Reference(prefix=PREFIX, identifier=identifier, name=name),
        )
        term.annotate_object(TD, Reference(prefix=TAXRANK_PREFIX, identifier=STRING_TO_TAXRANK[taxrank]))
        yield term

@jplfaria
Copy link
Contributor Author

Accession numbers

I ran a script calling the ENA API against the list of all accession numbers (https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/taxonomy/tax_slv_lsu_138.2.acc_taxid.gz )present in silva and verified that 99.9999% match an ENA record. Only 25 out of 458470 accessions failed to find a match not a match.

AAHC01014171 is an example of an accession that does not match an ENA record. In the silva page for there is a "Link to EBI/ENA" but it returns "no records found".

SILVA Internal IDs

I can also confirm I can't resolve these IDs anywhere. In the file you mentioned the IDs are indeed in the second column and the tax rank is in the third column.

A little more information on these IDs I got directly from SILVA via email communication:

"Currently, these IDs are automatically generated by our internal taxonomy database (auto-increment field) and are not identical for our SSU and LSU datasets as each sub-unit has a distinct taxonomy. The IDs do not change when the sequences composition of a taxon changes as long as the name is kept. However, they always change when the name is changed, i.e. even when only a typo in the name is fixed. Additionally, they change if one of their ancestor  taxa changes."

"IDs will never be re-assigned. Actually, the old entries remain in our internal database and are only marked as deleted. Therefore, we can guarantee that the one ID will ever be assigned to exactly one taxon"

Alternatively we could use the tax rank name as the "ID" similarly to the GTDB taxonomy ontology, but my understanding was that numerical IDs were preferable when available. I might be wrong in that assumption.

@cthoyt
Copy link
Member

cthoyt commented Jan 22, 2025

Do you know if entries that remain in the internal database and are marked as deleted have some kind of link to the new identifier for the same taxon?

@jplfaria
Copy link
Contributor Author

I do not know unfortunately.

I will try to get clarification and get back to you. If you have any other questions you think are pertinent to help for clarify any potential issues I am not foreseeing, please let me know, I can take the opportunity and inquire in the same email communication.

@jplfaria
Copy link
Contributor Author

@cthoyt this is what I heard back on your question:

"Unfortunately, the relation between the deleted and the new taxon is not tracked in the database."

@cthoyt
Copy link
Member

cthoyt commented Jan 24, 2025

This doesn't follow the guidelines for identifiers being persistent as described in https://doi.org/10.1371/journal.pbio.2001414, but this leaves an opportunity for building a supporting service for mapping outdated SILVA IDs. I'd suggest the following:

  1. finish putting the identifier into Bioregistry
  2. create a rudimentary PyOBO source
  3. build a reconciliation service that can try and reconstruct mappings from deleted records back to updated records.
  4. petition SILVA to do something more standard. they have both the GBCR and the Elixir Code Data resource sticker, so hopefully they care about improvements

@jgerken
Copy link

jgerken commented Jan 24, 2025

Thanks @jplfaria. I made a few updates, including putting @frankolivergloeckner as the contact person - it's Bioregistry policy to have a single point of contact and not a group email.

Please remove Frank Oliver as contact he is no longer the PI of SILVA and also has left the MPI several years ago - as has SILVA. SILVA is now owned and operated by the DSMZ. The new PI is Lorenz Reimer and I am the head of SILVA. But I would ask to put [email protected] as contact point. Our policy is that user request must be directed to [email protected] and not to individual team members. [email protected] is also not a group email but the email address associated with our helpdesk. This ensures that the whole team informed about user requests and that requests can be handles when a team member is unavailable. Adding Lorenz or my contact details would only cause friction for the user as we would tell them to resend their request to [email protected]

@jplfaria
Copy link
Contributor Author

@cthoyt, your proposed plan sounds good to me. What do we need to "finish putting the identifier into Bioregistry"? For the other points you brought up:

  1. I have started by implementing the Getter function but it seems it's failing some coverage tests.

  2. I am still trying to wrap my head around it; where would we maintain the "reconciliation service" you described?

@cthoyt
Copy link
Member

cthoyt commented Jan 27, 2025

Hi @jgerken, thanks for the note. The Bioregistry project's value system places greater emphasis on transparency, so if you're not comfortable with us curating an explicit contact person, we can leave this field blank for now.

@jplfaria I updated the PR for the new prefix to specifically be about the fact that this is for a taxonomy identifier, and leave a few warnings based on the discussion from this thread related to the persistence/uniqueness of IDs.

WRT bioversions, start by addressing all CI feedback, and if you get stuck, ping me there

Next time we chat, I can explain what I meant by reconciliation service. I wouldn't worry about that for now

cthoyt added a commit that referenced this issue Jan 28, 2025
Closes #1306

---------

Co-authored-by: jplfaria <[email protected]>
Co-authored-by: Charles Tapley Hoyt <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
New Used in combination with prefix, metaprefix, or collection for new entries Prefix
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants