SILVA #1306

jplfaria · 2024-12-12T06:15:37Z

Prefix

silva

Name

SILVA rRNA database project

Homepage

https://www.arb-silva.de/

Source Code Repository

No response

Description

SILVA is a comprehensive, quality-checked, and regularly updated resource of aligned ribosomal RNA (rRNA) gene sequences for Bacteria, Archaea, and Eukaryotes. It provides a consistent taxonomic framework, commonly used in microbial ecology and diversity studies.

License

CC-BY-SA-4.0

Publications

doi:10.1093/nar/gks1219 | doi:10.1093/nar/gkt1209

Example Local Unique Identifier

11084

Regular Expression Pattern for Local Unique Identifier

^\d+$

URI Format String

https://www.arb-silva.de/search?q=$1

Wikidata Property

No response

Contributor Name

Jose P. Faria

Contributor GitHub

jplfaria

Contributor ORCiD

0000-0001-9302-7250

Contributor Email

[email protected]

Contact Name

Frank Oliver Glöckner

Contact ORCiD

0000-0001-8528-9023

Contact GitHub

frankolivergloeckner

Contact Email

[email protected]

Additional Comments

No response

cthoyt · 2024-12-12T09:41:44Z

Thanks @jplfaria. I made a few updates, including putting @frankolivergloeckner as the contact person - it's Bioregistry policy to have a single point of contact and not a group email.

cthoyt · 2024-12-12T09:43:28Z

@jplfaria however, it's not clear what's the actual semantic space here. The URI format doesn't seem to work when using your example. Can you link me to a page that shows something for the example acccession number 11084?

cthoyt · 2024-12-12T12:17:17Z

Is it possible that https://www.arb-silva.de/browser/ssu-138.2/AJ001010 is actually the accession numbers that SILVA creates, and 11084 was an NCBI Taxonomy id?

jplfaria · 2024-12-12T16:57:16Z

I was going off the folder I believe it has taxonomy data:
https://www.arb-silva.de/no_cache/download/archive/current/Exports/taxonomy/
In that folder in specific my plan was to work ith the file https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/taxonomy/tax_slv_ssu_138.2.txt.gz take a look at the contents there and let me know if that makes any sense. There are additional files with mappings, etc, but I was going to start just with that. Also, I decided only to make this ontology file for the SSU data since that is the most commonly used. We could, in theory, have a separate ontology file for the LSU data or merge both, which I am not sure is a good idea.

cthoyt · 2024-12-12T17:01:51Z

Archaea;Crenarchaeota;Nitrososphaeria;Nitrosotaleales;	42950	order		138
Archaea;Crenarchaeota;Nitrososphaeria;Nitrosotaleales;Nitrosotaleaceae;	42951	family		138
Archaea;Crenarchaeota;Nitrososphaeria;Nitrosotaleales;Nitrosotaleaceae;Candidatus Nitrosotalea;	42952	genus		138

here are a few lines in that file, I think it corresponds to https://www.arb-silva.de/browser/ssu-121/AB050229/ but I can't figure out where any of the numbers (42950, 42951, 42952) are on the web page. Or, what does AB050229 mean?

jplfaria · 2024-12-12T18:12:05Z

We are bpth having issues understanding what is going on here. Maybe starting this entry could have been more timely. I will spend more time trying to understand the resource to be able to properly represent this entry.

jplfaria · 2024-12-13T06:51:37Z

The IDs in the file 42950, 42951, etc seem to be some taxonomy unique ID that only exists in that file, I cant find it anywhere else. When I search fo those IDs as ncbi taxonomy ids, the organisms are different.
Another issues I don't understand is that in the taxonomy file, there are no taxa for the "species" level, I understand that SILVA doesn't always has all taxonomy levels since the taxonomy may not be fully resolved with just the ssu unit sequence but in the GTDB metada file we use to build the GTDB ontology I see mapping to silva at the species level.
i may need to email these authors to understand what is happening.

jplfaria · 2025-01-21T18:36:54Z

I think I have now more clarity on how to represent silva as a taxonomy ontology.

The prefix should be adjusted for silva_ssu (small subunit) since according to SILVA, the large subunit (silva_lsu) taxonomy is different from the small subunit one. Fortunately, SILVA has every file in "duplicate" just with _ssu and _lsu difference in files names, so with the same code we will be able to create taxonomy ontologies file for both ssu and lsu taxonomies.

Regarding the taxonomy ids, the main file is still the one we discussed above:
https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/taxonomy/tax_slv_lsu_138.2.txt.gz

SILVA taxonomy is only assigned up to the genus level since resolution by only using RNA subunits

We can then use this file to get the link to the accession as the level below genus.

While we don't have URL linking to the IDs present in the tax_slv_lsu file, there is a permanent URL for the accession.

For example: https://www.arb-silva.de/browser/ssu/silva/CP019636 and the fill taxonomy is shown there:
Bacteria->Cyanobacteriota->Cyanobacteriia-.Cyanobacteriales->NostocaceaeScytonema VB-61278->CP019636

So for the aforementioned example and using the two files mentioned above, we could get a representation like this:

<!-- Domain: Bacteria (ID: 3) -->
<owl:Class rdf:about="http://example.org/silva_ssu/3">
    <rdfs:label>Bacteria</rdfs:label>
    <obo:TAXRANK_1000000 rdf:resource="http://purl.obolibrary.org/obo/TAXRANK_0000090"/>
</owl:Class>

intermediary taxon ranks

<!-- Genus: Scytonema VB-61278 (ID: 60450) -->
<owl:Class rdf:about="http://example.org/silva_ssu/60450">
    <rdfs:subClassOf rdf:resource="http://example.org/silva_ssu/60382"/>
    <rdfs:label>Scytonema VB-61278</rdfs:label>
    <obo:TAXRANK_1000000 rdf:resource="http://purl.obolibrary.org/obo/TAXRANK_0000006"/>
</owl:Class>

<!-- Accession Information -->
<owl:Class rdf:about="https://www.arb-silva.de/browser/ssu/silva/CP019636">
    <rdfs:subClassOf rdf:resource="http://example.org/silva_ssu/60450"/>
    <rdfs:label>Nostocales cyanobacterium HT-58-2</rdfs:label>
</owl:Class>

Please advise

cthoyt · 2025-01-21T22:30:33Z

Okay, so let me try to summarize what I understand. SILVA has two different semantic spaces:

Accession Numbers

SILVA has accession numbers like CP019636. This can be resolved with two different URL schemes corresponding to the "small subunit" (https://www.arb-silva.de/browser/ssu/silva/CP019636) and "large subunit" (https://www.arb-silva.de/browser/lsu/silva/CP019636) taxonomies.

However, this appears to link to ENA: https://www.ebi.ac.uk/ena/browser/view/CP019636. ENA has already been registered at https://bioregistry.io/ena.embl, and CP019636 can get resolved with https://bioregistry.io/ena.embl:CP019636

@jplfaria can you confirm that all SILVA accession numbers correspond to ENA records?

SILVA Internal IDs

IDs that appear in the second column of https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/taxonomy/tax_slv_lsu_138.2.txt.gz, like 2 for Archea. These can't be resolved anywhere as far as I know. However, this isn't a strict requirement for making a prefix.

How they can be ontologized is a discussion to pick back up on the PyOBO repo, something like

from pyobo import Reference, Term, TypeDef

import pandas as pd

url = "https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/taxonomy/tax_slv_lsu_138.2.txt.gz"

PREFIX = "silva"
TAXRANK_PREFIX = "TAXRANK"
#: String appearing in SILVA to TAXRANK identifier
STRING_TO_TAXRANK: dict[str, str] = {
    "phylum": "0000001",
}

TD = TypeDef.from_triple(TAXRANK_PREFIX, "0000000")


def iter_terms():
    # TODO how to get current version of SILVA?
    df = pd.read_csv(url, dtype=str)
    for hierarchy, identifier, taxrank, _, _unknown in df.values:
        name = hierarchy.split(";")[-1]
        term = Term(
            reference=Reference(prefix=PREFIX, identifier=identifier, name=name),
        )
        term.annotate_object(TD, Reference(prefix=TAXRANK_PREFIX, identifier=STRING_TO_TAXRANK[taxrank]))
        yield term

jplfaria · 2025-01-22T14:04:53Z

Accession numbers

I ran a script calling the ENA API against the list of all accession numbers (https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/taxonomy/tax_slv_lsu_138.2.acc_taxid.gz )present in silva and verified that 99.9999% match an ENA record. Only 25 out of 458470 accessions failed to find a match not a match.

AAHC01014171 is an example of an accession that does not match an ENA record. In the silva page for there is a "Link to EBI/ENA" but it returns "no records found".

SILVA Internal IDs

I can also confirm I can't resolve these IDs anywhere. In the file you mentioned the IDs are indeed in the second column and the tax rank is in the third column.

A little more information on these IDs I got directly from SILVA via email communication:

"Currently, these IDs are automatically generated by our internal taxonomy database (auto-increment field) and are not identical for our SSU and LSU datasets as each sub-unit has a distinct taxonomy. The IDs do not change when the sequences composition of a taxon changes as long as the name is kept. However, they always change when the name is changed, i.e. even when only a typo in the name is fixed. Additionally, they change if one of their ancestor taxa changes."

"IDs will never be re-assigned. Actually, the old entries remain in our internal database and are only marked as deleted. Therefore, we can guarantee that the one ID will ever be assigned to exactly one taxon"

Alternatively we could use the tax rank name as the "ID" similarly to the GTDB taxonomy ontology, but my understanding was that numerical IDs were preferable when available. I might be wrong in that assumption.

cthoyt · 2025-01-22T14:18:00Z

Do you know if entries that remain in the internal database and are marked as deleted have some kind of link to the new identifier for the same taxon?

jplfaria · 2025-01-22T14:26:36Z

I do not know unfortunately.

I will try to get clarification and get back to you. If you have any other questions you think are pertinent to help for clarify any potential issues I am not foreseeing, please let me know, I can take the opportunity and inquire in the same email communication.

jplfaria · 2025-01-23T17:25:10Z

@cthoyt this is what I heard back on your question:

"Unfortunately, the relation between the deleted and the new taxon is not tracked in the database."

cthoyt · 2025-01-24T09:58:08Z

This doesn't follow the guidelines for identifiers being persistent as described in https://doi.org/10.1371/journal.pbio.2001414, but this leaves an opportunity for building a supporting service for mapping outdated SILVA IDs. I'd suggest the following:

finish putting the identifier into Bioregistry
create a rudimentary PyOBO source
build a reconciliation service that can try and reconstruct mappings from deleted records back to updated records.
petition SILVA to do something more standard. they have both the GBCR and the Elixir Code Data resource sticker, so hopefully they care about improvements

jgerken · 2025-01-24T13:30:47Z

Thanks @jplfaria. I made a few updates, including putting @frankolivergloeckner as the contact person - it's Bioregistry policy to have a single point of contact and not a group email.

Please remove Frank Oliver as contact he is no longer the PI of SILVA and also has left the MPI several years ago - as has SILVA. SILVA is now owned and operated by the DSMZ. The new PI is Lorenz Reimer and I am the head of SILVA. But I would ask to put [email protected] as contact point. Our policy is that user request must be directed to [email protected] and not to individual team members. [email protected] is also not a group email but the email address associated with our helpdesk. This ensures that the whole team informed about user requests and that requests can be handles when a team member is unavailable. Adding Lorenz or my contact details would only cause friction for the user as we would tell them to resend their request to [email protected]

jplfaria · 2025-01-27T16:42:33Z

@cthoyt, your proposed plan sounds good to me. What do we need to "finish putting the identifier into Bioregistry"? For the other points you brought up:

I have started by implementing the Getter function but it seems it's failing some coverage tests.
I am still trying to wrap my head around it; where would we maintain the "reconciliation service" you described?

cthoyt · 2025-01-27T17:42:18Z

Hi @jgerken, thanks for the note. The Bioregistry project's value system places greater emphasis on transparency, so if you're not comfortable with us curating an explicit contact person, we can leave this field blank for now.

@jplfaria I updated the PR for the new prefix to specifically be about the fact that this is for a taxonomy identifier, and leave a few warnings based on the discussion from this thread related to the persistence/uniqueness of IDs.

WRT bioversions, start by addressing all CI feedback, and if you get stuck, ping me there

Next time we chat, I can explain what I meant by reconciliation service. I wouldn't worry about that for now

Closes #1306 --------- Co-authored-by: jplfaria <[email protected]> Co-authored-by: Charles Tapley Hoyt <[email protected]>

jplfaria added New Used in combination with prefix, metaprefix, or collection for new entries Prefix labels Dec 12, 2024

github-actions bot mentioned this issue Dec 12, 2024

Add prefix: silva #1307

Merged

jplfaria mentioned this issue Jan 24, 2025

Add getter for SILVA biopragmatics/bioversions#69

Merged

cthoyt closed this as completed in #1307 Jan 28, 2025

cthoyt added a commit that referenced this issue Jan 28, 2025

Add prefix: silva (#1307)

0c87a55

Closes #1306 --------- Co-authored-by: jplfaria <[email protected]> Co-authored-by: Charles Tapley Hoyt <[email protected]>

jplfaria mentioned this issue Feb 12, 2025

New source: SILVA taxonomy biopragmatics/pyobo#348

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SILVA #1306

SILVA #1306

jplfaria commented Dec 12, 2024 •

edited by cthoyt

Loading

cthoyt commented Dec 12, 2024

cthoyt commented Dec 12, 2024

cthoyt commented Dec 12, 2024

jplfaria commented Dec 12, 2024 •

edited

Loading

cthoyt commented Dec 12, 2024

jplfaria commented Dec 12, 2024 •

edited

Loading

jplfaria commented Dec 13, 2024

jplfaria commented Jan 21, 2025 •

edited by cthoyt

Loading

cthoyt commented Jan 21, 2025 •

edited

Loading

jplfaria commented Jan 22, 2025

cthoyt commented Jan 22, 2025

jplfaria commented Jan 22, 2025

jplfaria commented Jan 23, 2025

cthoyt commented Jan 24, 2025

jgerken commented Jan 24, 2025

jplfaria commented Jan 27, 2025

cthoyt commented Jan 27, 2025

SILVA #1306

SILVA #1306

Comments

jplfaria commented Dec 12, 2024 • edited by cthoyt Loading

Prefix

Name

Homepage

Source Code Repository

Description

License

Publications

Example Local Unique Identifier

Regular Expression Pattern for Local Unique Identifier

URI Format String

Wikidata Property

Contributor Name

Contributor GitHub

Contributor ORCiD

Contributor Email

Contact Name

Contact ORCiD

Contact GitHub

Contact Email

Additional Comments

cthoyt commented Dec 12, 2024

cthoyt commented Dec 12, 2024

cthoyt commented Dec 12, 2024

jplfaria commented Dec 12, 2024 • edited Loading

cthoyt commented Dec 12, 2024

jplfaria commented Dec 12, 2024 • edited Loading

jplfaria commented Dec 13, 2024

jplfaria commented Jan 21, 2025 • edited by cthoyt Loading

cthoyt commented Jan 21, 2025 • edited Loading

Accession Numbers

SILVA Internal IDs

jplfaria commented Jan 22, 2025

Accession numbers

SILVA Internal IDs

cthoyt commented Jan 22, 2025

jplfaria commented Jan 22, 2025

jplfaria commented Jan 23, 2025

cthoyt commented Jan 24, 2025

jgerken commented Jan 24, 2025

jplfaria commented Jan 27, 2025

cthoyt commented Jan 27, 2025

jplfaria commented Dec 12, 2024 •

edited by cthoyt

Loading

jplfaria commented Dec 12, 2024 •

edited

Loading

jplfaria commented Dec 12, 2024 •

edited

Loading

jplfaria commented Jan 21, 2025 •

edited by cthoyt

Loading

cthoyt commented Jan 21, 2025 •

edited

Loading