-
-
Notifications
You must be signed in to change notification settings - Fork 55
SILVA #1306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks @jplfaria. I made a few updates, including putting @frankolivergloeckner as the contact person - it's Bioregistry policy to have a single point of contact and not a group email. |
@jplfaria however, it's not clear what's the actual semantic space here. The URI format doesn't seem to work when using your example. Can you link me to a page that shows something for the example acccession number |
Is it possible that https://www.arb-silva.de/browser/ssu-138.2/AJ001010 is actually the accession numbers that SILVA creates, and 11084 was an NCBI Taxonomy id? |
I was going off the folder I believe it has taxonomy data: |
here are a few lines in that file, I think it corresponds to https://www.arb-silva.de/browser/ssu-121/AB050229/ but I can't figure out where any of the numbers (42950, 42951, 42952) are on the web page. Or, what does AB050229 mean? |
We are bpth having issues understanding what is going on here. Maybe starting this entry could have been more timely. I will spend more time trying to understand the resource to be able to properly represent this entry. |
The IDs in the file 42950, 42951, etc seem to be some taxonomy unique ID that only exists in that file, I cant find it anywhere else. When I search fo those IDs as ncbi taxonomy ids, the organisms are different. |
I think I have now more clarity on how to represent silva as a taxonomy ontology. The prefix should be adjusted for Regarding the taxonomy ids, the main file is still the one we discussed above: SILVA taxonomy is only assigned up to the genus level since resolution by only using RNA subunits We can then use this file to get the link to the accession as the level below genus. While we don't have URL linking to the IDs present in the tax_slv_lsu file, there is a permanent URL for the accession. For example: https://www.arb-silva.de/browser/ssu/silva/CP019636 and the fill taxonomy is shown there: So for the aforementioned example and using the two files mentioned above, we could get a representation like this: <!-- Domain: Bacteria (ID: 3) -->
<owl:Class rdf:about="http://example.org/silva_ssu/3">
<rdfs:label>Bacteria</rdfs:label>
<obo:TAXRANK_1000000 rdf:resource="http://purl.obolibrary.org/obo/TAXRANK_0000090"/>
</owl:Class> intermediary taxon ranks <!-- Genus: Scytonema VB-61278 (ID: 60450) -->
<owl:Class rdf:about="http://example.org/silva_ssu/60450">
<rdfs:subClassOf rdf:resource="http://example.org/silva_ssu/60382"/>
<rdfs:label>Scytonema VB-61278</rdfs:label>
<obo:TAXRANK_1000000 rdf:resource="http://purl.obolibrary.org/obo/TAXRANK_0000006"/>
</owl:Class>
<!-- Accession Information -->
<owl:Class rdf:about="https://www.arb-silva.de/browser/ssu/silva/CP019636">
<rdfs:subClassOf rdf:resource="http://example.org/silva_ssu/60450"/>
<rdfs:label>Nostocales cyanobacterium HT-58-2</rdfs:label>
</owl:Class> Please advise |
Okay, so let me try to summarize what I understand. SILVA has two different semantic spaces: Accession NumbersSILVA has accession numbers like However, this appears to link to ENA: https://www.ebi.ac.uk/ena/browser/view/CP019636. ENA has already been registered at https://bioregistry.io/ena.embl, and @jplfaria can you confirm that all SILVA accession numbers correspond to ENA records? SILVA Internal IDsIDs that appear in the second column of https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/taxonomy/tax_slv_lsu_138.2.txt.gz, like How they can be ontologized is a discussion to pick back up on the PyOBO repo, something like from pyobo import Reference, Term, TypeDef
import pandas as pd
url = "https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/taxonomy/tax_slv_lsu_138.2.txt.gz"
PREFIX = "silva"
TAXRANK_PREFIX = "TAXRANK"
#: String appearing in SILVA to TAXRANK identifier
STRING_TO_TAXRANK: dict[str, str] = {
"phylum": "0000001",
}
TD = TypeDef.from_triple(TAXRANK_PREFIX, "0000000")
def iter_terms():
# TODO how to get current version of SILVA?
df = pd.read_csv(url, dtype=str)
for hierarchy, identifier, taxrank, _, _unknown in df.values:
name = hierarchy.split(";")[-1]
term = Term(
reference=Reference(prefix=PREFIX, identifier=identifier, name=name),
)
term.annotate_object(TD, Reference(prefix=TAXRANK_PREFIX, identifier=STRING_TO_TAXRANK[taxrank]))
yield term |
Accession numbersI ran a script calling the ENA API against the list of all accession numbers (https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/taxonomy/tax_slv_lsu_138.2.acc_taxid.gz )present in silva and verified that 99.9999% match an ENA record. Only 25 out of 458470 accessions failed to find a match not a match. AAHC01014171 is an example of an accession that does not match an ENA record. In the silva page for there is a "Link to EBI/ENA" but it returns "no records found". SILVA Internal IDsI can also confirm I can't resolve these IDs anywhere. In the file you mentioned the IDs are indeed in the second column and the tax rank is in the third column. A little more information on these IDs I got directly from SILVA via email communication:
Alternatively we could use the tax rank name as the "ID" similarly to the GTDB taxonomy ontology, but my understanding was that numerical IDs were preferable when available. I might be wrong in that assumption. |
Do you know if entries that remain in the internal database and are marked as deleted have some kind of link to the new identifier for the same taxon? |
I do not know unfortunately. I will try to get clarification and get back to you. If you have any other questions you think are pertinent to help for clarify any potential issues I am not foreseeing, please let me know, I can take the opportunity and inquire in the same email communication. |
@cthoyt this is what I heard back on your question:
|
This doesn't follow the guidelines for identifiers being persistent as described in https://doi.org/10.1371/journal.pbio.2001414, but this leaves an opportunity for building a supporting service for mapping outdated SILVA IDs. I'd suggest the following:
|
Please remove Frank Oliver as contact he is no longer the PI of SILVA and also has left the MPI several years ago - as has SILVA. SILVA is now owned and operated by the DSMZ. The new PI is Lorenz Reimer and I am the head of SILVA. But I would ask to put [email protected] as contact point. Our policy is that user request must be directed to [email protected] and not to individual team members. [email protected] is also not a group email but the email address associated with our helpdesk. This ensures that the whole team informed about user requests and that requests can be handles when a team member is unavailable. Adding Lorenz or my contact details would only cause friction for the user as we would tell them to resend their request to [email protected] |
@cthoyt, your proposed plan sounds good to me. What do we need to "finish putting the identifier into Bioregistry"? For the other points you brought up:
|
Hi @jgerken, thanks for the note. The Bioregistry project's value system places greater emphasis on transparency, so if you're not comfortable with us curating an explicit contact person, we can leave this field blank for now. @jplfaria I updated the PR for the new prefix to specifically be about the fact that this is for a taxonomy identifier, and leave a few warnings based on the discussion from this thread related to the persistence/uniqueness of IDs. WRT bioversions, start by addressing all CI feedback, and if you get stuck, ping me there Next time we chat, I can explain what I meant by reconciliation service. I wouldn't worry about that for now |
Closes #1306 --------- Co-authored-by: jplfaria <[email protected]> Co-authored-by: Charles Tapley Hoyt <[email protected]>
Prefix
silva
Name
SILVA rRNA database project
Homepage
https://www.arb-silva.de/
Source Code Repository
No response
Description
SILVA is a comprehensive, quality-checked, and regularly updated resource of aligned ribosomal RNA (rRNA) gene sequences for Bacteria, Archaea, and Eukaryotes. It provides a consistent taxonomic framework, commonly used in microbial ecology and diversity studies.
License
CC-BY-SA-4.0
Publications
doi:10.1093/nar/gks1219 | doi:10.1093/nar/gkt1209
Example Local Unique Identifier
11084
Regular Expression Pattern for Local Unique Identifier
^\d+$
URI Format String
https://www.arb-silva.de/search?q=$1
Wikidata Property
No response
Contributor Name
Jose P. Faria
Contributor GitHub
jplfaria
Contributor ORCiD
0000-0001-9302-7250
Contributor Email
[email protected]
Contact Name
Frank Oliver Glöckner
Contact ORCiD
0000-0001-8528-9023
Contact GitHub
frankolivergloeckner
Contact Email
[email protected]
Additional Comments
No response
The text was updated successfully, but these errors were encountered: