-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can I build a Sourmash LCA database for 16S rRNA gene sequence datasets #1421
Comments
hi @LZC0034 see also and also #548 which whoops we never answered. In terms of documentation, are you talking about the tutorial, here? https://sourmash.readthedocs.io/en/latest/tutorials-lca.html Off the top of my head,
I'll try to post more advice here soon. |
Hi @ctb, I have established the LCA database based on Silva Release 138 sequence database in 7-mer (k=7) hash. For preliminary tests, I established the LCA databases based on randomly selected 10, 50, 100, and 500 sequences/signatures from the signature pool of the Silva database. The LCA database established based on 10 signatures worked the best, where all the 10 query signatures were classified to the taxa at the species level. However, as the amount of signatures increases, the classification became worse. For the LCA database based on 500 signatures, the query signatures were even not matched with any taxonomy. LCA classification steps:
Classification results:
Question: Thanks, |
hi chao, apologies for taking so long to reply! Please feel free to come chat with us over at https://gitter.im/sourmash-bio/community# if you like, too; we're hoping to make that a more interactive forum to discussion! I think you are running into a problem that we have only really understood ourselves in the past two years or so: k-mer based LCA approaches saturate taxonomic signal quite quickly, and so for large databases you may end up with very poor classification approaches. This is discussed here, and here, and also in Nasko et al. 2018. A few things you can try:
I hope this isn't too confusing and I hope that we can help you get this working, too! |
Sourmash lca calssify-gather-search test for 16S sequence-Chao-07-12-2021.pdf Hi @ctb, I have tried the lca classify, gather, and search commands for 7-mer and 21-mer hash signatures computed from the sequences of the SILVA database. First, I randomly selected 500 sequences from SILVA database and computed for 7-mer and 21-mer hash signatures. Afterwards, I established lca databases using the 500 of 7-mer and 21-mer signatures. Then, I used the lca classify, gather, and search commands to match the 500 signatures against the built database. For the 7-mer hash signatures, the results showed that no match between the queries and database was obtained by using lca classify and gather commands, and three matches was found by using search command. For the 21-mer hash signatures, all the queries were matched with the lca database by using lca classify, while no match and two matches were found by using gather and search commands, respectively. Then, I tried the gather and search commands to match the 500 queries against a built STB database of 21-mer hash signatures. The output reported an error “requested threshold_bp is unattainable with this query” after using gather command. But two matches were found after using search command. Without the gather output, I cannot use the new taxonomy commands from the Sourmash v4.2.0. In addition, I tried to use the lca classify to match a query of a 21-mer hash signature computed from a real 16S sequence sample containing about 100 OTUs. The output showed the classification was disagreed. Please see the attached file for the details. Did I do something wrong during the procedure? Any suggestion is appreciated! Thanks, |
neat! Thank you for troubleshooting! for 'gather', you can run it with I think the lack of matches with 7-mers is because you're using a small subset of silva only, which is consistent with the other results. I suspect 'gather' with a much larger 7-mer database will work. |
Thanks for your suggestion! I added the Do you have any suggestions about solving this? Thanks! |
eep! that's a bug all right... apologies! While I track that down, please try using |
figured out the bug - apparently we never tested the latest gather code from #1613 with a scaled of 1! Working on a test and a fix now. |
Fixed in #1670. My suggestion of trying If you want to try the code out before the next release, you can either install from the
but, again, you'll need to wait for merge... thanks for reporting! |
bug is fixed in latest branch 🎉 but you'll have to wait for a release to update your conda install, I'm afraid :(. please do let me know if you want to try out the |
Great! I do want to try out the pip install. Thanks! |
the fix from #1670 is included in sourmash 4.2.1, which is now available via bioconda 🎉 |
note that recent results (and some downstream work to figure out why, exactly, sourmash works so well) suggests to me that sourmash gather should be particularly good at 16S analysis with scaled=1. In brief, the min-set-cov approach should be really good. @brooksph told me this a long time ago, so this is me telling him he was right all along (as usual). |
Thanks for your comments @ctb. My recently posted preprint also showed 7-mer hash data generated by sourmash with scaled=1 worked better for fresh produce safety and quality prediction than amplicon sequence variant data. Since the sourmash taxonomic analysis issue for 16S rRNA gene sequencing data was not solved, I did not include this part in my manuscript. I did use sourmash gather with scaled=1, but the taxonomic outcomes were not good. I will try to figure it out. |
Hi @ctb ! |
sourmash gather is the min-set-cov approach ;). (per @bryshalm is actually going through this all right now, I will ask her to give you some tips! |
(but also I think you should use something other than an LCA database. More on that soon.) |
Hi, I am trying to build my own Sourmash LCA database based on NCBI, SILVA, or Greengenes databases for taxonomic classification for my k-mer hash dataset (e.g. 7-mer) computed from 16S rRNA gene sequence dataset. The simple example given in Sourmash website shows the pre-calculated signatures and taxonomy spreadsheet should be created. I am wondering if there is any detailed instruction on how to create these two files or how you built the genbank-k31.lca.json.gz database. Thanks!
The text was updated successfully, but these errors were encountered: