How can I build a Sourmash LCA database for 16S rRNA gene sequence datasets #1421

LZC0034 · 2021-03-29T15:01:42Z

Hi, I am trying to build my own Sourmash LCA database based on NCBI, SILVA, or Greengenes databases for taxonomic classification for my k-mer hash dataset (e.g. 7-mer) computed from 16S rRNA gene sequence dataset. The simple example given in Sourmash website shows the pre-calculated signatures and taxonomy spreadsheet should be created. I am wondering if there is any detailed instruction on how to create these two files or how you built the genbank-k31.lca.json.gz database. Thanks!

ctb · 2021-03-29T16:40:08Z

hi @LZC0034 see also

#1000

and also #548 which whoops we never answered.

In terms of documentation, are you talking about the tutorial, here?

https://sourmash.readthedocs.io/en/latest/tutorials-lca.html

Off the top of my head,

you probably want to use sourmash sketch dna -p scaled=1 for calculating the signatures;
you should make sure that each 16s rRNA input sequence has a name with a unique identifier in it
use --singleton and --name-from-first with sourmash sketch to name the sketches properly

I'll try to post more advice here soon.

LZC0034 · 2021-05-12T07:21:46Z

hi @LZC0034 see also

#1000

and also #548 which whoops we never answered.

In terms of documentation, are you talking about the tutorial, here?

https://sourmash.readthedocs.io/en/latest/tutorials-lca.html

Off the top of my head,

you probably want to use sourmash sketch dna -p scaled=1 for calculating the signatures;

you should make sure that each 16s rRNA input sequence has a name with a unique identifier in it

use --singleton and --name-from-first with sourmash sketch to name the sketches properly

I'll try to post more advice here soon.

Hi @ctb,

I have established the LCA database based on Silva Release 138 sequence database in 7-mer (k=7) hash. For preliminary tests, I established the LCA databases based on randomly selected 10, 50, 100, and 500 sequences/signatures from the signature pool of the Silva database. The LCA database established based on 10 signatures worked the best, where all the 10 query signatures were classified to the taxa at the species level. However, as the amount of signatures increases, the classification became worse. For the LCA database based on 500 signatures, the query signatures were even not matched with any taxonomy.

LCA classification steps:
After prepared the Silva sequence and corresponding taxomomy, I used the following scripts to do the sourmash lca classification

compute signature
sourmash sketch dna -p k=7,scaled=1 ./*.fasta --name-from-first
Build sourmash LCA database
sourmash lca index --scaled=1 -k=7 -f silva_tax.csv silva.lca.json ./*.sig
Classify signatures
sourmash lca classify --db silva.lca.json --query ./*.sig -o out.csv

Classification results:

The randomly selected 10 signatures were classified to taxa with the full taxonomic rank based on the LCA database established based on the 10 signatures.
In the classification of randomly selected 10 signatures against LCA database built based on 50 signatures, most of the signatures were classified to taxa down to species level, but a few were classified to taxa at a lower taxonomic level.
In the LCA database built based on 100 signatures, most of the signatures were classified to taxa at the domain level. A few signatures were even unclassified to any taxa.
In the LCA database built based on 500 signatures, most of the signatures were unclassified to any taxa.

Question:
Did I do something wrong during the process of LCA classification? Please let me know if you have any questions or suggestions.

Thanks,
Chao

ctb · 2021-06-19T17:27:15Z

hi chao, apologies for taking so long to reply! Please feel free to come chat with us over at https://gitter.im/sourmash-bio/community# if you like, too; we're hoping to make that a more interactive forum to discussion!

I think you are running into a problem that we have only really understood ourselves in the past two years or so: k-mer based LCA approaches saturate taxonomic signal quite quickly, and so for large databases you may end up with very poor classification approaches. This is discussed here, and here, and also in Nasko et al. 2018.

A few things you can try:

first, we are just now implementing a standard way to use a gather-based algorithm to assign taxonomy. Please see [MRG] add taxonomy subcommand #1543; if you are interested in trying it out, let us know and we can help you! (You would not need to change your database, just the commands you are running!)
note also that in [MRG] add taxonomy subcommand #1543 we have added the (otherwise unavailable) functionality in [WIP] improve identifier & taxonomy parsing for lca index #1542, which considerably improves options for using identifiers (and associated error reporting) when using sourmash lca index.
second, you should try increasing k-mer size well past 7, to 21 or even 31 or 51. With a sourmash gather based approach as we are suggesting in [MRG] add taxonomy subcommand #1543, you should get more specific results with larger k-mer sizes, although sensitivity might go down past 21 (that's a guess, though).

I hope this isn't too confusing and I hope that we can help you get this working, too!

LZC0034 · 2021-07-12T21:01:00Z

Sourmash lca calssify-gather-search test for 16S sequence-Chao-07-12-2021.pdf

Hi @ctb,

I have tried the lca classify, gather, and search commands for 7-mer and 21-mer hash signatures computed from the sequences of the SILVA database.

First, I randomly selected 500 sequences from SILVA database and computed for 7-mer and 21-mer hash signatures. Afterwards, I established lca databases using the 500 of 7-mer and 21-mer signatures. Then, I used the lca classify, gather, and search commands to match the 500 signatures against the built database. For the 7-mer hash signatures, the results showed that no match between the queries and database was obtained by using lca classify and gather commands, and three matches was found by using search command. For the 21-mer hash signatures, all the queries were matched with the lca database by using lca classify, while no match and two matches were found by using gather and search commands, respectively.

Then, I tried the gather and search commands to match the 500 queries against a built STB database of 21-mer hash signatures. The output reported an error “requested threshold_bp is unattainable with this query” after using gather command. But two matches were found after using search command. Without the gather output, I cannot use the new taxonomy commands from the Sourmash v4.2.0.

In addition, I tried to use the lca classify to match a query of a 21-mer hash signature computed from a real 16S sequence sample containing about 100 OTUs. The output showed the classification was disagreed. Please see the attached file for the details. Did I do something wrong during the procedure? Any suggestion is appreciated!

Thanks,
Chao

ctb · 2021-07-13T14:50:52Z

neat! Thank you for troubleshooting!

for 'gather', you can run it with --threshold-bp=0 to disable that warning. The default is 50kb which won't work for 16s :).

I think the lack of matches with 7-mers is because you're using a small subset of silva only, which is consistent with the other results. I suspect 'gather' with a much larger 7-mer database will work.

LZC0034 · 2021-07-14T08:00:10Z

neat! Thank you for troubleshooting!

for 'gather', you can run it with --threshold-bp=0 to disable that warning. The default is 50kb which won't work for 16s :).

I think the lack of matches with 7-mers is because you're using a small subset of silva only, which is consistent with the other results. I suspect 'gather' with a much larger 7-mer database will work.

Thanks for your suggestion!

I added the --threshold-bp=0 to the 'gather' command and got an error report as below

Do you have any suggestions about solving this? Thanks!

ctb · 2021-07-14T13:04:33Z

eep! that's a bug all right... apologies!

While I track that down, please try using --no-prefetch and/or --linear on the command line.

ctb · 2021-07-14T13:14:56Z

figured out the bug - apparently we never tested the latest gather code from #1613 with a scaled of 1! Working on a test and a fix now.

ctb · 2021-07-14T13:28:47Z

eep! that's a bug all right... apologies!

While I track that down, please try using --no-prefetch and/or --linear on the command line.

Fixed in #1670. My suggestion of trying --no-prefetch and --linear will not work, unfortunately.

If you want to try the code out before the next release, you can either install from the latest branch (once #1670 is merged) or try with --scaled=2, which will not trigger the bug. Let me know if you want help using the latest branch - it should work if you do this:

git clone https://github.com/sourmash-bio/sourmash
cd sourmash
pip install -e .

but, again, you'll need to wait for merge...

thanks for reporting!

ctb · 2021-07-14T15:54:55Z

bug is fixed in latest branch 🎉 but you'll have to wait for a release to update your conda install, I'm afraid :(. please do let me know if you want to try out the pip install stuff above.

LZC0034 · 2021-07-14T15:59:38Z

bug is fixed in latest branch 🎉 but you'll have to wait for a release to update your conda install, I'm afraid :(. please do let me know if you want to try out the pip install stuff above.

Great! I do want to try out the pip install. Thanks!

ctb · 2021-07-17T16:00:57Z

the fix from #1670 is included in sourmash 4.2.1, which is now available via bioconda 🎉

ctb · 2022-12-16T15:35:14Z

note that recent results (and some downstream work to figure out why, exactly, sourmash works so well) suggests to me that sourmash gather should be particularly good at 16S analysis with scaled=1. In brief, the min-set-cov approach should be really good.

@brooksph told me this a long time ago, so this is me telling him he was right all along (as usual).

LZC0034 · 2022-12-19T19:10:27Z

note that recent results (and some downstream work to figure out why, exactly, sourmash works so well) suggests to me that sourmash gather should be particularly good at 16S analysis with scaled=1. In brief, the min-set-cov approach should be really good.

@brooksph told me this a long time ago, so this is me telling him he was right all along (as usual).

Thanks for your comments @ctb. My recently posted preprint also showed 7-mer hash data generated by sourmash with scaled=1 worked better for fresh produce safety and quality prediction than amplicon sequence variant data. Since the sourmash taxonomic analysis issue for 16S rRNA gene sequencing data was not solved, I did not include this part in my manuscript. I did use sourmash gather with scaled=1, but the taxonomic outcomes were not good. I will try to figure it out.

ReneKat · 2023-06-03T02:33:53Z

note that recent results (and some downstream work to figure out why, exactly, sourmash works so well) suggests to me that sourmash gather should be particularly good at 16S analysis with scaled=1. In brief, the min-set-cov approach should be really good.

@brooksph told me this a long time ago, so this is me telling him he was right all along (as usual).

Hi @ctb !
Can you point me in the right direction for using the SILVA database to classify 16S sequences? From what I've read, I should use scaled=1 and -k=7 when building the SILVA database. Then use sourmash gather for the classifiction part. What is the min-set-cov approach?
Thanks for your guidance!
-René

ctb · 2023-06-03T15:02:45Z

sourmash gather is the min-set-cov approach ;). (per

@bryshalm is actually going through this all right now, I will ask her to give you some tips!

ctb · 2023-06-03T15:03:07Z

(but also I think you should use something other than an LCA database. More on that soon.)

ReneKat · 2023-07-03T23:14:25Z

(but also I think you should use something other than an LCA database. More on that soon.)

Thanks @ctb Looking forward to hearing more tips. Is there anyway to use the GTDB for better classification?

@bryshalm if you need a guinea pig to try out any code, I volunteer. :)

ctb mentioned this issue Apr 23, 2021

what about including marker gene "sketches" in sourmash? #670

Open

LZC0034 closed this as completed May 8, 2021

LZC0034 reopened this May 8, 2021

bluegenes added the use-case label Jun 11, 2021

ctb mentioned this issue Jun 19, 2021

Would sourmash lca classify be a good tool to distinguish 16s from 18s? #548

Closed

ctb mentioned this issue Jul 14, 2021

[MRG] fix scaled=1 gather bug #1670

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I build a Sourmash LCA database for 16S rRNA gene sequence datasets #1421

How can I build a Sourmash LCA database for 16S rRNA gene sequence datasets #1421

LZC0034 commented Mar 29, 2021

ctb commented Mar 29, 2021

LZC0034 commented May 12, 2021

ctb commented Jun 19, 2021

LZC0034 commented Jul 12, 2021

ctb commented Jul 13, 2021

LZC0034 commented Jul 14, 2021 •

edited by ctb

Loading

ctb commented Jul 14, 2021

ctb commented Jul 14, 2021

ctb commented Jul 14, 2021

ctb commented Jul 14, 2021

LZC0034 commented Jul 14, 2021

ctb commented Jul 17, 2021

ctb commented Dec 16, 2022

LZC0034 commented Dec 19, 2022

ReneKat commented Jun 3, 2023 •

edited

Loading

ctb commented Jun 3, 2023

ctb commented Jun 3, 2023

ReneKat commented Jul 3, 2023

How can I build a Sourmash LCA database for 16S rRNA gene sequence datasets #1421

How can I build a Sourmash LCA database for 16S rRNA gene sequence datasets #1421

Comments

LZC0034 commented Mar 29, 2021

ctb commented Mar 29, 2021

LZC0034 commented May 12, 2021

ctb commented Jun 19, 2021

LZC0034 commented Jul 12, 2021

ctb commented Jul 13, 2021

LZC0034 commented Jul 14, 2021 • edited by ctb Loading

ctb commented Jul 14, 2021

ctb commented Jul 14, 2021

ctb commented Jul 14, 2021

ctb commented Jul 14, 2021

LZC0034 commented Jul 14, 2021

ctb commented Jul 17, 2021

ctb commented Dec 16, 2022

LZC0034 commented Dec 19, 2022

ReneKat commented Jun 3, 2023 • edited Loading

ctb commented Jun 3, 2023

ctb commented Jun 3, 2023

ReneKat commented Jul 3, 2023

LZC0034 commented Jul 14, 2021 •

edited by ctb

Loading

ReneKat commented Jun 3, 2023 •

edited

Loading