Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I build a Sourmash LCA database for 16S rRNA gene sequence datasets #1421

Open
LZC0034 opened this issue Mar 29, 2021 · 18 comments
Open
Labels

Comments

@LZC0034
Copy link

LZC0034 commented Mar 29, 2021

Hi, I am trying to build my own Sourmash LCA database based on NCBI, SILVA, or Greengenes databases for taxonomic classification for my k-mer hash dataset (e.g. 7-mer) computed from 16S rRNA gene sequence dataset. The simple example given in Sourmash website shows the pre-calculated signatures and taxonomy spreadsheet should be created. I am wondering if there is any detailed instruction on how to create these two files or how you built the genbank-k31.lca.json.gz database. Thanks!

@ctb
Copy link
Contributor

ctb commented Mar 29, 2021

hi @LZC0034 see also

#1000

and also #548 which whoops we never answered.

In terms of documentation, are you talking about the tutorial, here?

https://sourmash.readthedocs.io/en/latest/tutorials-lca.html

Off the top of my head,

  • you probably want to use sourmash sketch dna -p scaled=1 for calculating the signatures;
  • you should make sure that each 16s rRNA input sequence has a name with a unique identifier in it
  • use --singleton and --name-from-first with sourmash sketch to name the sketches properly

I'll try to post more advice here soon.

@LZC0034
Copy link
Author

LZC0034 commented May 12, 2021

hi @LZC0034 see also

#1000

and also #548 which whoops we never answered.

In terms of documentation, are you talking about the tutorial, here?

https://sourmash.readthedocs.io/en/latest/tutorials-lca.html

Off the top of my head,

  • you probably want to use sourmash sketch dna -p scaled=1 for calculating the signatures;
  • you should make sure that each 16s rRNA input sequence has a name with a unique identifier in it
  • use --singleton and --name-from-first with sourmash sketch to name the sketches properly

I'll try to post more advice here soon.

Hi @ctb,

I have established the LCA database based on Silva Release 138 sequence database in 7-mer (k=7) hash. For preliminary tests, I established the LCA databases based on randomly selected 10, 50, 100, and 500 sequences/signatures from the signature pool of the Silva database. The LCA database established based on 10 signatures worked the best, where all the 10 query signatures were classified to the taxa at the species level. However, as the amount of signatures increases, the classification became worse. For the LCA database based on 500 signatures, the query signatures were even not matched with any taxonomy.

LCA classification steps:
After prepared the Silva sequence and corresponding taxomomy, I used the following scripts to do the sourmash lca classification

  1. compute signature
    sourmash sketch dna -p k=7,scaled=1 ./*.fasta --name-from-first

  2. Build sourmash LCA database
    sourmash lca index --scaled=1 -k=7 -f silva_tax.csv silva.lca.json ./*.sig

  3. Classify signatures
    sourmash lca classify --db silva.lca.json --query ./*.sig -o out.csv

Classification results:

  1. The randomly selected 10 signatures were classified to taxa with the full taxonomic rank based on the LCA database established based on the 10 signatures.
  2. In the classification of randomly selected 10 signatures against LCA database built based on 50 signatures, most of the signatures were classified to taxa down to species level, but a few were classified to taxa at a lower taxonomic level.
  3. In the LCA database built based on 100 signatures, most of the signatures were classified to taxa at the domain level. A few signatures were even unclassified to any taxa.
  4. In the LCA database built based on 500 signatures, most of the signatures were unclassified to any taxa.

Question:
Did I do something wrong during the process of LCA classification? Please let me know if you have any questions or suggestions.

Thanks,
Chao

@ctb
Copy link
Contributor

ctb commented Jun 19, 2021

hi chao, apologies for taking so long to reply! Please feel free to come chat with us over at https://gitter.im/sourmash-bio/community# if you like, too; we're hoping to make that a more interactive forum to discussion!

I think you are running into a problem that we have only really understood ourselves in the past two years or so: k-mer based LCA approaches saturate taxonomic signal quite quickly, and so for large databases you may end up with very poor classification approaches. This is discussed here, and here, and also in Nasko et al. 2018.

A few things you can try:

  • first, we are just now implementing a standard way to use a gather-based algorithm to assign taxonomy. Please see [MRG] add taxonomy subcommand #1543; if you are interested in trying it out, let us know and we can help you! (You would not need to change your database, just the commands you are running!)
  • note also that in [MRG] add taxonomy subcommand #1543 we have added the (otherwise unavailable) functionality in [WIP] improve identifier & taxonomy parsing for lca index #1542, which considerably improves options for using identifiers (and associated error reporting) when using sourmash lca index.
  • second, you should try increasing k-mer size well past 7, to 21 or even 31 or 51. With a sourmash gather based approach as we are suggesting in [MRG] add taxonomy subcommand #1543, you should get more specific results with larger k-mer sizes, although sensitivity might go down past 21 (that's a guess, though).

I hope this isn't too confusing and I hope that we can help you get this working, too!

@LZC0034
Copy link
Author

LZC0034 commented Jul 12, 2021

Sourmash lca calssify-gather-search test for 16S sequence-Chao-07-12-2021.pdf

Hi @ctb,

I have tried the lca classify, gather, and search commands for 7-mer and 21-mer hash signatures computed from the sequences of the SILVA database.

First, I randomly selected 500 sequences from SILVA database and computed for 7-mer and 21-mer hash signatures. Afterwards, I established lca databases using the 500 of 7-mer and 21-mer signatures. Then, I used the lca classify, gather, and search commands to match the 500 signatures against the built database. For the 7-mer hash signatures, the results showed that no match between the queries and database was obtained by using lca classify and gather commands, and three matches was found by using search command. For the 21-mer hash signatures, all the queries were matched with the lca database by using lca classify, while no match and two matches were found by using gather and search commands, respectively.

Then, I tried the gather and search commands to match the 500 queries against a built STB database of 21-mer hash signatures. The output reported an error “requested threshold_bp is unattainable with this query” after using gather command. But two matches were found after using search command. Without the gather output, I cannot use the new taxonomy commands from the Sourmash v4.2.0.

In addition, I tried to use the lca classify to match a query of a 21-mer hash signature computed from a real 16S sequence sample containing about 100 OTUs. The output showed the classification was disagreed. Please see the attached file for the details. Did I do something wrong during the procedure? Any suggestion is appreciated!

Thanks,
Chao

@ctb
Copy link
Contributor

ctb commented Jul 13, 2021

neat! Thank you for troubleshooting!

for 'gather', you can run it with --threshold-bp=0 to disable that warning. The default is 50kb which won't work for 16s :).

I think the lack of matches with 7-mers is because you're using a small subset of silva only, which is consistent with the other results. I suspect 'gather' with a much larger 7-mer database will work.

@LZC0034
Copy link
Author

LZC0034 commented Jul 14, 2021

neat! Thank you for troubleshooting!

for 'gather', you can run it with --threshold-bp=0 to disable that warning. The default is 50kb which won't work for 16s :).

I think the lack of matches with 7-mers is because you're using a small subset of silva only, which is consistent with the other results. I suspect 'gather' with a much larger 7-mer database will work.

Thanks for your suggestion!

I added the --threshold-bp=0 to the 'gather' command and got an error report as below

image

Do you have any suggestions about solving this? Thanks!

@ctb
Copy link
Contributor

ctb commented Jul 14, 2021

eep! that's a bug all right... apologies!

While I track that down, please try using --no-prefetch and/or --linear on the command line.

@ctb
Copy link
Contributor

ctb commented Jul 14, 2021

figured out the bug - apparently we never tested the latest gather code from #1613 with a scaled of 1! Working on a test and a fix now.

@ctb
Copy link
Contributor

ctb commented Jul 14, 2021

eep! that's a bug all right... apologies!

While I track that down, please try using --no-prefetch and/or --linear on the command line.

Fixed in #1670. My suggestion of trying --no-prefetch and --linear will not work, unfortunately.

If you want to try the code out before the next release, you can either install from the latest branch (once #1670 is merged) or try with --scaled=2, which will not trigger the bug. Let me know if you want help using the latest branch - it should work if you do this:

git clone https://github.com/sourmash-bio/sourmash
cd sourmash
pip install -e .

but, again, you'll need to wait for merge...

thanks for reporting!

@ctb
Copy link
Contributor

ctb commented Jul 14, 2021

bug is fixed in latest branch 🎉 but you'll have to wait for a release to update your conda install, I'm afraid :(. please do let me know if you want to try out the pip install stuff above.

@LZC0034
Copy link
Author

LZC0034 commented Jul 14, 2021

bug is fixed in latest branch 🎉 but you'll have to wait for a release to update your conda install, I'm afraid :(. please do let me know if you want to try out the pip install stuff above.

Great! I do want to try out the pip install. Thanks!

@ctb
Copy link
Contributor

ctb commented Jul 17, 2021

the fix from #1670 is included in sourmash 4.2.1, which is now available via bioconda 🎉

@ctb
Copy link
Contributor

ctb commented Dec 16, 2022

note that recent results (and some downstream work to figure out why, exactly, sourmash works so well) suggests to me that sourmash gather should be particularly good at 16S analysis with scaled=1. In brief, the min-set-cov approach should be really good.

@brooksph told me this a long time ago, so this is me telling him he was right all along (as usual).

@LZC0034
Copy link
Author

LZC0034 commented Dec 19, 2022

note that recent results (and some downstream work to figure out why, exactly, sourmash works so well) suggests to me that sourmash gather should be particularly good at 16S analysis with scaled=1. In brief, the min-set-cov approach should be really good.

@brooksph told me this a long time ago, so this is me telling him he was right all along (as usual).

Thanks for your comments @ctb. My recently posted preprint also showed 7-mer hash data generated by sourmash with scaled=1 worked better for fresh produce safety and quality prediction than amplicon sequence variant data. Since the sourmash taxonomic analysis issue for 16S rRNA gene sequencing data was not solved, I did not include this part in my manuscript. I did use sourmash gather with scaled=1, but the taxonomic outcomes were not good. I will try to figure it out.

@ReneKat
Copy link

ReneKat commented Jun 3, 2023

note that recent results (and some downstream work to figure out why, exactly, sourmash works so well) suggests to me that sourmash gather should be particularly good at 16S analysis with scaled=1. In brief, the min-set-cov approach should be really good.

@brooksph told me this a long time ago, so this is me telling him he was right all along (as usual).

Hi @ctb !
Can you point me in the right direction for using the SILVA database to classify 16S sequences? From what I've read, I should use scaled=1 and -k=7 when building the SILVA database. Then use sourmash gather for the classifiction part. What is the min-set-cov approach?
Thanks for your guidance!
-René

@ctb
Copy link
Contributor

ctb commented Jun 3, 2023

sourmash gather is the min-set-cov approach ;). (per

@bryshalm is actually going through this all right now, I will ask her to give you some tips!

@ctb
Copy link
Contributor

ctb commented Jun 3, 2023

(but also I think you should use something other than an LCA database. More on that soon.)

@ReneKat
Copy link

ReneKat commented Jul 3, 2023

(but also I think you should use something other than an LCA database. More on that soon.)

Thanks @ctb Looking forward to hearing more tips. Is there anyway to use the GTDB for better classification?

@bryshalm if you need a guinea pig to try out any code, I volunteer. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants