Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sketching files containing many small sequences: manysketch is astonishingly fast #3252

Open
ctb opened this issue Jul 14, 2024 · 5 comments

Comments

@ctb
Copy link
Contributor

ctb commented Jul 14, 2024

I'm trying to sketch the RVDB, the Reference Viral Genome Database. The clustered file is ~600 MB.

sourmash scripts manysketch C-RVDBvCurrent.manysketch.csv -o C-RVDBvCurrent.manysketch.zip -p dna,k=21,scaled=1000 --singleton

took about 5 minutes.

sourmash sketch dna -p k=21 C-RVDBvCurrent.fasta.gz -o C-RVDBvCurrent.sig.zip --singleton

didn't finish in 24 hours.

what's the reason!? By my understanding manysketch isn't multithreaded when reading single FASTA files, so it's not multithreading. Presumably just the Python for loop penalty and/or using screed!? Wow.

On a mostly unrelated note, the sig.zip file is larger than the FASTA file. So that sucks.

@ctb
Copy link
Contributor Author

ctb commented Jul 14, 2024

and on a further somewhat unrelated note, fastgather took even less time than sketching.

@ctb
Copy link
Contributor Author

ctb commented Jul 14, 2024

and even more so, to add a sketch it is faster to

  • add a sequence to the fa gz file
  • rerun manysketch

than it is to run sig cat to combine the old database with a new sketch 😭

@lxsteiner
Copy link

Hi @ctb
How does this compare with #2537 ?

In case of a multifasta file, to work with manysketch, we'd first have to break down the entries into individual files, and compile a .csv list for the input, or what would the recommendation be?
Thanks.

@ctb
Copy link
Contributor Author

ctb commented Sep 27, 2024

Try using --singleton - that should do what you want :). See: link

@ctb
Copy link
Contributor Author

ctb commented Nov 11, 2024

singlesketch in sourmash_plugin_branchwater is even faster now ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants