Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utility: sourmash-uniqify, an iterative greedy clustering of sourmash signatures, mark 2 #1265

Open
ctb opened this issue Dec 28, 2020 · 6 comments
Labels
plugin_todo Write a plugin for this!

Comments

@ctb
Copy link
Contributor

ctb commented Dec 28, 2020

a better? or at least more understandable? version of #333

see gist at https://gist.github.com/ctb/85fe2efb23b70bbd72ea6a69750d1284

the README from the gist: iterative greedy clustering of sourmash signatures

ref sourmash

please post questions and comments and thoughts over at sourmash#1265

inspired by this sourmash issue: #1251

note: this requires sourmash 3.5.x or sourmash 4.x.

obtaining this script

  1. click on the 'raw' link below for sourmash-uniqify.

  2. you can grab the script directly from the repo:

wget https://gist.github.com/ctb/85fe2efb23b70bbd72ea6a69750d1284/raw/5953c1025bab0c23e7a06f316fc17dbcec299720/sourmash-uniqify.py

- but this may not get the latest version...

  1. perhaps better, you can clone the repository: recommended
git clone https://gist.github.com/ctb/85fe2efb23b70bbd72ea6a69750d1284
  1. and/or or you can download the repository as a zip, but again, this may not get the latest version.
wget https://gist.github.com/ctb/85fe2efb23b70bbd72ea6a69750d1284/archive/5953c1025bab0c23e7a06f316fc17dbcec299720.zip

usage:

python3 sourmash-uniqify.py [ multiple filenames, directory paths, or sourmash databases ]

Optional arguments:

  • --prefix - output prefix, e.g. --prefix=output_dir/clustme will put all output in output_dir/clustme.*
  • --threshold - Jaccard similarity threshold above which two signatures will be clustered. default 0.2

sourmash-uniqify shuffles the sequences using a random number generator seeded with --seed. by default, this is fixed at 1, so unless you change the seed you will always get the same output for the same input arguments. set --seed to 0 to change the seed each time.

you can specify -k/--ksize and --moltype to load specific types of signatures; default is k=31, DNA.

note, all signatures loaded must be compatible for comparison in terms of scaled/num; right now there is no way to pick just the scaled signatures, or just the num signatures.

output:

output of this script consists of a founder file for each cluster, a cluster
file for each non-singleton cluster, and a summary CSV.

summary CSV

the summary CSV output at '{prefix}.summary.csv' contains the
following columns:

  • 'origin_path' - the file or path from which the signature was loaded (corresponds to command-line arguments)
  • 'name' - signature name, as generated by sourmash sketch. may be duplicate.
  • 'filename' - filename of source sequences, as generated by sourmash sketch. may be duplicate.
  • 'md5sum' - ~unique hexadecimal string for each signature. should be unique.
  • 'cluster' - cluster number (integer).
  • 'member_type' - either 'founder' or 'member'. If 'founder', this is the signature to which all others were clustered; if 'member', this is a duplicate signature to the founder.

founder file

for each cluster, there is a founder signature output with the name
'{prefix}.cluster.{n}.founder.sig'. This is the founder signature for this
cluster, to which all members of this cluster match with similarity >= threshold (as specified with --threshold).

cluster file

for each cluster with more than one signature in it, there will be a
cluster file containing 1 or more signatures, under the name
'{prefix}.cluster.{n}.cluster.sig'.

These are the signatures who compare to the founder signature with
similarity >= threshold (as specified with --threshold).

using this output to cluster actual genomes

In brief, you should be able to:

  • take a directory with a bunch of genome files
  • run sourmash sketch dna * to turn them into sourmash signatures.
  • make a directory for the output: mkdir output
  • then run this script, sourmash-uniqify.py, on the signatures in the directory: sourmash-uniqify.py *.sig --prefix=output/myproj
  • examine output/myproj.summary.csv

To select out the "unique" set of representative genomes, then do:

mkdir unique
ln $(grep ',founder' output/myproj.summary.csv | cut -d, -f2) unique/

et voila!

other notes

this uses an n-squared algorithm so you probably don't want to run this on more than a few thousand signatures.

it's also kinda brute-force dumb, for simplicity of prototype implementation.

file an issue in the sourmash tracker if you'd like it to scale better.

@ctb
Copy link
Contributor Author

ctb commented Dec 28, 2020

this could be updated and made more generic by using pairwise evolutionary distance instead. cc @bluegenes

@ctb
Copy link
Contributor Author

ctb commented Dec 28, 2020

some random explanations:

  • we cannot guarantee that there is a way to cluster genome into natural clusters in a robust and reproducible way
  • specifically, imagine genomes a, b, and c, where a ~ b and b ~ c but a ~ c (with ~ being whatever pairwise similarity metric you want to use)
  • should this be a cluster? maybe, maybe not. it depends!
  • this script basically randomly picks a founder.
  • if the founder is a, you end up with a cluster a, b and a singleton c
  • if the founder is b, you end up with a cluster a, b, c
  • if the founder is c, you end up with a cluster c, b and a singleton a

this resolves a confounding issue I had with the sourmash clustering approach in #333 where I couldn't really figure out what the cutoff meant 😄 , by making the cutoff mean something obvious (here, Jaccard similarity, or in the future something like ANI or whatever) and then doing the random founder picking with a fixed random number seed.

@ctb
Copy link
Contributor Author

ctb commented Dec 28, 2020

the simplest options for scaling are to -

  • compute the distances only once, which will use more memory but should speed things up
  • do the similarity computation in rust
  • do a pre-clustering at a very low threshold and then parallelize finder-grained clustering

it's fundamentally an n**2 algorithm tho ;)

@ctb
Copy link
Contributor Author

ctb commented Dec 28, 2020

some more reasonably obvious things to say -

  • could easily provide a --scaffold option that would take in a set of (presumed) singletons, and prefilter against those
  • this would allow iteration of large collections of signatures without n**2 behavior...

@ctb
Copy link
Contributor Author

ctb commented Jan 7, 2021

@ctb
Copy link
Contributor Author

ctb commented Jan 27, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plugin_todo Write a plugin for this!
Projects
None yet
Development

No branches or pull requests

1 participant