utility: sourmash-uniqify, an iterative greedy clustering of sourmash signatures, mark 2 #1265

ctb · 2020-12-28T16:06:01Z

a better? or at least more understandable? version of #333

see gist at https://gist.github.com/ctb/85fe2efb23b70bbd72ea6a69750d1284

the README from the gist: iterative greedy clustering of sourmash signatures

ref sourmash

please post questions and comments and thoughts over at sourmash#1265

inspired by this sourmash issue: #1251

note: this requires sourmash 3.5.x or sourmash 4.x.

obtaining this script

click on the 'raw' link below for sourmash-uniqify.
you can grab the script directly from the repo:

wget https://gist.github.com/ctb/85fe2efb23b70bbd72ea6a69750d1284/raw/5953c1025bab0c23e7a06f316fc17dbcec299720/sourmash-uniqify.py

- but this may not get the latest version...

perhaps better, you can clone the repository: recommended

git clone https://gist.github.com/ctb/85fe2efb23b70bbd72ea6a69750d1284

and/or or you can download the repository as a zip, but again, this may not get the latest version.

wget https://gist.github.com/ctb/85fe2efb23b70bbd72ea6a69750d1284/archive/5953c1025bab0c23e7a06f316fc17dbcec299720.zip

usage:

python3 sourmash-uniqify.py [ multiple filenames, directory paths, or sourmash databases ]

Optional arguments:

--prefix - output prefix, e.g. --prefix=output_dir/clustme will put all output in output_dir/clustme.*
--threshold - Jaccard similarity threshold above which two signatures will be clustered. default 0.2

sourmash-uniqify shuffles the sequences using a random number generator seeded with --seed. by default, this is fixed at 1, so unless you change the seed you will always get the same output for the same input arguments. set --seed to 0 to change the seed each time.

you can specify -k/--ksize and --moltype to load specific types of signatures; default is k=31, DNA.

note, all signatures loaded must be compatible for comparison in terms of scaled/num; right now there is no way to pick just the scaled signatures, or just the num signatures.

output:

output of this script consists of a founder file for each cluster, a cluster
file for each non-singleton cluster, and a summary CSV.

summary CSV

the summary CSV output at '{prefix}.summary.csv' contains the
following columns:

'origin_path' - the file or path from which the signature was loaded (corresponds to command-line arguments)
'name' - signature name, as generated by sourmash sketch. may be duplicate.
'filename' - filename of source sequences, as generated by sourmash sketch. may be duplicate.
'md5sum' - ~unique hexadecimal string for each signature. should be unique.
'cluster' - cluster number (integer).
'member_type' - either 'founder' or 'member'. If 'founder', this is the signature to which all others were clustered; if 'member', this is a duplicate signature to the founder.

founder file

for each cluster, there is a founder signature output with the name
'{prefix}.cluster.{n}.founder.sig'. This is the founder signature for this
cluster, to which all members of this cluster match with similarity >= threshold (as specified with --threshold).

cluster file

for each cluster with more than one signature in it, there will be a
cluster file containing 1 or more signatures, under the name
'{prefix}.cluster.{n}.cluster.sig'.

These are the signatures who compare to the founder signature with
similarity >= threshold (as specified with --threshold).

using this output to cluster actual genomes

In brief, you should be able to:

take a directory with a bunch of genome files
run sourmash sketch dna * to turn them into sourmash signatures.
make a directory for the output: mkdir output
then run this script, sourmash-uniqify.py, on the signatures in the directory: sourmash-uniqify.py *.sig --prefix=output/myproj
examine output/myproj.summary.csv

To select out the "unique" set of representative genomes, then do:

mkdir unique
ln $(grep ',founder' output/myproj.summary.csv | cut -d, -f2) unique/

et voila!

other notes

this uses an n-squared algorithm so you probably don't want to run this on more than a few thousand signatures.

it's also kinda brute-force dumb, for simplicity of prototype implementation.

file an issue in the sourmash tracker if you'd like it to scale better.

The text was updated successfully, but these errors were encountered:

ctb · 2020-12-28T16:07:30Z

this could be updated and made more generic by using pairwise evolutionary distance instead. cc @bluegenes

ctb · 2020-12-28T16:39:36Z

some random explanations:

we cannot guarantee that there is a way to cluster genome into natural clusters in a robust and reproducible way
specifically, imagine genomes a, b, and c, where a ~ b and b ~ c but a ~ c (with ~ being whatever pairwise similarity metric you want to use)
should this be a cluster? maybe, maybe not. it depends!
this script basically randomly picks a founder.
if the founder is a, you end up with a cluster a, b and a singleton c
if the founder is b, you end up with a cluster a, b, c
if the founder is c, you end up with a cluster c, b and a singleton a

this resolves a confounding issue I had with the sourmash clustering approach in #333 where I couldn't really figure out what the cutoff meant 😄 , by making the cutoff mean something obvious (here, Jaccard similarity, or in the future something like ANI or whatever) and then doing the random founder picking with a fixed random number seed.

ctb · 2020-12-28T16:41:09Z

the simplest options for scaling are to -

compute the distances only once, which will use more memory but should speed things up
do the similarity computation in rust
do a pre-clustering at a very low threshold and then parallelize finder-grained clustering

it's fundamentally an n**2 algorithm tho ;)

ctb · 2020-12-28T19:47:16Z

some more reasonably obvious things to say -

could easily provide a --scaffold option that would take in a set of (presumed) singletons, and prefilter against those
this would allow iteration of large collections of signatures without n**2 behavior...

ctb · 2021-01-07T19:43:42Z

dRep seems relevant ;) https://twitter.com/MattagenOlmics/status/1347260217516453888

ctb · 2022-01-27T05:15:15Z

moved here: https://github.com/ctb/2022-sourmash-uniqify

This was referenced Jan 10, 2021

add 'cluster' and 'cocluster' to sourmash #700

Open

Identifying groups of samples for co-assembly #1212

Closed

ctb mentioned this issue Feb 2, 2022

adding sourmash cluster - some specific thoughts #1814

Open

ctb mentioned this issue Jul 31, 2023

sourmash plugins - ideas dumping ground #2453

Open

ctb added the plugin_todo Write a plugin for this! label Sep 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utility: sourmash-uniqify, an iterative greedy clustering of sourmash signatures, mark 2 #1265

utility: sourmash-uniqify, an iterative greedy clustering of sourmash signatures, mark 2 #1265

ctb commented Dec 28, 2020 •

edited

Loading

ctb commented Dec 28, 2020

ctb commented Dec 28, 2020

ctb commented Dec 28, 2020

ctb commented Dec 28, 2020

ctb commented Jan 7, 2021

ctb commented Jan 27, 2022

utility: sourmash-uniqify, an iterative greedy clustering of sourmash signatures, mark 2 #1265

utility: sourmash-uniqify, an iterative greedy clustering of sourmash signatures, mark 2 #1265

Comments

ctb commented Dec 28, 2020 • edited Loading

the README from the gist: iterative greedy clustering of sourmash signatures

obtaining this script

usage:

output:

summary CSV

founder file

cluster file

using this output to cluster actual genomes

other notes

ctb commented Dec 28, 2020

ctb commented Dec 28, 2020

ctb commented Dec 28, 2020

ctb commented Dec 28, 2020

ctb commented Jan 7, 2021

ctb commented Jan 27, 2022

ctb commented Dec 28, 2020 •

edited

Loading