-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
utility: sourmash-uniqify, an iterative greedy clustering of sourmash signatures, mark 2 #1265
Comments
this could be updated and made more generic by using pairwise evolutionary distance instead. cc @bluegenes |
some random explanations:
this resolves a confounding issue I had with the sourmash clustering approach in #333 where I couldn't really figure out what the cutoff meant 😄 , by making the cutoff mean something obvious (here, Jaccard similarity, or in the future something like ANI or whatever) and then doing the random founder picking with a fixed random number seed. |
the simplest options for scaling are to -
it's fundamentally an n**2 algorithm tho ;) |
some more reasonably obvious things to say -
|
dRep seems relevant ;) https://twitter.com/MattagenOlmics/status/1347260217516453888 |
a better? or at least more understandable? version of #333
see gist at https://gist.github.com/ctb/85fe2efb23b70bbd72ea6a69750d1284
the README from the gist: iterative greedy clustering of sourmash signatures
ref sourmash
please post questions and comments and thoughts over at sourmash#1265
inspired by this sourmash issue: #1251
note: this requires sourmash 3.5.x or sourmash 4.x.
obtaining this script
click on the 'raw' link below for sourmash-uniqify.
you can grab the script directly from the repo:
- but this may not get the latest version...
usage:
Optional arguments:
--prefix=output_dir/clustme
will put all output inoutput_dir/clustme.*
sourmash-uniqify shuffles the sequences using a random number generator seeded with
--seed
. by default, this is fixed at 1, so unless you change the seed you will always get the same output for the same input arguments. set --seed to 0 to change the seed each time.you can specify -k/--ksize and --moltype to load specific types of signatures; default is k=31, DNA.
note, all signatures loaded must be compatible for comparison in terms of scaled/num; right now there is no way to pick just the scaled signatures, or just the num signatures.
output:
output of this script consists of a founder file for each cluster, a cluster
file for each non-singleton cluster, and a summary CSV.
summary CSV
the summary CSV output at '{prefix}.summary.csv' contains the
following columns:
sourmash sketch
. may be duplicate.sourmash sketch
. may be duplicate.founder file
for each cluster, there is a founder signature output with the name
'{prefix}.cluster.{n}.founder.sig'. This is the founder signature for this
cluster, to which all members of this cluster match with similarity >= threshold (as specified with --threshold).
cluster file
for each cluster with more than one signature in it, there will be a
cluster file containing 1 or more signatures, under the name
'{prefix}.cluster.{n}.cluster.sig'.
These are the signatures who compare to the founder signature with
similarity >= threshold (as specified with --threshold).
using this output to cluster actual genomes
In brief, you should be able to:
sourmash sketch dna *
to turn them into sourmash signatures.mkdir output
sourmash-uniqify.py
, on the signatures in the directory:sourmash-uniqify.py *.sig --prefix=output/myproj
output/myproj.summary.csv
To select out the "unique" set of representative genomes, then do:
et voila!
other notes
this uses an n-squared algorithm so you probably don't want to run this on more than a few thousand signatures.
it's also kinda brute-force dumb, for simplicity of prototype implementation.
file an issue in the sourmash tracker if you'd like it to scale better.
The text was updated successfully, but these errors were encountered: