Adapt the kSpider's algorithm in pairwise comparisons #219

mr-eyes · 2024-02-17T00:20:12Z

I plan to implement the kSpider's algorithm in Rust, utilizing the inverted index of RocksDB. I wanted to ask if that should be another sub-command in the branchwater pairwise or if it should be a separate command. Or maybe a different repo (i.e., the original kSpider project).

The text was updated successfully, but these errors were encountered:

mr-eyes · 2024-02-17T00:26:54Z

The kSpider algorithm is particularly good in comparing large sparse hash-sets.

ctb · 2024-02-17T17:29:18Z

A few hot takes:

I think in this repo (branchwater plugin) is fine, as long as it is clearly documented and tested 😁
if it takes unique/restricted/limited database types, e.g. ONLY a rocksdb index, I would suggest a new command
if it behaves EXACTLY like pairwise and can autodetect a rocksdb, then leave it as pairwise
note that the rocksdb database format has been evolving a bit, e.g. b/t v0.8.6 and 0.9.0 of this plugin, it has changed. That may be because of the update to sourmash core.

bluegenes · 2024-02-22T01:05:49Z

hey @mr-eyes -- I already had this draft in progress based on our discussions in January, just hadn't committed/ pushed up my changes yet. Would love to have your contributions/insight/clustering expertise/etc!

I haven't had a chance to verify clusters are created correctly yet, just wanted to get this up here to avoid you re-doing any of the work I already put in.

#234

mr-eyes · 2024-02-22T01:11:42Z

Hi @bluegenes , I would absolutely love to contribute to this as needed. I want to clarify that #234 is not what I mean by the kSpider's algorithm. The kSpider algorithm utilizes the index to generate a sparse pairwise matrix rather than the brute-force comparisons done in #181. This is expected to work well when comparing highly diverse/sparse data.

bluegenes · 2024-02-22T01:18:33Z

Whoops - completely misread that somehow. @ctb's comments make a lot more sense now 😂. Sparse pairwise would be great too!

In any case, would love to have your input on the cluster PR :). I 'll tag your for review when I think it's ready.

ctb · 2024-02-26T18:10:29Z

ref sourmash-bio/sourmash#2271

@mr-eyes

This PR adds a new command, `cluster`, that can be used to cluster the output from `pairwise` and `multisearch`. `cluster`uses `rustworkx-core` (which internally uses `petgraph`) to build a graph, adding edges between nodes when the similarity exceeds the user-defined threshold. It can work on any of the similarity columns output by `pairwise` or `multisearch`, and will add all nodes to the graph to preserve singleton 'clusters' in the output. `cluster` outputs two files: 1. cluster identities file: `Component_X, name1;name2;name3...` 2. cluster size histogram `cluster_size, count` context for some things I tried: - try using petgraph directly and removing rustworkx dependency > nope,`rustworkx-core` adds `connected_components` that returns the connected components, rather than just the number of connected components. Could reimplement if `rustworkx-core` brings in a lot of deps - try using 'extend_with_edges' instead of add_edge logic. > nope, only in `petgraph` **Punted Issues:** - develop clustering visualizations (ref @mr-eyes kSpider/dbretina work). Optionally output dot file of graph? (#248) - enable updating clusters, rather than always regenerating from scratch (#249) - benchmark `cluster` (#247) > `pairwise` files can be millions of lines long. Would it be faster to parallel read them, store them in an `edges` vector, and then add nodes/edges sequentially? Note that we would probably need to either 1. store all edges, including those that do not pass threshold) or 2. After building the graph from edges, add nodes from `names_to_node` that are not already in the graph to preserve singletons. Related issues: * #219 * sourmash-bio/sourmash#2271 * sourmash-bio/sourmash#700 * sourmash-bio/sourmash#225 * sourmash-bio/sourmash#274 --------- Co-authored-by: C. Titus Brown <[email protected]>

ctb mentioned this issue Feb 20, 2024

pairwise jaccard - distance - GPU sourmash-bio/sourmash#3016

Open

ctb mentioned this issue Feb 26, 2024

MRG: Add graph-based clustering #234

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adapt the kSpider's algorithm in pairwise comparisons #219

Adapt the kSpider's algorithm in pairwise comparisons #219

mr-eyes commented Feb 17, 2024 •

edited

Loading

mr-eyes commented Feb 17, 2024

ctb commented Feb 17, 2024

bluegenes commented Feb 22, 2024 •

edited

Loading

mr-eyes commented Feb 22, 2024

bluegenes commented Feb 22, 2024

ctb commented Feb 26, 2024

Adapt the kSpider's algorithm in pairwise comparisons #219

Adapt the kSpider's algorithm in pairwise comparisons #219

Comments

mr-eyes commented Feb 17, 2024 • edited Loading

mr-eyes commented Feb 17, 2024

ctb commented Feb 17, 2024

bluegenes commented Feb 22, 2024 • edited Loading

mr-eyes commented Feb 22, 2024

bluegenes commented Feb 22, 2024

ctb commented Feb 26, 2024

mr-eyes commented Feb 17, 2024 •

edited

Loading

bluegenes commented Feb 22, 2024 •

edited

Loading