[WIP] SBT scaffold #1201

luizirber · 2020-09-24T04:39:28Z

Add a new function scaffold that takes a list of signatures and build an SBT clustered by shared hashes.

At the moment still uses the same amount of memory, but building by levels (from bottom up) allows saving each node as they are built, and then unload them as the next level is built.

This replaces the tree.insert(sig) approach in sourmash index, with a fallback to the regular insertion code if --append is set.

TODO

set up storage properly, so we can start saving leaves (and later internal levels) as they are built
more benchmarks
Test more corner cases
more comments about how it works (HowDe SBT does distances-with-maxheap too)
Use an HLL to count unique k-mers when loading sigs, and then set the GraphFactory with it

Checklist

Is it mergeable?
make test Did it pass the tests?
make coverage Is the new code covered?
Did it change the command-line interface? Only additions are allowed
without a major version increment. Changing file formats also requires a
major version number increment.
Was a spellchecker run on the source code and documentation after
changes were made?

codecov · 2020-09-24T04:44:51Z

Codecov Report

Merging #1201 (a3cbdd4) into latest (95bd546) will decrease coverage by 40.55%.
The diff coverage is 19.94%.

@@             Coverage Diff             @@
##           latest    #1201       +/-   ##
===========================================
- Coverage   83.31%   42.75%   -40.56%     
===========================================
  Files         103      103               
  Lines        9601     9851      +250     
===========================================
- Hits         7999     4212     -3787     
- Misses       1602     5639     +4037

Flag	Coverage Δ
rusttests	`67.51% <8.33%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
sourmash/cli/index.py	`21.05% <ø> (-78.95%)`	⬇️
sourmash/commands.py	`4.39% <0.00%> (-83.97%)`	⬇️
sourmash/sbt_storage.py	`45.19% <0.00%> (-46.33%)`	⬇️
src/core/src/ffi/nodegraph.rs	`0.00% <0.00%> (ø)`
src/core/src/sketch/nodegraph.rs	`84.41% <0.00%> (+0.54%)`	⬆️
sourmash/sbt.py	`36.83% <17.34%> (-52.18%)`	⬇️
sourmash/nodegraph.py	`74.28% <18.18%> (-19.36%)`	⬇️
sourmash/logging.py	`83.47% <50.00%> (-12.57%)`	⬇️
sourmash/sbtmh.py	`50.81% <52.17%> (-38.90%)`	⬇️
src/core/src/sketch/hyperloglog/estimators.rs	`93.47% <100.00%> (ø)`
... and 72 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 95bd546...a3cbdd4. Read the comment docs.

olgabot · 2020-09-24T04:57:16Z

Very interesting! Could this be used in #925 (sorry this PR has been taking a while)

luizirber · 2020-09-24T14:53:27Z

(it was getting late last night, talking more now =])

Very interesting! Could this be used in #925 (sorry this PR has been taking a while)

I think it can be used indirectly, but they are very different (and complementary) approaches. scaffold is an offline (batch) method, it needs all the data available beforehand. The current SBT.insert approach is online (streaming), because we can add data one at a time.

In the sourmash index case we do have all the data, so might as well use it. But building SBTs (and other indices) as new data is available is very useful for exploratory research, and also for updating previously existing SBTs (the scaffold approach would need to redo all the calculations).

So, I think this PR is useful for #925 as a baseline for comparison, but the bottom-up building and "calculate all pairwise distances for each level" approach are not going to work over there...

luizirber · 2020-10-12T15:15:01Z

(This PR now includes a HLL implementation. whoops.)

luizirber · 2020-10-22T15:48:20Z

Linking back to #545, since this PR implements part of the HowDe SBT approach (the clustering), but is still missing the HowDe-like nodes (two compressed bitvectors instead of a Nodegraph)

luizirber · 2020-10-30T17:29:41Z

Some intriguing results: in #1221 (comment) I reported running times and memory consumption for the bitmagic-based Nodegraph. Here is an updated table with the latest changes in this PR:

	Runtime (s)	Memory (MB)	Index size (MB)
original	7.36	7,367	233
bitmagic	24.94	334	241
#1137 + #1138 + bitmagic	6.51	358	241
#1021	12.95	195	134

The clustering of the SBT is helping a lot with index size, and also with memory consumption (since less nodes have to be checked). I'm a bit surprised with the runtime increase, but the machine where I'm running tests is a bit overworked at the moment, so it might be related.

The main issue? I'm seeing 22 matches, instead of 21 like before. So I might have found a bug with the current SBT? 😨
I'll run gather with a LinearIndex built from the same sigs and see what the ground truth should be... (but, damning evidence, I built the query from 22 signatures, so it is probably a bug with current SBTs 😞 )

Update: ground truth is 22 matches, so current SBT has a bug.

Implement a HyperLogLog sketch based on the `khmer` implementation but using the estimator from ["New cardinality estimation algorithms for HyperLogLog sketches"](http://oertl.github.io/hyperloglog-sketch-estimation-paper/paper/paper.pdf) (also implemented in `dashing`). This PR also moves `add_sequence` and `add_protein` to `SigsTrait`, closing #1057. The encoding data and methods (`hp`, `dayhoff`, `aa` and `HashFunctions`) was in the MinHash source file, and since it is more general-purpose it was moved to a new module `encodings`, which is then used by `SigsTrait`. (these changes are both spun off #1201)

broken impl with tests want union, not intersection; one-level rotation for subtrees use it in index CLI, fix bug for single-leaf SBT add a HLL impl, move alphabet stuff to encodings hll ffi expose cardinality, add_hash and add_sequence new ertl ml estimator for HLL implemented joint mle reorganize hll and estimators fix rust checks and add python tests for hll working on fixing scaffolding issues new hll methods working on partial saving add trace to logging

luizirber · 2023-02-13T04:11:19Z

This is still useful, but I'm mostly not working on SBTs anymore... Happy to discuss if anyone wants to pick it up, but closing for now.

luizirber force-pushed the scaffold_py branch from c02451a to 63ac76d Compare October 12, 2020 15:14

luizirber force-pushed the scaffold_py branch 5 times, most recently from 5522636 to a572578 Compare October 12, 2020 15:57

This was referenced Oct 22, 2020

Move add_sequence and add_protein to SigsTrait #1057

Closed

Expose a function to produce a list of all hashes from a sequence #1214

Closed

luizirber mentioned this pull request Oct 27, 2020

[WIP] Using bitmagic for internal Nodegraph representation #1221

Closed

5 tasks

luizirber force-pushed the scaffold_py branch from e9df14e to e3d260e Compare October 28, 2020 03:01

This was referenced Oct 28, 2020

[MRG] Add a HLL implementation #1223

Merged

[MRG] Add a tracing function to logging #1224

Closed

luizirber force-pushed the scaffold_py branch 3 times, most recently from 0b26611 to 2709d82 Compare October 30, 2020 17:22

luizirber force-pushed the scaffold_py branch from 02e907a to e21da04 Compare October 31, 2020 22:00

luizirber force-pushed the scaffold_py branch 2 times, most recently from 935d219 to 218ae49 Compare November 1, 2020 15:53

luizirber added 5 commits November 8, 2020 20:14

trace

92a4017

use Nodegraphs for intersection

33c958f

build Nodegraph earlier

ed8ffa8

take factory instead of bf_size for scaffold

a6f72e6

luizirber added 3 commits November 8, 2020 20:14

wip split internal and leaf nodes in storages

3a65fb8

split storage setup to new method

2701a83

comments

a3cbdd4

luizirber force-pushed the scaffold_py branch from 218ae49 to a3cbdd4 Compare November 9, 2020 04:14

luizirber mentioned this pull request Feb 26, 2021

[MRG] implement a simple ZipFileLinearIndex class #1349

Merged

5 tasks

luizirber mentioned this pull request Mar 10, 2021

support/provide/require manifests for collections of signatures? #1352

Closed

This was referenced Apr 19, 2021

[MRG] Rework the find functionality for Index classes #1392

Merged

[MRG] Add MinHash.intersection(...) #1474

Merged

luizirber mentioned this pull request May 2, 2021

[MRG] refactor gather functionality for speed & modularity; provide prefetch functionality. #1370

Merged

16 tasks

luizirber closed this Feb 13, 2023

luizirber deleted the scaffold_py branch February 13, 2023 04:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] SBT scaffold #1201

[WIP] SBT scaffold #1201

luizirber commented Sep 24, 2020 •

edited

Loading

codecov bot commented Sep 24, 2020 •

edited

Loading

olgabot commented Sep 24, 2020

luizirber commented Sep 24, 2020

luizirber commented Oct 12, 2020 •

edited

Loading

luizirber commented Oct 22, 2020

luizirber commented Oct 30, 2020 •

edited

Loading

luizirber commented Feb 13, 2023

[WIP] SBT scaffold #1201

[WIP] SBT scaffold #1201

Conversation

luizirber commented Sep 24, 2020 • edited Loading

TODO

Checklist

codecov bot commented Sep 24, 2020 • edited Loading

Codecov Report

olgabot commented Sep 24, 2020

luizirber commented Sep 24, 2020

luizirber commented Oct 12, 2020 • edited Loading

luizirber commented Oct 22, 2020

luizirber commented Oct 30, 2020 • edited Loading

luizirber commented Feb 13, 2023

luizirber commented Sep 24, 2020 •

edited

Loading

codecov bot commented Sep 24, 2020 •

edited

Loading

luizirber commented Oct 12, 2020 •

edited

Loading

luizirber commented Oct 30, 2020 •

edited

Loading