Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] SBT scaffold #1201

Closed
wants to merge 8 commits into from
Closed

[WIP] SBT scaffold #1201

wants to merge 8 commits into from

Conversation

luizirber
Copy link
Member

@luizirber luizirber commented Sep 24, 2020

Add a new function scaffold that takes a list of signatures and build an SBT clustered by shared hashes.

At the moment still uses the same amount of memory, but building by levels (from bottom up) allows saving each node as they are built, and then unload them as the next level is built.

This replaces the tree.insert(sig) approach in sourmash index, with a fallback to the regular insertion code if --append is set.

TODO

  • set up storage properly, so we can start saving leaves (and later internal levels) as they are built
  • more benchmarks
  • Test more corner cases
  • more comments about how it works (HowDe SBT does distances-with-maxheap too)
  • Use an HLL to count unique k-mers when loading sigs, and then set the GraphFactory with it

Checklist

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@codecov
Copy link

codecov bot commented Sep 24, 2020

Codecov Report

Merging #1201 (a3cbdd4) into latest (95bd546) will decrease coverage by 40.55%.
The diff coverage is 19.94%.

Impacted file tree graph

@@             Coverage Diff             @@
##           latest    #1201       +/-   ##
===========================================
- Coverage   83.31%   42.75%   -40.56%     
===========================================
  Files         103      103               
  Lines        9601     9851      +250     
===========================================
- Hits         7999     4212     -3787     
- Misses       1602     5639     +4037     
Flag Coverage Δ
rusttests 67.51% <8.33%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
sourmash/cli/index.py 21.05% <ø> (-78.95%) ⬇️
sourmash/commands.py 4.39% <0.00%> (-83.97%) ⬇️
sourmash/sbt_storage.py 45.19% <0.00%> (-46.33%) ⬇️
src/core/src/ffi/nodegraph.rs 0.00% <0.00%> (ø)
src/core/src/sketch/nodegraph.rs 84.41% <0.00%> (+0.54%) ⬆️
sourmash/sbt.py 36.83% <17.34%> (-52.18%) ⬇️
sourmash/nodegraph.py 74.28% <18.18%> (-19.36%) ⬇️
sourmash/logging.py 83.47% <50.00%> (-12.57%) ⬇️
sourmash/sbtmh.py 50.81% <52.17%> (-38.90%) ⬇️
src/core/src/sketch/hyperloglog/estimators.rs 93.47% <100.00%> (ø)
... and 72 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 95bd546...a3cbdd4. Read the comment docs.

@olgabot
Copy link
Collaborator

olgabot commented Sep 24, 2020

Very interesting! Could this be used in #925 (sorry this PR has been taking a while)

@luizirber
Copy link
Member Author

(it was getting late last night, talking more now =])

Very interesting! Could this be used in #925 (sorry this PR has been taking a while)

I think it can be used indirectly, but they are very different (and complementary) approaches. scaffold is an offline (batch) method, it needs all the data available beforehand. The current SBT.insert approach is online (streaming), because we can add data one at a time.

In the sourmash index case we do have all the data, so might as well use it. But building SBTs (and other indices) as new data is available is very useful for exploratory research, and also for updating previously existing SBTs (the scaffold approach would need to redo all the calculations).

So, I think this PR is useful for #925 as a baseline for comparison, but the bottom-up building and "calculate all pairwise distances for each level" approach are not going to work over there...

@luizirber
Copy link
Member Author

luizirber commented Oct 12, 2020

(This PR now includes a HLL implementation. whoops.)

@luizirber
Copy link
Member Author

Linking back to #545, since this PR implements part of the HowDe SBT approach (the clustering), but is still missing the HowDe-like nodes (two compressed bitvectors instead of a Nodegraph)

@luizirber
Copy link
Member Author

luizirber commented Oct 30, 2020

Some intriguing results: in #1221 (comment) I reported running times and memory consumption for the bitmagic-based Nodegraph. Here is an updated table with the latest changes in this PR:

Runtime (s) Memory (MB) Index size (MB)
original 7.36 7,367 233
bitmagic 24.94 334 241
#1137 + #1138 + bitmagic 6.51 358 241
#1021 12.95 195 134

The clustering of the SBT is helping a lot with index size, and also with memory consumption (since less nodes have to be checked). I'm a bit surprised with the runtime increase, but the machine where I'm running tests is a bit overworked at the moment, so it might be related.

The main issue? I'm seeing 22 matches, instead of 21 like before. So I might have found a bug with the current SBT? 😨
I'll run gather with a LinearIndex built from the same sigs and see what the ground truth should be... (but, damning evidence, I built the query from 22 signatures, so it is probably a bug with current SBTs 😞 )

Update: ground truth is 22 matches, so current SBT has a bug.

luizirber added a commit that referenced this pull request Oct 31, 2020
Implement a HyperLogLog sketch based on the `khmer` implementation but using the estimator from ["New cardinality estimation algorithms for HyperLogLog sketches"](http://oertl.github.io/hyperloglog-sketch-estimation-paper/paper/paper.pdf) (also implemented in `dashing`).

This PR also moves `add_sequence` and `add_protein` to `SigsTrait`, closing #1057.

The encoding data and methods (`hp`, `dayhoff`, `aa` and `HashFunctions`) was in the MinHash source file, and since it is more general-purpose it was moved to a new module `encodings`, which is then used by `SigsTrait`.

(these changes are both spun off #1201)
@luizirber luizirber force-pushed the scaffold_py branch 2 times, most recently from 935d219 to 218ae49 Compare November 1, 2020 15:53
broken impl with tests

want union, not intersection; one-level rotation for subtrees

use it in index CLI, fix bug for single-leaf SBT

add a HLL impl, move alphabet stuff to encodings
hll ffi
expose cardinality, add_hash and add_sequence

new ertl ml estimator for HLL

implemented joint mle

reorganize hll and estimators

fix rust checks and add python tests for hll

working on fixing scaffolding issues

new hll methods

working on partial saving

add trace to logging
@luizirber
Copy link
Member Author

This is still useful, but I'm mostly not working on SBTs anymore... Happy to discuss if anyone wants to pick it up, but closing for now.

@luizirber luizirber closed this Feb 13, 2023
@luizirber luizirber deleted the scaffold_py branch February 13, 2023 04:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants