-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] SBT scaffold #1201
[WIP] SBT scaffold #1201
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #1201 +/- ##
===========================================
- Coverage 83.31% 42.75% -40.56%
===========================================
Files 103 103
Lines 9601 9851 +250
===========================================
- Hits 7999 4212 -3787
- Misses 1602 5639 +4037
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Very interesting! Could this be used in #925 (sorry this PR has been taking a while) |
(it was getting late last night, talking more now =])
I think it can be used indirectly, but they are very different (and complementary) approaches. In the So, I think this PR is useful for #925 as a baseline for comparison, but the bottom-up building and "calculate all pairwise distances for each level" approach are not going to work over there... |
c02451a
to
63ac76d
Compare
(This PR now includes a HLL implementation. whoops.) |
5522636
to
a572578
Compare
Linking back to #545, since this PR implements part of the HowDe SBT approach (the clustering), but is still missing the HowDe-like nodes (two compressed bitvectors instead of a Nodegraph) |
e9df14e
to
e3d260e
Compare
0b26611
to
2709d82
Compare
Some intriguing results: in #1221 (comment) I reported running times and memory consumption for the bitmagic-based Nodegraph. Here is an updated table with the latest changes in this PR:
The clustering of the SBT is helping a lot with index size, and also with memory consumption (since less nodes have to be checked). I'm a bit surprised with the runtime increase, but the machine where I'm running tests is a bit overworked at the moment, so it might be related. The main issue? I'm seeing 22 matches, instead of 21 like before. So I might have found a bug with the current SBT? 😨 Update: ground truth is 22 matches, so current SBT has a bug. |
02e907a
to
e21da04
Compare
Implement a HyperLogLog sketch based on the `khmer` implementation but using the estimator from ["New cardinality estimation algorithms for HyperLogLog sketches"](http://oertl.github.io/hyperloglog-sketch-estimation-paper/paper/paper.pdf) (also implemented in `dashing`). This PR also moves `add_sequence` and `add_protein` to `SigsTrait`, closing #1057. The encoding data and methods (`hp`, `dayhoff`, `aa` and `HashFunctions`) was in the MinHash source file, and since it is more general-purpose it was moved to a new module `encodings`, which is then used by `SigsTrait`. (these changes are both spun off #1201)
935d219
to
218ae49
Compare
broken impl with tests want union, not intersection; one-level rotation for subtrees use it in index CLI, fix bug for single-leaf SBT add a HLL impl, move alphabet stuff to encodings hll ffi expose cardinality, add_hash and add_sequence new ertl ml estimator for HLL implemented joint mle reorganize hll and estimators fix rust checks and add python tests for hll working on fixing scaffolding issues new hll methods working on partial saving add trace to logging
218ae49
to
a3cbdd4
Compare
This is still useful, but I'm mostly not working on SBTs anymore... Happy to discuss if anyone wants to pick it up, but closing for now. |
Add a new function
scaffold
that takes a list of signatures and build an SBT clustered by shared hashes.At the moment still uses the same amount of memory, but building by levels (from bottom up) allows saving each node as they are built, and then unload them as the next level is built.
This replaces the
tree.insert(sig)
approach insourmash index
, with a fallback to the regular insertion code if--append
is set.TODO
GraphFactory
with itChecklist
make test
Did it pass the tests?make coverage
Is the new code covered?without a major version increment. Changing file formats also requires a
major version number increment.
changes were made?