[WIP] Remove min_n_below from search code #1137

luizirber · 2020-08-02T15:49:31Z

There are three main cases when searching an index in sourmash (I'll use Q as query signature and s as a signature/internal node for the analysis):

Case 1: similarity searches

We want to find any signature that is above a similarity threshold in a collection:
J(Q, s) = |Q ∩ s| / |Q ∪ s|. Since |Q| <= |Q ∪ s|, we can use |Q| as an upper bound (because it can overestimate the similarity, but never underestimate).

Case 2: containment searches

We want to find what signatures are above a containment threshold to the query (with sourmash search --containment) or which one is the best match (maximizes the query containment in the signature), as in sourmash gather.
C(Q, s) = |Q ∩ s| / |Q|. Similar to case 1, we just need |Q|.

Case 3: reverse containment searches

We want to find what signatures are above a containment threshold to the query like case 2, but we want to maximize the signature containment (not the query containment).
C(s, Q) = |Q ∩ s| / |s|.
This is the case where min_n_below is necessary to bound |s|. It is also not used in any place in sourmash.

`min_n_below` is too pessimistic, and ends up making search slower.

Because the only place we really need min_n_below is not a use case for sourmash, the consequence is that min_n_below is too pessimistic when pruning subtrees during search, and it ends up making search slower.

This all came from the realization that I was thinking about gather backwards. The best match in gather maximizes C(Q,s) and not C(s,Q). An example: let's say there is a small viral scaled MinHash s with only one hash. If gather was implemented with C(s, Q) and the hash is in Q, then s would be the best match (because C(s, Q) = 1). But maximizing C(Q, s) is really finding the largest intersection between the query and a signature in an index, and so if there is a larger scaled MinHash s' with smaller C(s', Q) but larger C(Q, s'), then it is going to be selected (and its hashes will be removed from the query for the next round).

TODO

More tests, I'm not so sure current tests are checking for corner cases...

Checklist

Is it mergeable?
make test Did it pass the tests?
make coverage Is the new code covered?
Did it change the command-line interface? Only additions are allowed
without a major version increment. Changing file formats also requires a
major version number increment.
Was a spellchecker run on the source code and documentation after
changes were made?

codecov · 2020-08-02T15:56:09Z

Codecov Report

Merging #1137 (cf61c97) into latest (eb2b210) will decrease coverage by 0.18%.
The diff coverage is 59.25%.

@@            Coverage Diff             @@
##           latest    #1137      +/-   ##
==========================================
- Coverage   89.58%   89.39%   -0.19%     
==========================================
  Files         122      122              
  Lines       18989    19001      +12     
  Branches     1455     1448       -7     
==========================================
- Hits        17011    16986      -25     
- Misses       1750     1796      +46     
+ Partials      228      219       -9

Flag	Coverage Δ
python	`94.68% <94.11%> (-0.12%)`	⬇️
rust	`67.00% <0.00%> (-0.38%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/core/src/ffi/nodegraph.rs	`0.00% <0.00%> (ø)`
src/core/src/sketch/nodegraph.rs	`79.51% <0.00%> (-4.37%)`	⬇️
src/sourmash/sbt.py	`81.24% <ø> (-2.72%)`	⬇️
tests/test_sourmash.py	`99.71% <ø> (-0.01%)`	⬇️
src/sourmash/nodegraph.py	`71.05% <50.00%> (-0.77%)`	⬇️
src/sourmash/sbtmh.py	`93.69% <100.00%> (+1.32%)`	⬆️
tests/test_sbt.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eb2b210...cf61c97. Read the comment docs.

ctb · 2020-08-02T16:13:38Z

ok, this is pretty neat :). On first glance, I agree with your calculations!

ctb · 2020-08-02T16:13:53Z

(I also agree about More Testing Needed for edge cases...)

luizirber · 2020-08-02T16:45:07Z

Need more checks, and the machine I ran the benchmarks was running other processes too, but... wow.

luizirber · 2021-04-19T17:28:23Z

I updated the notebook in luizirber/2021-04-17-angular-bound, where I'm doing some analysis on the proposed upper bound for angular similarity. I ended up going with the second approach because it overestimates way less than the first one.

I used hypothesis to generate 5000 examples and then selected cases where the angular similarity was larger than 0.1 and the difference between angular similarity and the upper bound was larger than 0.1. When I plot an histogram for the first approach (a_sq += 1), this is the result:

and for the second approach (a_sq += abund * abund):

ctb · 2022-08-21T14:25:27Z

@luizirber is this worth pursuing (or at least keeping the PR up to date :), or has this been superseded by other changes?

luizirber · 2023-02-13T04:33:18Z

@luizirber is this worth pursuing (or at least keeping the PR up to date :), or has this been superseded by other changes?

Maybe? But I'm closing this PR because I won't work on it.

luizirber mentioned this pull request Aug 2, 2020

[WIP] Use a Nodegraph for searching in internal nodes #1138

Closed

6 tasks

luizirber force-pushed the remove_min_n_below branch from 5c18d7b to b5db252 Compare October 27, 2020 21:50

This was referenced Oct 27, 2020

[WIP] Using bitmagic for internal Nodegraph representation #1221

Closed

[WIP] SBT scaffold #1201

Closed

luizirber force-pushed the remove_min_n_below branch from b5db252 to 85f707f Compare March 26, 2021 02:57

luizirber changed the base branch from master to latest March 26, 2021 03:17

luizirber mentioned this pull request Apr 16, 2021

sourmash index does not flatten the signatures when building an SBT #1454

Closed

luizirber force-pushed the remove_min_n_below branch from 85f707f to a689d17 Compare April 17, 2021 02:47

ctb mentioned this pull request Apr 19, 2021

[MRG] Rework the find functionality for Index classes #1392

Merged

15 tasks

luizirber mentioned this pull request Apr 20, 2021

[MRG] Add MinHash.intersection(...) #1474

Merged

luizirber added 6 commits April 21, 2021 17:43

step 1: no need for min_n_below

8f01ec6

fix comments

2389eca

add min_n_below removal tests

7712066

fix test data by removing abundance

37b94c6

implement an angular similarity upper bound for nodegraph/minhash

08601ab

seems like abund * abund overestimates less

cf61c97

luizirber force-pushed the remove_min_n_below branch from 7cbc764 to cf61c97 Compare April 22, 2021 00:43

luizirber closed this Feb 13, 2023

luizirber deleted the remove_min_n_below branch February 13, 2023 04:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Remove min_n_below from search code #1137

[WIP] Remove min_n_below from search code #1137

luizirber commented Aug 2, 2020 •

edited

Loading

codecov bot commented Aug 2, 2020 •

edited

Loading

ctb commented Aug 2, 2020

ctb commented Aug 2, 2020

luizirber commented Aug 2, 2020

luizirber commented Apr 19, 2021 •

edited

Loading

ctb commented Aug 21, 2022

luizirber commented Feb 13, 2023

[WIP] Remove min_n_below from search code #1137

[WIP] Remove min_n_below from search code #1137

Conversation

luizirber commented Aug 2, 2020 • edited Loading

Case 1: similarity searches

Case 2: containment searches

Case 3: reverse containment searches

min_n_below is too pessimistic, and ends up making search slower.

TODO

Checklist

codecov bot commented Aug 2, 2020 • edited Loading

Codecov Report

ctb commented Aug 2, 2020

ctb commented Aug 2, 2020

luizirber commented Aug 2, 2020

luizirber commented Apr 19, 2021 • edited Loading

ctb commented Aug 21, 2022

luizirber commented Feb 13, 2023

luizirber commented Aug 2, 2020 •

edited

Loading

`min_n_below` is too pessimistic, and ends up making search slower.

codecov bot commented Aug 2, 2020 •

edited

Loading

luizirber commented Apr 19, 2021 •

edited

Loading