Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Remove min_n_below from search code #1137

Closed
wants to merge 6 commits into from
Closed

Conversation

luizirber
Copy link
Member

@luizirber luizirber commented Aug 2, 2020

There are three main cases when searching an index in sourmash (I'll use Q as query signature and s as a signature/internal node for the analysis):

Case 1: similarity searches

We want to find any signature that is above a similarity threshold in a collection:
J(Q, s) = |Q ∩ s| / |Q ∪ s|. Since |Q| <= |Q ∪ s|, we can use |Q| as an upper bound (because it can overestimate the similarity, but never underestimate).

Case 2: containment searches

We want to find what signatures are above a containment threshold to the query (with sourmash search --containment) or which one is the best match (maximizes the query containment in the signature), as in sourmash gather.
C(Q, s) = |Q ∩ s| / |Q|. Similar to case 1, we just need |Q|.

Case 3: reverse containment searches

We want to find what signatures are above a containment threshold to the query like case 2, but we want to maximize the signature containment (not the query containment).
C(s, Q) = |Q ∩ s| / |s|.
This is the case where min_n_below is necessary to bound |s|. It is also not used in any place in sourmash.

min_n_below is too pessimistic, and ends up making search slower.

Because the only place we really need min_n_below is not a use case for sourmash, the consequence is that min_n_below is too pessimistic when pruning subtrees during search, and it ends up making search slower.

This all came from the realization that I was thinking about gather backwards. The best match in gather maximizes C(Q,s) and not C(s,Q). An example: let's say there is a small viral scaled MinHash s with only one hash. If gather was implemented with C(s, Q) and the hash is in Q, then s would be the best match (because C(s, Q) = 1). But maximizing C(Q, s) is really finding the largest intersection between the query and a signature in an index, and so if there is a larger scaled MinHash s' with smaller C(s', Q) but larger C(Q, s'), then it is going to be selected (and its hashes will be removed from the query for the next round).

TODO

  • More tests, I'm not so sure current tests are checking for corner cases...

Checklist

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@codecov
Copy link

codecov bot commented Aug 2, 2020

Codecov Report

Merging #1137 (cf61c97) into latest (eb2b210) will decrease coverage by 0.18%.
The diff coverage is 59.25%.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1137      +/-   ##
==========================================
- Coverage   89.58%   89.39%   -0.19%     
==========================================
  Files         122      122              
  Lines       18989    19001      +12     
  Branches     1455     1448       -7     
==========================================
- Hits        17011    16986      -25     
- Misses       1750     1796      +46     
+ Partials      228      219       -9     
Flag Coverage Δ
python 94.68% <94.11%> (-0.12%) ⬇️
rust 67.00% <0.00%> (-0.38%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/core/src/ffi/nodegraph.rs 0.00% <0.00%> (ø)
src/core/src/sketch/nodegraph.rs 79.51% <0.00%> (-4.37%) ⬇️
src/sourmash/sbt.py 81.24% <ø> (-2.72%) ⬇️
tests/test_sourmash.py 99.71% <ø> (-0.01%) ⬇️
src/sourmash/nodegraph.py 71.05% <50.00%> (-0.77%) ⬇️
src/sourmash/sbtmh.py 93.69% <100.00%> (+1.32%) ⬆️
tests/test_sbt.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eb2b210...cf61c97. Read the comment docs.

@ctb
Copy link
Contributor

ctb commented Aug 2, 2020

ok, this is pretty neat :). On first glance, I agree with your calculations!

@ctb
Copy link
Contributor

ctb commented Aug 2, 2020

(I also agree about More Testing Needed for edge cases...)

@luizirber
Copy link
Member Author

Need more checks, and the machine I ran the benchmarks was running other processes too, but... wow.

image

@luizirber
Copy link
Member Author

luizirber commented Apr 19, 2021

I updated the notebook in luizirber/2021-04-17-angular-bound, where I'm doing some analysis on the proposed upper bound for angular similarity. I ended up going with the second approach because it overestimates way less than the first one.

I used hypothesis to generate 5000 examples and then selected cases where the angular similarity was larger than 0.1 and the difference between angular similarity and the upper bound was larger than 0.1. When I plot an histogram for the first approach (a_sq += 1), this is the result:
image

and for the second approach (a_sq += abund * abund):
image

@ctb
Copy link
Contributor

ctb commented Aug 21, 2022

@luizirber is this worth pursuing (or at least keeping the PR up to date :), or has this been superseded by other changes?

@luizirber
Copy link
Member Author

@luizirber is this worth pursuing (or at least keeping the PR up to date :), or has this been superseded by other changes?

Maybe? But I'm closing this PR because I won't work on it.

@luizirber luizirber closed this Feb 13, 2023
@luizirber luizirber deleted the remove_min_n_below branch February 13, 2023 04:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants