-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Remove min_n_below from search code #1137
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #1137 +/- ##
==========================================
- Coverage 89.58% 89.39% -0.19%
==========================================
Files 122 122
Lines 18989 19001 +12
Branches 1455 1448 -7
==========================================
- Hits 17011 16986 -25
- Misses 1750 1796 +46
+ Partials 228 219 -9
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
ok, this is pretty neat :). On first glance, I agree with your calculations! |
(I also agree about More Testing Needed for edge cases...) |
5c18d7b
to
b5db252
Compare
b5db252
to
85f707f
Compare
85f707f
to
a689d17
Compare
I updated the notebook in luizirber/2021-04-17-angular-bound, where I'm doing some analysis on the proposed upper bound for angular similarity. I ended up going with the second approach because it overestimates way less than the first one. I used and for the second approach ( |
7cbc764
to
cf61c97
Compare
@luizirber is this worth pursuing (or at least keeping the PR up to date :), or has this been superseded by other changes? |
Maybe? But I'm closing this PR because I won't work on it. |
There are three main cases when searching an index in sourmash (I'll use
Q
as query signature ands
as a signature/internal node for the analysis):Case 1: similarity searches
We want to find any signature that is above a similarity threshold in a collection:
J(Q, s) = |Q ∩ s| / |Q ∪ s|
. Since|Q| <= |Q ∪ s|
, we can use|Q|
as an upper bound (because it can overestimate the similarity, but never underestimate).Case 2: containment searches
We want to find what signatures are above a containment threshold to the query (with
sourmash search --containment
) or which one is the best match (maximizes the query containment in the signature), as insourmash gather
.C(Q, s) = |Q ∩ s| / |Q|
. Similar to case 1, we just need|Q|
.Case 3: reverse containment searches
We want to find what signatures are above a containment threshold to the query like case 2, but we want to maximize the signature containment (not the query containment).
C(s, Q) = |Q ∩ s| / |s|
.This is the case where
min_n_below
is necessary to bound|s|
. It is also not used in any place in sourmash.min_n_below
is too pessimistic, and ends up making search slower.Because the only place we really need
min_n_below
is not a use case for sourmash, the consequence is thatmin_n_below
is too pessimistic when pruning subtrees during search, and it ends up making search slower.This all came from the realization that I was thinking about
gather
backwards. The best match ingather
maximizesC(Q,s)
and notC(s,Q)
. An example: let's say there is a small viral scaled MinHashs
with only one hash. Ifgather
was implemented withC(s, Q)
and the hash is inQ
, thens
would be the best match (becauseC(s, Q) = 1
). But maximizingC(Q, s)
is really finding the largest intersection between the query and a signature in an index, and so if there is a larger scaled MinHashs'
with smallerC(s', Q)
but largerC(Q, s')
, then it is going to be selected (and its hashes will be removed from the query for the next round).TODO
Checklist
make test
Did it pass the tests?make coverage
Is the new code covered?without a major version increment. Changing file formats also requires a
major version number increment.
changes were made?