Speed up disjunctions by computing estimations of the score of the k-th top hit up-front.#12526
Speed up disjunctions by computing estimations of the score of the k-th top hit up-front.#12526jpountz wants to merge 6 commits intoapache:mainfrom
Conversation
…th top hit up-front. Currently, our dynamic pruning logic for disjunctions updates the minimum competitive score as it sees more and more competitive hits. However this process can take time if some of the high-scoring clauses don't have many hits, or are very sparse at the beginning of the doc ID space. It is possible to do better by trying to estimate a lower bound of the score of the k-th top hit up-front in order to bootstrap the minimum competitive score to a value that will immediately enable efficient dynamic pruning. The proposed approach computes this initial minimum score by only using clauses that have not evaluated k hits yet to drive iteration.
|
Here are results on |
|
I added a few tasks that I'm adding here for reference to see how it plays with disjunctions that have more terms or different document frequencies: While it tends to help queries that are already fast, it also helped OrHighVeryLow above, which is not among the fastest. I also like that none of the queries is getting a major slowdown. |
I think you meant OrHighLow, which is indeed very nicely improved |
|
Oops, yes indeed OrHighLow. |
|
Wow, impressive! Maybe we should add |
|
We could. These tasks are a bit malicious as the doc freq is slightly greater than the value of I still need to figure out a way to avoid referencing readers in weight, I think we had issues with that in the past though I can't remember exactly what the issue was. |
|
FYI there was an interesting observation on another benchmark that took advantage of recursive graph bisection: https://jpountz.github.io/lucene-9.7-vs-9.8/. One query ( |
@mikemccand I started looking into this, but my enwiki ( |
|
@mikemccand FYI I gave a try at adding some interesting boolean queries to nightly benchmarks at mikemccand/luceneutil#240. |
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
|
I'll reopen when I have time to get back to this, this could be a useful optimization, though the benefit has become lower thanks to other optimizations to disjunctions. |
Very late answer (sorry!): hmm indeed the frequencies reported in those task files (as comments) are likely from a different (older?) enwiki snapshot. It looks like you muscled through this and added the new atsks to nightly tasks, thanks! |
Currently, our dynamic pruning logic for disjunctions updates the minimum competitive score as it sees more and more competitive hits. However this process can take time if some of the high-scoring clauses don't have many hits, or are very sparse at the beginning of the doc ID space. It is possible to do better by trying to estimate a lower bound of the score of the k-th top hit up-front in order to bootstrap the minimum competitive score to a value that will immediately enable efficient dynamic pruning.
The proposed approach computes this initial minimum score by only using clauses that have not evaluated 2*k hits yet to drive iteration.