Speed up disjunctions by computing estimations of the score of the k-th top hit up-front. by jpountz · Pull Request #12526 · apache/lucene

jpountz · 2023-08-29T21:34:24Z

Currently, our dynamic pruning logic for disjunctions updates the minimum competitive score as it sees more and more competitive hits. However this process can take time if some of the high-scoring clauses don't have many hits, or are very sparse at the beginning of the doc ID space. It is possible to do better by trying to estimate a lower bound of the score of the k-th top hit up-front in order to bootstrap the minimum competitive score to a value that will immediately enable efficient dynamic pruning.

The proposed approach computes this initial minimum score by only using clauses that have not evaluated 2*k hits yet to drive iteration.

…th top hit up-front. Currently, our dynamic pruning logic for disjunctions updates the minimum competitive score as it sees more and more competitive hits. However this process can take time if some of the high-scoring clauses don't have many hits, or are very sparse at the beginning of the doc ID space. It is possible to do better by trying to estimate a lower bound of the score of the k-th top hit up-front in order to bootstrap the minimum competitive score to a value that will immediately enable efficient dynamic pruning. The proposed approach computes this initial minimum score by only using clauses that have not evaluated k hits yet to drive iteration.

jpountz · 2023-08-29T21:36:09Z

Here are results on wikimedium10m. OrHighHigh and OrHighMed don't get a speedup because their minimum competitive scores compute pretty quickly anyway, but OrHighHigh sees a major speedup:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                      AndHighLow     1452.84      (2.0%)     1410.03      (3.8%)   -2.9% (  -8% -    2%) 0.017
                          Fuzzy1       98.38      (1.6%)       96.51      (1.1%)   -1.9% (  -4% -    0%) 0.001
            HighIntervalsOrdered        6.24      (5.8%)        6.15      (4.4%)   -1.4% ( -11% -    9%) 0.494
                      OrHighHigh       61.69      (6.4%)       60.87      (5.4%)   -1.3% ( -12% -   11%) 0.585
             MedIntervalsOrdered       44.82      (5.1%)       44.23      (3.9%)   -1.3% (  -9% -    8%) 0.476
             LowIntervalsOrdered       57.23      (5.4%)       56.48      (4.0%)   -1.3% ( -10% -    8%) 0.497
                       OrHighMed      190.42      (3.8%)      188.10      (3.8%)   -1.2% (  -8% -    6%) 0.430
                      AndHighMed      236.92      (4.1%)      234.25      (4.1%)   -1.1% (  -8% -    7%) 0.500
                    OrHighNotMed      425.77      (6.6%)      421.99      (5.2%)   -0.9% ( -11% -   11%) 0.715
                         MedTerm      788.26      (7.2%)      781.68      (3.4%)   -0.8% ( -10% -   10%) 0.716
                   OrHighNotHigh      317.53      (6.6%)      314.90      (5.5%)   -0.8% ( -12% -   12%) 0.738
                        HighTerm      593.70      (7.6%)      589.22      (3.9%)   -0.8% ( -11% -   11%) 0.760
                          Fuzzy2       73.16      (1.3%)       72.68      (1.3%)   -0.7% (  -3% -    1%) 0.206
                   OrNotHighHigh      413.61      (6.0%)      411.20      (5.2%)   -0.6% ( -11% -   11%) 0.798
                       LowPhrase       43.15      (2.9%)       42.90      (1.4%)   -0.6% (  -4% -    3%) 0.526
                    OrNotHighMed      425.13      (4.4%)      422.86      (3.3%)   -0.5% (  -7% -    7%) 0.735
                HighSloppyPhrase       12.59      (4.7%)       12.53      (5.6%)   -0.5% ( -10% -   10%) 0.808
                     LowSpanNear       28.72      (2.1%)       28.57      (2.2%)   -0.5% (  -4% -    3%) 0.559
                    OrHighNotLow      475.44      (7.1%)      473.03      (5.2%)   -0.5% ( -11% -   12%) 0.841
                        PKLookup      245.49      (3.5%)      244.36      (3.8%)   -0.5% (  -7% -    7%) 0.759
                 LowSloppyPhrase       67.32      (2.7%)       67.06      (2.8%)   -0.4% (  -5% -    5%) 0.730
                         LowTerm     1124.64      (6.8%)     1120.58      (3.5%)   -0.4% (  -9% -   10%) 0.870
                        Wildcard      172.10      (2.7%)      171.49      (2.4%)   -0.4% (  -5% -    4%) 0.735
                       MedPhrase       59.34      (3.1%)       59.16      (1.4%)   -0.3% (  -4% -    4%) 0.765
           HighTermDayOfYearSort      457.23      (1.2%)      456.10      (1.2%)   -0.2% (  -2% -    2%) 0.611
                     MedSpanNear       29.71      (3.0%)       29.64      (2.6%)   -0.2% (  -5% -    5%) 0.859
                    OrNotHighLow     1283.05      (2.7%)     1282.59      (1.9%)   -0.0% (  -4% -    4%) 0.971
               HighTermMonthSort     4728.97      (2.8%)     4729.28      (1.9%)    0.0% (  -4% -    4%) 0.995
                     AndHighHigh       63.31      (4.8%)       63.31      (4.8%)    0.0% (  -9% -   10%) 0.997
                         Prefix3      346.29      (4.3%)      346.43      (3.9%)    0.0% (  -7% -    8%) 0.980
                      TermDTSort      192.60      (1.1%)      192.76      (0.9%)    0.1% (  -1% -    2%) 0.830
                         Respell       96.59      (1.6%)       96.73      (1.3%)    0.2% (  -2% -    3%) 0.798
               HighTermTitleSort      161.71      (3.7%)      162.22      (4.8%)    0.3% (  -7% -    9%) 0.860
            HighTermTitleBDVSort       15.55      (3.3%)       15.61      (2.5%)    0.4% (  -5% -    6%) 0.748
                      HighPhrase       97.18      (4.1%)       97.75      (2.1%)    0.6% (  -5% -    7%) 0.659
                    HighSpanNear        6.86      (6.0%)        6.91      (6.6%)    0.8% ( -11% -   14%) 0.754
                 MedSloppyPhrase       38.05      (5.0%)       38.38      (4.0%)    0.9% (  -7% -   10%) 0.646
                          IntNRQ       94.63     (21.2%)       98.52     (21.0%)    4.1% ( -31% -   58%) 0.634
                       OrHighLow      430.69      (5.4%)      617.92      (4.7%)   43.5% (  31% -   56%) 0.000

jpountz · 2023-08-30T19:46:53Z

I added a few tasks that I'm adding here for reference to see how it plays with disjunctions that have more terms or different document frequencies:

OrHighVeryLow: 2005 mousehole # freq=835460 freq=123
OrHighVeryLow: until motorboats # freq=425389 freq=128
OrHighVeryLow: made monceau # freq=742313 freq=126
OrHighVeryLow: do bush's # freq=511178 freq=2681
OrHighVeryLow: 10 mikup # freq=918339 freq=119
OrHighMedLow: international chris valois
OrHighMedLow: right million universalist
OrHighMedLow: known created forays
OrHighMedLow: its universal bush's
OrHighMedLow: 9 network racedetail.html
OrHighHighHigh: 2005 until made
OrHighHighHigh: do 10 international
OrHighHighHigh: right known its
OrHighHighHigh: until 10 known
OrHighHighHigh: made international its
OrHighMedMed: international chris million
OrHighMedMed: right million created
OrHighMedMed: known created universal
OrHighMedMed: its universal network
OrHighMedMed: 9 network chris
OrHighHighLow: several following valois
OrHighHighLow: publisher end universalist
OrHighHighLow: 2009 film forays
OrHighHighLow: http known bush's
OrHighHighLow: south county racedetail.html
OrHighHighMed: international right million
OrHighHighMed: right known created
OrHighMighMed: known its universal
OrHighHighMed: its 9 network
OrHighHighMed: 9 international chris

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                    OrHighMedMed      158.53      (3.6%)      155.92      (4.4%)   -1.7% (  -9% -    6%) 0.193
                  OrHighHighHigh       53.97      (5.0%)       53.13      (4.9%)   -1.6% ( -10% -    8%) 0.324
                   OrHighHighMed      106.81      (4.0%)      105.37      (4.3%)   -1.3% (  -9% -    7%) 0.306
                      OrHighHigh       64.42      (5.6%)       63.64      (4.0%)   -1.2% ( -10% -    8%) 0.433
                   OrHighMighMed      201.12      (3.7%)      198.74      (3.5%)   -1.2% (  -8% -    6%) 0.298
                    OrHighMedLow      323.10      (3.7%)      319.32      (4.2%)   -1.2% (  -8% -    6%) 0.349
                       OrHighMed      227.13      (3.9%)      225.41      (3.0%)   -0.8% (  -7% -    6%) 0.487
                        HighTerm      652.70      (4.2%)      659.51      (5.3%)    1.0% (  -8% -   11%) 0.491
                        PKLookup      248.57      (3.4%)      251.38      (1.9%)    1.1% (  -4% -    6%) 0.198
                         MedTerm     1060.67      (4.5%)     1076.33      (5.4%)    1.5% (  -8% -   11%) 0.350
                         LowTerm     1639.65      (7.0%)     1667.48      (4.9%)    1.7% (  -9% -   14%) 0.377
                   OrHighVeryLow      172.35      (8.2%)      196.54      (8.4%)   14.0% (  -2% -   33%) 0.000
                   OrHighHighLow      449.76      (3.0%)      633.61      (3.5%)   40.9% (  33% -   48%) 0.000
                       OrHighLow      546.08      (5.4%)     1187.98      (5.1%)  117.5% ( 101% -  135%) 0.000

While it tends to help queries that are already fast, it also helped OrHighVeryLow above, which is not among the fastest. I also like that none of the queries is getting a major slowdown.

msokolov · 2023-08-30T20:30:52Z

OrHighHigh sees a major speedup:

I think you meant OrHighLow, which is indeed very nicely improved

jpountz · 2023-08-30T20:43:15Z

Oops, yes indeed OrHighLow.

mikemccand · 2023-08-31T09:52:33Z

Wow, impressive! Maybe we should add OrHighVeryLow to nightly benchy too?

jpountz · 2023-09-11T19:39:35Z

We could. These tasks are a bit malicious as the doc freq is slightly greater than the value of k=100 so it takes lots of collected matches to find k documents that have this term. I suspect that another interesting value for the document frequency is when it is a bit less than k.

I still need to figure out a way to avoid referencing readers in weight, I think we had issues with that in the past though I can't remember exactly what the issue was.

jpountz · 2023-09-14T07:16:36Z

FYI there was an interesting observation on another benchmark that took advantage of recursive graph bisection: https://jpountz.github.io/lucene-9.7-vs-9.8/. One query (the incredibles) became more than 7x (!) slower because recursive graph bisection had moved matches of the term with the highest score weight towards the end of the doc ID space. This should get addressed by a change like this PR.

jpountz · 2023-09-22T10:09:09Z

Maybe we should add OrHighVeryLow to nightly benchy too?

@mikemccand I started looking into this, but my enwiki (enwiki-20120502-lines-with-random-label.txt) seems to have slightly different frequencies compared to frequencies reported in wikinightly.tasks, are nightly benchmarks using the same export or a different one? I think it could make sense to have two new tasks OrHighLow110 where the low-frequency term always has a frequency of 110 >k and OrHighLow90 where the low-frequency term always has a frequency of 90<k. These two cases are interesting because in one case it takes very long to collect k matches of the highest scoring clause, and in the other case this never happens.

jpountz · 2023-11-01T20:50:57Z

@mikemccand FYI I gave a try at adding some interesting boolean queries to nightly benchmarks at mikemccand/luceneutil#240.

github-actions · 2024-01-08T12:23:47Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

jpountz · 2024-01-08T13:58:34Z

I'll reopen when I have time to get back to this, this could be a useful optimization, though the benefit has become lower thanks to other optimizations to disjunctions.

mikemccand · 2024-01-08T14:12:43Z

Maybe we should add OrHighVeryLow to nightly benchy too?

@mikemccand I started looking into this, but my enwiki (enwiki-20120502-lines-with-random-label.txt) seems to have slightly different frequencies compared to frequencies reported in wikinightly.tasks, are nightly benchmarks using the same export or a different one? I think it could make sense to have two new tasks OrHighLow110 where the low-frequency term always has a frequency of 110 >k and OrHighLow90 where the low-frequency term always has a frequency of 90<k. These two cases are interesting because in one case it takes very long to collect k matches of the highest scoring clause, and in the other case this never happens.

Very late answer (sorry!): hmm indeed the frequencies reported in those task files (as comments) are likely from a different (older?) enwiki snapshot. It looks like you muscled through this and added the new atsks to nightly tasks, thanks!

jpountz mentioned this pull request Aug 29, 2023

Add support for recursive graph bisection. #12489

Merged

iter

5ac983c

Merge branch 'main' into bootstrap_min_score

37d332a

Simplify and add tests.

7b90764

jpountz marked this pull request as ready for review September 22, 2023 10:32

jpountz added 2 commits September 26, 2023 16:05

Merge branch 'main' into bootstrap_min_score

c9af5fc

Fix conditions.

9a72984

github-actions bot added the Stale label Jan 8, 2024

jpountz closed this Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up disjunctions by computing estimations of the score of the k-th top hit up-front.#12526

Speed up disjunctions by computing estimations of the score of the k-th top hit up-front.#12526
jpountz wants to merge 6 commits intoapache:mainfrom
jpountz:bootstrap_min_score

jpountz commented Aug 29, 2023 •

edited

Loading

Uh oh!

jpountz commented Aug 29, 2023

Uh oh!

jpountz commented Aug 30, 2023

Uh oh!

msokolov commented Aug 30, 2023

Uh oh!

jpountz commented Aug 30, 2023

Uh oh!

mikemccand commented Aug 31, 2023

Uh oh!

jpountz commented Sep 11, 2023

Uh oh!

jpountz commented Sep 14, 2023

Uh oh!

jpountz commented Sep 22, 2023

Uh oh!

jpountz commented Nov 1, 2023

Uh oh!

github-actions bot commented Jan 8, 2024

Uh oh!

jpountz commented Jan 8, 2024

Uh oh!

mikemccand commented Jan 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jpountz commented Aug 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpountz commented Aug 29, 2023

Uh oh!

jpountz commented Aug 30, 2023

Uh oh!

msokolov commented Aug 30, 2023

Uh oh!

jpountz commented Aug 30, 2023

Uh oh!

mikemccand commented Aug 31, 2023

Uh oh!

jpountz commented Sep 11, 2023

Uh oh!

jpountz commented Sep 14, 2023

Uh oh!

jpountz commented Sep 22, 2023

Uh oh!

jpountz commented Nov 1, 2023

Uh oh!

github-actions bot commented Jan 8, 2024

Uh oh!

jpountz commented Jan 8, 2024

Uh oh!

mikemccand commented Jan 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jpountz commented Aug 29, 2023 •

edited

Loading