Simplify ForDeltaUtil's prefix sum. by jpountz · Pull Request #14979 · apache/lucene

jpountz · 2025-07-21T21:08:33Z

I remember benchmarking prefix sums quite extensively, and unrolled loops performed significantly better than their rolled on counterpart, both on micro and macro benchmarks:

private static void prefixSum(int[] arr, int len) {
  for (int i = 1; i < len; ++i) {
    arr[i] += arr[i-1];
  }
}

However, I recently discovered that rewriting the loop this way performs much better, and almost on par with the unrolled variant:

private static void prefixSum(int[] arr, int len) {
  int sum = 0;
  for (int i = 0; i < len; ++i) {
    sum += arr[i];
    arr[i] = sum;
  }
}

I haven't checked the assembly yet, but both a JMH benchmark and luceneutil agree that it doesn't introduce a slowdown, so I cut over prefix sums to this approach.

I remember benchmarking prefix sums quite extensively, and unrolled loops performed significantly better than their rolled on counterpart, both on micro and macro benchmarks: ```java private static void prefixSum(int[] arr, int len) { for (int i = 1; i < len; ++i) { arr[i] += arr[i-1]; } } ``` However, I recently discovered that rewriting the loop this way performs much better, and almost on par with the unrolled variant: ```java private static void prefixSum(int[] arr, int len) { int sum = 0; for (int i = 0; i < len; ++i) { sum += arr[i]; arr[i] = sum; } } ``` I haven't checked the assembly yet, but both a JMH benchmark and luceneutil agree that it doesn't introduce a slowdown, so I cut over prefix sums to this approach.

jpountz · 2025-07-21T21:09:54Z

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                      TermDTSort      395.64      (5.8%)      390.57      (4.0%)   -1.3% ( -10% -    9%) 0.482
                      OrHighRare      303.65      (7.1%)      300.02      (5.2%)   -1.2% ( -12% -   11%) 0.599
                       CountTerm     9330.02      (3.5%)     9238.82      (3.2%)   -1.0% (  -7% -    5%) 0.422
                  FilteredPhrase       32.40      (1.5%)       32.19      (1.4%)   -0.7% (  -3% -    2%) 0.215
                   TermTitleSort       84.19      (4.2%)       83.64      (4.4%)   -0.7% (  -8% -    8%) 0.676
               CombinedOrHighMed       88.57      (0.7%)       88.17      (2.0%)   -0.4% (  -3% -    2%) 0.419
              CombinedOrHighHigh       23.30      (0.8%)       23.21      (3.2%)   -0.4% (  -4% -    3%) 0.638
               FilteredOrHighMed      153.60      (1.0%)      153.10      (1.1%)   -0.3% (  -2% -    1%) 0.401
                     CountPhrase        4.24      (2.1%)        4.23      (3.5%)   -0.3% (  -5% -    5%) 0.772
                  CountOrHighMed      358.51      (1.0%)      357.43      (1.9%)   -0.3% (  -3% -    2%) 0.584
      FilteredOr2Terms2StopWords      147.65      (0.9%)      147.28      (1.2%)   -0.2% (  -2% -    1%) 0.538
                 FilteredPrefix3      151.73      (2.5%)      151.40      (1.7%)   -0.2% (  -4% -    4%) 0.776
              FilteredOrHighHigh       67.41      (1.9%)       67.29      (1.6%)   -0.2% (  -3% -    3%) 0.777
                FilteredOr3Terms      167.05      (0.8%)      166.74      (1.0%)   -0.2% (  -2% -    1%) 0.592
                          OrMany       23.50      (3.0%)       23.46      (2.6%)   -0.2% (  -5% -    5%) 0.862
             And2Terms2StopWords      206.60      (1.4%)      206.31      (1.3%)   -0.1% (  -2% -    2%) 0.770
               TermDayOfYearSort      282.79      (4.1%)      282.54      (3.7%)   -0.1% (  -7% -    8%) 0.950
             CountFilteredPhrase       25.43      (2.3%)       25.41      (2.1%)   -0.1% (  -4% -    4%) 0.922
             FilteredOrStopWords       45.74      (1.9%)       45.73      (1.9%)   -0.0% (  -3% -    3%) 0.990
                AndMedOrHighHigh       88.27      (1.9%)       88.28      (1.7%)    0.0% (  -3% -    3%) 0.980
                  FilteredIntNRQ      297.19      (0.7%)      297.43      (0.8%)    0.1% (  -1% -    1%) 0.783
                 CountOrHighHigh      340.83      (1.8%)      341.27      (2.9%)    0.1% (  -4% -    4%) 0.884
          CountFilteredOrHighMed      149.06      (0.6%)      149.26      (0.7%)    0.1% (  -1% -    1%) 0.559
                    CombinedTerm       39.45      (0.9%)       39.51      (0.5%)    0.1% (  -1% -    1%) 0.586
                  FilteredOrMany       16.55      (1.1%)       16.57      (1.2%)    0.2% (  -2% -    2%) 0.715
                     CountOrMany       29.11      (1.3%)       29.17      (1.6%)    0.2% (  -2% -    3%) 0.721
         CountFilteredOrHighHigh      136.99      (0.8%)      137.25      (1.0%)    0.2% (  -1% -    1%) 0.547
              CombinedAndHighMed       89.73      (0.8%)       89.93      (0.6%)    0.2% (  -1% -    1%) 0.382
                    AndStopWords       47.24      (2.7%)       47.35      (2.1%)    0.2% (  -4% -    5%) 0.789
             CountFilteredOrMany       27.25      (1.2%)       27.32      (1.5%)    0.2% (  -2% -    2%) 0.617
                      AndHighMed      202.48      (2.5%)      202.99      (1.9%)    0.3% (  -3% -    4%) 0.750
              Or2Terms2StopWords      206.67      (1.4%)      207.22      (1.9%)    0.3% (  -3% -    3%) 0.664
                    FilteredTerm      162.69      (2.2%)      163.18      (2.7%)    0.3% (  -4% -    5%) 0.744
                     AndHighHigh       69.16      (3.1%)       69.37      (2.4%)    0.3% (  -5% -    6%) 0.758
               FilteredAnd3Terms      189.84      (1.5%)      190.44      (1.0%)    0.3% (  -2% -    2%) 0.496
     FilteredAnd2Terms2StopWords      214.48      (2.4%)      215.19      (1.2%)    0.3% (  -3% -    4%) 0.631
                       And3Terms      240.86      (2.3%)      241.78      (1.5%)    0.4% (  -3% -    4%) 0.593
                 AndHighOrMedMed       51.39      (1.4%)       51.62      (1.2%)    0.4% (  -2% -    3%) 0.359
             CombinedAndHighHigh       23.50      (1.1%)       23.61      (0.7%)    0.5% (  -1% -    2%) 0.149
                CountAndHighHigh      357.29      (1.8%)      359.20      (2.5%)    0.5% (  -3% -    4%) 0.507
                     OrStopWords       48.86      (2.2%)       49.19      (2.2%)    0.7% (  -3% -    5%) 0.413
                       OrHighMed      258.66      (1.8%)      260.44      (1.6%)    0.7% (  -2% -    4%) 0.272
            FilteredAndStopWords       64.59      (4.0%)       65.06      (2.5%)    0.7% (  -5% -    7%) 0.555
                 CountAndHighMed      307.15      (0.7%)      309.50      (1.3%)    0.8% (  -1% -    2%) 0.044
                      OrHighHigh       78.09      (2.2%)       78.75      (2.1%)    0.8% (  -3% -    5%) 0.280
              FilteredAndHighMed      155.05      (2.9%)      156.38      (1.5%)    0.9% (  -3% -    5%) 0.307
             FilteredAndHighHigh       77.96      (4.5%)       78.64      (2.5%)    0.9% (  -5% -    8%) 0.506
                   TermMonthSort     3341.21      (1.3%)     3373.59      (2.0%)    1.0% (  -2% -    4%) 0.111
                        Or3Terms      230.62      (1.8%)      233.16      (1.7%)    1.1% (  -2% -    4%) 0.090
                            Term      666.11      (5.6%)      677.28      (3.6%)    1.7% (  -7% -   11%) 0.328

gf2121

I played with your benchmark and can reproduce the speed up locally (prefixSumScalarNew).

Benchmark                                        (size)   Mode  Cnt   Score   Error   Units
PrefixSumBenchmark.prefixSumScalar                  128  thrpt    5  17.567 ± 0.117  ops/us
PrefixSumBenchmark.prefixSumScalarInlined           128  thrpt    5  26.228 ± 0.086  ops/us
PrefixSumBenchmark.prefixSumScalarNew               128  thrpt    5  25.864 ± 0.043  ops/us
PrefixSumBenchmark.prefixSumVector128               128  thrpt    5  20.668 ± 0.350  ops/us
PrefixSumBenchmark.prefixSumVector128_v2            128  thrpt    5  26.103 ± 0.176  ops/us
PrefixSumBenchmark.prefixSumVector256               128  thrpt    5  28.632 ± 0.956  ops/us
PrefixSumBenchmark.prefixSumVector256_v2            128  thrpt    5  44.185 ± 0.978  ops/us
PrefixSumBenchmark.prefixSumVector256_v2_inline     128  thrpt    5  43.949 ± 0.225  ops/us
PrefixSumBenchmark.prefixSumVector256_v3            128  thrpt    5  20.108 ± 1.157  ops/us
PrefixSumBenchmark.prefixSumVector512               128  thrpt    5  32.676 ± 0.266  ops/us
PrefixSumBenchmark.prefixSumVector512_v2            128  thrpt    5  57.176 ± 0.413  ops/us

I checked the assemble and the only difference i can see is that baseline uses a register in the unrolled(8x) loop body so it needs to read from array before each iteration, while this PR uses a register across iterations.

jpountz · 2025-07-22T11:40:09Z

Thanks for checking! For reference here's what it gives on my machine (AMD Ryzen 9 3900X):

Benchmark                                          (size)   Mode  Cnt   Score    Error   Units
PrefixSumBenchmark.prefixSumScalar                    128  thrpt    5  19.081 ±  0.550  ops/us
PrefixSumBenchmark.prefixSumScalar                   1024  thrpt    5   2.180 ±  0.097  ops/us
PrefixSumBenchmark.prefixSumScalarUnrolled            128  thrpt    5  32.679 ±  1.819  ops/us
PrefixSumBenchmark.prefixSumScalarUnrolled           1024  thrpt    5  31.804 ±  0.067  ops/us
PrefixSumBenchmark.prefixSumScalar_v2                 128  thrpt    5  30.677 ±  0.308  ops/us
PrefixSumBenchmark.prefixSumScalar_v2                1024  thrpt    5   3.501 ±  0.035  ops/us
PrefixSumBenchmark.prefixSumVector128                 128  thrpt    5  16.519 ±  0.724  ops/us
PrefixSumBenchmark.prefixSumVector128                1024  thrpt    5   1.845 ±  0.003  ops/us
PrefixSumBenchmark.prefixSumVector128_v2              128  thrpt    5  19.237 ±  0.518  ops/us
PrefixSumBenchmark.prefixSumVector128_v2             1024  thrpt    5   1.883 ±  0.014  ops/us
PrefixSumBenchmark.prefixSumVector256                 128  thrpt    5  23.473 ±  0.164  ops/us
PrefixSumBenchmark.prefixSumVector256                1024  thrpt    5   3.029 ±  0.021  ops/us
PrefixSumBenchmark.prefixSumVector256_v2              128  thrpt    5  27.053 ±  0.129  ops/us
PrefixSumBenchmark.prefixSumVector256_v2             1024  thrpt    5   3.162 ±  0.093  ops/us
PrefixSumBenchmark.prefixSumVector256_v2_unrolled     128  thrpt    5  26.211 ±  0.156  ops/us
PrefixSumBenchmark.prefixSumVector256_v2_unrolled    1024  thrpt    5  25.478 ±  0.185  ops/us
PrefixSumBenchmark.prefixSumVector256_v3              128  thrpt    5  14.690 ±  0.037  ops/us
PrefixSumBenchmark.prefixSumVector256_v3             1024  thrpt    5   1.920 ±  0.057  ops/us
PrefixSumBenchmark.prefixSumVector512                 128  thrpt    5   0.052 ±  0.005  ops/us
PrefixSumBenchmark.prefixSumVector512                1024  thrpt    5   0.006 ±  0.001  ops/us
PrefixSumBenchmark.prefixSumVector512_v2              128  thrpt    5   0.082 ±  0.005  ops/us
PrefixSumBenchmark.prefixSumVector512_v2             1024  thrpt    5   0.010 ±  0.001  ops/us

I remember benchmarking prefix sums quite extensively, and unrolled loops performed significantly better than their rolled on counterpart, both on micro and macro benchmarks: ```java private static void prefixSum(int[] arr, int len) { for (int i = 1; i < len; ++i) { arr[i] += arr[i-1]; } } ``` However, I recently discovered that rewriting the loop this way performs much better, and almost on par with the unrolled variant: ```java private static void prefixSum(int[] arr, int len) { int sum = 0; for (int i = 0; i < len; ++i) { sum += arr[i]; arr[i] = sum; } } ```

Replaced the two-step prefix sum loop in `Lucene99HnswVectorsReader` with a single-loop variant that avoids redundant memory access and improves performance. Previous approach: - Read first value separately. - Then used previous buffer element + readVInt(). New approach: - Accumulates sum in a single pass and assigns directly. This change follows the suggestion from issue apache#14979 and has the same functional behavior with slightly better efficiency.

This applies the same change as apache#14979 to two more prefix sums. I was not able to measure speedups (or slowdowns) with luceneutil, but I think it's still better to write our prefix sums this way.

Benchmarks at apache/lucene#14979 suggested that tracking the sum in a variable performs faster than adding the previous value to each array element.

This applies the same change as #14979 to two more prefix sums. I was not able to measure speedups (or slowdowns) with luceneutil, but I think it's still better to write our prefix sums this way.

Benchmarks at apache/lucene#14979 suggested that tracking the sum in a variable performs faster than adding the previous value to each array element.

jpountz added this to the 10.3.0 milestone Jul 21, 2025

jpountz added the skip-changelog Apply to PRs that don't need a changelog entry, stopping the automated changelog check. label Jul 21, 2025

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking Jul 21, 2025

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking Jul 21, 2025

github-actions bot added the module:core/codecs label Jul 21, 2025

gf2121 approved these changes Jul 22, 2025

View reviewed changes

jpountz merged commit a2a9a3b into apache:main Jul 23, 2025
8 checks passed

github-project-automation bot moved this from Open to Merged in OpenSearch Lucene & Core Performance Tracking Jul 23, 2025

jpountz deleted the simplify_prefix_sum branch July 23, 2025 19:25

jpountz mentioned this pull request Aug 1, 2025

Improve prefix sum in Lucene99HnswVectorsReader #15024

Closed

This was referenced Aug 1, 2025

Optimize prefix sum computation in Lucene99HnswVectorsReader yossev/lucene#1

Merged

Optimize prefix sum computation in Lucene99HnswVectorsReader; fixes #15024 #15027

Open

jpountz mentioned this pull request Sep 5, 2025

Optimize a bit two more prefix sums. #15156

Merged

jpountz mentioned this pull request Sep 5, 2025

Optimize TSDBDocValuesFormat's prefix sum a bit. elastic/elasticsearch#134204

Merged

parkertimmins mentioned this pull request Mar 30, 2026

feat(tsdb): add pipeline runtime and rename stage interfaces elastic/elasticsearch#145175

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify ForDeltaUtil's prefix sum.#14979

Simplify ForDeltaUtil's prefix sum.#14979
jpountz merged 1 commit intoapache:mainfrom
jpountz:simplify_prefix_sum

jpountz commented Jul 21, 2025

Uh oh!

jpountz commented Jul 21, 2025

Uh oh!

gf2121 left a comment

Uh oh!

jpountz commented Jul 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jpountz commented Jul 21, 2025

Uh oh!

jpountz commented Jul 21, 2025

Uh oh!

gf2121 left a comment

Choose a reason for hiding this comment

Uh oh!

jpountz commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jpountz commented Jul 22, 2025 •

edited

Loading