Skip to content

Simplify ForDeltaUtil's prefix sum.#14979

Merged
jpountz merged 1 commit intoapache:mainfrom
jpountz:simplify_prefix_sum
Jul 23, 2025
Merged

Simplify ForDeltaUtil's prefix sum.#14979
jpountz merged 1 commit intoapache:mainfrom
jpountz:simplify_prefix_sum

Conversation

@jpountz
Copy link
Copy Markdown
Contributor

@jpountz jpountz commented Jul 21, 2025

I remember benchmarking prefix sums quite extensively, and unrolled loops performed significantly better than their rolled on counterpart, both on micro and macro benchmarks:

private static void prefixSum(int[] arr, int len) {
  for (int i = 1; i < len; ++i) {
    arr[i] += arr[i-1];
  }
}

However, I recently discovered that rewriting the loop this way performs much better, and almost on par with the unrolled variant:

private static void prefixSum(int[] arr, int len) {
  int sum = 0;
  for (int i = 0; i < len; ++i) {
    sum += arr[i];
    arr[i] = sum;
  }
}

I haven't checked the assembly yet, but both a JMH benchmark and luceneutil agree that it doesn't introduce a slowdown, so I cut over prefix sums to this approach.

I remember benchmarking prefix sums quite extensively, and unrolled loops
performed significantly better than their rolled on counterpart, both on micro
and macro benchmarks:

```java
private static void prefixSum(int[] arr, int len) {
  for (int i = 1; i < len; ++i) {
    arr[i] += arr[i-1];
  }
}
```

However, I recently discovered that rewriting the loop this way performs much
better, and almost on par with the unrolled variant:

```java
private static void prefixSum(int[] arr, int len) {
  int sum = 0;
  for (int i = 0; i < len; ++i) {
    sum += arr[i];
    arr[i] = sum;
  }
}
```

I haven't checked the assembly yet, but both a JMH benchmark and luceneutil
agree that it doesn't introduce a slowdown, so I cut over prefix sums to this
approach.
@jpountz jpountz added this to the 10.3.0 milestone Jul 21, 2025
@jpountz jpountz added the skip-changelog Apply to PRs that don't need a changelog entry, stopping the automated changelog check. label Jul 21, 2025
@jpountz
Copy link
Copy Markdown
Contributor Author

jpountz commented Jul 21, 2025

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                      TermDTSort      395.64      (5.8%)      390.57      (4.0%)   -1.3% ( -10% -    9%) 0.482
                      OrHighRare      303.65      (7.1%)      300.02      (5.2%)   -1.2% ( -12% -   11%) 0.599
                       CountTerm     9330.02      (3.5%)     9238.82      (3.2%)   -1.0% (  -7% -    5%) 0.422
                  FilteredPhrase       32.40      (1.5%)       32.19      (1.4%)   -0.7% (  -3% -    2%) 0.215
                   TermTitleSort       84.19      (4.2%)       83.64      (4.4%)   -0.7% (  -8% -    8%) 0.676
               CombinedOrHighMed       88.57      (0.7%)       88.17      (2.0%)   -0.4% (  -3% -    2%) 0.419
              CombinedOrHighHigh       23.30      (0.8%)       23.21      (3.2%)   -0.4% (  -4% -    3%) 0.638
               FilteredOrHighMed      153.60      (1.0%)      153.10      (1.1%)   -0.3% (  -2% -    1%) 0.401
                     CountPhrase        4.24      (2.1%)        4.23      (3.5%)   -0.3% (  -5% -    5%) 0.772
                  CountOrHighMed      358.51      (1.0%)      357.43      (1.9%)   -0.3% (  -3% -    2%) 0.584
      FilteredOr2Terms2StopWords      147.65      (0.9%)      147.28      (1.2%)   -0.2% (  -2% -    1%) 0.538
                 FilteredPrefix3      151.73      (2.5%)      151.40      (1.7%)   -0.2% (  -4% -    4%) 0.776
              FilteredOrHighHigh       67.41      (1.9%)       67.29      (1.6%)   -0.2% (  -3% -    3%) 0.777
                FilteredOr3Terms      167.05      (0.8%)      166.74      (1.0%)   -0.2% (  -2% -    1%) 0.592
                          OrMany       23.50      (3.0%)       23.46      (2.6%)   -0.2% (  -5% -    5%) 0.862
             And2Terms2StopWords      206.60      (1.4%)      206.31      (1.3%)   -0.1% (  -2% -    2%) 0.770
               TermDayOfYearSort      282.79      (4.1%)      282.54      (3.7%)   -0.1% (  -7% -    8%) 0.950
             CountFilteredPhrase       25.43      (2.3%)       25.41      (2.1%)   -0.1% (  -4% -    4%) 0.922
             FilteredOrStopWords       45.74      (1.9%)       45.73      (1.9%)   -0.0% (  -3% -    3%) 0.990
                AndMedOrHighHigh       88.27      (1.9%)       88.28      (1.7%)    0.0% (  -3% -    3%) 0.980
                  FilteredIntNRQ      297.19      (0.7%)      297.43      (0.8%)    0.1% (  -1% -    1%) 0.783
                 CountOrHighHigh      340.83      (1.8%)      341.27      (2.9%)    0.1% (  -4% -    4%) 0.884
          CountFilteredOrHighMed      149.06      (0.6%)      149.26      (0.7%)    0.1% (  -1% -    1%) 0.559
                    CombinedTerm       39.45      (0.9%)       39.51      (0.5%)    0.1% (  -1% -    1%) 0.586
                  FilteredOrMany       16.55      (1.1%)       16.57      (1.2%)    0.2% (  -2% -    2%) 0.715
                     CountOrMany       29.11      (1.3%)       29.17      (1.6%)    0.2% (  -2% -    3%) 0.721
         CountFilteredOrHighHigh      136.99      (0.8%)      137.25      (1.0%)    0.2% (  -1% -    1%) 0.547
              CombinedAndHighMed       89.73      (0.8%)       89.93      (0.6%)    0.2% (  -1% -    1%) 0.382
                    AndStopWords       47.24      (2.7%)       47.35      (2.1%)    0.2% (  -4% -    5%) 0.789
             CountFilteredOrMany       27.25      (1.2%)       27.32      (1.5%)    0.2% (  -2% -    2%) 0.617
                      AndHighMed      202.48      (2.5%)      202.99      (1.9%)    0.3% (  -3% -    4%) 0.750
              Or2Terms2StopWords      206.67      (1.4%)      207.22      (1.9%)    0.3% (  -3% -    3%) 0.664
                    FilteredTerm      162.69      (2.2%)      163.18      (2.7%)    0.3% (  -4% -    5%) 0.744
                     AndHighHigh       69.16      (3.1%)       69.37      (2.4%)    0.3% (  -5% -    6%) 0.758
               FilteredAnd3Terms      189.84      (1.5%)      190.44      (1.0%)    0.3% (  -2% -    2%) 0.496
     FilteredAnd2Terms2StopWords      214.48      (2.4%)      215.19      (1.2%)    0.3% (  -3% -    4%) 0.631
                       And3Terms      240.86      (2.3%)      241.78      (1.5%)    0.4% (  -3% -    4%) 0.593
                 AndHighOrMedMed       51.39      (1.4%)       51.62      (1.2%)    0.4% (  -2% -    3%) 0.359
             CombinedAndHighHigh       23.50      (1.1%)       23.61      (0.7%)    0.5% (  -1% -    2%) 0.149
                CountAndHighHigh      357.29      (1.8%)      359.20      (2.5%)    0.5% (  -3% -    4%) 0.507
                     OrStopWords       48.86      (2.2%)       49.19      (2.2%)    0.7% (  -3% -    5%) 0.413
                       OrHighMed      258.66      (1.8%)      260.44      (1.6%)    0.7% (  -2% -    4%) 0.272
            FilteredAndStopWords       64.59      (4.0%)       65.06      (2.5%)    0.7% (  -5% -    7%) 0.555
                 CountAndHighMed      307.15      (0.7%)      309.50      (1.3%)    0.8% (  -1% -    2%) 0.044
                      OrHighHigh       78.09      (2.2%)       78.75      (2.1%)    0.8% (  -3% -    5%) 0.280
              FilteredAndHighMed      155.05      (2.9%)      156.38      (1.5%)    0.9% (  -3% -    5%) 0.307
             FilteredAndHighHigh       77.96      (4.5%)       78.64      (2.5%)    0.9% (  -5% -    8%) 0.506
                   TermMonthSort     3341.21      (1.3%)     3373.59      (2.0%)    1.0% (  -2% -    4%) 0.111
                        Or3Terms      230.62      (1.8%)      233.16      (1.7%)    1.1% (  -2% -    4%) 0.090
                            Term      666.11      (5.6%)      677.28      (3.6%)    1.7% (  -7% -   11%) 0.328

Copy link
Copy Markdown
Contributor

@gf2121 gf2121 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played with your benchmark and can reproduce the speed up locally (prefixSumScalarNew).

Benchmark                                        (size)   Mode  Cnt   Score   Error   Units
PrefixSumBenchmark.prefixSumScalar                  128  thrpt    5  17.567 ± 0.117  ops/us
PrefixSumBenchmark.prefixSumScalarInlined           128  thrpt    5  26.228 ± 0.086  ops/us
PrefixSumBenchmark.prefixSumScalarNew               128  thrpt    5  25.864 ± 0.043  ops/us
PrefixSumBenchmark.prefixSumVector128               128  thrpt    5  20.668 ± 0.350  ops/us
PrefixSumBenchmark.prefixSumVector128_v2            128  thrpt    5  26.103 ± 0.176  ops/us
PrefixSumBenchmark.prefixSumVector256               128  thrpt    5  28.632 ± 0.956  ops/us
PrefixSumBenchmark.prefixSumVector256_v2            128  thrpt    5  44.185 ± 0.978  ops/us
PrefixSumBenchmark.prefixSumVector256_v2_inline     128  thrpt    5  43.949 ± 0.225  ops/us
PrefixSumBenchmark.prefixSumVector256_v3            128  thrpt    5  20.108 ± 1.157  ops/us
PrefixSumBenchmark.prefixSumVector512               128  thrpt    5  32.676 ± 0.266  ops/us
PrefixSumBenchmark.prefixSumVector512_v2            128  thrpt    5  57.176 ± 0.413  ops/us

I checked the assemble and the only difference i can see is that baseline uses a register in the unrolled(8x) loop body so it needs to read from array before each iteration, while this PR uses a register across iterations.

@jpountz
Copy link
Copy Markdown
Contributor Author

jpountz commented Jul 22, 2025

Thanks for checking! For reference here's what it gives on my machine (AMD Ryzen 9 3900X):

Benchmark                                          (size)   Mode  Cnt   Score    Error   Units
PrefixSumBenchmark.prefixSumScalar                    128  thrpt    5  19.081 ±  0.550  ops/us
PrefixSumBenchmark.prefixSumScalar                   1024  thrpt    5   2.180 ±  0.097  ops/us
PrefixSumBenchmark.prefixSumScalarUnrolled            128  thrpt    5  32.679 ±  1.819  ops/us
PrefixSumBenchmark.prefixSumScalarUnrolled           1024  thrpt    5  31.804 ±  0.067  ops/us
PrefixSumBenchmark.prefixSumScalar_v2                 128  thrpt    5  30.677 ±  0.308  ops/us
PrefixSumBenchmark.prefixSumScalar_v2                1024  thrpt    5   3.501 ±  0.035  ops/us
PrefixSumBenchmark.prefixSumVector128                 128  thrpt    5  16.519 ±  0.724  ops/us
PrefixSumBenchmark.prefixSumVector128                1024  thrpt    5   1.845 ±  0.003  ops/us
PrefixSumBenchmark.prefixSumVector128_v2              128  thrpt    5  19.237 ±  0.518  ops/us
PrefixSumBenchmark.prefixSumVector128_v2             1024  thrpt    5   1.883 ±  0.014  ops/us
PrefixSumBenchmark.prefixSumVector256                 128  thrpt    5  23.473 ±  0.164  ops/us
PrefixSumBenchmark.prefixSumVector256                1024  thrpt    5   3.029 ±  0.021  ops/us
PrefixSumBenchmark.prefixSumVector256_v2              128  thrpt    5  27.053 ±  0.129  ops/us
PrefixSumBenchmark.prefixSumVector256_v2             1024  thrpt    5   3.162 ±  0.093  ops/us
PrefixSumBenchmark.prefixSumVector256_v2_unrolled     128  thrpt    5  26.211 ±  0.156  ops/us
PrefixSumBenchmark.prefixSumVector256_v2_unrolled    1024  thrpt    5  25.478 ±  0.185  ops/us
PrefixSumBenchmark.prefixSumVector256_v3              128  thrpt    5  14.690 ±  0.037  ops/us
PrefixSumBenchmark.prefixSumVector256_v3             1024  thrpt    5   1.920 ±  0.057  ops/us
PrefixSumBenchmark.prefixSumVector512                 128  thrpt    5   0.052 ±  0.005  ops/us
PrefixSumBenchmark.prefixSumVector512                1024  thrpt    5   0.006 ±  0.001  ops/us
PrefixSumBenchmark.prefixSumVector512_v2              128  thrpt    5   0.082 ±  0.005  ops/us
PrefixSumBenchmark.prefixSumVector512_v2             1024  thrpt    5   0.010 ±  0.001  ops/us

@jpountz jpountz merged commit a2a9a3b into apache:main Jul 23, 2025
8 checks passed
@jpountz jpountz deleted the simplify_prefix_sum branch July 23, 2025 19:25
jpountz added a commit that referenced this pull request Jul 23, 2025
I remember benchmarking prefix sums quite extensively, and unrolled loops
performed significantly better than their rolled on counterpart, both on micro
and macro benchmarks:

```java
private static void prefixSum(int[] arr, int len) {
  for (int i = 1; i < len; ++i) {
    arr[i] += arr[i-1];
  }
}
```

However, I recently discovered that rewriting the loop this way performs much
better, and almost on par with the unrolled variant:

```java
private static void prefixSum(int[] arr, int len) {
  int sum = 0;
  for (int i = 0; i < len; ++i) {
    sum += arr[i];
    arr[i] = sum;
  }
}
```
yossev added a commit to yossev/lucene that referenced this pull request Aug 1, 2025
Replaced the two-step prefix sum loop in `Lucene99HnswVectorsReader` with a single-loop variant that avoids redundant memory access and improves performance.

Previous approach:
- Read first value separately.
- Then used previous buffer element + readVInt().

New approach:
- Accumulates sum in a single pass and assigns directly.

This change follows the suggestion from issue apache#14979 and has the same functional behavior with slightly better efficiency.
jpountz added a commit to jpountz/lucene that referenced this pull request Sep 5, 2025
This applies the same change as apache#14979 to two more prefix sums. I was not able
to measure speedups (or slowdowns) with luceneutil, but I think it's still
better to write our prefix sums this way.
jpountz added a commit to jpountz/elasticsearch that referenced this pull request Sep 5, 2025
Benchmarks at apache/lucene#14979 suggested that
tracking the sum in a variable performs faster than adding the previous value
to each array element.
jpountz added a commit that referenced this pull request Sep 7, 2025
This applies the same change as #14979 to two more prefix sums. I was not able
to measure speedups (or slowdowns) with luceneutil, but I think it's still
better to write our prefix sums this way.
jpountz added a commit that referenced this pull request Sep 7, 2025
This applies the same change as #14979 to two more prefix sums. I was not able
to measure speedups (or slowdowns) with luceneutil, but I think it's still
better to write our prefix sums this way.
jpountz added a commit to elastic/elasticsearch that referenced this pull request Sep 9, 2025
Benchmarks at apache/lucene#14979 suggested that
tracking the sum in a variable performs faster than adding the previous value
to each array element.
rjernst pushed a commit to rjernst/elasticsearch that referenced this pull request Sep 9, 2025
Benchmarks at apache/lucene#14979 suggested that
tracking the sum in a variable performs faster than adding the previous value
to each array element.
Kubik42 pushed a commit to Kubik42/elasticsearch that referenced this pull request Sep 9, 2025
Benchmarks at apache/lucene#14979 suggested that
tracking the sum in a variable performs faster than adding the previous value
to each array element.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:core/codecs skip-changelog Apply to PRs that don't need a changelog entry, stopping the automated changelog check.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants