Simplify ForDeltaUtil's prefix sum.#14979
Merged
jpountz merged 1 commit intoapache:mainfrom Jul 23, 2025
Merged
Conversation
I remember benchmarking prefix sums quite extensively, and unrolled loops
performed significantly better than their rolled on counterpart, both on micro
and macro benchmarks:
```java
private static void prefixSum(int[] arr, int len) {
for (int i = 1; i < len; ++i) {
arr[i] += arr[i-1];
}
}
```
However, I recently discovered that rewriting the loop this way performs much
better, and almost on par with the unrolled variant:
```java
private static void prefixSum(int[] arr, int len) {
int sum = 0;
for (int i = 0; i < len; ++i) {
sum += arr[i];
arr[i] = sum;
}
}
```
I haven't checked the assembly yet, but both a JMH benchmark and luceneutil
agree that it doesn't introduce a slowdown, so I cut over prefix sums to this
approach.
Contributor
Author
|
gf2121
approved these changes
Jul 22, 2025
Contributor
gf2121
left a comment
There was a problem hiding this comment.
I played with your benchmark and can reproduce the speed up locally (prefixSumScalarNew).
Benchmark (size) Mode Cnt Score Error Units
PrefixSumBenchmark.prefixSumScalar 128 thrpt 5 17.567 ± 0.117 ops/us
PrefixSumBenchmark.prefixSumScalarInlined 128 thrpt 5 26.228 ± 0.086 ops/us
PrefixSumBenchmark.prefixSumScalarNew 128 thrpt 5 25.864 ± 0.043 ops/us
PrefixSumBenchmark.prefixSumVector128 128 thrpt 5 20.668 ± 0.350 ops/us
PrefixSumBenchmark.prefixSumVector128_v2 128 thrpt 5 26.103 ± 0.176 ops/us
PrefixSumBenchmark.prefixSumVector256 128 thrpt 5 28.632 ± 0.956 ops/us
PrefixSumBenchmark.prefixSumVector256_v2 128 thrpt 5 44.185 ± 0.978 ops/us
PrefixSumBenchmark.prefixSumVector256_v2_inline 128 thrpt 5 43.949 ± 0.225 ops/us
PrefixSumBenchmark.prefixSumVector256_v3 128 thrpt 5 20.108 ± 1.157 ops/us
PrefixSumBenchmark.prefixSumVector512 128 thrpt 5 32.676 ± 0.266 ops/us
PrefixSumBenchmark.prefixSumVector512_v2 128 thrpt 5 57.176 ± 0.413 ops/us
I checked the assemble and the only difference i can see is that baseline uses a register in the unrolled(8x) loop body so it needs to read from array before each iteration, while this PR uses a register across iterations.
Contributor
Author
|
Thanks for checking! For reference here's what it gives on my machine (AMD Ryzen 9 3900X): |
jpountz
added a commit
that referenced
this pull request
Jul 23, 2025
I remember benchmarking prefix sums quite extensively, and unrolled loops
performed significantly better than their rolled on counterpart, both on micro
and macro benchmarks:
```java
private static void prefixSum(int[] arr, int len) {
for (int i = 1; i < len; ++i) {
arr[i] += arr[i-1];
}
}
```
However, I recently discovered that rewriting the loop this way performs much
better, and almost on par with the unrolled variant:
```java
private static void prefixSum(int[] arr, int len) {
int sum = 0;
for (int i = 0; i < len; ++i) {
sum += arr[i];
arr[i] = sum;
}
}
```
yossev
added a commit
to yossev/lucene
that referenced
this pull request
Aug 1, 2025
Replaced the two-step prefix sum loop in `Lucene99HnswVectorsReader` with a single-loop variant that avoids redundant memory access and improves performance. Previous approach: - Read first value separately. - Then used previous buffer element + readVInt(). New approach: - Accumulates sum in a single pass and assigns directly. This change follows the suggestion from issue apache#14979 and has the same functional behavior with slightly better efficiency.
This was referenced Aug 1, 2025
jpountz
added a commit
to jpountz/lucene
that referenced
this pull request
Sep 5, 2025
This applies the same change as apache#14979 to two more prefix sums. I was not able to measure speedups (or slowdowns) with luceneutil, but I think it's still better to write our prefix sums this way.
jpountz
added a commit
to jpountz/elasticsearch
that referenced
this pull request
Sep 5, 2025
Benchmarks at apache/lucene#14979 suggested that tracking the sum in a variable performs faster than adding the previous value to each array element.
jpountz
added a commit
that referenced
this pull request
Sep 7, 2025
This applies the same change as #14979 to two more prefix sums. I was not able to measure speedups (or slowdowns) with luceneutil, but I think it's still better to write our prefix sums this way.
jpountz
added a commit
that referenced
this pull request
Sep 7, 2025
This applies the same change as #14979 to two more prefix sums. I was not able to measure speedups (or slowdowns) with luceneutil, but I think it's still better to write our prefix sums this way.
jpountz
added a commit
to elastic/elasticsearch
that referenced
this pull request
Sep 9, 2025
Benchmarks at apache/lucene#14979 suggested that tracking the sum in a variable performs faster than adding the previous value to each array element.
rjernst
pushed a commit
to rjernst/elasticsearch
that referenced
this pull request
Sep 9, 2025
Benchmarks at apache/lucene#14979 suggested that tracking the sum in a variable performs faster than adding the previous value to each array element.
Kubik42
pushed a commit
to Kubik42/elasticsearch
that referenced
this pull request
Sep 9, 2025
Benchmarks at apache/lucene#14979 suggested that tracking the sum in a variable performs faster than adding the previous value to each array element.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I remember benchmarking prefix sums quite extensively, and unrolled loops performed significantly better than their rolled on counterpart, both on micro and macro benchmarks:
However, I recently discovered that rewriting the loop this way performs much better, and almost on par with the unrolled variant:
I haven't checked the assembly yet, but both a JMH benchmark and luceneutil agree that it doesn't introduce a slowdown, so I cut over prefix sums to this approach.