Skip to content

[Native] BBQ Int4 to 1-bit dot product functions#140264

Merged
ldematte merged 48 commits intoelastic:mainfrom
ldematte:simd/int4-bit1-dot
Jan 21, 2026
Merged

[Native] BBQ Int4 to 1-bit dot product functions#140264
ldematte merged 48 commits intoelastic:mainfrom
ldematte:simd/int4-bit1-dot

Conversation

@ldematte
Copy link
Copy Markdown
Contributor

@ldematte ldematte commented Jan 7, 2026

This PR introduces the scaffolding needed for native dot product of a int4 query vector against 1-bit vectors, in BBQ.
The native function implementations already have optimized, platform specific versions, as the Panama versions were faster than vanilla native implementations:

The speedup is between none and 20% better for single scoring, and between 20-40% better for bulk scoring. This is without score adjustment, which is still done Java side (with Panama). That would be tackled in a follow-up, and it is expected to give it a bit more performance boost.

Benchmarks for various architectures and vendors can be found below.

Closes #128523

@ldematte
Copy link
Copy Markdown
Contributor Author

ldematte commented Jan 9, 2026

Benchmarks on ARM (Mac) show that the "basic" native functions are on-par with the current Panama implementations.
Before:

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt  Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  31,349 ± 1,977  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5  22,184 ± 1,393  ops/ms

After (this PR):

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  39,980 ± 3,399  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5  23,119 ± 1,109  ops/ms

So basically the same for non-bulk, and ~20% better with bulk.

That looks promising, but I'm going to benchmark on x64 before marking this PR as ready

@ldematte
Copy link
Copy Markdown
Contributor Author

ldematte commented Jan 9, 2026

Performance on Intel shows a decrease wrt Panama, confirming that vectorization on x64 has better support than ARM.
Before:

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP            SCALAR  thrpt    5   8.502 ± 0.268  ops/ms
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  17.727 ± 0.457  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP            SCALAR  thrpt    5   7.647 ± 0.353  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5  10.336 ± 0.800  ops/ms

After:

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  15.985 ± 1.561  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5   8.257 ± 1.131  ops/ms

To prevent a regression, I'm going to tweak the x64 code to use SIMD already, instead of deferring this to a follow-up PR

@ldematte
Copy link
Copy Markdown
Contributor Author

ldematte commented Jan 9, 2026

A simple AVX2 implementation brings things ~ on par with the Panama implementation:

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  18.506 ± 0.654  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5   9.865 ± 0.538  ops/ms

A more advanced one gives a ~15% performance boost in bulk over Panama
EDIT: actually after another small optimization it's a bit more than 20%

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  23.068 ± 1.476  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5  11.331 ± 0.250  ops/ms

I think it's worth using it.
Next I'll check how this behaves on more modern AVX-512 processors. If it's OK, I'm going to finalize this PR.

@ldematte
Copy link
Copy Markdown
Contributor Author

ldematte commented Jan 9, 2026

AVX-512 (on c8i) brings mixed news. The Panama version is fast!

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP            SCALAR  thrpt    5  12.363 ± 0.074  ops/ms
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  36.807 ± 0.313  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP            SCALAR  thrpt    5  10.930 ± 0.120  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5  17.632 ± 0.559  ops/ms

you see, 3x on bulk and 1.7x in the single case.
Because probably it uses _mm512_popcnt_epi64 underneath (which makes a lot of sense)
In fact, changing the native AVX-512 implementation to use that same function brings us close:

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  40.063 ± 0.378  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5  13.395 ± 0.403  ops/ms

A bit better for bulk (~ +10%), a bit worse for single scoring (~ -20%).
Why worse for single scoring? I think we are so close here that the cost of transitioning to native code (e.g. the CAS instruction on MemorySegment) are starting to hurt.

I'll test on AMD; so far, no big win on x64 despite optimizations. Panama on x64 for this simple things (at the end, this is just AND + POPCNT) is already quite good. ARM (in our experience) it's a different topic.

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @ldematte, I've updated the changelog YAML for you.

Copy link
Copy Markdown
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* Produces a method handle returning the dot product of an int4 (half-byte) vector and
* a bit vector (one bit per element)
*
*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra new line.

@ldematte ldematte requested a review from iverase January 20, 2026 11:35
@ldematte
Copy link
Copy Markdown
Contributor Author

Turns out we need ARM optimizations right away: while on Mac the vanilla native implementation is enough, on Graviton it would have introduced a performance regression:

c8g (Graviton 4)

Scalar

Benchmark                       (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP            SCALAR  thrpt    5   2.952 ± 0.011  ops/ms
VectorScorerOSQBenchmark.score       1    1024             MMAP            SCALAR  thrpt    5   2.849 ± 0.021  ops/ms

Panama

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  17.145 ± 0.071  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5  10.287 ± 0.092  ops/ms

Native builtin

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt  Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  8.610 ± 0.048  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5  5.634 ± 0.012  ops/ms

NEON

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  20.772 ± 0.186  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5   9.377 ± 0.294  ops/ms

@ldematte
Copy link
Copy Markdown
Contributor Author

Very similar on Graviton 2 (c6g)

Panama

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt  Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  5.500 ± 0.036  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5  3.624 ± 0.029  ops/ms

Native builtin

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt  Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  3.428 ± 0.038  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5  2.520 ± 0.021  ops/ms

NEON

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt  Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  9.878 ± 0.092  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5  4.408 ± 0.018  ops/ms

@ldematte
Copy link
Copy Markdown
Contributor Author

The upside is that we can claim a performance boost on ARM up to 20% for single scoring, and between 15% and a very nice 70% for bulk.


// Fast AVX2 popcount, based on "Faster Population Counts Using AVX2 Instructions"
// See https://arxiv.org/abs/1611.07612 and https://github.com/WojciechMula/sse-popcount
static inline __m256i dot_bit_256(const __m256i a, const int8_t* b) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised this isnt named popcount in some fashion, given its relation to __builtin_popcount

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's popcount after and betweend the operands, so effectively it's a partial bitwise dot product

Copy link
Copy Markdown
Member

@thecoop thecoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions/comments, but otherwise ok

@ldematte ldematte enabled auto-merge (squash) January 21, 2026 11:38
@ldematte ldematte merged commit 49f7563 into elastic:main Jan 21, 2026
41 checks passed
@ldematte ldematte deleted the simd/int4-bit1-dot branch January 21, 2026 13:17
ldematte added a commit that referenced this pull request Jan 26, 2026
…uct (#141047)

SVE Implementation of the dot product between int4 and int1 (single bit). Follows #140264

SVE is supported by e.g. Graviton 3 and 4 processors; it supports variable length SIMD registers, so on some hardware it should give a performance boost over NEON (which supports a fixed width of 128 bits).
Graviton 3 has a register width of 256 bits, but Graviton 4 has a width of just 128 bits, so we should see little to none performance gains. However, SVE is more future proof: if the next processors have significantly wider SIMD registers, the SVE implementation should already take advantage from this implementation.

As expected, the SVE implementation is 40% faster than NEON on Graviton 3, (and 10% faster than the Panama version), and just 5% faster than NEON on Graviton 4 (and no faster than the Panama version).
The bulk operations, which take advantage of inlining + unrolling to give the processor a strong "hint" of which data it could prefetch, are between 15% and 30% faster than the Panama version (on Graviton 3 and 4 respectively).

Relates to: #139750
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch test-arm Pull Requests that should be tested against arm agents v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Evaluate the performance benefits of an off-heap and/or AVX 512 BBQ vector scorer

6 participants