[Native] BBQ Int4 to 1-bit dot product functions by ldematte · Pull Request #140264 · elastic/elasticsearch

ldematte · 2026-01-07T12:13:06Z

This PR introduces the scaffolding needed for native dot product of a int4 query vector against 1-bit vectors, in BBQ.
The native function implementations already have optimized, platform specific versions, as the Panama versions were faster than vanilla native implementations:

for ARM we have a NEON implementation (largely lifted from Native OSQ scoring #134623, thanks @iverase)
on x64:
- for AVX2, I used a combination of lookup tables and shuffling, as described in ["Faster Population Counts Using AVX2 Instructions"](https://arxiv.org/abs/1611.07612 and https://github.com/WojciechMula/sse-popcount)
- for AVX-512, I used the vpopcntq, 512-bit wide popcount (very likely, the same that Panama translates to), plus some better prefetching.

The speedup is between none and 20% better for single scoring, and between 20-40% better for bulk scoring. This is without score adjustment, which is still done Java side (with Panama). That would be tackled in a follow-up, and it is expected to give it a bit more performance boost.

Benchmarks for various architectures and vendors can be found below.

Closes #128523

ldematte · 2026-01-09T11:15:51Z

Benchmarks on ARM (Mac) show that the "basic" native functions are on-par with the current Panama implementations.
Before:

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt  Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  31,349 ± 1,977  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5  22,184 ± 1,393  ops/ms

After (this PR):

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  39,980 ± 3,399  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5  23,119 ± 1,109  ops/ms

So basically the same for non-bulk, and ~20% better with bulk.

That looks promising, but I'm going to benchmark on x64 before marking this PR as ready

ldematte · 2026-01-09T13:33:39Z

Performance on Intel shows a decrease wrt Panama, confirming that vectorization on x64 has better support than ARM.
Before:

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP            SCALAR  thrpt    5   8.502 ± 0.268  ops/ms
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  17.727 ± 0.457  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP            SCALAR  thrpt    5   7.647 ± 0.353  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5  10.336 ± 0.800  ops/ms

After:

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  15.985 ± 1.561  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5   8.257 ± 1.131  ops/ms

To prevent a regression, I'm going to tweak the x64 code to use SIMD already, instead of deferring this to a follow-up PR

ldematte · 2026-01-09T15:05:39Z

A simple AVX2 implementation brings things ~ on par with the Panama implementation:

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  18.506 ± 0.654  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5   9.865 ± 0.538  ops/ms

A more advanced one gives a ~15% performance boost in bulk over Panama
EDIT: actually after another small optimization it's a bit more than 20%

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  23.068 ± 1.476  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5  11.331 ± 0.250  ops/ms

I think it's worth using it.
Next I'll check how this behaves on more modern AVX-512 processors. If it's OK, I'm going to finalize this PR.

...java/org/elasticsearch/simdvec/internal/vectorization/MSBitToInt4ESNextOSQVectorsScorer.java

…h into simd/int4-bit1-dot

ldematte · 2026-01-09T16:59:01Z

AVX-512 (on c8i) brings mixed news. The Panama version is fast!

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP            SCALAR  thrpt    5  12.363 ± 0.074  ops/ms
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  36.807 ± 0.313  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP            SCALAR  thrpt    5  10.930 ± 0.120  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5  17.632 ± 0.559  ops/ms

you see, 3x on bulk and 1.7x in the single case.
Because probably it uses _mm512_popcnt_epi64 underneath (which makes a lot of sense)
In fact, changing the native AVX-512 implementation to use that same function brings us close:

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  40.063 ± 0.378  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5  13.395 ± 0.403  ops/ms

A bit better for bulk (~ +10%), a bit worse for single scoring (~ -20%).
Why worse for single scoring? I think we are so close here that the cost of transitioning to native code (e.g. the CAS instruction on MemorySegment) are starting to hurt.

I'll test on AMD; so far, no big win on x64 despite optimizations. Panama on x64 for this simple things (at the end, this is just AND + POPCNT) is already quite good. ARM (in our experience) it's a different topic.

…h into simd/int4-bit1-dot

elasticsearchmachine · 2026-01-19T08:30:20Z

Hi @ldematte, I've updated the changelog YAML for you.

…h into simd/int4-bit1-dot

ChrisHegarty

LGTM

ChrisHegarty · 2026-01-16T17:08:40Z

libs/native/src/main/java/org/elasticsearch/nativeaccess/VectorSimilarityFunctions.java

+     * Produces a method handle returning the dot product of an int4 (half-byte) vector and
+     * a bit vector (one bit per element)
+     *
+     *


extra new line.

ldematte · 2026-01-20T11:37:34Z

Turns out we need ARM optimizations right away: while on Mac the vanilla native implementation is enough, on Graviton it would have introduced a performance regression:

c8g (Graviton 4)

Scalar

Benchmark                       (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP            SCALAR  thrpt    5   2.952 ± 0.011  ops/ms
VectorScorerOSQBenchmark.score       1    1024             MMAP            SCALAR  thrpt    5   2.849 ± 0.021  ops/ms

Panama

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  17.145 ± 0.071  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5  10.287 ± 0.092  ops/ms

Native builtin

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt  Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  8.610 ± 0.048  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5  5.634 ± 0.012  ops/ms

NEON

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  20.772 ± 0.186  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5   9.377 ± 0.294  ops/ms

ldematte · 2026-01-20T11:38:37Z

Very similar on Graviton 2 (c6g)

Panama

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt  Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  5.500 ± 0.036  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5  3.624 ± 0.029  ops/ms

Native builtin

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt  Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  3.428 ± 0.038  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5  2.520 ± 0.021  ops/ms

NEON

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt  Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  9.878 ± 0.092  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5  4.408 ± 0.018  ops/ms

ldematte · 2026-01-20T11:44:15Z

The upside is that we can claim a performance boost on ARM up to 20% for single scoring, and between 15% and a very nice 70% for bulk.

libs/native/src/test/java/org/elasticsearch/nativeaccess/jdk/JDKVectorLibraryInt4Tests.java

thecoop · 2026-01-21T09:25:06Z

libs/simdvec/native/src/vec/c/amd64/vec_1.cpp

+
+// Fast AVX2 popcount, based on "Faster Population Counts Using AVX2 Instructions"
+// See https://arxiv.org/abs/1611.07612 and https://github.com/WojciechMula/sse-popcount
+static inline __m256i dot_bit_256(const __m256i a, const int8_t* b) {


I'm surprised this isnt named popcount in some fashion, given its relation to __builtin_popcount

It's popcount after and betweend the operands, so effectively it's a partial bitwise dot product

libs/simdvec/native/src/vec/c/amd64/vec_1.cpp

libs/simdvec/native/src/vec/c/amd64/vec_2.cpp

...java/org/elasticsearch/simdvec/internal/vectorization/MSBitToInt4ESNextOSQVectorsScorer.java

thecoop

A few questions/comments, but otherwise ok

…h into simd/int4-bit1-dot

…uct (#141047) SVE Implementation of the dot product between int4 and int1 (single bit). Follows #140264 SVE is supported by e.g. Graviton 3 and 4 processors; it supports variable length SIMD registers, so on some hardware it should give a performance boost over NEON (which supports a fixed width of 128 bits). Graviton 3 has a register width of 256 bits, but Graviton 4 has a width of just 128 bits, so we should see little to none performance gains. However, SVE is more future proof: if the next processors have significantly wider SIMD registers, the SVE implementation should already take advantage from this implementation. As expected, the SVE implementation is 40% faster than NEON on Graviton 3, (and 10% faster than the Panama version), and just 5% faster than NEON on Graviton 4 (and no faster than the Panama version). The bulk operations, which take advantage of inlining + unrolling to give the processor a strong "hint" of which data it could prefetch, are between 15% and 30% faster than the Panama version (on Graviton 3 and 4 respectively). Relates to: #139750

ldematte added 5 commits December 18, 2025 14:13

Int4 to bit1 dot product functions

2712841

Wiring it up for DiskBBQ

085b4ec

Parametrize VectorScorerOSQBenchmark

3ba7513

Merge remote-tracking branch 'upstream/main' into simd/int4-bit1-dot

564b1fc

Fixes post-merge

098e70e

ldematte added WIP :Search Relevance/Vectors Vector search labels Jan 7, 2026

elasticsearchmachine added the v9.4.0 label Jan 7, 2026

thecoop self-requested a review January 7, 2026 15:37

ldematte added 4 commits January 8, 2026 09:11

Copy/paste AMD native implementation(s)

ec0029d

Merge + Fixes

34b81c2

Renaming, fix parameter ordering

b07929e

Merge remote-tracking branch 'upstream/main' into simd/int4-bit1-dot

9e29360

Merge remote-tracking branch 'upstream/main' into simd/int4-bit1-dot

c81782a

ldematte added 2 commits January 9, 2026 15:18

AVX2 simple optimization

352154a

AVX2 more advanced optimization

2c6dcff

ldematte commented Jan 9, 2026

View reviewed changes

...java/org/elasticsearch/simdvec/internal/vectorization/MSBitToInt4ESNextOSQVectorsScorer.java Show resolved Hide resolved

ldematte commented Jan 9, 2026

View reviewed changes

...java/org/elasticsearch/simdvec/internal/vectorization/MSBitToInt4ESNextOSQVectorsScorer.java Show resolved Hide resolved

ldematte requested review from ChrisHegarty and benwtrent January 9, 2026 15:10

ldematte added 2 commits January 9, 2026 16:30

Spotelss + enable native

bf4554c

Merge branch 'simd/int4-bit1-dot' of github.com:ldematte/elasticsearc…

99cef84

…h into simd/int4-bit1-dot

ldematte added 4 commits January 9, 2026 18:29

Small avx2 improvement

6a8c9f8

Merge branch 'simd/int4-bit1-dot' of github.com:ldematte/elasticsearc…

4b511ff

…h into simd/int4-bit1-dot

fix

dda3336

AVX-512 optimization

563525b

ldematte added 3 commits January 19, 2026 15:01

Merge remote-tracking branch 'upstream/main' into simd/int4-bit1-dot

d90eae3

Merge branch 'simd/int4-bit1-dot' of github.com:ldematte/elasticsearc…

1b06db9

…h into simd/int4-bit1-dot

Merge remote-tracking branch 'upstream/main' into simd/int4-bit1-dot

da40176

ChrisHegarty approved these changes Jan 20, 2026

View reviewed changes

ldematte added 3 commits January 20, 2026 12:20

ARM Neon optimized version

517b59f

Bump vec version after publish

2fbd6c0

spotless

be9576c

ldematte requested a review from iverase January 20, 2026 11:35

Merge branch 'main' into simd/int4-bit1-dot

8b04e31

iverase approved these changes Jan 20, 2026

View reviewed changes

thecoop mentioned this pull request Jan 20, 2026

Add native operations for scoring floats #140169

Merged

thecoop reviewed Jan 21, 2026

View reviewed changes

libs/native/src/test/java/org/elasticsearch/nativeaccess/jdk/JDKVectorLibraryInt4Tests.java Show resolved Hide resolved

thecoop reviewed Jan 21, 2026

View reviewed changes

libs/simdvec/native/src/vec/c/amd64/vec_1.cpp Outdated Show resolved Hide resolved

thecoop reviewed Jan 21, 2026

View reviewed changes

libs/simdvec/native/src/vec/c/amd64/vec_2.cpp Show resolved Hide resolved

thecoop reviewed Jan 21, 2026

View reviewed changes

...java/org/elasticsearch/simdvec/internal/vectorization/MSBitToInt4ESNextOSQVectorsScorer.java Outdated Show resolved Hide resolved

thecoop approved these changes Jan 21, 2026

View reviewed changes

ldematte added 3 commits January 21, 2026 12:35

PR comments

15b4189

Merge branch 'simd/int4-bit1-dot' of github.com:ldematte/elasticsearc…

e006aa7

…h into simd/int4-bit1-dot

Merge remote-tracking branch 'upstream/main' into simd/int4-bit1-dot

e8bfc54

ldematte enabled auto-merge (squash) January 21, 2026 11:38

ldematte merged commit 49f7563 into elastic:main Jan 21, 2026
41 checks passed

ldematte deleted the simd/int4-bit1-dot branch January 21, 2026 13:17

ldematte mentioned this pull request Jan 21, 2026

[Native] Optimized ARM (SVE) functions for BBQ Int4 to 1-bit dot product #141047

Merged

ldematte mentioned this pull request Jan 27, 2026

[Native][x64] Native implementation of DiskBBQ scoring (distance w adjustment) #141332

Merged

Conversation

ldematte commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ldematte commented Jan 9, 2026

Uh oh!

ldematte commented Jan 9, 2026

Uh oh!

ldematte commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ldematte commented Jan 9, 2026

Uh oh!

elasticsearchmachine commented Jan 19, 2026

Uh oh!

ChrisHegarty left a comment

Choose a reason for hiding this comment

Uh oh!

ChrisHegarty Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

ldematte commented Jan 20, 2026

Uh oh!

ldematte commented Jan 20, 2026

Uh oh!

ldematte commented Jan 20, 2026

Uh oh!

Uh oh!

thecoop Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

ldematte Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thecoop left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ldematte commented Jan 7, 2026 •

edited

Loading

ldematte commented Jan 9, 2026 •

edited

Loading