[Native] Optimized ARM (SVE) functions for BBQ Int4 to 1-bit dot product by ldematte · Pull Request #141047 · elastic/elasticsearch

ldematte · 2026-01-21T13:29:07Z

SVE Implementation of the dot product between int4 and int1 (single bit). Follows #140264

SVE is supported by e.g. Graviton 3 and 4 processors; it supports variable length SIMD registers, so on some hardware it should give a performance boost over NEON (which supports a fixed width of 128 bits).
Graviton 3 has a register width of 256 bits, but Graviton 4 has a width of just 128 bits, so we should see little to none performance gains. However, SVE is more future proof: if the next processors have significantly wider SIMD registers, the SVE implementation should already take advantage from this implementation.

As expected, the SVE implementation is 40% faster than NEON on Graviton 3, (and 10% faster than the Panama version), and just 5% faster than NEON on Graviton 4 (and no faster than the Panama version).
The bulk operations, which take advantage of inlining + unrolling to give the processor a strong "hint" of which data it could prefetch, are between 15% and 30% faster than the Panama version (on Graviton 3 and 4 respectively).

Relates to: #139750

Benchmarks:

Graviton 3

Panama

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  15.462 ± 0.075  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5   8.226 ± 0.052  ops/ms

NEON

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  16.349 ± 0.664  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5   6.512 ± 0.061  ops/ms

SVE

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  18.031 ± 0.166  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5   9.100 ± 0.200  ops/ms

Graviton 4

Panama

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  17.145 ± 0.071  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5  10.287 ± 0.092  ops/ms

NEON

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  20.772 ± 0.186  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5   9.377 ± 0.294  ops/ms

SVE

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  22.836 ± 0.619  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5  10.144 ± 0.064  ops/ms

elasticsearchmachine · 2026-01-21T13:29:55Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

elasticsearchmachine · 2026-01-21T13:30:18Z

Hi @ldematte, I've created a changelog YAML for you.

ldematte · 2026-01-21T13:39:07Z

libs/simdvec/native/src/vec/c/aarch64/vec_2.cpp

+        const int8_t* a2 = a + mapper(c + 2, offsets) * pitch;
+        const int8_t* a3 = a + mapper(c + 3, offsets) * pitch;
+
+        int64_t subRet0_0 = 0;


This is quite verbose, but experiments showed that this unroll x4 is the best way to instruct ARM processors about data access patterns.
Due to (strong) differences in architecture between each ARM implementation, use of specific prefetch operations might give more performance, but it's strongly processor dependent, with the risk of "overfitting" for a particular processor.
"Hinting" which data we are going to access next (by unrolling) is more "portable", as each processor will figure out what to do with the hardware it has available.

ChrisHegarty

Very nice. And no tests!! which is a good thing - existing tests cover this functionality with run on Graviton 3/4 ! :-)

ldematte · 2026-01-23T12:13:50Z

And no tests!! which is a good thing - existing tests cover this functionality with run on Graviton 3/4 ! :-)

Yes.. what we need is to ensure the tests run on the target HW, like we discussed yesterday. For now, I've run the relevant ones manually.

Add SVE optimized implementation for ARM

98f9abc

ldematte requested a review from a team as a code owner January 21, 2026 13:29

ldematte added test-arm Pull Requests that should be tested against arm agents >enhancement :Search Relevance/Vectors Vector search labels Jan 21, 2026

elasticsearchmachine added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.4.0 labels Jan 21, 2026

ldematte requested review from ChrisHegarty, benwtrent and thecoop January 21, 2026 13:30

ldematte added 2 commits January 21, 2026 14:30

Merge branch 'main' into simd/int4-1-dot-sve

86b8542

Update docs/changelog/141047.yaml

ceddfa7

ldematte commented Jan 21, 2026

View reviewed changes

ldematte mentioned this pull request Jan 21, 2026

Implement native (Disk)BBQ scoring (single/bulk) #139750

Closed

13 tasks

ChrisHegarty approved these changes Jan 23, 2026

View reviewed changes

ldematte merged commit 9a349b3 into elastic:main Jan 26, 2026
41 checks passed

ldematte deleted the simd/int4-1-dot-sve branch January 26, 2026 08:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Native] Optimized ARM (SVE) functions for BBQ Int4 to 1-bit dot product#141047

[Native] Optimized ARM (SVE) functions for BBQ Int4 to 1-bit dot product#141047
ldematte merged 3 commits intoelastic:mainfrom
ldematte:simd/int4-1-dot-sve

ldematte commented Jan 21, 2026 •

edited

Loading

Uh oh!

elasticsearchmachine commented Jan 21, 2026

Uh oh!

elasticsearchmachine commented Jan 21, 2026

Uh oh!

ldematte Jan 21, 2026

Uh oh!

ChrisHegarty left a comment

Uh oh!

ldematte commented Jan 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ldematte commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Jan 21, 2026

Uh oh!

elasticsearchmachine commented Jan 21, 2026

Uh oh!

ldematte Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

ChrisHegarty left a comment

Choose a reason for hiding this comment

Uh oh!

ldematte commented Jan 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ldematte commented Jan 21, 2026 •

edited

Loading