Skip to content

[Native] Optimized ARM (SVE) functions for BBQ Int4 to 1-bit dot product#141047

Merged
ldematte merged 3 commits intoelastic:mainfrom
ldematte:simd/int4-1-dot-sve
Jan 26, 2026
Merged

[Native] Optimized ARM (SVE) functions for BBQ Int4 to 1-bit dot product#141047
ldematte merged 3 commits intoelastic:mainfrom
ldematte:simd/int4-1-dot-sve

Conversation

@ldematte
Copy link
Copy Markdown
Contributor

@ldematte ldematte commented Jan 21, 2026

SVE Implementation of the dot product between int4 and int1 (single bit). Follows #140264

SVE is supported by e.g. Graviton 3 and 4 processors; it supports variable length SIMD registers, so on some hardware it should give a performance boost over NEON (which supports a fixed width of 128 bits).
Graviton 3 has a register width of 256 bits, but Graviton 4 has a width of just 128 bits, so we should see little to none performance gains. However, SVE is more future proof: if the next processors have significantly wider SIMD registers, the SVE implementation should already take advantage from this implementation.

As expected, the SVE implementation is 40% faster than NEON on Graviton 3, (and 10% faster than the Panama version), and just 5% faster than NEON on Graviton 4 (and no faster than the Panama version).
The bulk operations, which take advantage of inlining + unrolling to give the processor a strong "hint" of which data it could prefetch, are between 15% and 30% faster than the Panama version (on Graviton 3 and 4 respectively).

Relates to: #139750

Benchmarks:

Graviton 3

Panama

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  15.462 ± 0.075  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5   8.226 ± 0.052  ops/ms

NEON

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  16.349 ± 0.664  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5   6.512 ± 0.061  ops/ms

SVE

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  18.031 ± 0.166  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5   9.100 ± 0.200  ops/ms

Graviton 4

Panama

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  17.145 ± 0.071  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5  10.287 ± 0.092  ops/ms

NEON

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  20.772 ± 0.186  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5   9.377 ± 0.294  ops/ms

SVE

Benchmark                           (bits)  (dims)  (directoryType)  (implementation)   Mode  Cnt   Score   Error   Units
VectorScorerOSQBenchmark.bulkScore       1    1024             MMAP        VECTORIZED  thrpt    5  22.836 ± 0.619  ops/ms
VectorScorerOSQBenchmark.score           1    1024             MMAP        VECTORIZED  thrpt    5  10.144 ± 0.064  ops/ms

@ldematte ldematte requested a review from a team as a code owner January 21, 2026 13:29
@ldematte ldematte added test-arm Pull Requests that should be tested against arm agents >enhancement :Search Relevance/Vectors Vector search labels Jan 21, 2026
@elasticsearchmachine elasticsearchmachine added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.4.0 labels Jan 21, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @ldematte, I've created a changelog YAML for you.

const int8_t* a2 = a + mapper(c + 2, offsets) * pitch;
const int8_t* a3 = a + mapper(c + 3, offsets) * pitch;

int64_t subRet0_0 = 0;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite verbose, but experiments showed that this unroll x4 is the best way to instruct ARM processors about data access patterns.
Due to (strong) differences in architecture between each ARM implementation, use of specific prefetch operations might give more performance, but it's strongly processor dependent, with the risk of "overfitting" for a particular processor.
"Hinting" which data we are going to access next (by unrolling) is more "portable", as each processor will figure out what to do with the hardware it has available.

Copy link
Copy Markdown
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. And no tests!! which is a good thing - existing tests cover this functionality with run on Graviton 3/4 ! :-)

@ldematte
Copy link
Copy Markdown
Contributor Author

And no tests!! which is a good thing - existing tests cover this functionality with run on Graviton 3/4 ! :-)

Yes.. what we need is to ensure the tests run on the target HW, like we discussed yesterday. For now, I've run the relevant ones manually.

@ldematte ldematte merged commit 9a349b3 into elastic:main Jan 26, 2026
41 checks passed
@ldematte ldematte deleted the simd/int4-1-dot-sve branch January 26, 2026 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch test-arm Pull Requests that should be tested against arm agents v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants