[Native] SIMD implementations for native `int4` vector scoring by ldematte · Pull Request #144429 · elastic/elasticsearch

ldematte · 2026-03-17T18:19:37Z

This PR introduces SIMD implementations for the native int4 scorers. These are now consistently faster than the Panama implementation (~2x on x64 and ~2.6x on ARM), so we can use them. For this reason, the PR also reverts 6c94b3f to re-enable the new scorers in Elasticsearch.

To get the best performances on ARM, this PR also bumps the target architecture to -march=armv8.2-a+dotprod, and of course it updates vec_caps to reflect that (minimum NEON + dotprod).
This enables us to use vdotq_u32, which gives a significant boost, between +40% and +80%, bumping the gain of our native implementation from 1.5x over Lucene to 2.5x over Lucene for int4 (on Graviton4).

The drawback is that we drop native support for ARMv8.0; however the only instance available in cloud that does not support it is Graviton1 (AWS A1 instances, Cortex-A72, which does not support dot product instructions). That said, it's extremely unlikely that somebody would run Elasticsearch on this instance type for vector search workloads, and the change to vec_caps in this PR will return 0 on that hardware, gracefully falling back to Lucene scorers.

Bulk scorer

VectorScorerInt4BulkBenchmark, dims=1024, bulkSize=32, DOT_PRODUCT, ops/s (higher is better)

AWS ARM (Graviton, c8gd.xlarge)

Benchmark	LUCENE	NATIVE	NATIVE vs Lucene
scoreMultipleRandomBulk	6,935	18,508	+167%
scoreMultipleSequentialBulk	6,981	18,791	+169%
scoreQueryMultipleRandomBulk	6,948	18,194	+162%

AWS AMD (c8a.xlarge)

numVectors=1500

Benchmark	LUCENE	NATIVE	NATIVE vs Lucene
scoreMultipleRandomBulk	9,040	17,145	+90%
scoreMultipleSequentialBulk	9,027	17,810	+97%
scoreQueryMultipleRandomBulk	10,339	20,034	+94%

numVectors=128

Benchmark	LUCENE	NATIVE	NATIVE vs Lucene
scoreMultipleRandomBulk	109,992	221,617	+101%
scoreMultipleSequentialBulk	109,916	217,933	+98%
scoreQueryMultipleRandomBulk	125,273	234,073	+87%

numVectors=130000

Benchmark	LUCENE	NATIVE	NATIVE vs Lucene
scoreMultipleRandomBulk	636	1,040	+63%
scoreMultipleSequentialBulk	705	1,292	+83%
scoreQueryMultipleRandomBulk	617	1,149	+86%

Add JMH benchmarks for int4 (PACKED_NIBBLE) quantized vector scoring to establish performance baselines before adding native C++ support. Three benchmark levels mirror the existing int7u suite: - VectorScorerInt4OperationBenchmark: raw dot product - VectorScorerInt4Benchmark: single-score with correction math - VectorScorerInt4BulkBenchmark: multi-vector scoring patterns including bulkScore API Each benchmark compares SCALAR (plain loop) vs LUCENE (Panama SIMD) implementations for DOT_PRODUCT and EUCLIDEAN similarity. Made-with: Cursor

Introduce scalar C++ implementations for int4 packed-nibble dot product (single, bulk, bulk-with-offsets) and wire them through JdkVectorLibrary, Similarities, and the new Int4VectorScorerSupplier and Int4VectorScorer classes. Both the HNSW graph-build (scorer supplier) and query-time (scorer) paths in ES94ScalarQuantizedVectorsFormat now use native int4 scoring when available. Made-with: Cursor

Deduplicate correction logic between Int4VectorScorer and Int4VectorScorerSupplier into Int4Corrections. Both classes are now final (no sealed subclass hierarchy) and resolve the similarity-specific correction via method references stored at construction time. Made-with: Cursor

Made-with: Cursor

…ment handling.

…bstractVectorTestCase

… with test code via testFixture.

The native Int4 implementation is not yet competitive with Lucene's Panama SIMD scorer. Revert the production integration until performance is improved. The native implementation, tests, and benchmarks remain intact. Made-with: Cursor

Replace the scalar C++ int4 dot product implementation with SIMD-vectorized versions targeting NEON (aarch64) and AVX2 (amd64), to match Lucene's Panama vectorized performance. aarch64: uses vmull_u8 for widening 8->16 bit multiply and vpadalq_u16 for pairwise accumulation into 32-bit, with 4 independent accumulators to break dependency chains. Bulk operations batch 4 vectors at a time. amd64: uses _mm256_cvtepu8_epi16 for zero-extension and _mm256_madd_epi16 for multiply-accumulate into 32-bit, with 2 accumulators. Bulk operations batch 2 vectors with explicit cache-line prefetching of the next batch. Both architectures extract nibbles via shift+mask, load the two unpacked halves (high/low nibble targets), and share query vector loads across batched document vectors. Made-with: Cursor

Use doc_*/query_* naming consistently across both aarch64 and amd64 implementations instead of opaque abbreviations like p, hi, u_hi, as. Made-with: Cursor

This reverts commit 6c94b3f.

elasticsearchmachine · 2026-03-17T18:20:03Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

ldematte · 2026-03-18T07:18:52Z

I have a different implementation for ARM which shows 2.5x improvements over Panama, but with a caveat. I'll open a conversation on slack about it.

ChrisHegarty

Very nice! LGTM

…or int4

…h into native/vec-i4-simd

elasticsearchmachine · 2026-03-18T11:36:00Z

Hi @ldematte, I've created a changelog YAML for you.

ldematte · 2026-03-18T15:20:34Z

CI is green, don't know why GH was not updated; merging (see https://buildkite.com/elastic/elasticsearch-pull-request/builds/129428)

…ic#144429) Following elastic#144215 This PR introduces SIMD implementations for the native int4 scorers. These are now consistently faster than the Panama implementation (~2x on x64 and ~2.6x on ARM), so we can use them. For this reason, the PR also reverts 6c94b3f to re-enable the new scorers in Elasticsearch. To get the best performances on ARM, this PR also bumps the target architecture to -march=armv8.2-a+dotprod, and of course it updates vec_caps to reflect that (minimum NEON + dotprod). This enables us to use vdotq_u32, which gives a significant boost, between +40% and +80%, bumping the gain of our native implementation from 1.5x over Lucene to 2.5x over Lucene for int4 (on Graviton4). The drawback is that we drop native support for ARMv8.0; however the only instance available in cloud that does not support it is Graviton1 (AWS A1 instances, Cortex-A72, which does not support dot product instructions). That said, it's extremely unlikely that somebody would run Elasticsearch on this instance type for vector search workloads, and the change to vec_caps in this PR will return 0 on that hardware, gracefully falling back to Lucene scorers.

This PR introduces some smaller optimizations to the x64 int4 implementations. Now that #144429 is merged, I resumed #109238 and the detailed analysis I did there, and discovered that we were not using the optimal set of instructions. The older PR used a inner loop that was at the theoretical maximum for most of the processors, with a throughput of 32 elements per CPU cycle. I applied the same schema to the new implementations introduce in the previous PR; the bulk scoring paths show significant gains — +19% to +25% on the Bulk variants, and +9% to +19% on the non-bulk variants. Also, I implemented a AVX-512 variant; this should give us an additional theoretical speedup of 2x in the inner calculation loop (over the AVX2 implementation), which should translate to a 12-50% throughput increase depending on vector dimensions (higher dimensions --> more time spent in the inner loop).

This PR introduces some smaller optimizations to the x64 int4 implementations. Now that elastic#144429 is merged, I resumed elastic#109238 and the detailed analysis I did there, and discovered that we were not using the optimal set of instructions. The older PR used a inner loop that was at the theoretical maximum for most of the processors, with a throughput of 32 elements per CPU cycle. I applied the same schema to the new implementations introduce in the previous PR; the bulk scoring paths show significant gains — +19% to +25% on the Bulk variants, and +9% to +19% on the non-bulk variants. Also, I implemented a AVX-512 variant; this should give us an additional theoretical speedup of 2x in the inner calculation loop (over the AVX2 implementation), which should translate to a 12-50% throughput increase depending on vector dimensions (higher dimensions --> more time spent in the inner loop).

ldematte added 30 commits March 12, 2026 14:51

Add int4 vector scorer factory tests

731064d

Made-with: Cursor

Move int4 correction formulas into ScalarOperations

ec7a834

Cleanup

c94f733

More cleanup

aefb340

More renaming

31b3531

Removing duplicated classes + single correction function

8d57563

Merge remote-tracking branch 'upstream/main' into native/vec-i4

b06d112

Merge branch 'native/vec-i4' into native/vec-i4-impl

56271a3

Small fix -- explicit cast to long to force <R> to be Long

6ae48de

Refactoring: extract common ScorerImpl, fix heap-segment, improve seg…

4723a36

…ment handling.

Add tests for bulk scoring, NIO filesystem. Move common function to A…

e364e10

…bstractVectorTestCase

Add benchmarks for native; refactor benchmark code to share utilities…

383b5d0

… with test code via testFixture.

Revert native Int4 scorer integration in codec

6c94b3f

The native Int4 implementation is not yet competitive with Lucene's Panama SIMD scorer. Revert the production integration until performance is improved. The native implementation, tests, and benchmarks remain intact. Made-with: Cursor

Refactor: expose Int4 test utilities via a test fixture in libs/native

b6eb707

Introduce lower level i4 distance functions tests

d0da896

Small test fix

51691bd

Merge remote-tracking branch 'upstream/main' into native/vec-i4-impl

e54ed73

Fix missing header

acf0522

Publish vec binaries + update version

02d458d

Merge branch 'main' into native/vec-i4-impl

176bcb7

Merge branch 'main' into native/vec-i4-impl

f2f7727

Merge branch 'main' into native/vec-i4-impl

ed81926

Rename int4 SIMD variables for readability

9acc489

Use doc_*/query_* naming consistently across both aarch64 and amd64 implementations instead of opaque abbreviations like p, hi, u_hi, as. Made-with: Cursor

Optimize the AVX2 implementation using _mm256_maddubs_epi16

a1c68b3

Merge remote-tracking branch 'upstream/main' into native/vec-i4-simd

d1138da

Better representation of bit size for DataType

95413e6

ldematte added 3 commits March 17, 2026 19:06

Publish vec binaries + update version

ff1b211

Fix: botched merge

478f630

Revert "Revert native Int4 scorer integration in codec"

91f41d5

This reverts commit 6c94b3f.

ldematte requested a review from thecoop March 17, 2026 18:19

ldematte requested a review from a team as a code owner March 17, 2026 18:19

ldematte added >enhancement :Search Relevance/Vectors Vector search labels Mar 17, 2026

elasticsearchmachine added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.4.0 labels Mar 17, 2026

Merge branch 'main' into native/vec-i4-simd

cd301cb

ldematte requested a review from ChrisHegarty March 18, 2026 07:17

ldematte added 2 commits March 18, 2026 09:19

Merge branch 'main' into native/vec-i4-simd

7c26384

Merge branch 'main' into native/vec-i4-simd

dafd3ce

ChrisHegarty approved these changes Mar 18, 2026

View reviewed changes

ldematte added 3 commits March 18, 2026 12:22

Bump vec_caps to reflect armv8.2-a+dotprod compilation; use dotprod f…

9285616

…or int4

Merge branch 'native/vec-i4-simd' of github.com:ldematte/elasticsearc…

11d6c9f

…h into native/vec-i4-simd

Update docs/changelog/144429.yaml

4851e20

Merge branch 'main' into native/vec-i4-simd

c9ad7db

ldematte merged commit 7aacd5f into elastic:main Mar 18, 2026
34 of 36 checks passed

ldematte deleted the native/vec-i4-simd branch March 18, 2026 15:23

ldematte mentioned this pull request Mar 20, 2026

[Native] int4 x86 SIMD optimizations #144649

Merged

ldematte mentioned this pull request Mar 27, 2026

Add native int4 operation benchmarks; fix JDK22+ test guards #145096

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Native] SIMD implementations for native `int4` vector scoring#144429

[Native] SIMD implementations for native `int4` vector scoring#144429
ldematte merged 40 commits intoelastic:mainfrom
ldematte:native/vec-i4-simd

ldematte commented Mar 17, 2026 •

edited

Loading

Uh oh!

elasticsearchmachine commented Mar 17, 2026

Uh oh!

ldematte commented Mar 18, 2026

Uh oh!

ChrisHegarty left a comment

Uh oh!

elasticsearchmachine commented Mar 18, 2026

Uh oh!

ldematte commented Mar 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ldematte commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bulk scorer

AWS ARM (Graviton, c8gd.xlarge)

AWS AMD (c8a.xlarge)

numVectors=1500

numVectors=128

numVectors=130000

Uh oh!

elasticsearchmachine commented Mar 17, 2026

Uh oh!

ldematte commented Mar 18, 2026

Uh oh!

ChrisHegarty left a comment

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Mar 18, 2026

Uh oh!

ldematte commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ldematte commented Mar 17, 2026 •

edited

Loading

ldematte commented Mar 18, 2026 •

edited

Loading