Skip to content

[Native] SIMD implementations for native int4 vector scoring#144429

Merged
ldematte merged 40 commits intoelastic:mainfrom
ldematte:native/vec-i4-simd
Mar 18, 2026
Merged

[Native] SIMD implementations for native int4 vector scoring#144429
ldematte merged 40 commits intoelastic:mainfrom
ldematte:native/vec-i4-simd

Conversation

@ldematte
Copy link
Copy Markdown
Contributor

@ldematte ldematte commented Mar 17, 2026

Following #144215

This PR introduces SIMD implementations for the native int4 scorers. These are now consistently faster than the Panama implementation (~2x on x64 and ~2.6x on ARM), so we can use them. For this reason, the PR also reverts 6c94b3f to re-enable the new scorers in Elasticsearch.

To get the best performances on ARM, this PR also bumps the target architecture to -march=armv8.2-a+dotprod, and of course it updates vec_caps to reflect that (minimum NEON + dotprod).
This enables us to use vdotq_u32, which gives a significant boost, between +40% and +80%, bumping the gain of our native implementation from 1.5x over Lucene to 2.5x over Lucene for int4 (on Graviton4).

The drawback is that we drop native support for ARMv8.0; however the only instance available in cloud that does not support it is Graviton1 (AWS A1 instances, Cortex-A72, which does not support dot product instructions). That said, it's extremely unlikely that somebody would run Elasticsearch on this instance type for vector search workloads, and the change to vec_caps in this PR will return 0 on that hardware, gracefully falling back to Lucene scorers.

Bulk scorer

VectorScorerInt4BulkBenchmark, dims=1024, bulkSize=32, DOT_PRODUCT, ops/s (higher is better)

AWS ARM (Graviton, c8gd.xlarge)

Benchmark LUCENE NATIVE NATIVE vs Lucene
scoreMultipleRandomBulk 6,935 18,508 +167%
scoreMultipleSequentialBulk 6,981 18,791 +169%
scoreQueryMultipleRandomBulk 6,948 18,194 +162%

AWS AMD (c8a.xlarge)

numVectors=1500

Benchmark LUCENE NATIVE NATIVE vs Lucene
scoreMultipleRandomBulk 9,040 17,145 +90%
scoreMultipleSequentialBulk 9,027 17,810 +97%
scoreQueryMultipleRandomBulk 10,339 20,034 +94%

numVectors=128

Benchmark LUCENE NATIVE NATIVE vs Lucene
scoreMultipleRandomBulk 109,992 221,617 +101%
scoreMultipleSequentialBulk 109,916 217,933 +98%
scoreQueryMultipleRandomBulk 125,273 234,073 +87%

numVectors=130000

Benchmark LUCENE NATIVE NATIVE vs Lucene
scoreMultipleRandomBulk 636 1,040 +63%
scoreMultipleSequentialBulk 705 1,292 +83%
scoreQueryMultipleRandomBulk 617 1,149 +86%

ldematte added 30 commits March 12, 2026 14:51
Add JMH benchmarks for int4 (PACKED_NIBBLE) quantized vector scoring
to establish performance baselines before adding native C++ support.

Three benchmark levels mirror the existing int7u suite:
- VectorScorerInt4OperationBenchmark: raw dot product
- VectorScorerInt4Benchmark: single-score with correction math
- VectorScorerInt4BulkBenchmark: multi-vector scoring patterns
  including bulkScore API

Each benchmark compares SCALAR (plain loop) vs LUCENE (Panama SIMD)
implementations for DOT_PRODUCT and EUCLIDEAN similarity.

Made-with: Cursor
Introduce scalar C++ implementations for int4 packed-nibble
dot product (single, bulk, bulk-with-offsets) and wire them
through JdkVectorLibrary, Similarities, and the new
Int4VectorScorerSupplier and Int4VectorScorer classes.
Both the HNSW graph-build (scorer supplier) and query-time
(scorer) paths in ES94ScalarQuantizedVectorsFormat now use
native int4 scoring when available.

Made-with: Cursor
Deduplicate correction logic between Int4VectorScorer and
Int4VectorScorerSupplier into Int4Corrections. Both classes
are now final (no sealed subclass hierarchy) and resolve the
similarity-specific correction via method references stored
at construction time.

Made-with: Cursor
The native Int4 implementation is not yet competitive
with Lucene's Panama SIMD scorer. Revert the production
integration until performance is improved. The native
implementation, tests, and benchmarks remain intact.

Made-with: Cursor
Replace the scalar C++ int4 dot product implementation with
SIMD-vectorized versions targeting NEON (aarch64) and AVX2
(amd64), to match Lucene's Panama vectorized performance.

aarch64: uses vmull_u8 for widening 8->16 bit multiply and
vpadalq_u16 for pairwise accumulation into 32-bit, with 4
independent accumulators to break dependency chains. Bulk
operations batch 4 vectors at a time.

amd64: uses _mm256_cvtepu8_epi16 for zero-extension and
_mm256_madd_epi16 for multiply-accumulate into 32-bit, with
2 accumulators. Bulk operations batch 2 vectors with explicit
cache-line prefetching of the next batch.

Both architectures extract nibbles via shift+mask, load the
two unpacked halves (high/low nibble targets), and share
query vector loads across batched document vectors.

Made-with: Cursor
Use doc_*/query_* naming consistently across both
aarch64 and amd64 implementations instead of opaque
abbreviations like p, hi, u_hi, as.

Made-with: Cursor
@ldematte ldematte requested a review from thecoop March 17, 2026 18:19
@ldematte ldematte requested a review from a team as a code owner March 17, 2026 18:19
@elasticsearchmachine elasticsearchmachine added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.4.0 labels Mar 17, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@ldematte ldematte requested a review from ChrisHegarty March 18, 2026 07:17
@ldematte
Copy link
Copy Markdown
Contributor Author

I have a different implementation for ARM which shows 2.5x improvements over Panama, but with a caveat. I'll open a conversation on slack about it.

Copy link
Copy Markdown
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! LGTM

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @ldematte, I've created a changelog YAML for you.

@ldematte
Copy link
Copy Markdown
Contributor Author

ldematte commented Mar 18, 2026

CI is green, don't know why GH was not updated; merging (see https://buildkite.com/elastic/elasticsearch-pull-request/builds/129428)

@ldematte ldematte merged commit 7aacd5f into elastic:main Mar 18, 2026
34 of 36 checks passed
@ldematte ldematte deleted the native/vec-i4-simd branch March 18, 2026 15:23
michalborek pushed a commit to michalborek/elasticsearch that referenced this pull request Mar 23, 2026
…ic#144429)

Following elastic#144215

This PR introduces SIMD implementations for the native int4 scorers. These are now consistently faster than the Panama implementation (~2x on x64 and ~2.6x on ARM), so we can use them. For this reason, the PR also reverts 6c94b3f to re-enable the new scorers in Elasticsearch.

To get the best performances on ARM, this PR also bumps the target architecture to -march=armv8.2-a+dotprod, and of course it updates vec_caps to reflect that (minimum NEON + dotprod).
This enables us to use vdotq_u32, which gives a significant boost, between +40% and +80%, bumping the gain of our native implementation from 1.5x over Lucene to 2.5x over Lucene for int4 (on Graviton4).

The drawback is that we drop native support for ARMv8.0; however the only instance available in cloud that does not support it is Graviton1 (AWS A1 instances, Cortex-A72, which does not support dot product instructions). That said, it's extremely unlikely that somebody would run Elasticsearch on this instance type for vector search workloads, and the change to vec_caps in this PR will return 0 on that hardware, gracefully falling back to Lucene scorers.
ldematte added a commit that referenced this pull request Mar 25, 2026
This PR introduces some smaller optimizations to the x64 int4 implementations.

Now that #144429 is merged, I resumed #109238 and the detailed analysis I did there, and discovered that we were not using the optimal set of instructions.

The older PR used a inner loop that was at the theoretical maximum for most of the processors, with a throughput of 32 elements per CPU cycle. I applied the same schema to the new implementations introduce in the previous PR; the bulk scoring paths show significant gains — +19% to +25% on the Bulk variants, and +9% to +19% on the non-bulk variants.

Also, I implemented a AVX-512 variant; this should give us an additional theoretical speedup of 2x in the inner calculation loop (over the AVX2 implementation), which should translate to a 12-50% throughput increase depending on vector dimensions (higher dimensions --> more time spent in the inner loop).
seanzatzdev pushed a commit to seanzatzdev/elasticsearch that referenced this pull request Mar 27, 2026
This PR introduces some smaller optimizations to the x64 int4 implementations.

Now that elastic#144429 is merged, I resumed elastic#109238 and the detailed analysis I did there, and discovered that we were not using the optimal set of instructions.

The older PR used a inner loop that was at the theoretical maximum for most of the processors, with a throughput of 32 elements per CPU cycle. I applied the same schema to the new implementations introduce in the previous PR; the bulk scoring paths show significant gains — +19% to +25% on the Bulk variants, and +9% to +19% on the non-bulk variants.

Also, I implemented a AVX-512 variant; this should give us an additional theoretical speedup of 2x in the inner calculation loop (over the AVX2 implementation), which should translate to a 12-50% throughput increase depending on vector dimensions (higher dimensions --> more time spent in the inner loop).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants