[Native] SIMD implementations for native int4 vector scoring#144429
Merged
ldematte merged 40 commits intoelastic:mainfrom Mar 18, 2026
Merged
[Native] SIMD implementations for native int4 vector scoring#144429ldematte merged 40 commits intoelastic:mainfrom
int4 vector scoring#144429ldematte merged 40 commits intoelastic:mainfrom
Conversation
Add JMH benchmarks for int4 (PACKED_NIBBLE) quantized vector scoring to establish performance baselines before adding native C++ support. Three benchmark levels mirror the existing int7u suite: - VectorScorerInt4OperationBenchmark: raw dot product - VectorScorerInt4Benchmark: single-score with correction math - VectorScorerInt4BulkBenchmark: multi-vector scoring patterns including bulkScore API Each benchmark compares SCALAR (plain loop) vs LUCENE (Panama SIMD) implementations for DOT_PRODUCT and EUCLIDEAN similarity. Made-with: Cursor
Introduce scalar C++ implementations for int4 packed-nibble dot product (single, bulk, bulk-with-offsets) and wire them through JdkVectorLibrary, Similarities, and the new Int4VectorScorerSupplier and Int4VectorScorer classes. Both the HNSW graph-build (scorer supplier) and query-time (scorer) paths in ES94ScalarQuantizedVectorsFormat now use native int4 scoring when available. Made-with: Cursor
Deduplicate correction logic between Int4VectorScorer and Int4VectorScorerSupplier into Int4Corrections. Both classes are now final (no sealed subclass hierarchy) and resolve the similarity-specific correction via method references stored at construction time. Made-with: Cursor
Made-with: Cursor
…bstractVectorTestCase
… with test code via testFixture.
The native Int4 implementation is not yet competitive with Lucene's Panama SIMD scorer. Revert the production integration until performance is improved. The native implementation, tests, and benchmarks remain intact. Made-with: Cursor
Replace the scalar C++ int4 dot product implementation with SIMD-vectorized versions targeting NEON (aarch64) and AVX2 (amd64), to match Lucene's Panama vectorized performance. aarch64: uses vmull_u8 for widening 8->16 bit multiply and vpadalq_u16 for pairwise accumulation into 32-bit, with 4 independent accumulators to break dependency chains. Bulk operations batch 4 vectors at a time. amd64: uses _mm256_cvtepu8_epi16 for zero-extension and _mm256_madd_epi16 for multiply-accumulate into 32-bit, with 2 accumulators. Bulk operations batch 2 vectors with explicit cache-line prefetching of the next batch. Both architectures extract nibbles via shift+mask, load the two unpacked halves (high/low nibble targets), and share query vector loads across batched document vectors. Made-with: Cursor
Use doc_*/query_* naming consistently across both aarch64 and amd64 implementations instead of opaque abbreviations like p, hi, u_hi, as. Made-with: Cursor
Collaborator
|
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
Contributor
Author
|
I have a different implementation for ARM which shows 2.5x improvements over Panama, but with a caveat. I'll open a conversation on slack about it. |
Collaborator
|
Hi @ldematte, I've created a changelog YAML for you. |
Contributor
Author
|
CI is green, don't know why GH was not updated; merging (see https://buildkite.com/elastic/elasticsearch-pull-request/builds/129428) |
michalborek
pushed a commit
to michalborek/elasticsearch
that referenced
this pull request
Mar 23, 2026
…ic#144429) Following elastic#144215 This PR introduces SIMD implementations for the native int4 scorers. These are now consistently faster than the Panama implementation (~2x on x64 and ~2.6x on ARM), so we can use them. For this reason, the PR also reverts 6c94b3f to re-enable the new scorers in Elasticsearch. To get the best performances on ARM, this PR also bumps the target architecture to -march=armv8.2-a+dotprod, and of course it updates vec_caps to reflect that (minimum NEON + dotprod). This enables us to use vdotq_u32, which gives a significant boost, between +40% and +80%, bumping the gain of our native implementation from 1.5x over Lucene to 2.5x over Lucene for int4 (on Graviton4). The drawback is that we drop native support for ARMv8.0; however the only instance available in cloud that does not support it is Graviton1 (AWS A1 instances, Cortex-A72, which does not support dot product instructions). That said, it's extremely unlikely that somebody would run Elasticsearch on this instance type for vector search workloads, and the change to vec_caps in this PR will return 0 on that hardware, gracefully falling back to Lucene scorers.
ldematte
added a commit
that referenced
this pull request
Mar 25, 2026
This PR introduces some smaller optimizations to the x64 int4 implementations. Now that #144429 is merged, I resumed #109238 and the detailed analysis I did there, and discovered that we were not using the optimal set of instructions. The older PR used a inner loop that was at the theoretical maximum for most of the processors, with a throughput of 32 elements per CPU cycle. I applied the same schema to the new implementations introduce in the previous PR; the bulk scoring paths show significant gains — +19% to +25% on the Bulk variants, and +9% to +19% on the non-bulk variants. Also, I implemented a AVX-512 variant; this should give us an additional theoretical speedup of 2x in the inner calculation loop (over the AVX2 implementation), which should translate to a 12-50% throughput increase depending on vector dimensions (higher dimensions --> more time spent in the inner loop).
seanzatzdev
pushed a commit
to seanzatzdev/elasticsearch
that referenced
this pull request
Mar 27, 2026
This PR introduces some smaller optimizations to the x64 int4 implementations. Now that elastic#144429 is merged, I resumed elastic#109238 and the detailed analysis I did there, and discovered that we were not using the optimal set of instructions. The older PR used a inner loop that was at the theoretical maximum for most of the processors, with a throughput of 32 elements per CPU cycle. I applied the same schema to the new implementations introduce in the previous PR; the bulk scoring paths show significant gains — +19% to +25% on the Bulk variants, and +9% to +19% on the non-bulk variants. Also, I implemented a AVX-512 variant; this should give us an additional theoretical speedup of 2x in the inner calculation loop (over the AVX2 implementation), which should translate to a 12-50% throughput increase depending on vector dimensions (higher dimensions --> more time spent in the inner loop).
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Following #144215
This PR introduces SIMD implementations for the native
int4scorers. These are now consistently faster than the Panama implementation (~2x on x64 and ~2.6x on ARM), so we can use them. For this reason, the PR also reverts 6c94b3f to re-enable the new scorers in Elasticsearch.To get the best performances on ARM, this PR also bumps the target architecture to
-march=armv8.2-a+dotprod, and of course it updatesvec_capsto reflect that (minimum NEON + dotprod).This enables us to use
vdotq_u32, which gives a significant boost, between +40% and +80%, bumping the gain of our native implementation from 1.5x over Lucene to 2.5x over Lucene for int4 (on Graviton4).The drawback is that we drop native support for
ARMv8.0; however the only instance available in cloud that does not support it is Graviton1 (AWS A1 instances, Cortex-A72, which does not support dot product instructions). That said, it's extremely unlikely that somebody would run Elasticsearch on this instance type for vector search workloads, and the change tovec_capsin this PR will return 0 on that hardware, gracefully falling back to Lucene scorers.Bulk scorer
VectorScorerInt4BulkBenchmark, dims=1024, bulkSize=32, DOT_PRODUCT, ops/s (higher is better)AWS ARM (Graviton, c8gd.xlarge)
AWS AMD (c8a.xlarge)
numVectors=1500
numVectors=128
numVectors=130000