Skip to content

[Native] Initial support and plumbing for native int4 vector scoring#144215

Merged
ldematte merged 27 commits intoelastic:mainfrom
ldematte:native/vec-i4-impl
Mar 17, 2026
Merged

[Native] Initial support and plumbing for native int4 vector scoring#144215
ldematte merged 27 commits intoelastic:mainfrom
ldematte:native/vec-i4-impl

Conversation

@ldematte
Copy link
Copy Markdown
Contributor

@ldematte ldematte commented Mar 13, 2026

This PR introduces support and plumbing for native int4 vector scoring.

In particular:

  • "naive" native int4 vector scoring implementation — scalar (non-SIMD) native C implementations of packed-nibble int4 dot product for both ARM and x64 (vec_doti4 (single), vec_doti4_bulk, vec_doti4_bulk_offsets).
  • Usual Java-side plumbing (JDKVectorLibrary, Similatities, etc.) in libs/native
  • Vector scorer implementations in libs/simdvec (Int4VectorScorer and Int4VectorScorerSupplier)
  • Tests, both at scorer level (Int4VectorScorerFactoryTests, with MMap and NIO directory variants) and lower level (JDKVectorLibraryInt4Tests)
  • Updated JMH benchmarks from Add int4 vector scoring benchmarks #144105 (VectorScorerInt4Benchmark and VectorScorerInt4BulkBenchmark to include NATIVE implementations
    • switched to IndexInput-based data for fair comparison
    • refactored to avoid duplication with tests

What is NOT included / future work

  • The new scorer is not (yet) used in production. The integration in ES94ScalarQuantizedVectorsFormat.java was reverted (commit 6c94b3f), as the naive scalar native implementation is not competitive against Lucene's Panama SIMD. To re-enable: revert that commit.
  • Which of course means we want to add SIMD-optimized native int4 implementations, and optimized bulk operations
  • Notice that we are not missing distance functions -- only DOT_PRODUCT is needed for native Int4 — other functions are computed by applying correction terms on top of the raw dot product result.

ldematte added 19 commits March 12, 2026 14:51
Add JMH benchmarks for int4 (PACKED_NIBBLE) quantized vector scoring
to establish performance baselines before adding native C++ support.

Three benchmark levels mirror the existing int7u suite:
- VectorScorerInt4OperationBenchmark: raw dot product
- VectorScorerInt4Benchmark: single-score with correction math
- VectorScorerInt4BulkBenchmark: multi-vector scoring patterns
  including bulkScore API

Each benchmark compares SCALAR (plain loop) vs LUCENE (Panama SIMD)
implementations for DOT_PRODUCT and EUCLIDEAN similarity.

Made-with: Cursor
Introduce scalar C++ implementations for int4 packed-nibble
dot product (single, bulk, bulk-with-offsets) and wire them
through JdkVectorLibrary, Similarities, and the new
Int4VectorScorerSupplier and Int4VectorScorer classes.
Both the HNSW graph-build (scorer supplier) and query-time
(scorer) paths in ES94ScalarQuantizedVectorsFormat now use
native int4 scoring when available.

Made-with: Cursor
Deduplicate correction logic between Int4VectorScorer and
Int4VectorScorerSupplier into Int4Corrections. Both classes
are now final (no sealed subclass hierarchy) and resolve the
similarity-specific correction via method references stored
at construction time.

Made-with: Cursor
The native Int4 implementation is not yet competitive
with Lucene's Panama SIMD scorer. Revert the production
integration until performance is improved. The native
implementation, tests, and benchmarks remain intact.

Made-with: Cursor
@ldematte ldematte added >enhancement and removed WIP labels Mar 16, 2026
@ldematte ldematte marked this pull request as ready for review March 16, 2026 09:20
@ldematte ldematte requested a review from a team as a code owner March 16, 2026 09:20
@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Mar 16, 2026
@@ -0,0 +1,52 @@
/*
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thecoop see what I have done here: I started to divide functions by data type. Do you think it's a good idea? Let me know if you prefer me to move these to vec_1.cpp and deal with how to separate things later.

* hi lo hi lo hi lo hi lo
* 7..4 3..0 7..4 3..0 7..4 3..0 7..4 3..0
*/
public final class Int4TestUtils {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GH does not display it, but these functions and comments are just moved over from VectorScorerTestUtils and ScalarOperations so they can be shared by tests and benchmarks.

float ay = queryLower;
float ly = (queryUpper - ay) * LIMIT_SCALE;
float max = Float.NEGATIVE_INFINITY;
for (int i = 0; i < numNodes; i++) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan to move this into panama/native later?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optionally yes. For BBQ for example I did not do that in the end: the cost was negligible. Here it might be worth it, if we have bigger bulk sizes. I'll check, but it's definitely going to be a follow-up

Copy link
Copy Markdown
Member

@thecoop thecoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks sensible, a few comments for consideration

* Unsigned int7. Single vector score returns results as an int.
*/
INT7U(Byte.BYTES),
INT7U(Byte.BYTES * 8),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Byte.SIZE and friends

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! TIL, thanks for pointing that out!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was too trigger happy :)
I have a PR coming up following this one, I'll address that there!

@ldematte ldematte merged commit 134356c into elastic:main Mar 17, 2026
36 checks passed
ldematte added a commit that referenced this pull request Mar 18, 2026
Following #144215

This PR introduces SIMD implementations for the native int4 scorers. These are now consistently faster than the Panama implementation (~2x on x64 and ~2.6x on ARM), so we can use them. For this reason, the PR also reverts 6c94b3f to re-enable the new scorers in Elasticsearch.

To get the best performances on ARM, this PR also bumps the target architecture to -march=armv8.2-a+dotprod, and of course it updates vec_caps to reflect that (minimum NEON + dotprod).
This enables us to use vdotq_u32, which gives a significant boost, between +40% and +80%, bumping the gain of our native implementation from 1.5x over Lucene to 2.5x over Lucene for int4 (on Graviton4).

The drawback is that we drop native support for ARMv8.0; however the only instance available in cloud that does not support it is Graviton1 (AWS A1 instances, Cortex-A72, which does not support dot product instructions). That said, it's extremely unlikely that somebody would run Elasticsearch on this instance type for vector search workloads, and the change to vec_caps in this PR will return 0 on that hardware, gracefully falling back to Lucene scorers.
michalborek pushed a commit to michalborek/elasticsearch that referenced this pull request Mar 23, 2026
elastic#144215)

This PR introduces support and plumbing for native int4 vector scoring.

In particular:
- a "naive" native int4 vector scoring implementation — scalar (non-SIMD) native C implementations of packed-nibble int4 dot product for both ARM and x64 (vec_doti4 (single), vec_doti4_bulk, vec_doti4_bulk_offsets).
- the usual Java-side plumbing (JDKVectorLibrary, Similatities, etc.) in libs/native
- Vector scorer implementations in libs/simdvec (Int4VectorScorer and Int4VectorScorerSupplier)
- Tests, both at scorer level (Int4VectorScorerFactoryTests, with MMap and NIO directory variants) and lower level (JDKVectorLibraryInt4Tests)
- Updated JMH benchmarks from Add int4 vector scoring benchmarks elastic#144105 (VectorScorerInt4Benchmark and VectorScorerInt4BulkBenchmark to include NATIVE implementations
   - switched to IndexInput-based data for fair comparison
   - refactored to avoid duplication with tests

What is NOT included / future work:
- The new scorer is not (yet) used in production. The integration in ES94ScalarQuantizedVectorsFormat.java was reverted (commit 6c94b3f), as the naive scalar native implementation is not competitive against Lucene's Panama SIMD. To re-enable: revert that commit.
- Which of course means we want to add SIMD-optimized native int4 implementations, and optimized bulk operations
- Notice that we are not missing distance functions -- only DOT_PRODUCT is needed for native Int4 — other functions are computed by applying correction terms on top of the raw dot product result.
michalborek pushed a commit to michalborek/elasticsearch that referenced this pull request Mar 23, 2026
…ic#144429)

Following elastic#144215

This PR introduces SIMD implementations for the native int4 scorers. These are now consistently faster than the Panama implementation (~2x on x64 and ~2.6x on ARM), so we can use them. For this reason, the PR also reverts 6c94b3f to re-enable the new scorers in Elasticsearch.

To get the best performances on ARM, this PR also bumps the target architecture to -march=armv8.2-a+dotprod, and of course it updates vec_caps to reflect that (minimum NEON + dotprod).
This enables us to use vdotq_u32, which gives a significant boost, between +40% and +80%, bumping the gain of our native implementation from 1.5x over Lucene to 2.5x over Lucene for int4 (on Graviton4).

The drawback is that we drop native support for ARMv8.0; however the only instance available in cloud that does not support it is Graviton1 (AWS A1 instances, Cortex-A72, which does not support dot product instructions). That said, it's extremely unlikely that somebody would run Elasticsearch on this instance type for vector search workloads, and the change to vec_caps in this PR will return 0 on that hardware, gracefully falling back to Lucene scorers.
ChrisHegarty added a commit to ChrisHegarty/elasticsearch that referenced this pull request Mar 24, 2026
The bulk test methods in JDKVectorLibraryInt4Tests pass heap-backed
MemorySegments (via MemorySegment.ofArray) directly to native downcall
handles. On Java 21, heap segments cannot be passed to FFM downcalls
because Linker.Option.critical(true) is only available from Java 22+.

Allocate query segments from the arena (off-heap) instead, matching the
pattern used by other tests.

Introduced by elastic#144215
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants