[Native] Initial support and plumbing for native `int4` vector scoring by ldematte · Pull Request #144215 · elastic/elasticsearch

ldematte · 2026-03-13T15:42:03Z

This PR introduces support and plumbing for native int4 vector scoring.

In particular:

"naive" native int4 vector scoring implementation — scalar (non-SIMD) native C implementations of packed-nibble int4 dot product for both ARM and x64 (vec_doti4 (single), vec_doti4_bulk, vec_doti4_bulk_offsets).
Usual Java-side plumbing (JDKVectorLibrary, Similatities, etc.) in libs/native
Vector scorer implementations in libs/simdvec (Int4VectorScorer and Int4VectorScorerSupplier)
Tests, both at scorer level (Int4VectorScorerFactoryTests, with MMap and NIO directory variants) and lower level (JDKVectorLibraryInt4Tests)
Updated JMH benchmarks from Add int4 vector scoring benchmarks #144105 (VectorScorerInt4Benchmark and VectorScorerInt4BulkBenchmark to include NATIVE implementations
- switched to IndexInput-based data for fair comparison
- refactored to avoid duplication with tests

What is NOT included / future work

The new scorer is not (yet) used in production. The integration in ES94ScalarQuantizedVectorsFormat.java was reverted (commit 6c94b3f), as the naive scalar native implementation is not competitive against Lucene's Panama SIMD. To re-enable: revert that commit.
Which of course means we want to add SIMD-optimized native int4 implementations, and optimized bulk operations
Notice that we are not missing distance functions -- only DOT_PRODUCT is needed for native Int4 — other functions are computed by applying correction terms on top of the raw dot product result.

Add JMH benchmarks for int4 (PACKED_NIBBLE) quantized vector scoring to establish performance baselines before adding native C++ support. Three benchmark levels mirror the existing int7u suite: - VectorScorerInt4OperationBenchmark: raw dot product - VectorScorerInt4Benchmark: single-score with correction math - VectorScorerInt4BulkBenchmark: multi-vector scoring patterns including bulkScore API Each benchmark compares SCALAR (plain loop) vs LUCENE (Panama SIMD) implementations for DOT_PRODUCT and EUCLIDEAN similarity. Made-with: Cursor

Introduce scalar C++ implementations for int4 packed-nibble dot product (single, bulk, bulk-with-offsets) and wire them through JdkVectorLibrary, Similarities, and the new Int4VectorScorerSupplier and Int4VectorScorer classes. Both the HNSW graph-build (scorer supplier) and query-time (scorer) paths in ES94ScalarQuantizedVectorsFormat now use native int4 scoring when available. Made-with: Cursor

Deduplicate correction logic between Int4VectorScorer and Int4VectorScorerSupplier into Int4Corrections. Both classes are now final (no sealed subclass hierarchy) and resolve the similarity-specific correction via method references stored at construction time. Made-with: Cursor

Made-with: Cursor

…ment handling.

…bstractVectorTestCase

… with test code via testFixture.

The native Int4 implementation is not yet competitive with Lucene's Panama SIMD scorer. Revert the production integration until performance is improved. The native implementation, tests, and benchmarks remain intact. Made-with: Cursor

ldematte · 2026-03-16T09:48:30Z

libs/simdvec/native/src/vec/c/aarch64/vec_i4_1.cpp

@@ -0,0 +1,52 @@
+/*


@thecoop see what I have done here: I started to divide functions by data type. Do you think it's a good idea? Let me know if you prefer me to move these to vec_1.cpp and deal with how to separate things later.

ldematte · 2026-03-16T09:50:47Z

libs/native/src/testFixtures/java/org/elasticsearch/nativeaccess/Int4TestUtils.java

+ *           hi    lo    hi   lo    hi   lo    hi   lo
+ *          7..4  3..0  7..4 3..0  7..4 3..0  7..4 3..0
+ */
+public final class Int4TestUtils {


GH does not display it, but these functions and comments are just moved over from VectorScorerTestUtils and ScalarOperations so they can be shared by tests and benchmarks.

benchmarks/src/main/java/org/elasticsearch/benchmark/vector/scorer/Int4BenchmarkUtils.java

benchmarks/src/main/java/org/elasticsearch/benchmark/vector/scorer/ScalarOperations.java

...marks/src/main/java/org/elasticsearch/benchmark/vector/scorer/VectorScorerInt4Benchmark.java

benchmarks/build.gradle

libs/native/src/main/java/org/elasticsearch/nativeaccess/jdk/JdkVectorLibrary.java

libs/native/src/test/java/org/elasticsearch/nativeaccess/jdk/JDKVectorLibraryInt4Tests.java

libs/simdvec/native/src/vec/c/aarch64/vec_i4_1.cpp

thecoop · 2026-03-17T11:39:30Z

libs/simdvec/src/main21/java/org/elasticsearch/simdvec/internal/Int4Corrections.java

+        float ay = queryLower;
+        float ly = (queryUpper - ay) * LIMIT_SCALE;
+        float max = Float.NEGATIVE_INFINITY;
+        for (int i = 0; i < numNodes; i++) {


Is the plan to move this into panama/native later?

Optionally yes. For BBQ for example I did not do that in the end: the cost was negligible. Here it might be worth it, if we have bigger bulk sizes. I'll check, but it's definitely going to be a follow-up

libs/simdvec/src/main21/java/org/elasticsearch/simdvec/internal/Int4Corrections.java

libs/simdvec/src/test/java/org/elasticsearch/simdvec/Int4VectorScorerFactoryTests.java

thecoop

All looks sensible, a few comments for consideration

libs/native/src/main/java/org/elasticsearch/nativeaccess/VectorSimilarityFunctions.java

thecoop · 2026-03-17T17:26:57Z

libs/native/src/main/java/org/elasticsearch/nativeaccess/VectorSimilarityFunctions.java

         * Unsigned int7. Single vector score returns results as an int.
         */
-        INT7U(Byte.BYTES),
+        INT7U(Byte.BYTES * 8),


Byte.SIZE and friends

Oh! TIL, thanks for pointing that out!

I was too trigger happy :)
I have a PR coming up following this one, I'll address that there!

Following #144215 This PR introduces SIMD implementations for the native int4 scorers. These are now consistently faster than the Panama implementation (~2x on x64 and ~2.6x on ARM), so we can use them. For this reason, the PR also reverts 6c94b3f to re-enable the new scorers in Elasticsearch. To get the best performances on ARM, this PR also bumps the target architecture to -march=armv8.2-a+dotprod, and of course it updates vec_caps to reflect that (minimum NEON + dotprod). This enables us to use vdotq_u32, which gives a significant boost, between +40% and +80%, bumping the gain of our native implementation from 1.5x over Lucene to 2.5x over Lucene for int4 (on Graviton4). The drawback is that we drop native support for ARMv8.0; however the only instance available in cloud that does not support it is Graviton1 (AWS A1 instances, Cortex-A72, which does not support dot product instructions). That said, it's extremely unlikely that somebody would run Elasticsearch on this instance type for vector search workloads, and the change to vec_caps in this PR will return 0 on that hardware, gracefully falling back to Lucene scorers.

elastic#144215) This PR introduces support and plumbing for native int4 vector scoring. In particular: - a "naive" native int4 vector scoring implementation — scalar (non-SIMD) native C implementations of packed-nibble int4 dot product for both ARM and x64 (vec_doti4 (single), vec_doti4_bulk, vec_doti4_bulk_offsets). - the usual Java-side plumbing (JDKVectorLibrary, Similatities, etc.) in libs/native - Vector scorer implementations in libs/simdvec (Int4VectorScorer and Int4VectorScorerSupplier) - Tests, both at scorer level (Int4VectorScorerFactoryTests, with MMap and NIO directory variants) and lower level (JDKVectorLibraryInt4Tests) - Updated JMH benchmarks from Add int4 vector scoring benchmarks elastic#144105 (VectorScorerInt4Benchmark and VectorScorerInt4BulkBenchmark to include NATIVE implementations - switched to IndexInput-based data for fair comparison - refactored to avoid duplication with tests What is NOT included / future work: - The new scorer is not (yet) used in production. The integration in ES94ScalarQuantizedVectorsFormat.java was reverted (commit 6c94b3f), as the naive scalar native implementation is not competitive against Lucene's Panama SIMD. To re-enable: revert that commit. - Which of course means we want to add SIMD-optimized native int4 implementations, and optimized bulk operations - Notice that we are not missing distance functions -- only DOT_PRODUCT is needed for native Int4 — other functions are computed by applying correction terms on top of the raw dot product result.

…ic#144429) Following elastic#144215 This PR introduces SIMD implementations for the native int4 scorers. These are now consistently faster than the Panama implementation (~2x on x64 and ~2.6x on ARM), so we can use them. For this reason, the PR also reverts 6c94b3f to re-enable the new scorers in Elasticsearch. To get the best performances on ARM, this PR also bumps the target architecture to -march=armv8.2-a+dotprod, and of course it updates vec_caps to reflect that (minimum NEON + dotprod). This enables us to use vdotq_u32, which gives a significant boost, between +40% and +80%, bumping the gain of our native implementation from 1.5x over Lucene to 2.5x over Lucene for int4 (on Graviton4). The drawback is that we drop native support for ARMv8.0; however the only instance available in cloud that does not support it is Graviton1 (AWS A1 instances, Cortex-A72, which does not support dot product instructions). That said, it's extremely unlikely that somebody would run Elasticsearch on this instance type for vector search workloads, and the change to vec_caps in this PR will return 0 on that hardware, gracefully falling back to Lucene scorers.

The bulk test methods in JDKVectorLibraryInt4Tests pass heap-backed MemorySegments (via MemorySegment.ofArray) directly to native downcall handles. On Java 21, heap segments cannot be passed to FFM downcalls because Linker.Option.critical(true) is only available from Java 22+. Allocate query segments from the arena (off-heap) instead, matching the pattern used by other tests. Introduced by elastic#144215

ldematte added 19 commits March 12, 2026 14:51

Add int4 vector scorer factory tests

731064d

Made-with: Cursor

Move int4 correction formulas into ScalarOperations

ec7a834

Cleanup

c94f733

More cleanup

aefb340

More renaming

31b3531

Removing duplicated classes + single correction function

8d57563

Merge remote-tracking branch 'upstream/main' into native/vec-i4

b06d112

Merge branch 'native/vec-i4' into native/vec-i4-impl

56271a3

Small fix -- explicit cast to long to force <R> to be Long

6ae48de

Refactoring: extract common ScorerImpl, fix heap-segment, improve seg…

4723a36

…ment handling.

Add tests for bulk scoring, NIO filesystem. Move common function to A…

e364e10

…bstractVectorTestCase

Add benchmarks for native; refactor benchmark code to share utilities…

383b5d0

… with test code via testFixture.

Revert native Int4 scorer integration in codec

6c94b3f

The native Int4 implementation is not yet competitive with Lucene's Panama SIMD scorer. Revert the production integration until performance is improved. The native implementation, tests, and benchmarks remain intact. Made-with: Cursor

Refactor: expose Int4 test utilities via a test fixture in libs/native

b6eb707

Introduce lower level i4 distance functions tests

d0da896

Small test fix

51691bd

ldematte added WIP :Search Relevance/Vectors Vector search labels Mar 13, 2026

elasticsearchmachine added the v9.4.0 label Mar 13, 2026

ldematte added 3 commits March 16, 2026 10:08

Merge remote-tracking branch 'upstream/main' into native/vec-i4-impl

e54ed73

Fix missing header

acf0522

Publish vec binaries + update version

02d458d

ldematte added >enhancement and removed WIP labels Mar 16, 2026

ldematte marked this pull request as ready for review March 16, 2026 09:20

ldematte requested a review from a team as a code owner March 16, 2026 09:20

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Mar 16, 2026

ldematte commented Mar 16, 2026

View reviewed changes

ldematte added 3 commits March 16, 2026 11:41

Merge branch 'main' into native/vec-i4-impl

176bcb7

Merge branch 'main' into native/vec-i4-impl

f2f7727

Merge branch 'main' into native/vec-i4-impl

ed81926