[Native] Initial support and plumbing for native int4 vector scoring#144215
[Native] Initial support and plumbing for native int4 vector scoring#144215ldematte merged 27 commits intoelastic:mainfrom
int4 vector scoring#144215Conversation
Add JMH benchmarks for int4 (PACKED_NIBBLE) quantized vector scoring to establish performance baselines before adding native C++ support. Three benchmark levels mirror the existing int7u suite: - VectorScorerInt4OperationBenchmark: raw dot product - VectorScorerInt4Benchmark: single-score with correction math - VectorScorerInt4BulkBenchmark: multi-vector scoring patterns including bulkScore API Each benchmark compares SCALAR (plain loop) vs LUCENE (Panama SIMD) implementations for DOT_PRODUCT and EUCLIDEAN similarity. Made-with: Cursor
Introduce scalar C++ implementations for int4 packed-nibble dot product (single, bulk, bulk-with-offsets) and wire them through JdkVectorLibrary, Similarities, and the new Int4VectorScorerSupplier and Int4VectorScorer classes. Both the HNSW graph-build (scorer supplier) and query-time (scorer) paths in ES94ScalarQuantizedVectorsFormat now use native int4 scoring when available. Made-with: Cursor
Deduplicate correction logic between Int4VectorScorer and Int4VectorScorerSupplier into Int4Corrections. Both classes are now final (no sealed subclass hierarchy) and resolve the similarity-specific correction via method references stored at construction time. Made-with: Cursor
Made-with: Cursor
…bstractVectorTestCase
… with test code via testFixture.
The native Int4 implementation is not yet competitive with Lucene's Panama SIMD scorer. Revert the production integration until performance is improved. The native implementation, tests, and benchmarks remain intact. Made-with: Cursor
| @@ -0,0 +1,52 @@ | |||
| /* | |||
There was a problem hiding this comment.
@thecoop see what I have done here: I started to divide functions by data type. Do you think it's a good idea? Let me know if you prefer me to move these to vec_1.cpp and deal with how to separate things later.
| * hi lo hi lo hi lo hi lo | ||
| * 7..4 3..0 7..4 3..0 7..4 3..0 7..4 3..0 | ||
| */ | ||
| public final class Int4TestUtils { |
There was a problem hiding this comment.
GH does not display it, but these functions and comments are just moved over from VectorScorerTestUtils and ScalarOperations so they can be shared by tests and benchmarks.
benchmarks/src/main/java/org/elasticsearch/benchmark/vector/scorer/Int4BenchmarkUtils.java
Outdated
Show resolved
Hide resolved
benchmarks/src/main/java/org/elasticsearch/benchmark/vector/scorer/ScalarOperations.java
Outdated
Show resolved
Hide resolved
...marks/src/main/java/org/elasticsearch/benchmark/vector/scorer/VectorScorerInt4Benchmark.java
Outdated
Show resolved
Hide resolved
libs/native/src/main/java/org/elasticsearch/nativeaccess/jdk/JdkVectorLibrary.java
Outdated
Show resolved
Hide resolved
libs/native/src/test/java/org/elasticsearch/nativeaccess/jdk/JDKVectorLibraryInt4Tests.java
Outdated
Show resolved
Hide resolved
| float ay = queryLower; | ||
| float ly = (queryUpper - ay) * LIMIT_SCALE; | ||
| float max = Float.NEGATIVE_INFINITY; | ||
| for (int i = 0; i < numNodes; i++) { |
There was a problem hiding this comment.
Is the plan to move this into panama/native later?
There was a problem hiding this comment.
Optionally yes. For BBQ for example I did not do that in the end: the cost was negligible. Here it might be worth it, if we have bigger bulk sizes. I'll check, but it's definitely going to be a follow-up
libs/simdvec/src/main21/java/org/elasticsearch/simdvec/internal/Int4Corrections.java
Show resolved
Hide resolved
libs/simdvec/src/test/java/org/elasticsearch/simdvec/Int4VectorScorerFactoryTests.java
Outdated
Show resolved
Hide resolved
libs/simdvec/src/test/java/org/elasticsearch/simdvec/Int4VectorScorerFactoryTests.java
Outdated
Show resolved
Hide resolved
libs/simdvec/src/test/java/org/elasticsearch/simdvec/Int4VectorScorerFactoryTests.java
Outdated
Show resolved
Hide resolved
thecoop
left a comment
There was a problem hiding this comment.
All looks sensible, a few comments for consideration
libs/native/src/main/java/org/elasticsearch/nativeaccess/VectorSimilarityFunctions.java
Outdated
Show resolved
Hide resolved
| * Unsigned int7. Single vector score returns results as an int. | ||
| */ | ||
| INT7U(Byte.BYTES), | ||
| INT7U(Byte.BYTES * 8), |
There was a problem hiding this comment.
Oh! TIL, thanks for pointing that out!
There was a problem hiding this comment.
I was too trigger happy :)
I have a PR coming up following this one, I'll address that there!
Following #144215 This PR introduces SIMD implementations for the native int4 scorers. These are now consistently faster than the Panama implementation (~2x on x64 and ~2.6x on ARM), so we can use them. For this reason, the PR also reverts 6c94b3f to re-enable the new scorers in Elasticsearch. To get the best performances on ARM, this PR also bumps the target architecture to -march=armv8.2-a+dotprod, and of course it updates vec_caps to reflect that (minimum NEON + dotprod). This enables us to use vdotq_u32, which gives a significant boost, between +40% and +80%, bumping the gain of our native implementation from 1.5x over Lucene to 2.5x over Lucene for int4 (on Graviton4). The drawback is that we drop native support for ARMv8.0; however the only instance available in cloud that does not support it is Graviton1 (AWS A1 instances, Cortex-A72, which does not support dot product instructions). That said, it's extremely unlikely that somebody would run Elasticsearch on this instance type for vector search workloads, and the change to vec_caps in this PR will return 0 on that hardware, gracefully falling back to Lucene scorers.
elastic#144215) This PR introduces support and plumbing for native int4 vector scoring. In particular: - a "naive" native int4 vector scoring implementation — scalar (non-SIMD) native C implementations of packed-nibble int4 dot product for both ARM and x64 (vec_doti4 (single), vec_doti4_bulk, vec_doti4_bulk_offsets). - the usual Java-side plumbing (JDKVectorLibrary, Similatities, etc.) in libs/native - Vector scorer implementations in libs/simdvec (Int4VectorScorer and Int4VectorScorerSupplier) - Tests, both at scorer level (Int4VectorScorerFactoryTests, with MMap and NIO directory variants) and lower level (JDKVectorLibraryInt4Tests) - Updated JMH benchmarks from Add int4 vector scoring benchmarks elastic#144105 (VectorScorerInt4Benchmark and VectorScorerInt4BulkBenchmark to include NATIVE implementations - switched to IndexInput-based data for fair comparison - refactored to avoid duplication with tests What is NOT included / future work: - The new scorer is not (yet) used in production. The integration in ES94ScalarQuantizedVectorsFormat.java was reverted (commit 6c94b3f), as the naive scalar native implementation is not competitive against Lucene's Panama SIMD. To re-enable: revert that commit. - Which of course means we want to add SIMD-optimized native int4 implementations, and optimized bulk operations - Notice that we are not missing distance functions -- only DOT_PRODUCT is needed for native Int4 — other functions are computed by applying correction terms on top of the raw dot product result.
…ic#144429) Following elastic#144215 This PR introduces SIMD implementations for the native int4 scorers. These are now consistently faster than the Panama implementation (~2x on x64 and ~2.6x on ARM), so we can use them. For this reason, the PR also reverts 6c94b3f to re-enable the new scorers in Elasticsearch. To get the best performances on ARM, this PR also bumps the target architecture to -march=armv8.2-a+dotprod, and of course it updates vec_caps to reflect that (minimum NEON + dotprod). This enables us to use vdotq_u32, which gives a significant boost, between +40% and +80%, bumping the gain of our native implementation from 1.5x over Lucene to 2.5x over Lucene for int4 (on Graviton4). The drawback is that we drop native support for ARMv8.0; however the only instance available in cloud that does not support it is Graviton1 (AWS A1 instances, Cortex-A72, which does not support dot product instructions). That said, it's extremely unlikely that somebody would run Elasticsearch on this instance type for vector search workloads, and the change to vec_caps in this PR will return 0 on that hardware, gracefully falling back to Lucene scorers.
The bulk test methods in JDKVectorLibraryInt4Tests pass heap-backed MemorySegments (via MemorySegment.ofArray) directly to native downcall handles. On Java 21, heap segments cannot be passed to FFM downcalls because Linker.Option.critical(true) is only available from Java 22+. Allocate query segments from the arena (off-heap) instead, matching the pattern used by other tests. Introduced by elastic#144215
This PR introduces support and plumbing for native
int4vector scoring.In particular:
vec_doti4(single),vec_doti4_bulk,vec_doti4_bulk_offsets).libs/nativelibs/simdvec(Int4VectorScorerandInt4VectorScorerSupplier)Int4VectorScorerFactoryTests, with MMap and NIO directory variants) and lower level (JDKVectorLibraryInt4Tests)VectorScorerInt4BenchmarkandVectorScorerInt4BulkBenchmarkto include NATIVE implementationsWhat is NOT included / future work
ES94ScalarQuantizedVectorsFormat.javawas reverted (commit 6c94b3f), as the naive scalar native implementation is not competitive against Lucene's Panama SIMD. To re-enable: revert that commit.