Add bulk-sparse native vector scoring for searchable snapshots via DirectAccessInput#144557
Conversation
|
Hi @ChrisHegarty, I've created a changelog YAML for you. |
ldematte
left a comment
There was a problem hiding this comment.
I gave it a quick first pass, concentrating especially on the native part and how it interacts (how we pass the dataset). Looks good!
libs/simdvec/src/main21/java/org/elasticsearch/simdvec/internal/IndexInputUtils.java
Show resolved
Hide resolved
| long[] offsets, | ||
| int length, | ||
| int count, | ||
| long[] addrs, |
There was a problem hiding this comment.
I suppose this is a parameter because we'd likely reuse it, and it's not directly a MemorySegment (of size count * ADDRESS.bytes) because we want to call it from code that does not have the preview things?
There was a problem hiding this comment.
yeah. This could be a premature optimisation. Lemme revert it, as it's not clear that it's worth it at this point.
There was a problem hiding this comment.
No it's OK I think, just wanted to confirm I understood it correctly
libs/simdvec/src/main21/java/org/elasticsearch/simdvec/internal/IndexInputUtils.java
Outdated
Show resolved
Hide resolved
...ugin/blob-cache/src/main/java/org/elasticsearch/blobcache/shared/SharedBlobCacheService.java
Show resolved
Hide resolved
While working on bulk sparse scoring (#144557), I noticed that INT8 and FLOAT32 were missing testBulkIllegalDims coverage that INT7U, INT4, and BBQ already have. Extracting this into a small targeted PR. Both new tests verify IOOBE for count overflow, negative count, negative dims, and undersized result buffer, matching the existing pattern in JDKVectorLibraryInt7uTests.
While working on bulk sparse scoring (#144557), I noticed the existing BULK_OFFSETS tests only use random offsets. Random offsets probabilistically cover duplicates and may happen to produce a sequential pattern, but neither case is guaranteed or verified explicitly, so I added two new tests make the patterns deterministic and assert specific properties that random offsets do not. I added these to INT7U only since the offset dispatch logic is the same array_mapper template across all element types. A bug in offset handling would surface here; other type-specific arithmetic is already covered by the existing per-type random-offset tests.
…144645) While working on bulk sparse scoring (#144557), I noticed that ByteVectorScorerFactoryTests only tested per-ordinal score() via the supplier path. This PR adds bulk scoring and query-side scorer coverage, extracted from ongoing work on bulk sparse scoring (#144557). The test structure is designed so that SNAP directory variants can be added alongside the MMap tests once DirectAccessInput support lands.
|
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
libs/native/src/main/java/org/elasticsearch/nativeaccess/VectorSimilarityFunctions.java
Show resolved
Hide resolved
libs/simdvec/src/test21/java/org/elasticsearch/simdvec/internal/IndexInputUtilsTests.java
Outdated
Show resolved
Hide resolved
| * <ol> | ||
| * <li>Array of 8-byte longs containing the native memory address of each vector</li> | ||
| * <li>Single vector to score against</li> | ||
| * <li>Number of dimensions, or for bbq, the number of index bytes</li> |
There was a problem hiding this comment.
I just didn't write the native code for it yet, but given how this is progressing - the native mapper template should be trivial. lemme take a look.
There was a problem hiding this comment.
BBQ can use a similar technique, but the code is a bit more involved. Let's do it as a follow up.
There was a problem hiding this comment.
Do we need to do this for BBQ/DiskBBQ? I think that in that case data is always contiguous...
libs/simdvec/src/main21/java/org/elasticsearch/simdvec/internal/IndexInputUtils.java
Show resolved
Hide resolved
thecoop
left a comment
There was a problem hiding this comment.
A few test tweaks, but otherwise vector side looks good
) While working on bulk sparse scoring (#144557), I noticed that checkBulkOffsets and checkBBQBulkOffsets validated segment sizes but not individual offset values. An out-of-range or negative offset would silently read memory beyond the data segment, risking a crash or silently wrong results. The solution is to replace the sequential size check with per-offset validation that checks each offset points to a valid vector within the data segment. The O(count) loop should be negligible relative to the O(count * dims) native call, but we've made the checks conditional on asserts to avoid any potential negative cost of this, and asserts should be good enough given our testing. Note: INT4 skips size=2 (packedLen=1) because checkBulkOffsets computes rowBytes = packedLen * 4 / 8 which truncates to 0 via integer division, making the bounds check trivially pass. This is a pre-existing issue with how INT4 passes packed byte length (not element count) as the length parameter to the generic check formula. We can address this separately, if needed.
…tic#144643) While working on bulk sparse scoring (elastic#144557), I noticed that checkBulkOffsets and checkBBQBulkOffsets validated segment sizes but not individual offset values. An out-of-range or negative offset would silently read memory beyond the data segment, risking a crash or silently wrong results. The solution is to replace the sequential size check with per-offset validation that checks each offset points to a valid vector within the data segment. The O(count) loop should be negligible relative to the O(count * dims) native call, but we've made the checks conditional on asserts to avoid any potential negative cost of this, and asserts should be good enough given our testing. Note: INT4 skips size=2 (packedLen=1) because checkBulkOffsets computes rowBytes = packedLen * 4 / 8 which truncates to 0 via integer division, making the bounds check trivially pass. This is a pre-existing issue with how INT4 passes packed byte length (not element count) as the length parameter to the generic check formula. We can address this separately, if needed.
…rectAccessInput (elastic#144557) This PR builds on the zero-copy DirectAccessInput infrastructure introduced in elastic#141718 to extend native SIMD bulk vector scoring to searchable snapshot (SNAP) data. Previously, native bulk scoring was limited to memory-mapped files (MemorySegmentAccessInput); SNAP inputs fell back to one-at-a-time Java scoring. During HNSW graph traversal, the search algorithm scores a batch of candidate neighbor vectors against the query vector in each step. With this change, those batches can now be scored in a single native SIMD call even when the underlying data lives in the shared blob cache rather than a memory-mapped file. The new `DirectAccessInput.withByteBufferSlices` API provides zero-copy access to the cached regions' direct byte buffers, allowing native memory addresses to be extracted and passed directly to the bulk-gather scoring functions without any heap copying. When any vector in a bulk batch crosses a cache region boundary, the entire batch falls back to one-at-a-time scoring. In practice this should be rare: for typical configurations the per-batch fallback probability is well under 1%. Key changes: * `DirectAccessInput.withByteBufferSlices` (libs/core): New bulk multi-region zero-copy access method, complementing the single-region `withByteBufferSlice` from elastic#141718. Implementations in `SharedBlobCacheService.CacheFile`, `FrozenIndexInput`, `BlobCacheIndexInput`, and `StoreMetricsIndexInput` handle offset adjustment for sliced inputs and graceful fallback (return false) when regions cross cache boundaries or are not mmap-backed. * `BULK_GATHER` native operation (libs/simdvec/native): New C++ bulk-gather functions for aarch64 and amd64 that accept an array of native memory addresses (one per vector) instead of requiring contiguous memory. Corresponding BULK_GATHER operation plumbing through `VectorSimilarityFunctions` and `JdkVectorLibrary`. * `IndexInputUtils.withSliceAddresses` (libs/simdvec): Utility that resolves file byte offsets to native memory addresses, dispatching through `MemorySegmentAccessInput` (pointer arithmetic) or `DirectAccessInput` (withByteBufferSlices). Includes reachabilityFence calls to ensure backing memory remains valid during native calls. * `ByteVectorScorer` and `Int7SQVectorScorer` (libs/simdvec): Refactored to use withSliceAddresses for bulk scoring, supporting both mmap and SNAP inputs through a unified code path. `GatherScorer` extracted as a shared top-level interface. * Test coverage: New tests across SharedBlobCacheServiceTests, FrozenIndexInputTests, BlobCacheIndexInputTests, StoreMetricsIndexInputTests, IndexInputUtilsTests, and ByteVectorScorerFactoryTests covering bulk access, offset adjustment on sliced inputs, cross-region boundary fallback, eviction scenarios, and the super.bulkScore() fallback path.
…tic#144643) While working on bulk sparse scoring (elastic#144557), I noticed that checkBulkOffsets and checkBBQBulkOffsets validated segment sizes but not individual offset values. An out-of-range or negative offset would silently read memory beyond the data segment, risking a crash or silently wrong results. The solution is to replace the sequential size check with per-offset validation that checks each offset points to a valid vector within the data segment. The O(count) loop should be negligible relative to the O(count * dims) native call, but we've made the checks conditional on asserts to avoid any potential negative cost of this, and asserts should be good enough given our testing. Note: INT4 skips size=2 (packedLen=1) because checkBulkOffsets computes rowBytes = packedLen * 4 / 8 which truncates to 0 via integer division, making the bounds check trivially pass. This is a pre-existing issue with how INT4 passes packed byte length (not element count) as the length parameter to the generic check formula. We can address this separately, if needed.
…rectAccessInput (elastic#144557) This PR builds on the zero-copy DirectAccessInput infrastructure introduced in elastic#141718 to extend native SIMD bulk vector scoring to searchable snapshot (SNAP) data. Previously, native bulk scoring was limited to memory-mapped files (MemorySegmentAccessInput); SNAP inputs fell back to one-at-a-time Java scoring. During HNSW graph traversal, the search algorithm scores a batch of candidate neighbor vectors against the query vector in each step. With this change, those batches can now be scored in a single native SIMD call even when the underlying data lives in the shared blob cache rather than a memory-mapped file. The new `DirectAccessInput.withByteBufferSlices` API provides zero-copy access to the cached regions' direct byte buffers, allowing native memory addresses to be extracted and passed directly to the bulk-gather scoring functions without any heap copying. When any vector in a bulk batch crosses a cache region boundary, the entire batch falls back to one-at-a-time scoring. In practice this should be rare: for typical configurations the per-batch fallback probability is well under 1%. Key changes: * `DirectAccessInput.withByteBufferSlices` (libs/core): New bulk multi-region zero-copy access method, complementing the single-region `withByteBufferSlice` from elastic#141718. Implementations in `SharedBlobCacheService.CacheFile`, `FrozenIndexInput`, `BlobCacheIndexInput`, and `StoreMetricsIndexInput` handle offset adjustment for sliced inputs and graceful fallback (return false) when regions cross cache boundaries or are not mmap-backed. * `BULK_GATHER` native operation (libs/simdvec/native): New C++ bulk-gather functions for aarch64 and amd64 that accept an array of native memory addresses (one per vector) instead of requiring contiguous memory. Corresponding BULK_GATHER operation plumbing through `VectorSimilarityFunctions` and `JdkVectorLibrary`. * `IndexInputUtils.withSliceAddresses` (libs/simdvec): Utility that resolves file byte offsets to native memory addresses, dispatching through `MemorySegmentAccessInput` (pointer arithmetic) or `DirectAccessInput` (withByteBufferSlices). Includes reachabilityFence calls to ensure backing memory remains valid during native calls. * `ByteVectorScorer` and `Int7SQVectorScorer` (libs/simdvec): Refactored to use withSliceAddresses for bulk scoring, supporting both mmap and SNAP inputs through a unified code path. `GatherScorer` extracted as a shared top-level interface. * Test coverage: New tests across SharedBlobCacheServiceTests, FrozenIndexInputTests, BlobCacheIndexInputTests, StoreMetricsIndexInputTests, IndexInputUtilsTests, and ByteVectorScorerFactoryTests covering bulk access, offset adjustment on sliced inputs, cross-region boundary fallback, eviction scenarios, and the super.bulkScore() fallback path.
…tic#144643) While working on bulk sparse scoring (elastic#144557), I noticed that checkBulkOffsets and checkBBQBulkOffsets validated segment sizes but not individual offset values. An out-of-range or negative offset would silently read memory beyond the data segment, risking a crash or silently wrong results. The solution is to replace the sequential size check with per-offset validation that checks each offset points to a valid vector within the data segment. The O(count) loop should be negligible relative to the O(count * dims) native call, but we've made the checks conditional on asserts to avoid any potential negative cost of this, and asserts should be good enough given our testing. Note: INT4 skips size=2 (packedLen=1) because checkBulkOffsets computes rowBytes = packedLen * 4 / 8 which truncates to 0 via integer division, making the bounds check trivially pass. This is a pre-existing issue with how INT4 passes packed byte length (not element count) as the length parameter to the generic check formula. We can address this separately, if needed.
This PR builds on the zero-copy DirectAccessInput infrastructure introduced in #141718 to extend native SIMD bulk vector scoring to searchable snapshot (SNAP) data. Previously, native bulk scoring was limited to memory-mapped files (MemorySegmentAccessInput); SNAP inputs fell back to one-at-a-time Java scoring.
During HNSW graph traversal, the search algorithm scores a batch of candidate neighbor vectors against the query vector in each step. With this change, those batches can now be scored in a single native SIMD call even when the underlying data lives in the shared blob cache rather than a memory-mapped file. The new
DirectAccessInput.withByteBufferSlicesAPI provides zero-copy access to the cached regions' direct byte buffers, allowing native memory addresses to be extracted and passed directly to the bulk-gather scoring functions without any heap copying. When any vector in a bulk batch crosses a cache region boundary, the entire batch falls back to one-at-a-time scoring. In practice this should be rare: for typical configurations the per-batch fallback probability is well under 1%.Key changes:
DirectAccessInput.withByteBufferSlices(libs/core): New bulk multi-region zero-copy access method, complementing the single-regionwithByteBufferSlicefrom Enable zero-copy SIMD vector scoring on searchable snapshots (frozen tier) #141718. Implementations inSharedBlobCacheService.CacheFile,FrozenIndexInput,BlobCacheIndexInput, andStoreMetricsIndexInputhandle offset adjustment for sliced inputs and graceful fallback (return false) when regions cross cache boundaries or are not mmap-backed.BULK_GATHERnative operation (libs/simdvec/native): New C++ bulk-gather functions for aarch64 and amd64 that accept an array of native memory addresses (one per vector) instead of requiring contiguous memory. Corresponding BULK_GATHER operation plumbing throughVectorSimilarityFunctionsandJdkVectorLibrary.IndexInputUtils.withSliceAddresses(libs/simdvec): Utility that resolves file byte offsets to native memory addresses, dispatching throughMemorySegmentAccessInput(pointer arithmetic) orDirectAccessInput(withByteBufferSlices). Includes reachabilityFence calls to ensure backing memory remains valid during native calls.ByteVectorScorerandInt7SQVectorScorer(libs/simdvec): Refactored to use withSliceAddresses for bulk scoring, supporting both mmap and SNAP inputs through a unified code path.GatherScorerextracted as a shared top-level interface.