Enable zero-copy SIMD vector scoring on searchable snapshots (frozen tier)#141718
Merged
ChrisHegarty merged 129 commits intoelastic:mainfrom Mar 2, 2026
Merged
Enable zero-copy SIMD vector scoring on searchable snapshots (frozen tier)#141718ChrisHegarty merged 129 commits intoelastic:mainfrom
ChrisHegarty merged 129 commits intoelastic:mainfrom
Conversation
Collaborator
|
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
acdd954 to
0b80022
Compare
thecoop
reviewed
Feb 3, 2026
...java/org/elasticsearch/simdvec/internal/vectorization/MSBitToInt4ESNextOSQVectorsScorer.java
Outdated
Show resolved
Hide resolved
thecoop
reviewed
Feb 3, 2026
...hmarks/src/main/java/org/elasticsearch/benchmark/vector/scorer/VectorScorerOSQBenchmark.java
Outdated
Show resolved
Hide resolved
ldematte
reviewed
Feb 4, 2026
...java/org/elasticsearch/simdvec/internal/vectorization/MSBitToInt4ESNextOSQVectorsScorer.java
Outdated
Show resolved
Hide resolved
Collaborator
|
Hi @ChrisHegarty, I've created a changelog YAML for you. |
benwtrent
approved these changes
Mar 2, 2026
szybia
added a commit
to szybia/elasticsearch
that referenced
this pull request
Mar 2, 2026
…locations
* upstream/main: (94 commits)
Mute org.elasticsearch.xpack.esql.qa.mixed.EsqlClientYamlIT test {p0=esql/40_tsdb/TS Command grouping on text field} elastic#142544
Mute org.elasticsearch.index.store.StoreDirectoryMetricsIT testDirectoryMetrics elastic#143419
Mute org.elasticsearch.xpack.esql.qa.multi_node.GenerativeIT test elastic#143023
TS_INFO information retrieval command (elastic#142721)
ESQL: External source parallel execution and distribution (elastic#143349)
Mute org.elasticsearch.index.mapper.blockloader.FlattenedFieldRootBlockLoaderTests testBlockLoaderForFieldInObject {preference=Params[syntheticSource=false, preference=DOC_VALUES]} elastic#143414
Mute org.elasticsearch.index.mapper.blockloader.FlattenedFieldRootBlockLoaderTests testBlockLoaderForFieldInObject {preference=Params[syntheticSource=false, preference=NONE]} elastic#143413
Mute org.elasticsearch.index.mapper.blockloader.FlattenedFieldRootBlockLoaderTests testBlockLoaderForFieldInObject {preference=Params[syntheticSource=false, preference=STORED]} elastic#143412
Removing ingest random sampling (elastic#143289)
Mute org.elasticsearch.xpack.esql.qa.single_node.GenerativeIT test elastic#143023
[Transform] Clean up internal tests (elastic#143246)
Skip time series field type merge for non-TS agg queries (elastic#143262)
Enable zero-copy SIMD vector scoring on searchable snapshots (frozen tier) (elastic#141718)
Mute org.elasticsearch.xpack.search.CrossClusterAsyncSearchIT testCancelViaExpirationOnRemoteResultsWithMinimizeRoundtrips elastic#143407
Fix MemorySegmentUtilsTests (elastic#143391)
Unmute testWorkflowsRestrictionAllowsAccess (elastic#143308)
Cancel async query on expiry (elastic#143016)
ESQL: Finish migrating error testing (elastic#143322)
Reduce LuceneOperator.Status memory consumption with large QueryDSL queries (elastic#143175)
ESQL: Generative testing with full text functions (elastic#142961)
...
tballison
pushed a commit
to tballison/elasticsearch
that referenced
this pull request
Mar 3, 2026
… tier) (elastic#141718) ## Summary SIMD-accelerated vector scorers (OSQ / BBQ / int7) previously required an mmap-backed `MemorySegment`, meaning searchable snapshot (frozen tier) data had to be copied onto the heap before scoring. This made quantized vector scoring on frozen indices significantly slower than on locally mmap'd data. This PR enables the blob-cache to expose its memory-mapped regions directly to the vector scorers, eliminating the heap copy and bringing frozen-tier BBQ scoring throughput on par with that of mmap. More broadly, this lays the foundation for zero-copy access to blob-cache data beyond vector scoring. The new `DirectAccessInput` interface is a general-purpose contract — any `IndexInput` consumer can use it to obtain a direct `ByteBuffer` view of the underlying data, scoped to a callback with lifecycle managed internally. On the blob-cache side, the new `SharedBytes.IO.byteBufferSlice` and `CacheFileRegion.tryGetByteBufferSlice` APIs are not vector-specific; they can be used by any future codec or query path that would benefit from avoiding heap copies when reading from frozen-tier data (e.g. postings, doc values, stored fields). The `FrozenIndexInput`, `BlobCacheBufferedIndexInput`, and `StoreMetricsIndexInput` wrappers all propagate the interface, so it is preserved through `FilterIndexInput` chains regardless of the consumer. ### What changed **New `DirectAccessInput` interface** (`libs/core`) — a callback-style contract that any `IndexInput` can implement to offer a direct `ByteBuffer` view of its data. The buffer is scoped to the callback, so all ref-counting and lifecycle is handled internally. **Blob-cache plumbing** — `SharedBytes.IO.byteBufferSlice` and `CacheFileRegion.tryGetByteBufferSlice` expose read-only `ByteBuffer` slices of memory-mapped cache regions, with ref-count held for the duration of the callback. `FrozenIndexInput` implements `DirectAccessInput` on top of this. `BlobCacheBufferedIndexInput` and `StoreMetricsIndexInput` propagate the interface through their wrappers so it is not lost by `FilterIndexInput` chains. **`IndexInputSegments.withSlice`** (`libs/simdvec`, main21) — a single entry point that obtains a `MemorySegment` from an `IndexInput` by trying, in order: `MemorySegmentAccessInput` (mmap) -> `DirectAccessInput` (blob-cache) -> heap copy (fallback). Resource lifecycle is fully internal; callers just receive a `MemorySegment` in a callback. **Scorer refactoring** — the `MemorySegment` field is removed from all scorer classes (ES92 int7, OSQ 1-bit / 2-bit / 4-bit). Each scoring method now goes through `IndexInputSegments.withSlice`, and the core arithmetic is extracted into static `*Impl` methods for clarity. Constructors validate that the `IndexInput` is a supported type (`MemorySegmentAccessInput` or `DirectAccessInput`), failing fast otherwise.
ChrisHegarty
added a commit
that referenced
this pull request
Mar 3, 2026
…n Java 21 (#143479) On Java 21, FFI disallows passing heap-backed MemorySegments to native downcalls. Java 22+ removes this restriction. IndexInputUtils.withSlice is now the single path through which all simdvec scorers obtain MemorySegments from IndexInput data, and several of those scorers pass the segment directly to native downcalls. Rather than patching each and every call site across scorer classes individually, this fix takes a "safety by design" approach: withSlice itself now guarantees that the segment it hands to callers is always native-safe, regardless of Java version. No current or future caller needs to worry about the heap-segment restriction, correctness is enforced at the source. I also added an assertion to the DirectAccessInput path that the byte buffer is direct, documenting the invariant that real implementations always provide native-backed buffers. The tradeoff here is that on Java 21 only, the copyAndApply fallback path incurs an extra native allocation and copy (via a confined Arena) where a heap-backed segment would have sufficed, since a number of call sites only use the Panama Vector API and never touch native downcalls. On Java 22+ the behavior is unchanged. This is an acceptable cost: the fallback path is already the slowest path (mmap and direct buffer paths are preferred), and the alternative - requiring every call site to independently guard against heap segments, is fragile and has already proven easy to miss. We can do some further cleanup at the call site after this fix has been merged. caused by #141718 closes #143441
GalLalouche
pushed a commit
to GalLalouche/elasticsearch
that referenced
this pull request
Mar 3, 2026
…n Java 21 (elastic#143479) On Java 21, FFI disallows passing heap-backed MemorySegments to native downcalls. Java 22+ removes this restriction. IndexInputUtils.withSlice is now the single path through which all simdvec scorers obtain MemorySegments from IndexInput data, and several of those scorers pass the segment directly to native downcalls. Rather than patching each and every call site across scorer classes individually, this fix takes a "safety by design" approach: withSlice itself now guarantees that the segment it hands to callers is always native-safe, regardless of Java version. No current or future caller needs to worry about the heap-segment restriction, correctness is enforced at the source. I also added an assertion to the DirectAccessInput path that the byte buffer is direct, documenting the invariant that real implementations always provide native-backed buffers. The tradeoff here is that on Java 21 only, the copyAndApply fallback path incurs an extra native allocation and copy (via a confined Arena) where a heap-backed segment would have sufficed, since a number of call sites only use the Panama Vector API and never touch native downcalls. On Java 22+ the behavior is unchanged. This is an acceptable cost: the fallback path is already the slowest path (mmap and direct buffer paths are preferred), and the alternative - requiring every call site to independently guard against heap segments, is fragile and has already proven easy to miss. We can do some further cleanup at the call site after this fix has been merged. caused by elastic#141718 closes elastic#143441
shmuelhanoch
pushed a commit
to shmuelhanoch/elasticsearch
that referenced
this pull request
Mar 4, 2026
…n Java 21 (elastic#143479) On Java 21, FFI disallows passing heap-backed MemorySegments to native downcalls. Java 22+ removes this restriction. IndexInputUtils.withSlice is now the single path through which all simdvec scorers obtain MemorySegments from IndexInput data, and several of those scorers pass the segment directly to native downcalls. Rather than patching each and every call site across scorer classes individually, this fix takes a "safety by design" approach: withSlice itself now guarantees that the segment it hands to callers is always native-safe, regardless of Java version. No current or future caller needs to worry about the heap-segment restriction, correctness is enforced at the source. I also added an assertion to the DirectAccessInput path that the byte buffer is direct, documenting the invariant that real implementations always provide native-backed buffers. The tradeoff here is that on Java 21 only, the copyAndApply fallback path incurs an extra native allocation and copy (via a confined Arena) where a heap-backed segment would have sufficed, since a number of call sites only use the Panama Vector API and never touch native downcalls. On Java 22+ the behavior is unchanged. This is an acceptable cost: the fallback path is already the slowest path (mmap and direct buffer paths are preferred), and the alternative - requiring every call site to independently guard against heap segments, is fragile and has already proven easy to miss. We can do some further cleanup at the call site after this fix has been merged. caused by elastic#141718 closes elastic#143441
ldematte
added a commit
to ldematte/elasticsearch
that referenced
this pull request
Mar 6, 2026
This was referenced Mar 6, 2026
ChrisHegarty
added a commit
that referenced
this pull request
Mar 25, 2026
…rectAccessInput (#144557) This PR builds on the zero-copy DirectAccessInput infrastructure introduced in #141718 to extend native SIMD bulk vector scoring to searchable snapshot (SNAP) data. Previously, native bulk scoring was limited to memory-mapped files (MemorySegmentAccessInput); SNAP inputs fell back to one-at-a-time Java scoring. During HNSW graph traversal, the search algorithm scores a batch of candidate neighbor vectors against the query vector in each step. With this change, those batches can now be scored in a single native SIMD call even when the underlying data lives in the shared blob cache rather than a memory-mapped file. The new `DirectAccessInput.withByteBufferSlices` API provides zero-copy access to the cached regions' direct byte buffers, allowing native memory addresses to be extracted and passed directly to the bulk-gather scoring functions without any heap copying. When any vector in a bulk batch crosses a cache region boundary, the entire batch falls back to one-at-a-time scoring. In practice this should be rare: for typical configurations the per-batch fallback probability is well under 1%. Key changes: * `DirectAccessInput.withByteBufferSlices` (libs/core): New bulk multi-region zero-copy access method, complementing the single-region `withByteBufferSlice` from #141718. Implementations in `SharedBlobCacheService.CacheFile`, `FrozenIndexInput`, `BlobCacheIndexInput`, and `StoreMetricsIndexInput` handle offset adjustment for sliced inputs and graceful fallback (return false) when regions cross cache boundaries or are not mmap-backed. * `BULK_GATHER` native operation (libs/simdvec/native): New C++ bulk-gather functions for aarch64 and amd64 that accept an array of native memory addresses (one per vector) instead of requiring contiguous memory. Corresponding BULK_GATHER operation plumbing through `VectorSimilarityFunctions` and `JdkVectorLibrary`. * `IndexInputUtils.withSliceAddresses` (libs/simdvec): Utility that resolves file byte offsets to native memory addresses, dispatching through `MemorySegmentAccessInput` (pointer arithmetic) or `DirectAccessInput` (withByteBufferSlices). Includes reachabilityFence calls to ensure backing memory remains valid during native calls. * `ByteVectorScorer` and `Int7SQVectorScorer` (libs/simdvec): Refactored to use withSliceAddresses for bulk scoring, supporting both mmap and SNAP inputs through a unified code path. `GatherScorer` extracted as a shared top-level interface. * Test coverage: New tests across SharedBlobCacheServiceTests, FrozenIndexInputTests, BlobCacheIndexInputTests, StoreMetricsIndexInputTests, IndexInputUtilsTests, and ByteVectorScorerFactoryTests covering bulk access, offset adjustment on sliced inputs, cross-region boundary fallback, eviction scenarios, and the super.bulkScore() fallback path.
seanzatzdev
pushed a commit
to seanzatzdev/elasticsearch
that referenced
this pull request
Mar 27, 2026
…rectAccessInput (elastic#144557) This PR builds on the zero-copy DirectAccessInput infrastructure introduced in elastic#141718 to extend native SIMD bulk vector scoring to searchable snapshot (SNAP) data. Previously, native bulk scoring was limited to memory-mapped files (MemorySegmentAccessInput); SNAP inputs fell back to one-at-a-time Java scoring. During HNSW graph traversal, the search algorithm scores a batch of candidate neighbor vectors against the query vector in each step. With this change, those batches can now be scored in a single native SIMD call even when the underlying data lives in the shared blob cache rather than a memory-mapped file. The new `DirectAccessInput.withByteBufferSlices` API provides zero-copy access to the cached regions' direct byte buffers, allowing native memory addresses to be extracted and passed directly to the bulk-gather scoring functions without any heap copying. When any vector in a bulk batch crosses a cache region boundary, the entire batch falls back to one-at-a-time scoring. In practice this should be rare: for typical configurations the per-batch fallback probability is well under 1%. Key changes: * `DirectAccessInput.withByteBufferSlices` (libs/core): New bulk multi-region zero-copy access method, complementing the single-region `withByteBufferSlice` from elastic#141718. Implementations in `SharedBlobCacheService.CacheFile`, `FrozenIndexInput`, `BlobCacheIndexInput`, and `StoreMetricsIndexInput` handle offset adjustment for sliced inputs and graceful fallback (return false) when regions cross cache boundaries or are not mmap-backed. * `BULK_GATHER` native operation (libs/simdvec/native): New C++ bulk-gather functions for aarch64 and amd64 that accept an array of native memory addresses (one per vector) instead of requiring contiguous memory. Corresponding BULK_GATHER operation plumbing through `VectorSimilarityFunctions` and `JdkVectorLibrary`. * `IndexInputUtils.withSliceAddresses` (libs/simdvec): Utility that resolves file byte offsets to native memory addresses, dispatching through `MemorySegmentAccessInput` (pointer arithmetic) or `DirectAccessInput` (withByteBufferSlices). Includes reachabilityFence calls to ensure backing memory remains valid during native calls. * `ByteVectorScorer` and `Int7SQVectorScorer` (libs/simdvec): Refactored to use withSliceAddresses for bulk scoring, supporting both mmap and SNAP inputs through a unified code path. `GatherScorer` extracted as a shared top-level interface. * Test coverage: New tests across SharedBlobCacheServiceTests, FrozenIndexInputTests, BlobCacheIndexInputTests, StoreMetricsIndexInputTests, IndexInputUtilsTests, and ByteVectorScorerFactoryTests covering bulk access, offset adjustment on sliced inputs, cross-region boundary fallback, eviction scenarios, and the super.bulkScore() fallback path.
mamazzol
pushed a commit
to mamazzol/elasticsearch
that referenced
this pull request
Mar 30, 2026
…rectAccessInput (elastic#144557) This PR builds on the zero-copy DirectAccessInput infrastructure introduced in elastic#141718 to extend native SIMD bulk vector scoring to searchable snapshot (SNAP) data. Previously, native bulk scoring was limited to memory-mapped files (MemorySegmentAccessInput); SNAP inputs fell back to one-at-a-time Java scoring. During HNSW graph traversal, the search algorithm scores a batch of candidate neighbor vectors against the query vector in each step. With this change, those batches can now be scored in a single native SIMD call even when the underlying data lives in the shared blob cache rather than a memory-mapped file. The new `DirectAccessInput.withByteBufferSlices` API provides zero-copy access to the cached regions' direct byte buffers, allowing native memory addresses to be extracted and passed directly to the bulk-gather scoring functions without any heap copying. When any vector in a bulk batch crosses a cache region boundary, the entire batch falls back to one-at-a-time scoring. In practice this should be rare: for typical configurations the per-batch fallback probability is well under 1%. Key changes: * `DirectAccessInput.withByteBufferSlices` (libs/core): New bulk multi-region zero-copy access method, complementing the single-region `withByteBufferSlice` from elastic#141718. Implementations in `SharedBlobCacheService.CacheFile`, `FrozenIndexInput`, `BlobCacheIndexInput`, and `StoreMetricsIndexInput` handle offset adjustment for sliced inputs and graceful fallback (return false) when regions cross cache boundaries or are not mmap-backed. * `BULK_GATHER` native operation (libs/simdvec/native): New C++ bulk-gather functions for aarch64 and amd64 that accept an array of native memory addresses (one per vector) instead of requiring contiguous memory. Corresponding BULK_GATHER operation plumbing through `VectorSimilarityFunctions` and `JdkVectorLibrary`. * `IndexInputUtils.withSliceAddresses` (libs/simdvec): Utility that resolves file byte offsets to native memory addresses, dispatching through `MemorySegmentAccessInput` (pointer arithmetic) or `DirectAccessInput` (withByteBufferSlices). Includes reachabilityFence calls to ensure backing memory remains valid during native calls. * `ByteVectorScorer` and `Int7SQVectorScorer` (libs/simdvec): Refactored to use withSliceAddresses for bulk scoring, supporting both mmap and SNAP inputs through a unified code path. `GatherScorer` extracted as a shared top-level interface. * Test coverage: New tests across SharedBlobCacheServiceTests, FrozenIndexInputTests, BlobCacheIndexInputTests, StoreMetricsIndexInputTests, IndexInputUtilsTests, and ByteVectorScorerFactoryTests covering bulk access, offset adjustment on sliced inputs, cross-region boundary fallback, eviction scenarios, and the super.bulkScore() fallback path.
18 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SIMD-accelerated vector scorers (OSQ / BBQ / int7) previously required an mmap-backed
MemorySegment, meaning searchable snapshot (frozen tier) data had to be copied onto theheap before scoring. This made quantized vector scoring on frozen indices significantly slower than
on locally mmap'd data.
This PR enables the blob-cache to expose its memory-mapped regions directly to the vector
scorers, eliminating the heap copy and bringing frozen-tier BBQ scoring throughput on par with that
of mmap.
More broadly, this lays the foundation for zero-copy access to blob-cache data beyond
vector scoring. The new
DirectAccessInputinterface is a general-purpose contract — anyIndexInputconsumer can use it to obtain a directByteBufferview of the underlyingdata, scoped to a callback with lifecycle managed internally. On the blob-cache side, the
new
SharedBytes.IO.byteBufferSliceandCacheFileRegion.tryGetByteBufferSliceAPIs arenot vector-specific; they can be used by any future codec or query path that would benefit
from avoiding heap copies when reading from frozen-tier data (e.g. postings, doc values,
stored fields). The
FrozenIndexInput,BlobCacheBufferedIndexInput, andStoreMetricsIndexInputwrappers all propagate the interface, so it is preserved throughFilterIndexInputchains regardless of the consumer.What changed
New
DirectAccessInputinterface (libs/core) — a callback-style contract that anyIndexInputcan implement to offer a directByteBufferview of its data. The buffer isscoped to the callback, so all ref-counting and lifecycle is handled internally.
Blob-cache plumbing —
SharedBytes.IO.byteBufferSliceandCacheFileRegion.tryGetByteBufferSliceexpose read-onlyByteBufferslices ofmemory-mapped cache regions, with ref-count held for the duration of the callback.
FrozenIndexInputimplementsDirectAccessInputon top of this.BlobCacheBufferedIndexInputandStoreMetricsIndexInputpropagate the interface throughtheir wrappers so it is not lost by
FilterIndexInputchains.IndexInputSegments.withSlice(libs/simdvec, main21) — a single entry point thatobtains a
MemorySegmentfrom anIndexInputby trying, in order:MemorySegmentAccessInput(mmap) →DirectAccessInput(blob-cache) → heap copy (fallback).Resource lifecycle is fully internal; callers just receive a
MemorySegmentin a callback.Scorer refactoring — the
MemorySegmentfield is removed from all scorer classes(ES92 int7, OSQ 1-bit / 2-bit / 4-bit). Each scoring method now goes through
IndexInputSegments.withSlice, and the core arithmetic is extracted into static*Implmethods for clarity. Constructors validate that the
IndexInputis a supported type(
MemorySegmentAccessInputorDirectAccessInput), failing fast otherwise.