Skip to content

Enable zero-copy SIMD vector scoring on searchable snapshots (frozen tier)#141718

Merged
ChrisHegarty merged 129 commits intoelastic:mainfrom
ChrisHegarty:bbq_vec_ops
Mar 2, 2026
Merged

Enable zero-copy SIMD vector scoring on searchable snapshots (frozen tier)#141718
ChrisHegarty merged 129 commits intoelastic:mainfrom
ChrisHegarty:bbq_vec_ops

Conversation

@ChrisHegarty
Copy link
Copy Markdown
Contributor

@ChrisHegarty ChrisHegarty commented Feb 3, 2026

Summary

SIMD-accelerated vector scorers (OSQ / BBQ / int7) previously required an mmap-backed
MemorySegment, meaning searchable snapshot (frozen tier) data had to be copied onto the
heap before scoring. This made quantized vector scoring on frozen indices significantly slower than
on locally mmap'd data.

This PR enables the blob-cache to expose its memory-mapped regions directly to the vector
scorers, eliminating the heap copy and bringing frozen-tier BBQ scoring throughput on par with that
of mmap.

More broadly, this lays the foundation for zero-copy access to blob-cache data beyond
vector scoring. The new DirectAccessInput interface is a general-purpose contract — any
IndexInput consumer can use it to obtain a direct ByteBuffer view of the underlying
data, scoped to a callback with lifecycle managed internally. On the blob-cache side, the
new SharedBytes.IO.byteBufferSlice and CacheFileRegion.tryGetByteBufferSlice APIs are
not vector-specific; they can be used by any future codec or query path that would benefit
from avoiding heap copies when reading from frozen-tier data (e.g. postings, doc values,
stored fields). The FrozenIndexInput, BlobCacheBufferedIndexInput, and
StoreMetricsIndexInput wrappers all propagate the interface, so it is preserved through
FilterIndexInput chains regardless of the consumer.

What changed

New DirectAccessInput interface (libs/core) — a callback-style contract that any
IndexInput can implement to offer a direct ByteBuffer view of its data. The buffer is
scoped to the callback, so all ref-counting and lifecycle is handled internally.

Blob-cache plumbingSharedBytes.IO.byteBufferSlice and
CacheFileRegion.tryGetByteBufferSlice expose read-only ByteBuffer slices of
memory-mapped cache regions, with ref-count held for the duration of the callback.
FrozenIndexInput implements DirectAccessInput on top of this.
BlobCacheBufferedIndexInput and StoreMetricsIndexInput propagate the interface through
their wrappers so it is not lost by FilterIndexInput chains.

IndexInputSegments.withSlice (libs/simdvec, main21) — a single entry point that
obtains a MemorySegment from an IndexInput by trying, in order:
MemorySegmentAccessInput (mmap) → DirectAccessInput (blob-cache) → heap copy (fallback).
Resource lifecycle is fully internal; callers just receive a MemorySegment in a callback.

Scorer refactoring — the MemorySegment field is removed from all scorer classes
(ES92 int7, OSQ 1-bit / 2-bit / 4-bit). Each scoring method now goes through
IndexInputSegments.withSlice, and the core arithmetic is extracted into static *Impl
methods for clarity. Constructors validate that the IndexInput is a supported type
(MemorySegmentAccessInput or DirectAccessInput), failing fast otherwise.

@ChrisHegarty ChrisHegarty added :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch labels Feb 3, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Feb 3, 2026
@ChrisHegarty ChrisHegarty changed the title Rework BBQ vector comparison ops to work with the blob-cache Rework BBQ vector comparison ops to work more efficiently with the blob-cache Feb 3, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @ChrisHegarty, I've created a changelog YAML for you.

@ChrisHegarty ChrisHegarty merged commit a82f799 into elastic:main Mar 2, 2026
26 of 27 checks passed
@ChrisHegarty ChrisHegarty deleted the bbq_vec_ops branch March 2, 2026 17:38
szybia added a commit to szybia/elasticsearch that referenced this pull request Mar 2, 2026
…locations

* upstream/main: (94 commits)
  Mute org.elasticsearch.xpack.esql.qa.mixed.EsqlClientYamlIT test {p0=esql/40_tsdb/TS Command grouping on text field} elastic#142544
  Mute org.elasticsearch.index.store.StoreDirectoryMetricsIT testDirectoryMetrics elastic#143419
  Mute org.elasticsearch.xpack.esql.qa.multi_node.GenerativeIT test elastic#143023
  TS_INFO information retrieval command (elastic#142721)
  ESQL: External source parallel execution and distribution (elastic#143349)
  Mute org.elasticsearch.index.mapper.blockloader.FlattenedFieldRootBlockLoaderTests testBlockLoaderForFieldInObject {preference=Params[syntheticSource=false, preference=DOC_VALUES]} elastic#143414
  Mute org.elasticsearch.index.mapper.blockloader.FlattenedFieldRootBlockLoaderTests testBlockLoaderForFieldInObject {preference=Params[syntheticSource=false, preference=NONE]} elastic#143413
  Mute org.elasticsearch.index.mapper.blockloader.FlattenedFieldRootBlockLoaderTests testBlockLoaderForFieldInObject {preference=Params[syntheticSource=false, preference=STORED]} elastic#143412
  Removing ingest random sampling (elastic#143289)
  Mute org.elasticsearch.xpack.esql.qa.single_node.GenerativeIT test elastic#143023
  [Transform] Clean up internal tests (elastic#143246)
  Skip time series field type merge for non-TS agg queries (elastic#143262)
  Enable zero-copy SIMD vector scoring on searchable snapshots (frozen tier) (elastic#141718)
  Mute org.elasticsearch.xpack.search.CrossClusterAsyncSearchIT testCancelViaExpirationOnRemoteResultsWithMinimizeRoundtrips elastic#143407
  Fix MemorySegmentUtilsTests (elastic#143391)
  Unmute testWorkflowsRestrictionAllowsAccess (elastic#143308)
  Cancel async query on expiry (elastic#143016)
  ESQL: Finish migrating error testing (elastic#143322)
  Reduce LuceneOperator.Status memory consumption with large QueryDSL queries (elastic#143175)
  ESQL: Generative testing with full text functions (elastic#142961)
  ...
tballison pushed a commit to tballison/elasticsearch that referenced this pull request Mar 3, 2026
… tier) (elastic#141718)

## Summary
  SIMD-accelerated vector scorers (OSQ / BBQ / int7) previously required an mmap-backed
  `MemorySegment`, meaning searchable snapshot (frozen tier) data had to be copied onto the
  heap before scoring. This made quantized vector scoring on frozen indices significantly slower than
  on locally mmap'd data.

  This PR enables the blob-cache to expose its memory-mapped regions directly to the vector
  scorers, eliminating the heap copy and bringing frozen-tier BBQ scoring throughput on par with that 
  of mmap.

More broadly, this lays the foundation for zero-copy access to blob-cache data beyond
  vector scoring. The new `DirectAccessInput` interface is a general-purpose contract — any
  `IndexInput` consumer can use it to obtain a direct `ByteBuffer` view of the underlying
  data, scoped to a callback with lifecycle managed internally. On the blob-cache side, the
  new `SharedBytes.IO.byteBufferSlice` and `CacheFileRegion.tryGetByteBufferSlice` APIs are
  not vector-specific; they can be used by any future codec or query path that would benefit
  from avoiding heap copies when reading from frozen-tier data (e.g. postings, doc values,
  stored fields). The `FrozenIndexInput`, `BlobCacheBufferedIndexInput`, and
  `StoreMetricsIndexInput` wrappers all propagate the interface, so it is preserved through
  `FilterIndexInput` chains regardless of the consumer.

  ### What changed
  **New `DirectAccessInput` interface** (`libs/core`) — a callback-style contract that any
  `IndexInput` can implement to offer a direct `ByteBuffer` view of its data. The buffer is
  scoped to the callback, so all ref-counting and lifecycle is handled internally.

  **Blob-cache plumbing** — `SharedBytes.IO.byteBufferSlice` and
  `CacheFileRegion.tryGetByteBufferSlice` expose read-only `ByteBuffer` slices of
  memory-mapped cache regions, with ref-count held for the duration of the callback.
  `FrozenIndexInput` implements `DirectAccessInput` on top of this.
  `BlobCacheBufferedIndexInput` and `StoreMetricsIndexInput` propagate the interface through
  their wrappers so it is not lost by `FilterIndexInput` chains.

  **`IndexInputSegments.withSlice`** (`libs/simdvec`, main21) — a single entry point that
  obtains a `MemorySegment` from an `IndexInput` by trying, in order:
  `MemorySegmentAccessInput` (mmap) -> `DirectAccessInput` (blob-cache) -> heap copy (fallback).
  Resource lifecycle is fully internal; callers just receive a `MemorySegment` in a callback.

  **Scorer refactoring** — the `MemorySegment` field is removed from all scorer classes
  (ES92 int7, OSQ 1-bit / 2-bit / 4-bit). Each scoring method now goes through
  `IndexInputSegments.withSlice`, and the core arithmetic is extracted into static `*Impl`
  methods for clarity. Constructors validate that the `IndexInput` is a supported type
  (`MemorySegmentAccessInput` or `DirectAccessInput`), failing fast otherwise.
ChrisHegarty added a commit that referenced this pull request Mar 3, 2026
…n Java 21 (#143479)

On Java 21, FFI disallows passing heap-backed MemorySegments to native downcalls. Java 22+ removes this restriction.

IndexInputUtils.withSlice is now the single path through which all simdvec scorers obtain MemorySegments from IndexInput data, and several of those scorers pass the segment directly to native downcalls. Rather than patching each and every call site across scorer classes individually, this fix takes a "safety by design" approach: withSlice itself now guarantees that the segment it hands to callers is always native-safe, regardless of Java version. No current or future caller needs to worry about the heap-segment restriction, correctness is enforced at the source.

I also added an assertion to the DirectAccessInput path that the byte buffer is direct, documenting the invariant that real implementations always provide native-backed buffers.

The tradeoff here is that on Java 21 only, the copyAndApply fallback path incurs an extra native allocation and copy (via a confined Arena) where a heap-backed segment would have sufficed, since a number of call sites only use the Panama Vector API and never touch native downcalls. On Java 22+ the behavior is unchanged. This is an acceptable cost: the fallback path is already the slowest path (mmap and direct buffer paths are preferred), and the alternative - requiring every call site to independently guard against heap segments, is fragile and has already proven easy to miss. We can do some further cleanup at the call site after this fix has been merged.

caused by #141718
closes #143441
GalLalouche pushed a commit to GalLalouche/elasticsearch that referenced this pull request Mar 3, 2026
…n Java 21 (elastic#143479)

On Java 21, FFI disallows passing heap-backed MemorySegments to native downcalls. Java 22+ removes this restriction.

IndexInputUtils.withSlice is now the single path through which all simdvec scorers obtain MemorySegments from IndexInput data, and several of those scorers pass the segment directly to native downcalls. Rather than patching each and every call site across scorer classes individually, this fix takes a "safety by design" approach: withSlice itself now guarantees that the segment it hands to callers is always native-safe, regardless of Java version. No current or future caller needs to worry about the heap-segment restriction, correctness is enforced at the source.

I also added an assertion to the DirectAccessInput path that the byte buffer is direct, documenting the invariant that real implementations always provide native-backed buffers.

The tradeoff here is that on Java 21 only, the copyAndApply fallback path incurs an extra native allocation and copy (via a confined Arena) where a heap-backed segment would have sufficed, since a number of call sites only use the Panama Vector API and never touch native downcalls. On Java 22+ the behavior is unchanged. This is an acceptable cost: the fallback path is already the slowest path (mmap and direct buffer paths are preferred), and the alternative - requiring every call site to independently guard against heap segments, is fragile and has already proven easy to miss. We can do some further cleanup at the call site after this fix has been merged.

caused by elastic#141718
closes elastic#143441
shmuelhanoch pushed a commit to shmuelhanoch/elasticsearch that referenced this pull request Mar 4, 2026
…n Java 21 (elastic#143479)

On Java 21, FFI disallows passing heap-backed MemorySegments to native downcalls. Java 22+ removes this restriction.

IndexInputUtils.withSlice is now the single path through which all simdvec scorers obtain MemorySegments from IndexInput data, and several of those scorers pass the segment directly to native downcalls. Rather than patching each and every call site across scorer classes individually, this fix takes a "safety by design" approach: withSlice itself now guarantees that the segment it hands to callers is always native-safe, regardless of Java version. No current or future caller needs to worry about the heap-segment restriction, correctness is enforced at the source.

I also added an assertion to the DirectAccessInput path that the byte buffer is direct, documenting the invariant that real implementations always provide native-backed buffers.

The tradeoff here is that on Java 21 only, the copyAndApply fallback path incurs an extra native allocation and copy (via a confined Arena) where a heap-backed segment would have sufficed, since a number of call sites only use the Panama Vector API and never touch native downcalls. On Java 22+ the behavior is unchanged. This is an acceptable cost: the fallback path is already the slowest path (mmap and direct buffer paths are preferred), and the alternative - requiring every call site to independently guard against heap segments, is fragile and has already proven easy to miss. We can do some further cleanup at the call site after this fix has been merged.

caused by elastic#141718
closes elastic#143441
ldematte added a commit to ldematte/elasticsearch that referenced this pull request Mar 6, 2026
ChrisHegarty added a commit that referenced this pull request Mar 25, 2026
…rectAccessInput (#144557)

This PR builds on the zero-copy DirectAccessInput infrastructure introduced in #141718 to extend native SIMD bulk vector scoring to searchable snapshot (SNAP) data. Previously, native bulk scoring was limited to memory-mapped files (MemorySegmentAccessInput); SNAP inputs fell back to one-at-a-time Java scoring.

During HNSW graph traversal, the search algorithm scores a batch of candidate neighbor vectors against the query vector in each step. With this change, those batches can now be scored in a single native SIMD call even when the underlying data lives in the shared blob cache rather than a memory-mapped file. The new `DirectAccessInput.withByteBufferSlices` API provides zero-copy access to the cached regions' direct byte buffers, allowing native memory addresses to be extracted and passed directly to the bulk-gather scoring functions without any heap copying. When any vector in a bulk batch crosses a cache region boundary, the entire batch falls back to one-at-a-time scoring. In practice this should be rare: for typical configurations the per-batch fallback probability is well under 1%.

Key changes:
*  `DirectAccessInput.withByteBufferSlices` (libs/core): New bulk multi-region zero-copy access method, complementing the single-region `withByteBufferSlice` from #141718. Implementations in `SharedBlobCacheService.CacheFile`, `FrozenIndexInput`, `BlobCacheIndexInput`, and `StoreMetricsIndexInput` handle offset adjustment for sliced inputs and graceful fallback (return false) when regions cross cache boundaries or are not mmap-backed.
* `BULK_GATHER` native operation (libs/simdvec/native): New C++ bulk-gather functions for aarch64 and amd64 that accept an array of native memory addresses (one per vector) instead of requiring contiguous memory. Corresponding BULK_GATHER operation plumbing through `VectorSimilarityFunctions` and `JdkVectorLibrary`.
* `IndexInputUtils.withSliceAddresses` (libs/simdvec): Utility that resolves file byte offsets to native memory addresses, dispatching through `MemorySegmentAccessInput` (pointer arithmetic) or `DirectAccessInput` (withByteBufferSlices). Includes reachabilityFence calls to ensure backing memory remains valid during native calls.
* `ByteVectorScorer` and `Int7SQVectorScorer` (libs/simdvec): Refactored to use withSliceAddresses for bulk scoring, supporting both mmap and SNAP inputs through a unified code path. `GatherScorer` extracted as a shared top-level interface.
* Test coverage: New tests across SharedBlobCacheServiceTests, FrozenIndexInputTests, BlobCacheIndexInputTests, StoreMetricsIndexInputTests, IndexInputUtilsTests, and ByteVectorScorerFactoryTests covering bulk access, offset adjustment on sliced inputs, cross-region boundary fallback, eviction scenarios, and the super.bulkScore() fallback path.
seanzatzdev pushed a commit to seanzatzdev/elasticsearch that referenced this pull request Mar 27, 2026
…rectAccessInput (elastic#144557)

This PR builds on the zero-copy DirectAccessInput infrastructure introduced in elastic#141718 to extend native SIMD bulk vector scoring to searchable snapshot (SNAP) data. Previously, native bulk scoring was limited to memory-mapped files (MemorySegmentAccessInput); SNAP inputs fell back to one-at-a-time Java scoring.

During HNSW graph traversal, the search algorithm scores a batch of candidate neighbor vectors against the query vector in each step. With this change, those batches can now be scored in a single native SIMD call even when the underlying data lives in the shared blob cache rather than a memory-mapped file. The new `DirectAccessInput.withByteBufferSlices` API provides zero-copy access to the cached regions' direct byte buffers, allowing native memory addresses to be extracted and passed directly to the bulk-gather scoring functions without any heap copying. When any vector in a bulk batch crosses a cache region boundary, the entire batch falls back to one-at-a-time scoring. In practice this should be rare: for typical configurations the per-batch fallback probability is well under 1%.

Key changes:
*  `DirectAccessInput.withByteBufferSlices` (libs/core): New bulk multi-region zero-copy access method, complementing the single-region `withByteBufferSlice` from elastic#141718. Implementations in `SharedBlobCacheService.CacheFile`, `FrozenIndexInput`, `BlobCacheIndexInput`, and `StoreMetricsIndexInput` handle offset adjustment for sliced inputs and graceful fallback (return false) when regions cross cache boundaries or are not mmap-backed.
* `BULK_GATHER` native operation (libs/simdvec/native): New C++ bulk-gather functions for aarch64 and amd64 that accept an array of native memory addresses (one per vector) instead of requiring contiguous memory. Corresponding BULK_GATHER operation plumbing through `VectorSimilarityFunctions` and `JdkVectorLibrary`.
* `IndexInputUtils.withSliceAddresses` (libs/simdvec): Utility that resolves file byte offsets to native memory addresses, dispatching through `MemorySegmentAccessInput` (pointer arithmetic) or `DirectAccessInput` (withByteBufferSlices). Includes reachabilityFence calls to ensure backing memory remains valid during native calls.
* `ByteVectorScorer` and `Int7SQVectorScorer` (libs/simdvec): Refactored to use withSliceAddresses for bulk scoring, supporting both mmap and SNAP inputs through a unified code path. `GatherScorer` extracted as a shared top-level interface.
* Test coverage: New tests across SharedBlobCacheServiceTests, FrozenIndexInputTests, BlobCacheIndexInputTests, StoreMetricsIndexInputTests, IndexInputUtilsTests, and ByteVectorScorerFactoryTests covering bulk access, offset adjustment on sliced inputs, cross-region boundary fallback, eviction scenarios, and the super.bulkScore() fallback path.
mamazzol pushed a commit to mamazzol/elasticsearch that referenced this pull request Mar 30, 2026
…rectAccessInput (elastic#144557)

This PR builds on the zero-copy DirectAccessInput infrastructure introduced in elastic#141718 to extend native SIMD bulk vector scoring to searchable snapshot (SNAP) data. Previously, native bulk scoring was limited to memory-mapped files (MemorySegmentAccessInput); SNAP inputs fell back to one-at-a-time Java scoring.

During HNSW graph traversal, the search algorithm scores a batch of candidate neighbor vectors against the query vector in each step. With this change, those batches can now be scored in a single native SIMD call even when the underlying data lives in the shared blob cache rather than a memory-mapped file. The new `DirectAccessInput.withByteBufferSlices` API provides zero-copy access to the cached regions' direct byte buffers, allowing native memory addresses to be extracted and passed directly to the bulk-gather scoring functions without any heap copying. When any vector in a bulk batch crosses a cache region boundary, the entire batch falls back to one-at-a-time scoring. In practice this should be rare: for typical configurations the per-batch fallback probability is well under 1%.

Key changes:
*  `DirectAccessInput.withByteBufferSlices` (libs/core): New bulk multi-region zero-copy access method, complementing the single-region `withByteBufferSlice` from elastic#141718. Implementations in `SharedBlobCacheService.CacheFile`, `FrozenIndexInput`, `BlobCacheIndexInput`, and `StoreMetricsIndexInput` handle offset adjustment for sliced inputs and graceful fallback (return false) when regions cross cache boundaries or are not mmap-backed.
* `BULK_GATHER` native operation (libs/simdvec/native): New C++ bulk-gather functions for aarch64 and amd64 that accept an array of native memory addresses (one per vector) instead of requiring contiguous memory. Corresponding BULK_GATHER operation plumbing through `VectorSimilarityFunctions` and `JdkVectorLibrary`.
* `IndexInputUtils.withSliceAddresses` (libs/simdvec): Utility that resolves file byte offsets to native memory addresses, dispatching through `MemorySegmentAccessInput` (pointer arithmetic) or `DirectAccessInput` (withByteBufferSlices). Includes reachabilityFence calls to ensure backing memory remains valid during native calls.
* `ByteVectorScorer` and `Int7SQVectorScorer` (libs/simdvec): Refactored to use withSliceAddresses for bulk scoring, supporting both mmap and SNAP inputs through a unified code path. `GatherScorer` extracted as a shared top-level interface.
* Test coverage: New tests across SharedBlobCacheServiceTests, FrozenIndexInputTests, BlobCacheIndexInputTests, StoreMetricsIndexInputTests, IndexInputUtilsTests, and ByteVectorScorerFactoryTests covering bulk access, offset adjustment on sliced inputs, cross-region boundary fallback, eviction scenarios, and the super.bulkScore() fallback path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Search Relevance/Vectors Vector search serverless-linked Added by automation, don't add manually Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants