Enable zero-copy SIMD vector scoring on searchable snapshots (frozen tier) by ChrisHegarty · Pull Request #141718 · elastic/elasticsearch

ChrisHegarty · 2026-02-03T11:05:08Z

Summary

SIMD-accelerated vector scorers (OSQ / BBQ / int7) previously required an mmap-backed
MemorySegment, meaning searchable snapshot (frozen tier) data had to be copied onto the
heap before scoring. This made quantized vector scoring on frozen indices significantly slower than
on locally mmap'd data.

This PR enables the blob-cache to expose its memory-mapped regions directly to the vector
scorers, eliminating the heap copy and bringing frozen-tier BBQ scoring throughput on par with that
of mmap.

More broadly, this lays the foundation for zero-copy access to blob-cache data beyond
vector scoring. The new DirectAccessInput interface is a general-purpose contract — any
IndexInput consumer can use it to obtain a direct ByteBuffer view of the underlying
data, scoped to a callback with lifecycle managed internally. On the blob-cache side, the
new SharedBytes.IO.byteBufferSlice and CacheFileRegion.tryGetByteBufferSlice APIs are
not vector-specific; they can be used by any future codec or query path that would benefit
from avoiding heap copies when reading from frozen-tier data (e.g. postings, doc values,
stored fields). The FrozenIndexInput, BlobCacheBufferedIndexInput, and
StoreMetricsIndexInput wrappers all propagate the interface, so it is preserved through
FilterIndexInput chains regardless of the consumer.

What changed

New DirectAccessInput interface (libs/core) — a callback-style contract that any
IndexInput can implement to offer a direct ByteBuffer view of its data. The buffer is
scoped to the callback, so all ref-counting and lifecycle is handled internally.

Blob-cache plumbing — SharedBytes.IO.byteBufferSlice and
CacheFileRegion.tryGetByteBufferSlice expose read-only ByteBuffer slices of
memory-mapped cache regions, with ref-count held for the duration of the callback.
FrozenIndexInput implements DirectAccessInput on top of this.
BlobCacheBufferedIndexInput and StoreMetricsIndexInput propagate the interface through
their wrappers so it is not lost by FilterIndexInput chains.

IndexInputSegments.withSlice (libs/simdvec, main21) — a single entry point that
obtains a MemorySegment from an IndexInput by trying, in order:
MemorySegmentAccessInput (mmap) → DirectAccessInput (blob-cache) → heap copy (fallback).
Resource lifecycle is fully internal; callers just receive a MemorySegment in a callback.

Scorer refactoring — the MemorySegment field is removed from all scorer classes
(ES92 int7, OSQ 1-bit / 2-bit / 4-bit). Each scoring method now goes through
IndexInputSegments.withSlice, and the core arithmetic is extracted into static *Impl
methods for clarity. Constructors validate that the IndexInput is a supported type
(MemorySegmentAccessInput or DirectAccessInput), failing fast otherwise.

elasticsearchmachine · 2026-02-03T11:05:34Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

...java/org/elasticsearch/simdvec/internal/vectorization/MSBitToInt4ESNextOSQVectorsScorer.java

...hmarks/src/main/java/org/elasticsearch/benchmark/vector/scorer/VectorScorerOSQBenchmark.java

...java/org/elasticsearch/simdvec/internal/vectorization/MSBitToInt4ESNextOSQVectorsScorer.java

…napshot builds

elasticsearchmachine · 2026-02-23T15:32:03Z

Hi @ChrisHegarty, I've created a changelog YAML for you.

…locations * upstream/main: (94 commits) Mute org.elasticsearch.xpack.esql.qa.mixed.EsqlClientYamlIT test {p0=esql/40_tsdb/TS Command grouping on text field} elastic#142544 Mute org.elasticsearch.index.store.StoreDirectoryMetricsIT testDirectoryMetrics elastic#143419 Mute org.elasticsearch.xpack.esql.qa.multi_node.GenerativeIT test elastic#143023 TS_INFO information retrieval command (elastic#142721) ESQL: External source parallel execution and distribution (elastic#143349) Mute org.elasticsearch.index.mapper.blockloader.FlattenedFieldRootBlockLoaderTests testBlockLoaderForFieldInObject {preference=Params[syntheticSource=false, preference=DOC_VALUES]} elastic#143414 Mute org.elasticsearch.index.mapper.blockloader.FlattenedFieldRootBlockLoaderTests testBlockLoaderForFieldInObject {preference=Params[syntheticSource=false, preference=NONE]} elastic#143413 Mute org.elasticsearch.index.mapper.blockloader.FlattenedFieldRootBlockLoaderTests testBlockLoaderForFieldInObject {preference=Params[syntheticSource=false, preference=STORED]} elastic#143412 Removing ingest random sampling (elastic#143289) Mute org.elasticsearch.xpack.esql.qa.single_node.GenerativeIT test elastic#143023 [Transform] Clean up internal tests (elastic#143246) Skip time series field type merge for non-TS agg queries (elastic#143262) Enable zero-copy SIMD vector scoring on searchable snapshots (frozen tier) (elastic#141718) Mute org.elasticsearch.xpack.search.CrossClusterAsyncSearchIT testCancelViaExpirationOnRemoteResultsWithMinimizeRoundtrips elastic#143407 Fix MemorySegmentUtilsTests (elastic#143391) Unmute testWorkflowsRestrictionAllowsAccess (elastic#143308) Cancel async query on expiry (elastic#143016) ESQL: Finish migrating error testing (elastic#143322) Reduce LuceneOperator.Status memory consumption with large QueryDSL queries (elastic#143175) ESQL: Generative testing with full text functions (elastic#142961) ...

… tier) (elastic#141718) ## Summary SIMD-accelerated vector scorers (OSQ / BBQ / int7) previously required an mmap-backed `MemorySegment`, meaning searchable snapshot (frozen tier) data had to be copied onto the heap before scoring. This made quantized vector scoring on frozen indices significantly slower than on locally mmap'd data. This PR enables the blob-cache to expose its memory-mapped regions directly to the vector scorers, eliminating the heap copy and bringing frozen-tier BBQ scoring throughput on par with that of mmap. More broadly, this lays the foundation for zero-copy access to blob-cache data beyond vector scoring. The new `DirectAccessInput` interface is a general-purpose contract — any `IndexInput` consumer can use it to obtain a direct `ByteBuffer` view of the underlying data, scoped to a callback with lifecycle managed internally. On the blob-cache side, the new `SharedBytes.IO.byteBufferSlice` and `CacheFileRegion.tryGetByteBufferSlice` APIs are not vector-specific; they can be used by any future codec or query path that would benefit from avoiding heap copies when reading from frozen-tier data (e.g. postings, doc values, stored fields). The `FrozenIndexInput`, `BlobCacheBufferedIndexInput`, and `StoreMetricsIndexInput` wrappers all propagate the interface, so it is preserved through `FilterIndexInput` chains regardless of the consumer. ### What changed **New `DirectAccessInput` interface** (`libs/core`) — a callback-style contract that any `IndexInput` can implement to offer a direct `ByteBuffer` view of its data. The buffer is scoped to the callback, so all ref-counting and lifecycle is handled internally. **Blob-cache plumbing** — `SharedBytes.IO.byteBufferSlice` and `CacheFileRegion.tryGetByteBufferSlice` expose read-only `ByteBuffer` slices of memory-mapped cache regions, with ref-count held for the duration of the callback. `FrozenIndexInput` implements `DirectAccessInput` on top of this. `BlobCacheBufferedIndexInput` and `StoreMetricsIndexInput` propagate the interface through their wrappers so it is not lost by `FilterIndexInput` chains. **`IndexInputSegments.withSlice`** (`libs/simdvec`, main21) — a single entry point that obtains a `MemorySegment` from an `IndexInput` by trying, in order: `MemorySegmentAccessInput` (mmap) -> `DirectAccessInput` (blob-cache) -> heap copy (fallback). Resource lifecycle is fully internal; callers just receive a `MemorySegment` in a callback. **Scorer refactoring** — the `MemorySegment` field is removed from all scorer classes (ES92 int7, OSQ 1-bit / 2-bit / 4-bit). Each scoring method now goes through `IndexInputSegments.withSlice`, and the core arithmetic is extracted into static `*Impl` methods for clarity. Constructors validate that the `IndexInput` is a supported type (`MemorySegmentAccessInput` or `DirectAccessInput`), failing fast otherwise.

…n Java 21 (#143479) On Java 21, FFI disallows passing heap-backed MemorySegments to native downcalls. Java 22+ removes this restriction. IndexInputUtils.withSlice is now the single path through which all simdvec scorers obtain MemorySegments from IndexInput data, and several of those scorers pass the segment directly to native downcalls. Rather than patching each and every call site across scorer classes individually, this fix takes a "safety by design" approach: withSlice itself now guarantees that the segment it hands to callers is always native-safe, regardless of Java version. No current or future caller needs to worry about the heap-segment restriction, correctness is enforced at the source. I also added an assertion to the DirectAccessInput path that the byte buffer is direct, documenting the invariant that real implementations always provide native-backed buffers. The tradeoff here is that on Java 21 only, the copyAndApply fallback path incurs an extra native allocation and copy (via a confined Arena) where a heap-backed segment would have sufficed, since a number of call sites only use the Panama Vector API and never touch native downcalls. On Java 22+ the behavior is unchanged. This is an acceptable cost: the fallback path is already the slowest path (mmap and direct buffer paths are preferred), and the alternative - requiring every call site to independently guard against heap segments, is fragile and has already proven easy to miss. We can do some further cleanup at the call site after this fix has been merged. caused by #141718 closes #143441

…n Java 21 (elastic#143479) On Java 21, FFI disallows passing heap-backed MemorySegments to native downcalls. Java 22+ removes this restriction. IndexInputUtils.withSlice is now the single path through which all simdvec scorers obtain MemorySegments from IndexInput data, and several of those scorers pass the segment directly to native downcalls. Rather than patching each and every call site across scorer classes individually, this fix takes a "safety by design" approach: withSlice itself now guarantees that the segment it hands to callers is always native-safe, regardless of Java version. No current or future caller needs to worry about the heap-segment restriction, correctness is enforced at the source. I also added an assertion to the DirectAccessInput path that the byte buffer is direct, documenting the invariant that real implementations always provide native-backed buffers. The tradeoff here is that on Java 21 only, the copyAndApply fallback path incurs an extra native allocation and copy (via a confined Arena) where a heap-backed segment would have sufficed, since a number of call sites only use the Panama Vector API and never touch native downcalls. On Java 22+ the behavior is unchanged. This is an acceptable cost: the fallback path is already the slowest path (mmap and direct buffer paths are preferred), and the alternative - requiring every call site to independently guard against heap segments, is fragile and has already proven easy to miss. We can do some further cleanup at the call site after this fix has been merged. caused by elastic#141718 closes elastic#143441

…rectAccessInput (#144557) This PR builds on the zero-copy DirectAccessInput infrastructure introduced in #141718 to extend native SIMD bulk vector scoring to searchable snapshot (SNAP) data. Previously, native bulk scoring was limited to memory-mapped files (MemorySegmentAccessInput); SNAP inputs fell back to one-at-a-time Java scoring. During HNSW graph traversal, the search algorithm scores a batch of candidate neighbor vectors against the query vector in each step. With this change, those batches can now be scored in a single native SIMD call even when the underlying data lives in the shared blob cache rather than a memory-mapped file. The new `DirectAccessInput.withByteBufferSlices` API provides zero-copy access to the cached regions' direct byte buffers, allowing native memory addresses to be extracted and passed directly to the bulk-gather scoring functions without any heap copying. When any vector in a bulk batch crosses a cache region boundary, the entire batch falls back to one-at-a-time scoring. In practice this should be rare: for typical configurations the per-batch fallback probability is well under 1%. Key changes: * `DirectAccessInput.withByteBufferSlices` (libs/core): New bulk multi-region zero-copy access method, complementing the single-region `withByteBufferSlice` from #141718. Implementations in `SharedBlobCacheService.CacheFile`, `FrozenIndexInput`, `BlobCacheIndexInput`, and `StoreMetricsIndexInput` handle offset adjustment for sliced inputs and graceful fallback (return false) when regions cross cache boundaries or are not mmap-backed. * `BULK_GATHER` native operation (libs/simdvec/native): New C++ bulk-gather functions for aarch64 and amd64 that accept an array of native memory addresses (one per vector) instead of requiring contiguous memory. Corresponding BULK_GATHER operation plumbing through `VectorSimilarityFunctions` and `JdkVectorLibrary`. * `IndexInputUtils.withSliceAddresses` (libs/simdvec): Utility that resolves file byte offsets to native memory addresses, dispatching through `MemorySegmentAccessInput` (pointer arithmetic) or `DirectAccessInput` (withByteBufferSlices). Includes reachabilityFence calls to ensure backing memory remains valid during native calls. * `ByteVectorScorer` and `Int7SQVectorScorer` (libs/simdvec): Refactored to use withSliceAddresses for bulk scoring, supporting both mmap and SNAP inputs through a unified code path. `GatherScorer` extracted as a shared top-level interface. * Test coverage: New tests across SharedBlobCacheServiceTests, FrozenIndexInputTests, BlobCacheIndexInputTests, StoreMetricsIndexInputTests, IndexInputUtilsTests, and ByteVectorScorerFactoryTests covering bulk access, offset adjustment on sliced inputs, cross-region boundary fallback, eviction scenarios, and the super.bulkScore() fallback path.

…rectAccessInput (elastic#144557) This PR builds on the zero-copy DirectAccessInput infrastructure introduced in elastic#141718 to extend native SIMD bulk vector scoring to searchable snapshot (SNAP) data. Previously, native bulk scoring was limited to memory-mapped files (MemorySegmentAccessInput); SNAP inputs fell back to one-at-a-time Java scoring. During HNSW graph traversal, the search algorithm scores a batch of candidate neighbor vectors against the query vector in each step. With this change, those batches can now be scored in a single native SIMD call even when the underlying data lives in the shared blob cache rather than a memory-mapped file. The new `DirectAccessInput.withByteBufferSlices` API provides zero-copy access to the cached regions' direct byte buffers, allowing native memory addresses to be extracted and passed directly to the bulk-gather scoring functions without any heap copying. When any vector in a bulk batch crosses a cache region boundary, the entire batch falls back to one-at-a-time scoring. In practice this should be rare: for typical configurations the per-batch fallback probability is well under 1%. Key changes: * `DirectAccessInput.withByteBufferSlices` (libs/core): New bulk multi-region zero-copy access method, complementing the single-region `withByteBufferSlice` from elastic#141718. Implementations in `SharedBlobCacheService.CacheFile`, `FrozenIndexInput`, `BlobCacheIndexInput`, and `StoreMetricsIndexInput` handle offset adjustment for sliced inputs and graceful fallback (return false) when regions cross cache boundaries or are not mmap-backed. * `BULK_GATHER` native operation (libs/simdvec/native): New C++ bulk-gather functions for aarch64 and amd64 that accept an array of native memory addresses (one per vector) instead of requiring contiguous memory. Corresponding BULK_GATHER operation plumbing through `VectorSimilarityFunctions` and `JdkVectorLibrary`. * `IndexInputUtils.withSliceAddresses` (libs/simdvec): Utility that resolves file byte offsets to native memory addresses, dispatching through `MemorySegmentAccessInput` (pointer arithmetic) or `DirectAccessInput` (withByteBufferSlices). Includes reachabilityFence calls to ensure backing memory remains valid during native calls. * `ByteVectorScorer` and `Int7SQVectorScorer` (libs/simdvec): Refactored to use withSliceAddresses for bulk scoring, supporting both mmap and SNAP inputs through a unified code path. `GatherScorer` extracted as a shared top-level interface. * Test coverage: New tests across SharedBlobCacheServiceTests, FrozenIndexInputTests, BlobCacheIndexInputTests, StoreMetricsIndexInputTests, IndexInputUtilsTests, and ByteVectorScorerFactoryTests covering bulk access, offset adjustment on sliced inputs, cross-region boundary fallback, eviction scenarios, and the super.bulkScore() fallback path.

ChrisHegarty added 2 commits January 30, 2026 14:55

Use searchable snapshot in vector jmh

8c42e95

itr

c938f1a

ChrisHegarty added :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch labels Feb 3, 2026

elasticsearchmachine added the v9.4.0 label Feb 3, 2026

Merge branch 'main' into bbq_vec_ops

0b80022

ChrisHegarty force-pushed the bbq_vec_ops branch from acdd954 to 0b80022 Compare February 3, 2026 11:19

ChrisHegarty added 3 commits February 3, 2026 11:39

itr

0828cd2

itr

9a97011

typo

fc23dee

thecoop reviewed Feb 3, 2026

View reviewed changes

...java/org/elasticsearch/simdvec/internal/vectorization/MSBitToInt4ESNextOSQVectorsScorer.java Outdated Show resolved Hide resolved

thecoop reviewed Feb 3, 2026

View reviewed changes

...hmarks/src/main/java/org/elasticsearch/benchmark/vector/scorer/VectorScorerOSQBenchmark.java Outdated Show resolved Hide resolved

itr

bc4bbf5

ChrisHegarty added the >refactoring label Feb 3, 2026

elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Feb 3, 2026

ChrisHegarty changed the title ~~Rework BBQ vector comparison ops to work with the blob-cache~~ Rework BBQ vector comparison ops to work more efficiently with the blob-cache Feb 3, 2026

ldematte reviewed Feb 4, 2026

View reviewed changes

...java/org/elasticsearch/simdvec/internal/vectorization/MSBitToInt4ESNextOSQVectorsScorer.java Outdated Show resolved Hide resolved

ldematte mentioned this pull request Feb 4, 2026

[Native] Using native scorers in BBQ #141762

Merged

ChrisHegarty and others added 11 commits February 6, 2026 17:28

Merge branch 'main' into bbq_vec_ops

8056dc2

revert

a15dc37

revert

b620ca7

[CI] Auto commit changes from spotless

d1ad5fd

itr

1d10b8b

Merge remote-tracking branch 'chegar/bbq_vec_ops' into bbq_vec_ops

e776c72

fix bench

f63bbb1

Merge branch 'main' into bbq_vec_ops

04ae17d

Bump the default BlobCacheBufferIndexInput buffer from 1k to 4k for s…

f92e1ce

…napshot builds

Merge branch 'main' into default_blob_cache_input_buffer_size

41d72f3

Merge branch 'main' into default_blob_cache_input_buffer_size

a81bd5d

ChrisHegarty added >enhancement and removed >refactoring labels Feb 23, 2026

Update docs/changelog/141718.yaml

fc4eca5

ChrisHegarty added 7 commits February 25, 2026 09:49

update MemorySegmentES91OSQVectorsScorer

c37998f

Merge remote-tracking branch 'chegar/bbq_vec_ops' into bbq_vec_ops

57c8469

Merge remote-tracking branch 'upstream/main' into bbq_vec_ops

59208b8

use a scratch buffer for on-heap

e03e5fd

Merge branch 'main' into bbq_vec_ops

7f67f81

fix

7407119

revert

d2ced8b

benwtrent approved these changes Mar 2, 2026

View reviewed changes

ChrisHegarty merged commit a82f799 into elastic:main Mar 2, 2026
26 of 27 checks passed

ChrisHegarty deleted the bbq_vec_ops branch March 2, 2026 17:38

thecoop mentioned this pull request Mar 3, 2026

[CI] ESNextOSQVectorsScorerTests class failing #143441

Closed

ChrisHegarty mentioned this pull request Mar 3, 2026

Fix IndexInputUtils.withSlice to produce native-safe MemorySegments on Java 21 #143479

Merged

ldematte added a commit to ldematte/elasticsearch that referenced this pull request Mar 6, 2026

Update scorers to use new helpers from elastic#141718

bcf6c5a

This was referenced Mar 6, 2026

[ML] Wait for cluster state in test #143767

Merged

[Transform] Disable PIT for CPS #143876

Closed

ChrisHegarty mentioned this pull request Mar 19, 2026

Add bulk-sparse native vector scoring for searchable snapshots via DirectAccessInput #144557

Merged

salvatore-campagna mentioned this pull request Mar 31, 2026

TSDB Pipeline Codec: Production Readiness #141118

Open

18 tasks

thecoop mentioned this pull request Mar 31, 2026

Only use MemorySegment scorers when slices can be obtained from the IndexInput #145343

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable zero-copy SIMD vector scoring on searchable snapshots (frozen tier)#141718

Enable zero-copy SIMD vector scoring on searchable snapshots (frozen tier)#141718
ChrisHegarty merged 129 commits intoelastic:mainfrom
ChrisHegarty:bbq_vec_ops

ChrisHegarty commented Feb 3, 2026 •

edited

Loading

Uh oh!

elasticsearchmachine commented Feb 3, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elasticsearchmachine commented Feb 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

ChrisHegarty commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Uh oh!

elasticsearchmachine commented Feb 3, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elasticsearchmachine commented Feb 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

ChrisHegarty commented Feb 3, 2026 •

edited

Loading