[GPU] Handle segments too big for MSAI segment access by ldematte · Pull Request #141872 · elastic/elasticsearch

ldematte · 2026-02-04T18:11:38Z

MSAI (MemorySegmentAccessInput) provides access to an index input via file memory mapping, returning a MemorySegment.
However, MemorySegmentAccessInput may return null under some circumstances; in our usage, where we try to get a memory mapping for the whole file, it would fail if the file is too big (where "too big" is currently set at 16GB).

This PR adds a fallback for this situation; if we fail to get a MemorySegment from MemorySegmentAccessInput, we assume the input as a whole is too big, and we fall back to using a temp file. Data is copied over to the temp file, and we use the FileChannel Java API directly to map the whole file.
Using a temp file is necessary; if the input as a whole is too big (> 16G) we cannot reliably use directly on or off-heap memory, we need "something" to back that memory in case the OS needs to page it out (e.g. in the likely event we exceed the physical memory available).

Closes #141746

elasticsearchmachine · 2026-02-04T18:12:28Z

Hi @ldematte, I've created a changelog YAML for you.

elasticsearchmachine · 2026-02-07T17:43:12Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

ldematte · 2026-02-07T17:45:06Z

@mayya-sharipova @ChrisHegarty I added some unit tests, but I think it would be nice to have some IT tests as well. Do you think it's worth it? I'd need your help here, I don't know how I can add cases with a different max chunk size (I could do that manually maybe, but I don't think it's the right way).

ChrisHegarty

LGTM

libs/gpu-codec/src/main/java/org/elasticsearch/gpu/codec/DatasetUtilsImpl.java

mayya-sharipova · 2026-02-09T20:37:40Z

libs/gpu-codec/src/main/java/org/elasticsearch/gpu/codec/DatasetUtils.java

-    /** Returns a Dataset over an input slice */
-    CuVSMatrix fromSlice(MemorySegmentAccessInput input, long pos, long len, int numVectors, int dims, CuVSMatrix.DataType dataType)
-        throws IOException;
+    CuVSMatrix fromInput(MemorySegment input, int numVectors, int dims, CuVSMatrix.DataType dataType);


May be not related to this PR, do we need DatasetUtils class at all? Looks like we only have a single DatasetUtilsImpl now.

libs/gpu-codec/src/main/java/org/elasticsearch/gpu/codec/MemorySegmentUtils.java

mayya-sharipova

@ldematte Thanks Lorenzo for fixing the bug.
I also like how you restructured the code.

I will think about the testing. I left some comments, and after addressing them, the PR is good to be merged.

elasticsearchmachine · 2026-02-10T11:01:23Z

💚 Backport successful

Status	Branch	Result
✅	9.3

MSAI (MemorySegmentAccessInput) provides access to an index input via file memory mapping, returning a MemorySegment. However, MemorySegmentAccessInput may return null under some circumstances; in our usage, where we try to get a memory mapping for the whole file, it would fail if the file is too big (where "too big" is currently set at 16GB). This PR adds a fallback for this situation; if we fail to get a MemorySegment from MemorySegmentAccessInput, we assume the input as a whole is too big, and we fall back to using a temp file. Data is copied over to the temp file, and we use the FileChannel Java API directly to map the whole file. Using a temp file is necessary; if the input as a whole is too big (> 16G) we cannot reliably use directly on or off-heap memory, we need "something" to back that memory in case the OS needs to page it out (e.g. in the likely event we exceed the physical memory available). Closes elastic#141746

) MSAI (MemorySegmentAccessInput) provides access to an index input via file memory mapping, returning a MemorySegment. However, MemorySegmentAccessInput may return null under some circumstances; in our usage, where we try to get a memory mapping for the whole file, it would fail if the file is too big (where "too big" is currently set at 16GB). This PR adds a fallback for this situation; if we fail to get a MemorySegment from MemorySegmentAccessInput, we assume the input as a whole is too big, and we fall back to using a temp file. Data is copied over to the temp file, and we use the FileChannel Java API directly to map the whole file. Using a temp file is necessary; if the input as a whole is too big (> 16G) we cannot reliably use directly on or off-heap memory, we need "something" to back that memory in case the OS needs to page it out (e.g. in the likely event we exceed the physical memory available). Closes #141746

MemorySegmentUtils directly cast the Directory to FSDirectory, but Elasticsearch wraps directories in Store$StoreDirectory (which extends FilterDirectory, not FSDirectory). When vector data exceeds MMapDirectory's max chunk size, the fallback path hit this cast and threw a ClassCastException, failing all shard merges. ClassCastException: Store$StoreDirectory cannot be cast to FSDirectory (Store$StoreDirectory is in module org.elasticsearch.server; FSDirectory is in module org.apache.lucene.core) Use FilterDirectory.unwrap() to peel through wrapper layers before casting. Also fix log message formatting for segment size values. Relates to #141872

…3531) MemorySegmentUtils directly cast the Directory to FSDirectory, but Elasticsearch wraps directories in Store$StoreDirectory (which extends FilterDirectory, not FSDirectory). When vector data exceeds MMapDirectory's max chunk size, the fallback path hit this cast and threw a ClassCastException, failing all shard merges. ClassCastException: Store$StoreDirectory cannot be cast to FSDirectory (Store$StoreDirectory is in module org.elasticsearch.server; FSDirectory is in module org.apache.lucene.core) Use FilterDirectory.unwrap() to peel through wrapper layers before casting. Also fix log message formatting for segment size values. Relates to elastic#141872

) (#143924) * Fix GPU merge ClassCastException with wrapped directories (#143531) MemorySegmentUtils directly cast the Directory to FSDirectory, but Elasticsearch wraps directories in Store$StoreDirectory (which extends FilterDirectory, not FSDirectory). When vector data exceeds MMapDirectory's max chunk size, the fallback path hit this cast and threw a ClassCastException, failing all shard merges. ClassCastException: Store$StoreDirectory cannot be cast to FSDirectory (Store$StoreDirectory is in module org.elasticsearch.server; FSDirectory is in module org.apache.lucene.core) Use FilterDirectory.unwrap() to peel through wrapper layers before casting. Also fix log message formatting for segment size values. Relates to #141872 * Fix compilation error: update CuVSGPUSupport to GPUSupport in GPUMergeFallbackIT

Fallback via temp file for huge segments

2c4f975

ldematte added >bug auto-backport Automatically create backport pull requests when merged :Search Relevance/Vectors Vector search branch:9.3 labels Feb 4, 2026

elasticsearchmachine added v9.4.0 v9.3.1 and removed branch:9.3 labels Feb 4, 2026

Update docs/changelog/141872.yaml

ae64145

ldematte added the test-gpu Run tests using a GPU label Feb 4, 2026

ldematte added 3 commits February 4, 2026 19:14

Merge branch 'main' into gpu/fix-null-slice

78d61ce

Refactor MemorySegment helper methods to a separate utility class

7fb0086

Tests and fixes

1fc14c9

ldematte marked this pull request as ready for review February 7, 2026 17:42

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Feb 7, 2026

ChrisHegarty approved these changes Feb 7, 2026

View reviewed changes

ldematte added 2 commits February 8, 2026 11:35

Forbidden APIs + spotless

039cd6e

Merge branch 'main' into gpu/fix-null-slice

5b8f451