Skip to content

[GPU] Handle segments too big for MSAI segment access#141872

Merged
ldematte merged 10 commits intoelastic:mainfrom
ldematte:gpu/fix-null-slice
Feb 10, 2026
Merged

[GPU] Handle segments too big for MSAI segment access#141872
ldematte merged 10 commits intoelastic:mainfrom
ldematte:gpu/fix-null-slice

Conversation

@ldematte
Copy link
Copy Markdown
Contributor

@ldematte ldematte commented Feb 4, 2026

MSAI (MemorySegmentAccessInput) provides access to an index input via file memory mapping, returning a MemorySegment.
However, MemorySegmentAccessInput may return null under some circumstances; in our usage, where we try to get a memory mapping for the whole file, it would fail if the file is too big (where "too big" is currently set at 16GB).

This PR adds a fallback for this situation; if we fail to get a MemorySegment from MemorySegmentAccessInput, we assume the input as a whole is too big, and we fall back to using a temp file. Data is copied over to the temp file, and we use the FileChannel Java API directly to map the whole file.
Using a temp file is necessary; if the input as a whole is too big (> 16G) we cannot reliably use directly on or off-heap memory, we need "something" to back that memory in case the OS needs to page it out (e.g. in the likely event we exceed the physical memory available).

Closes #141746

@ldematte ldematte added >bug auto-backport Automatically create backport pull requests when merged :Search Relevance/Vectors Vector search branch:9.3 labels Feb 4, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @ldematte, I've created a changelog YAML for you.

@ldematte ldematte added the test-gpu Run tests using a GPU label Feb 4, 2026
@ldematte ldematte marked this pull request as ready for review February 7, 2026 17:42
@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Feb 7, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@ldematte
Copy link
Copy Markdown
Contributor Author

ldematte commented Feb 7, 2026

@mayya-sharipova @ChrisHegarty I added some unit tests, but I think it would be nice to have some IT tests as well. Do you think it's worth it? I'd need your help here, I don't know how I can add cases with a different max chunk size (I could do that manually maybe, but I don't think it's the right way).

Copy link
Copy Markdown
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

/** Returns a Dataset over an input slice */
CuVSMatrix fromSlice(MemorySegmentAccessInput input, long pos, long len, int numVectors, int dims, CuVSMatrix.DataType dataType)
throws IOException;
CuVSMatrix fromInput(MemorySegment input, int numVectors, int dims, CuVSMatrix.DataType dataType);
Copy link
Copy Markdown
Contributor

@mayya-sharipova mayya-sharipova Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be not related to this PR, do we need DatasetUtils class at all? Looks like we only have a single DatasetUtilsImpl now.

Copy link
Copy Markdown
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ldematte Thanks Lorenzo for fixing the bug.
I also like how you restructured the code.

I will think about the testing. I left some comments, and after addressing them, the PR is good to be merged.

@ldematte ldematte enabled auto-merge (squash) February 10, 2026 08:10
@ldematte ldematte merged commit 16ed460 into elastic:main Feb 10, 2026
36 checks passed
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

💚 Backport successful

Status Branch Result
9.3

ldematte added a commit to ldematte/elasticsearch that referenced this pull request Feb 10, 2026
MSAI (MemorySegmentAccessInput) provides access to an index input via file memory mapping, returning a MemorySegment.
However, MemorySegmentAccessInput may return null under some circumstances; in our usage, where we try to get a memory mapping for the whole file, it would fail if the file is too big (where "too big" is currently set at 16GB).

This PR adds a fallback for this situation; if we fail to get a MemorySegment from MemorySegmentAccessInput, we assume the input as a whole is too big, and we fall back to using a temp file. Data is copied over to the temp file, and we use the FileChannel Java API directly to map the whole file.
Using a temp file is necessary; if the input as a whole is too big (> 16G) we cannot reliably use directly on or off-heap memory, we need "something" to back that memory in case the OS needs to page it out (e.g. in the likely event we exceed the physical memory available).

Closes elastic#141746
@ldematte ldematte deleted the gpu/fix-null-slice branch February 10, 2026 12:51
ldematte added a commit that referenced this pull request Feb 10, 2026
)

MSAI (MemorySegmentAccessInput) provides access to an index input via file memory mapping, returning a MemorySegment.
However, MemorySegmentAccessInput may return null under some circumstances; in our usage, where we try to get a memory mapping for the whole file, it would fail if the file is too big (where "too big" is currently set at 16GB).

This PR adds a fallback for this situation; if we fail to get a MemorySegment from MemorySegmentAccessInput, we assume the input as a whole is too big, and we fall back to using a temp file. Data is copied over to the temp file, and we use the FileChannel Java API directly to map the whole file.
Using a temp file is necessary; if the input as a whole is too big (> 16G) we cannot reliably use directly on or off-heap memory, we need "something" to back that memory in case the OS needs to page it out (e.g. in the likely event we exceed the physical memory available).

Closes #141746
elasticsearchmachine pushed a commit that referenced this pull request Mar 10, 2026
MemorySegmentUtils directly cast the Directory to FSDirectory,  but
Elasticsearch wraps directories in Store$StoreDirectory  (which extends
FilterDirectory, not FSDirectory). When vector  data exceeds
MMapDirectory's max chunk size, the fallback path  hit this cast and
threw a ClassCastException, failing all shard merges.

ClassCastException:     Store$StoreDirectory cannot be cast to
FSDirectory     (Store$StoreDirectory is in module    
org.elasticsearch.server; FSDirectory is in module    
org.apache.lucene.core)

Use FilterDirectory.unwrap() to peel through wrapper layers before
casting.

Also fix log message formatting for segment size values.

Relates to #141872
mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this pull request Mar 10, 2026
…3531)

MemorySegmentUtils directly cast the Directory to FSDirectory,  but
Elasticsearch wraps directories in Store$StoreDirectory  (which extends
FilterDirectory, not FSDirectory). When vector  data exceeds
MMapDirectory's max chunk size, the fallback path  hit this cast and
threw a ClassCastException, failing all shard merges.

ClassCastException:     Store$StoreDirectory cannot be cast to
FSDirectory     (Store$StoreDirectory is in module    
org.elasticsearch.server; FSDirectory is in module    
org.apache.lucene.core)

Use FilterDirectory.unwrap() to peel through wrapper layers before
casting.

Also fix log message formatting for segment size values.

Relates to elastic#141872
elasticsearchmachine pushed a commit that referenced this pull request Mar 10, 2026
) (#143924)

* Fix GPU merge ClassCastException with wrapped directories (#143531)

MemorySegmentUtils directly cast the Directory to FSDirectory,  but
Elasticsearch wraps directories in Store$StoreDirectory  (which extends
FilterDirectory, not FSDirectory). When vector  data exceeds
MMapDirectory's max chunk size, the fallback path  hit this cast and
threw a ClassCastException, failing all shard merges.

ClassCastException:     Store$StoreDirectory cannot be cast to
FSDirectory     (Store$StoreDirectory is in module    
org.elasticsearch.server; FSDirectory is in module    
org.apache.lucene.core)

Use FilterDirectory.unwrap() to peel through wrapper layers before
casting.

Also fix log message formatting for segment size values.

Relates to #141872

* Fix compilation error: update CuVSGPUSupport to GPUSupport in GPUMergeFallbackIT
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged >bug :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch test-gpu Run tests using a GPU v9.3.1 v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GPU codec merge fails for vector data exceeding 16GB

4 participants