[Bugfix][MLA] Add logits size budget to sparse indexer prefill chunking by LucasWilkinson · Pull Request #36178 · vllm-project/vllm

LucasWilkinson · 2026-03-05T22:02:13Z

Alternative to #35488, credit to @haosdent

Summary

Adds a logits tensor size constraint to sparse MLA indexer prefill chunking to prevent CUDA OOM
Introduces VLLM_SPARSE_INDEXER_MAX_LOGITS_MB env var (default 512 MB) to bound the [M, N] float32 logits tensor
Replaces split_prefill_chunks with split_indexer_prefill_chunks that respects both workspace and logits size constraints

Test plan

Added unit tests for split_indexer_prefill_chunks covering various constraint scenarios
Run existing MLA tests: pytest tests/v1/attention/test_sparse_mla_backends.py

🤖 Generated with Claude Code

gemini-code-assist

Code Review

This pull request introduces a logits size constraint to the sparse MLA indexer's prefill chunking logic to prevent out-of-memory errors. A new environment variable VLLM_SPARSE_INDEXER_MAX_LOGITS_MB is added to control this budget. The core change is the new split_indexer_prefill_chunks function, which correctly chunks requests based on both workspace size and the new logits size constraint. The implementation is robust, handling cases where a single request might exceed the budget to avoid getting stuck. The accompanying unit tests are comprehensive and cover various scenarios, ensuring the correctness of the new logic. Overall, this is a solid improvement for memory management in the sparse indexer.

mergify · 2026-03-12T19:31:56Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

haosdent · 2026-03-25T04:46:12Z

vllm/v1/attention/backends/mla/indexer.py

+            query_start_loc_cpu[req_slice.start : req_slice.stop + 1]
+            - query_start_loc_cpu[req_slice.start]
        )
        cu_seqlen_ks, cu_seqlen_ke = kv_spans_from_batches(


kv_spans_from_batches would calculate multiple times for the same request right

good catch! but this is once per forward pass and is overlapped due to async scheduling so i dont think avoiding the redundant work here is critical

mergify · 2026-03-25T06:41:23Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/model_executor/layers/sparse_attn_indexer.py

MatthewBonanni · 2026-03-25T19:44:18Z

vllm/v1/attention/backends/mla/indexer.py

+            query_start_loc_cpu[req_slice.start : req_slice.stop + 1]
+            - query_start_loc_cpu[req_slice.start]
        )
        cu_seqlen_ks, cu_seqlen_ke = kv_spans_from_batches(


The sparse MLA indexer allocates a [M, N] float32 logits tensor during prefill, where M is total query tokens and N is total sequence length. For long sequences or large batches, this can exceed GPU memory. This adds a new constraint to split_indexer_prefill_chunks that bounds M*N*4 bytes to VLLM_SPARSE_INDEXER_MAX_LOGITS_MB (default 512 MB), preventing CUDA OOM during prefill. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

MatthewBonanni

LGTM, thanks for the fix!

LucasWilkinson requested a review from pavanimajety as a code owner March 5, 2026 22:02

mergify bot added v1 bug Something isn't working labels Mar 5, 2026

gemini-code-assist bot reviewed Mar 5, 2026

View reviewed changes

LucasWilkinson mentioned this pull request Mar 5, 2026

[WIP][Bugfix] Fix CUDA OOM in sparse_attn_indexer prefill with high concurrency #35488

Closed

3 tasks

LucasWilkinson marked this pull request as draft March 5, 2026 22:11

nejch mentioned this pull request Mar 12, 2026

[Bug]: GLM-5 FP8 on H200 CUDA OOM in sparse_attn_indexer at High Concurrency #34553

Open

1 task

mergify bot added the needs-rebase label Mar 12, 2026

LucasWilkinson marked this pull request as ready for review March 24, 2026 14:32

LucasWilkinson force-pushed the lucas/sparse-indexer-logits-budget branch from d82abae to f950710 Compare March 24, 2026 14:51

mergify bot removed the needs-rebase label Mar 24, 2026

LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 24, 2026

haosdent reviewed Mar 25, 2026

View reviewed changes

mergify bot added the needs-rebase label Mar 25, 2026

MatthewBonanni reviewed Mar 25, 2026

View reviewed changes

max-wittig approved these changes Mar 30, 2026

View reviewed changes

LucasWilkinson added 6 commits March 31, 2026 20:23

add subchunking

46eed97

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

cleanup

5c4a4c8

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

cleanup

7607bd5

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

add dummy allocations

58a1f48

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

cleanup

0a3cef6

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

LucasWilkinson force-pushed the lucas/sparse-indexer-logits-budget branch from f950710 to 0a3cef6 Compare March 31, 2026 20:24

mergify bot removed the needs-rebase label Mar 31, 2026

LucasWilkinson added 2 commits March 31, 2026 20:27

review comments

62a5ee6

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

cleanup

33d0a22

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

LucasWilkinson added this to the v0.19.0 cherry picks milestone Mar 31, 2026

MatthewBonanni approved these changes Mar 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][MLA] Add logits size budget to sparse indexer prefill chunking#36178

[Bugfix][MLA] Add logits size budget to sparse indexer prefill chunking#36178
LucasWilkinson wants to merge 8 commits intomainfrom
lucas/sparse-indexer-logits-budget

LucasWilkinson commented Mar 5, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mergify bot commented Mar 12, 2026

Uh oh!

haosdent Mar 25, 2026

Uh oh!

MatthewBonanni Mar 25, 2026

Uh oh!

LucasWilkinson Mar 31, 2026 •

edited

Loading

Uh oh!

mergify bot commented Mar 25, 2026

Uh oh!

Uh oh!

MatthewBonanni Mar 25, 2026

Uh oh!

MatthewBonanni left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

LucasWilkinson commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify bot commented Mar 12, 2026

Uh oh!

haosdent Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

MatthewBonanni Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 25, 2026

Uh oh!

Uh oh!

MatthewBonanni Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

MatthewBonanni left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

LucasWilkinson commented Mar 5, 2026 •

edited

Loading

LucasWilkinson Mar 31, 2026 •

edited

Loading