[Bugfix][MLA] Add logits size budget to sparse indexer prefill chunking#36178
[Bugfix][MLA] Add logits size budget to sparse indexer prefill chunking#36178LucasWilkinson wants to merge 8 commits intomainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a logits size constraint to the sparse MLA indexer's prefill chunking logic to prevent out-of-memory errors. A new environment variable VLLM_SPARSE_INDEXER_MAX_LOGITS_MB is added to control this budget. The core change is the new split_indexer_prefill_chunks function, which correctly chunks requests based on both workspace size and the new logits size constraint. The implementation is robust, handling cases where a single request might exceed the budget to avoid getting stuck. The accompanying unit tests are comprehensive and cover various scenarios, ensuring the correctness of the new logic. Overall, this is a solid improvement for memory management in the sparse indexer.
|
This pull request has merge conflicts that must be resolved before it can be |
d82abae to
f950710
Compare
| query_start_loc_cpu[req_slice.start : req_slice.stop + 1] | ||
| - query_start_loc_cpu[req_slice.start] | ||
| ) | ||
| cu_seqlen_ks, cu_seqlen_ke = kv_spans_from_batches( |
There was a problem hiding this comment.
kv_spans_from_batches would calculate multiple times for the same request right
There was a problem hiding this comment.
good catch! but this is once per forward pass and is overlapped due to async scheduling so i dont think avoiding the redundant work here is critical
|
This pull request has merge conflicts that must be resolved before it can be |
| query_start_loc_cpu[req_slice.start : req_slice.stop + 1] | ||
| - query_start_loc_cpu[req_slice.start] | ||
| ) | ||
| cu_seqlen_ks, cu_seqlen_ke = kv_spans_from_batches( |
The sparse MLA indexer allocates a [M, N] float32 logits tensor during prefill, where M is total query tokens and N is total sequence length. For long sequences or large batches, this can exceed GPU memory. This adds a new constraint to split_indexer_prefill_chunks that bounds M*N*4 bytes to VLLM_SPARSE_INDEXER_MAX_LOGITS_MB (default 512 MB), preventing CUDA OOM during prefill. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
f950710 to
0a3cef6
Compare
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
MatthewBonanni
left a comment
There was a problem hiding this comment.
LGTM, thanks for the fix!
Alternative to #35488, credit to @haosdent
Summary
VLLM_SPARSE_INDEXER_MAX_LOGITS_MBenv var (default 512 MB) to bound the [M, N] float32 logits tensorsplit_prefill_chunkswithsplit_indexer_prefill_chunksthat respects both workspace and logits size constraintsTest plan
split_indexer_prefill_chunkscovering various constraint scenariospytest tests/v1/attention/test_sparse_mla_backends.py🤖 Generated with Claude Code