[Bugfix] Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding#37442
Conversation
dafe6ad to
9fb8044
Compare
There was a problem hiding this comment.
Code Review
This pull request addresses a critical bug causing NaN propagation in MLA attention when using CUDA graph decoding with padding. The fix involves pre-allocating and zero-initializing output buffers to prevent stale data from contaminating results. The changes in vllm/v1/attention/backends/mla/cutlass_mla.py are well-implemented. However, the implementation in vllm/v1/attention/backends/mla/flashinfer_mla.py contains a critical bug in the shape and dtype of the pre-allocated buffer, which will cause runtime errors. My review provides a correction for this issue.
9fb8044 to
c8dfe2f
Compare
…graph padding When using CUDA graph capture with batch padding, padding slots with seq_lens=0 are skipped by the attention kernel, leaving uninitialized output. This NaN/garbage in padding rows can propagate to real tokens through downstream operations. Fix: pre-allocate output buffers with zero-init, reuse across calls. The buffer is lazily allocated on first use and reused on subsequent calls, so the zeroing cost is paid once (not per CUDA graph replay). Changes: - CUTLASS MLA: cache output+LSE buffers as class properties, zero-init on allocation, slice to batch size each call - FlashInfer MLA: same pattern, pass pre-zeroed out= tensor to trtllm_batch_decode_with_kv_cache_mla - CUTLASS workspace: zero-init at allocation (was torch.empty) Related: Dao-AILab/flash-attention#1974 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Elvir Crncevic <elvircrn@gmail.com>
c8dfe2f to
fe46b00
Compare
…decode) The output buffer was 3D (B, H, kv_lora_rank) but q is 4D (B, q_len_per_req, H, D) for spec decode. The kernel wrote past the buffer causing CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Elvir Crncevic <elvircrn@gmail.com>
FlashInfer expects out as 3D (batch, num_heads, kv_lora_rank). For multi-token queries (spec decode), let the kernel allocate its own output buffer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Elvir Crncevic <elvircrn@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
Implemented a workaround so that we can fix Minimal reproducer here, verifies the fix works: https://gist.github.com/MatthewBonanni/5c425568a2b880edcd7b5f03a8048e2d |
…UDA graph padding (vllm-project#37442) Signed-off-by: Elvir Crncevic <elvircrn@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
…UDA graph padding (vllm-project#37442) Signed-off-by: Elvir Crncevic <elvircrn@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> (cherry picked from commit ef2c4f7)
|
Hi @elvircrn The below cmds hit OOM (4xGB200). Reverting this PR makes the OOM go away. Some Claude analysis: |
The NaN fix (PR vllm-project#37442) allocated a persistent _decode_out buffer per attention layer. For DeepSeek-R1 (61 layers), this totals ~15 GiB of GPU memory that is also not accounted for during profiling, causing OOM when KV cache + buffers exceed available memory. Fix: use a single module-level buffer shared across all layers. Memory drops from ~15 GiB to ~256 MiB. The buffer is only written by one layer at a time (sequential forward pass), so sharing is safe. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The NaN fix (PR vllm-project#37442) allocated a persistent _decode_out buffer per attention layer. For DeepSeek-R1 (61 layers), this totals ~15 GiB of GPU memory that is also not accounted for during profiling, causing OOM when KV cache + buffers exceed available memory. Fix: use a single module-level buffer shared across all layers. Memory drops from ~15 GiB to ~256 MiB. The buffer is only written by one layer at a time (sequential forward pass), so sharing is safe. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…N from CUDA graph padding (vllm-project#37442)" This reverts commit ef2c4f7.
…N from CUDA graph padding (vllm-project#37442)" This reverts commit ef2c4f7. Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
…N from CUDA graph padding (vllm-project#37442)" This reverts commit ef2c4f7. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Elvir Crncevic <elvircrn@gmail.com>
…UDA graph padding (vllm-project#37442) Signed-off-by: Elvir Crncevic <elvircrn@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> (cherry picked from commit ef2c4f7)
…N from CUDA graph padding (vllm-project#37442)" This reverts commit ef2c4f7. Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
…UDA graph padding (vllm-project#37442) Signed-off-by: Elvir Crncevic <elvircrn@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
…UDA graph padding (vllm-project#37442) Signed-off-by: Elvir Crncevic <elvircrn@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
…N from CUDA graph padding (vllm-project#37442)" This reverts commit ef2c4f7. The zero-init workaround is unnecessary — the NaN was caused by a different issue (int64 expert IDs in the routing simulator). Reverting to restore the original torch.empty allocation which avoids the overhead of pre-allocated zero-init buffers. Signed-off-by: Elvir Crncevic <elvircrn@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…N from CUDA graph padding (vllm-project#37442)" This reverts commit ef2c4f7. Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
…UDA graph padding (vllm-project#37442) Signed-off-by: Elvir Crncevic <elvircrn@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
…UDA graph padding (vllm-project#37442) Signed-off-by: Elvir Crncevic <elvircrn@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
…N from CUDA graph padding (vllm-project#37442)" This reverts commit ef2c4f7. The zero-init workaround is unnecessary — the NaN was caused by a different issue (int64 expert IDs in the routing simulator). Reverting to restore the original torch.empty allocation which avoids the overhead of pre-allocated zero-init buffers. Signed-off-by: Elvir Crncevic <elvircrn@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…N from CUDA graph padding (vllm-project#37442)" This reverts commit ef2c4f7. The zero-init workaround is unnecessary — the NaN was caused by a different issue (int64 expert IDs in the routing simulator). Reverting to restore the original torch.empty allocation which avoids the overhead of pre-allocated zero-init buffers. Signed-off-by: Elvir Crncevic <elvircrn@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…N from CUDA graph padding (vllm-project#37442)" This reverts commit ef2c4f7. The zero-init workaround is unnecessary — the NaN was caused by a different issue (int64 expert IDs in the routing simulator). Reverting to restore the original torch.empty allocation which avoids the overhead of pre-allocated zero-init buffers. Signed-off-by: Elvir Crncevic <elvircrn@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cherry-pick d4a41a9: Revert "Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding (vllm-project#37442)" Apply PR vllm-project#38148: Fix NaN from stale FP4 scale padding in create_fp4_scale_tensor Signed-off-by: Travis Stephens <travis@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
…N from CUDA graph padding (vllm-project#37442)" This reverts commit ef2c4f7. The zero-init workaround is unnecessary — the NaN was caused by a different issue (int64 expert IDs in the routing simulator). Reverting to restore the original torch.empty allocation which avoids the overhead of pre-allocated zero-init buffers. Signed-off-by: Elvir Crncevic <elvircrn@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…UDA graph padding (vllm-project#37442) Signed-off-by: Elvir Crncevic <elvircrn@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>
Summary
When running CUDA graph decode with padding (e.g. batch of 1024 with 1 real request), unused slots have
seq_lens=0. The MLA decode kernels (both CUTLASS MLA and FlashInfer TRT-LLM MLA) skip writing output for these slots, leaving stale data in the output buffer. If that stale data contains NaN (from a previous iteration or uninitialized memory), it propagates to real tokens via downstream per-tensor FP8 quantization (amaxover the entire batch).This was observed in production on GB200 (SM100) with DeepSeek-R1 NVFP4 causing intermittent NaN outputs.
Root cause
Two mechanisms produce NaN in padding slots:
seq_lens=1, entries 1–127 are read from uninitialized KV cache pages. If those contain NaN, softmax produces NaN. (Related: FlashAttention3 forward producing NaN output when NaN exist in parts of input data that it should not be reading Dao-AILab/flash-attention#1974)While vLLM already zero-inits KV cache pages (fixing mechanism 2), mechanism 1 requires the output buffer itself to be clean.
Fix
Pre-allocate output buffers with
torch.zerosonce per layer (cached as instance attributes), then reuse via slicing on each call. This:memset; the buffer is allocated once and reused, compatible with CUDA graph capture/replaycutlass_mla.py) and FlashInfer MLA (flashinfer_mla.py)Test plan
🤖 Generated with Claude Code