[Bugfix] Revert "Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding" by elvircrn · Pull Request #38359 · vllm-project/vllm

elvircrn · 2026-03-27T12:18:25Z

Summary

Restores the original torch.empty allocation, removing the overhead of pre-allocated zero-init buffers and the out= workaround in FlashInfer MLA.

Test plan

Run GB200 DeepSeek-R1 NVFP4 decode with CUDA graph padding — verify no NaN
Verify no performance regression from removing the pre-allocated buffer

🤖 Generated with Claude Code

…N from CUDA graph padding (vllm-project#37442)" This reverts commit ef2c4f7. The zero-init workaround is unnecessary — the NaN was caused by a different issue (int64 expert IDs in the routing simulator). Reverting to restore the original torch.empty allocation which avoids the overhead of pre-allocated zero-init buffers. Signed-off-by: Elvir Crncevic <elvircrn@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request removes pre-allocated output buffers and simplifies tensor allocation logic across the CUTLASS and FlashInfer MLA backends. In cutlass_mla.py, the _decode_out buffer is replaced with a direct new_empty allocation, while in flashinfer_mla.py, the manual buffer management and padding zeroing workarounds in forward_mqa are removed. I have no feedback to provide.

elvircrn requested a review from pavanimajety as a code owner March 27, 2026 12:18

claude bot reviewed Mar 27, 2026

View reviewed changes

mergify bot added nvidia v1 bug Something isn't working labels Mar 27, 2026

github-project-automation bot added this to NVIDIA Mar 27, 2026

gemini-code-assist bot reviewed Mar 27, 2026

View reviewed changes

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 27, 2026

tlrmchlsmth added 2 commits March 28, 2026 15:01

Merge branch 'main' into revert-mla-zero-init

9418166

Merge branch 'main' into revert-mla-zero-init

644b511

xinli-sw mentioned this pull request Mar 31, 2026

[Bug] trtllm_batch_decode_with_kv_cache_mla writes NaNs in CUDAGraph padded region flashinfer-ai/flashinfer#2883

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Revert "Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding"#38359

[Bugfix] Revert "Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding"#38359
elvircrn wants to merge 3 commits intovllm-project:mainfrom
elvircrn:revert-mla-zero-init

elvircrn commented Mar 27, 2026 •

edited

Loading

Uh oh!

claude bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

elvircrn commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

elvircrn commented Mar 27, 2026 •

edited

Loading