Skip to content

[Bugfix] Revert "Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding"#38359

Open
elvircrn wants to merge 3 commits intovllm-project:mainfrom
elvircrn:revert-mla-zero-init
Open

[Bugfix] Revert "Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding"#38359
elvircrn wants to merge 3 commits intovllm-project:mainfrom
elvircrn:revert-mla-zero-init

Conversation

@elvircrn
Copy link
Copy Markdown
Contributor

@elvircrn elvircrn commented Mar 27, 2026

Summary

  • Restores the original torch.empty allocation, removing the overhead of pre-allocated zero-init buffers and the out= workaround in FlashInfer MLA.

Test plan

  • Run GB200 DeepSeek-R1 NVFP4 decode with CUDA graph padding — verify no NaN
  • Verify no performance regression from removing the pre-allocated buffer

🤖 Generated with Claude Code

…N from CUDA graph padding (vllm-project#37442)"

This reverts commit ef2c4f7.

The zero-init workaround is unnecessary — the NaN was caused by a
different issue (int64 expert IDs in the routing simulator). Reverting
to restore the original torch.empty allocation which avoids the
overhead of pre-allocated zero-init buffers.

Signed-off-by: Elvir Crncevic <elvircrn@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@elvircrn elvircrn requested a review from pavanimajety as a code owner March 27, 2026 12:18
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify bot added nvidia v1 bug Something isn't working labels Mar 27, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes pre-allocated output buffers and simplifies tensor allocation logic across the CUTLASS and FlashInfer MLA backends. In cutlass_mla.py, the _decode_out buffer is replaced with a direct new_empty allocation, while in flashinfer_mla.py, the manual buffer management and padding zeroing workarounds in forward_mqa are removed. I have no feedback to provide.

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working nvidia ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants