Fix NaN from stale FP4 scale padding in create_fp4_scale_tensor by elvircrn · Pull Request #38148 · vllm-project/vllm

elvircrn · 2026-03-25T21:24:59Z

Summary

Zero-fill FP4 scale tensors in create_fp4_scale_tensor (torch.empty → torch.zeros)
Fixes NaN contamination in MoE expert outputs on Blackwell (GB200) with NVFP4 quantization

Root cause

create_fp4_scale_tensor allocates the swizzled scale tensor with torch.empty. When the number of rows m is less than rounded_m (rounded up to 128 for the tile boundary), the padding rows' scales contain stale GPU memory. If that memory holds FP8 NaN (0x7F in float8_e4m3fn), the TRT-LLM mm_fp4 kernel with use_8x4_sf_layout=True (triggered when m <= 32) reads these padding scales and applies them to real rows, producing NaN output.

In practice, this manifests as sporadic NaN in MoE layer outputs during prefill — experts receiving ≤32 tokens (common with 256 experts) hit the use_8x4_sf_layout path. The NaN then cascades through all subsequent layers, corrupts KV cache entries, and propagates to decode servers via KV transfer (NIXL).

Fix

Replace torch.empty with torch.zeros for the scale tensor allocation. Zero scales ensure padding contributes 0 × data = 0 to real rows' output, neutralizing the kernel bug.

Test plan

Verified NaN reproduction on GB200 cluster with DeepSeek-R1-0528-NVFP4-v2
Confirmed NaN originates at MoE layer 41 on prefill (clean input → NaN after MoE)
Confirmed fix neutralizes the padding scale contamination
Run existing FP4 unit tests

🤖 Generated with Claude Code

Padding rows in the swizzled scale tensor were uninitialized (torch.empty), containing stale NaN from prior GPU allocations. The TRT-LLM mm_fp4 kernel with use_8x4_sf_layout=True reads padding scales and applies them to real rows, contaminating output with NaN. Zero-filling ensures padding scales contribute 0 * data = 0. Fixes: flashinfer-ai/flashinfer#2861 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Elvir Crncevic <elvircrn@gmail.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request modifies vllm/_custom_ops.py to initialize FP4 scale tensors with zeros using torch.zeros instead of torch.empty. This change ensures that the tensors are predictably initialized, preventing potential issues from uninitialized memory. There are no review comments to address.

Cherry-pick d4a41a9: Revert "Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding (vllm-project#37442)" Apply PR vllm-project#38148: Fix NaN from stale FP4 scale padding in create_fp4_scale_tensor Signed-off-by: Travis Stephens <travis@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

…-project#38148) Signed-off-by: Elvir Crncevic <elvircrn@gmail.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>

claude bot reviewed Mar 25, 2026

View reviewed changes

gemini-code-assist bot reviewed Mar 25, 2026

View reviewed changes

elvircrn requested review from WoosukKwon and njhill as code owners March 28, 2026 18:00

mergify bot added the v1 label Mar 28, 2026

elvircrn force-pushed the fix/fp4-scale-padding-nan branch 3 times, most recently from 51ccde0 to 2432520 Compare March 28, 2026 18:25

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 30, 2026

tlrmchlsmth approved these changes Mar 30, 2026

View reviewed changes

tlrmchlsmth enabled auto-merge (squash) March 30, 2026 11:50

Merge branch 'main' into fix/fp4-scale-padding-nan

4f64936

auto-merge was automatically disabled March 30, 2026 12:26
Pull Request is not mergeable

tlrmchlsmth enabled auto-merge (squash) March 31, 2026 19:56

Merge branch 'main' into fix/fp4-scale-padding-nan

bfbeb6a

tlrmchlsmth merged commit 0fab52f into vllm-project:main Apr 1, 2026
48 checks passed

tlrmchlsmth mentioned this pull request Apr 1, 2026

[Bugfix] Revert "Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding" #38359

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix NaN from stale FP4 scale padding in create_fp4_scale_tensor#38148

Fix NaN from stale FP4 scale padding in create_fp4_scale_tensor#38148
tlrmchlsmth merged 3 commits intovllm-project:mainfrom
elvircrn:fix/fp4-scale-padding-nan

elvircrn commented Mar 25, 2026

Uh oh!

claude bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

elvircrn commented Mar 25, 2026

Summary

Root cause

Fix

Related

Test plan

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants