[Bugfix][Kernel] Fix mHC fused-RMSNorm big-fuse miscompile for hidden_size != 4096 by zyongye · Pull Request #44692 · vllm-project/vllm

zyongye · 2026-06-05T22:35:40Z

Purpose

mhc_pre_big_fuse_with_norm_tilelang (the RMSNorm-fused mHC pre big-fuse used
by DeepSeek-V4 when norm_weight is supplied) pipelines its fused-RMSNorm pass
at num_stages=3. Combined with the loop-carried sumsq reduction and the
persistent output_shared staging buffer written at per-iteration offsets, the
TileLang software pipeliner miscompiles the loop. The result is silently wrong
layer_input for every hidden size except 4096:

hidden_size	trip count (`H // 1024`)	result vs fp32 reference
2048	2	NaN
3072	3	NaN
4096	4	correct (1.66e-3)
5120	5	0.55
6144	6	0.23
7168 (DSv4)	7	0.17
8192	8	0.36

It is occupancy-dependent (at H=7168 it is correct for ≤256 tokens, breaks at
≥512) and slightly non-deterministic run-to-run — the signature of a pipeline
prologue/epilogue codegen issue, not an arithmetic one. The path was evidently
only validated at hidden_size=4096.

Fix

Drop the fused-RMSNorm pass to num_stages=2, matching the correct no-norm
sibling kernel mhc_pre_big_fuse_tilelang (which already uses depth 2). One line.

Root cause is in the pipeliner, not the algorithm: forcing the no-norm kernel
(no loop-carried state) to num_stages=3 stays correct, so depth-3 alone is
fine — it only breaks when combined with the loop-carried reduction + persistent
shared buffer. Depth-2 handles that combination correctly.

Not a duplicate

Checked open mHC PRs (#42735 bf16 shared staging, #44144 XPU fused_post_pre,
#43950 ROCm aiter default, #42893 / #41834 DSv4 platform fixes). None touch the
num_stages / fused-norm correctness. #42735 is a separate float32→bf16 staging
perf change and already uses num_stages=2 in its variants.

Test

Verified on GB200 (CUDA 13, TileLang) against an fp32 RMSNorm reference built on
top of the existing mhc_pre_ref in tests/kernels/test_mhc_kernels.py:

Before: hidden_size 2048/3072 → NaN; 5120/6144/7168/8192 → rel-err 0.17–0.55; only 4096 correct.
After: all hidden sizes 2048–8192 match to ~1.6e-3 (bf16 floor), deterministic, across token counts 1..16384.
Full op path torch.ops.vllm.mhc_pre_tilelang(..., norm_weight, norm_eps) (deep_gemm + kernel) at H=3072/7168: 1.67e-3, all finite.
The non-fused paths (mhc_pre, mhc_post) are unchanged and unaffected.

AI assistance (Claude) was used to diagnose and write this change; the author reviewed every line and ran the verification above.

…_size != 4096 mhc_pre_big_fuse_with_norm_tilelang pipelined the fused-RMSNorm pass at num_stages=3. Combined with the loop-carried sumsq reduction and the persistent output_shared buffer, the tilelang software pipeliner miscompiles it: NaN when hidden_size // 1024 <= 3 (e.g. 2048, 3072) and finite-but-wrong, slightly non-deterministic output when hidden_size // 1024 > 4 (e.g. 7168, the DeepSeek-V4 size). Only hidden_size=4096 (trip count 4) was correct. Drop to num_stages=2, matching the correct no-norm sibling kernel mhc_pre_big_fuse_tilelang. Verified against an fp32 RMSNorm reference: all hidden sizes 2048-8192 now match to ~1.6e-3 (bf16 floor), deterministically, across token counts 1..16384. num_stages=3 without the loop-carried state (the no-norm kernel) compiles correctly, confirming the issue is the pipeliner's handling of that combination. Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

claude Bot reviewed Jun 5, 2026

View reviewed changes

mergify Bot added the bug Something isn't working label Jun 5, 2026

zyongye added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 5, 2026

jeejeelee approved these changes Jun 6, 2026

View reviewed changes

jeejeelee merged commit ec0a31d into vllm-project:main Jun 6, 2026
52 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][Kernel] Fix mHC fused-RMSNorm big-fuse miscompile for hidden_size != 4096#44692

[Bugfix][Kernel] Fix mHC fused-RMSNorm big-fuse miscompile for hidden_size != 4096#44692
jeejeelee merged 1 commit into
vllm-project:mainfrom
zyongye:fix-mhc-norm-num-stages

zyongye commented Jun 5, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

zyongye commented Jun 5, 2026

Purpose

Fix

Not a duplicate

Test

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants