Skip to content

[Bugfix][Kernel] Fix mHC fused-RMSNorm big-fuse miscompile for hidden_size != 4096#44692

Merged
jeejeelee merged 1 commit into
vllm-project:mainfrom
zyongye:fix-mhc-norm-num-stages
Jun 6, 2026
Merged

[Bugfix][Kernel] Fix mHC fused-RMSNorm big-fuse miscompile for hidden_size != 4096#44692
jeejeelee merged 1 commit into
vllm-project:mainfrom
zyongye:fix-mhc-norm-num-stages

Conversation

@zyongye
Copy link
Copy Markdown
Member

@zyongye zyongye commented Jun 5, 2026

Purpose

mhc_pre_big_fuse_with_norm_tilelang (the RMSNorm-fused mHC pre big-fuse used
by DeepSeek-V4 when norm_weight is supplied) pipelines its fused-RMSNorm pass
at num_stages=3. Combined with the loop-carried sumsq reduction and the
persistent output_shared staging buffer written at per-iteration offsets, the
TileLang software pipeliner miscompiles the loop. The result is silently wrong
layer_input
for every hidden size except 4096:

hidden_size trip count (H // 1024) result vs fp32 reference
2048 2 NaN
3072 3 NaN
4096 4 correct (1.66e-3)
5120 5 0.55
6144 6 0.23
7168 (DSv4) 7 0.17
8192 8 0.36

It is occupancy-dependent (at H=7168 it is correct for ≤256 tokens, breaks at
≥512) and slightly non-deterministic run-to-run — the signature of a pipeline
prologue/epilogue codegen issue, not an arithmetic one. The path was evidently
only validated at hidden_size=4096.

Fix

Drop the fused-RMSNorm pass to num_stages=2, matching the correct no-norm
sibling kernel mhc_pre_big_fuse_tilelang (which already uses depth 2). One line.

Root cause is in the pipeliner, not the algorithm: forcing the no-norm kernel
(no loop-carried state) to num_stages=3 stays correct, so depth-3 alone is
fine — it only breaks when combined with the loop-carried reduction + persistent
shared buffer. Depth-2 handles that combination correctly.

Not a duplicate

Checked open mHC PRs (#42735 bf16 shared staging, #44144 XPU fused_post_pre,
#43950 ROCm aiter default, #42893 / #41834 DSv4 platform fixes). None touch the
num_stages / fused-norm correctness. #42735 is a separate float32→bf16 staging
perf change and already uses num_stages=2 in its variants.

Test

Verified on GB200 (CUDA 13, TileLang) against an fp32 RMSNorm reference built on
top of the existing mhc_pre_ref in tests/kernels/test_mhc_kernels.py:

  • Before: hidden_size 2048/3072 → NaN; 5120/6144/7168/8192 → rel-err 0.17–0.55; only 4096 correct.
  • After: all hidden sizes 2048–8192 match to ~1.6e-3 (bf16 floor), deterministic, across token counts 1..16384.
  • Full op path torch.ops.vllm.mhc_pre_tilelang(..., norm_weight, norm_eps) (deep_gemm + kernel) at H=3072/7168: 1.67e-3, all finite.
  • The non-fused paths (mhc_pre, mhc_post) are unchanged and unaffected.

AI assistance (Claude) was used to diagnose and write this change; the author reviewed every line and ran the verification above.

…_size != 4096

mhc_pre_big_fuse_with_norm_tilelang pipelined the fused-RMSNorm pass at
num_stages=3. Combined with the loop-carried sumsq reduction and the
persistent output_shared buffer, the tilelang software pipeliner miscompiles
it: NaN when hidden_size // 1024 <= 3 (e.g. 2048, 3072) and finite-but-wrong,
slightly non-deterministic output when hidden_size // 1024 > 4 (e.g. 7168,
the DeepSeek-V4 size). Only hidden_size=4096 (trip count 4) was correct.

Drop to num_stages=2, matching the correct no-norm sibling kernel
mhc_pre_big_fuse_tilelang. Verified against an fp32 RMSNorm reference: all
hidden sizes 2048-8192 now match to ~1.6e-3 (bf16 floor), deterministically,
across token counts 1..16384. num_stages=3 without the loop-carried state
(the no-norm kernel) compiles correctly, confirming the issue is the
pipeliner's handling of that combination.

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added the bug Something isn't working label Jun 5, 2026
@zyongye zyongye added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 5, 2026
@jeejeelee jeejeelee merged commit ec0a31d into vllm-project:main Jun 6, 2026
52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants