Skip to content

[Bug] Fix V4-Pro NaN on Blackwell by converting fp8_einsum input scale to ue8m0#25733

Merged
Fridge003 merged 1 commit into
mainfrom
fix/fp8-einsum-ue8m0-scale
May 19, 2026
Merged

[Bug] Fix V4-Pro NaN on Blackwell by converting fp8_einsum input scale to ue8m0#25733
Fridge003 merged 1 commit into
mainfrom
fix/fp8-einsum-ue8m0-scale

Conversation

@yhyang201

@yhyang201 yhyang201 commented May 19, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Fix DeepSeek-V4-Pro producing garbage text during autoregressive decode on Blackwell GPUs (B300/B200)
  • Root cause: deep_gemm's transpose_and_pack_fp32_into_ue8m0 CUDA kernel has a packing bug — it doesn't mask mantissa bits when extracting fp32 exponent, so non-power-of-2 scale values get corrupted. This causes fp8_einsum to produce NaN at batch=1 (single-token decode)
  • Fix: add ceil_to_ue8m0() on the activation scale before passing to fp8_einsum, ensuring mantissa bits are zero before packing
  • The scale tensor is tiny (e.g. shape (2, 32)), so the overhead is negligible
  • Detailed analysis: https://gist.github.com/yhyang201/2fce5750e44970af419f9669e10e994c

CI States

Latest PR Test (Base): 🚫 Run #26072487958
Latest PR Test (Extra): ⚠️ Not enabled -- add run-ci-extra label to opt in.

…e to ue8m0

deep_gemm's ue8m0 packing kernel has a bug where non-power-of-2 fp32
scale values get their mantissa bits leaked into packed exponent fields.
This causes fp8_einsum to produce NaN during autoregressive decode
(batch=1) for DeepSeek-V4-Pro on Blackwell GPUs.

Adding ceil_to_ue8m0() on the activation scale ensures mantissa bits
are zero before packing, which matches deep_gemm's expected input
format and fixes the NaN.
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@yhyang201

Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@ch-wan

ch-wan commented May 19, 2026

Copy link
Copy Markdown
Collaborator

/rerun-test test_deepseek_v4_flash_fp4_b200.py

@github-actions

github-actions Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

🚀 4-gpu-b200 (1 test): ❌ View workflow run

cd test/ && python3 registered/dsv4/test_deepseek_v4_flash_fp4_b200.py

@yhyang201

Copy link
Copy Markdown
Collaborator Author

/rerun-test test_deepseek_v4_flash_fp4_b200.py

@github-actions

github-actions Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

🚀 4-gpu-b200 (1 test): ✅ View workflow run

cd test/ && python3 registered/dsv4/test_deepseek_v4_flash_fp4_b200.py

@yhyang201

Copy link
Copy Markdown
Collaborator Author

/rerun-test test_deepseek_v4_flash_fp4_megamoe_b200.py

@github-actions

github-actions Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

🚀 4-gpu-b200 (1 test): ✅ View workflow run

cd test/ && python3 registered/dsv4/test_deepseek_v4_flash_fp4_megamoe_b200.py

@Fridge003 Fridge003 merged commit 79ea30d into main May 19, 2026
183 of 214 checks passed
@Fridge003 Fridge003 deleted the fix/fp8-einsum-ue8m0-scale branch May 19, 2026 06:48
Kangyan-Zhou added a commit that referenced this pull request May 22, 2026
… converting fp8_einsum input scale to ue8m0 (#25733) (#26063)

Co-authored-by: Yuhao Yang <47235274+yhyang201@users.noreply.github.com>
Shunkangz pushed a commit to Shunkangz/sglang that referenced this pull request May 27, 2026
alphabetc1 pushed a commit to alphabetc1/sglang that referenced this pull request Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants