Fix flashinfer cutlass MoE output shape for non-FP4-packed inputs by alisonshao · Pull Request #14028 · sgl-project/sglang

alisonshao · 2025-11-27T00:29:04Z

Summary

Fixes nightly-test-perf-4-gpu-b200 failure caused by ValueError: Invalid shape of output: expected (512, 7168), got torch.Size([512, 14336]) when starting DeepSeek-V3-FP4 with flashinfer_cutlass MoE backend.

Root Cause

PR #13327 introduced a regression in ModelOptNvFp4FusedMoEMethod.apply() when refactoring the MoE dispatcher implementation. The output tensor allocation was changed from:

symm_output = torch.empty(x.shape[0], original_col, ...)

to:

symm_output = torch.empty(x.shape[0], x.shape[1] * 2, ...)

The * 2 multiplier was intended for when x is FP4-packed (where each byte holds 2 FP4 values, so x.shape[1] is half the original hidden_size). However, when should_use_flashinfer_cutlass_moe_fp4_allgather() returns False (as in the failing test with --tp 4 --ep 4), the hidden_states are NOT FP4-packed and already have the full hidden_size. This caused the output to be allocated with double the expected size (7168 * 2 = 14336 instead of 7168).

Fix

The output size is now conditional based on whether x_sf (hidden_states_scale) is provided:

When x_sf is not None: hidden_states are FP4-packed, use x.shape[1] * 2
When x_sf is None: hidden_states are not packed, use x.shape[1]

Test plan

Syntax check passes
nightly-test-perf-4-gpu-b200 should pass with this fix

Fix incorrect output tensor shape calculation in ModelOptNvFp4FusedMoEMethod that caused ValueError during server startup for DeepSeek-V3-FP4 models.

gemini-code-assist · 2025-11-27T00:29:08Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

alisonshao · 2025-11-27T00:31:04Z

/tag-and-rerun-ci

alisonshao · 2025-11-27T01:06:27Z

https://github.com/sgl-project/sglang/actions/runs/19721366821/job/56504438983

…l-project#14028)

Fix flashinfer cutlass MoE output shape for non-FP4-packed inputs

ca053f0

Fix incorrect output tensor shape calculation in ModelOptNvFp4FusedMoEMethod that caused ValueError during server startup for DeepSeek-V3-FP4 models.

alisonshao requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg and ch-wan as code owners November 27, 2025 00:29

github-actions bot added the quant LLM Quantization label Nov 27, 2025

github-actions bot added the run-ci label Nov 27, 2025

Fridge003 approved these changes Nov 27, 2025

View reviewed changes

Fridge003 merged commit 6330d66 into main Nov 27, 2025
103 of 154 checks passed

Fridge003 deleted the fix/modelopt-fp4-output-shape branch November 27, 2025 01:09

harvenstar pushed a commit to harvenstar/sglang that referenced this pull request Dec 4, 2025

Fix flashinfer cutlass MoE output shape for non-FP4-packed inputs (sg…

9080d99

…l-project#14028)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flashinfer cutlass MoE output shape for non-FP4-packed inputs#14028

Fix flashinfer cutlass MoE output shape for non-FP4-packed inputs#14028
Fridge003 merged 1 commit intomainfrom
fix/modelopt-fp4-output-shape

alisonshao commented Nov 27, 2025

Uh oh!

gemini-code-assist bot commented Nov 27, 2025

Uh oh!

alisonshao commented Nov 27, 2025

Uh oh!

alisonshao commented Nov 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alisonshao commented Nov 27, 2025

Summary

Root Cause

Fix

Test plan

Uh oh!

gemini-code-assist bot commented Nov 27, 2025

Uh oh!

alisonshao commented Nov 27, 2025

Uh oh!

alisonshao commented Nov 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants