Fix flashinfer cutlass MoE output shape for non-FP4-packed inputs#14028
Merged
Fix flashinfer cutlass MoE output shape for non-FP4-packed inputs#14028
Conversation
Fix incorrect output tensor shape calculation in ModelOptNvFp4FusedMoEMethod that caused ValueError during server startup for DeepSeek-V3-FP4 models.
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
Author
|
/tag-and-rerun-ci |
Collaborator
Author
Fridge003
approved these changes
Nov 27, 2025
harvenstar
pushed a commit
to harvenstar/sglang
that referenced
this pull request
Dec 4, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes
nightly-test-perf-4-gpu-b200failure caused byValueError: Invalid shape of output: expected (512, 7168), got torch.Size([512, 14336])when starting DeepSeek-V3-FP4 with flashinfer_cutlass MoE backend.Root Cause
PR #13327 introduced a regression in
ModelOptNvFp4FusedMoEMethod.apply()when refactoring the MoE dispatcher implementation. The output tensor allocation was changed from:to:
The
* 2multiplier was intended for whenxis FP4-packed (where each byte holds 2 FP4 values, sox.shape[1]is half the original hidden_size). However, whenshould_use_flashinfer_cutlass_moe_fp4_allgather()returns False (as in the failing test with--tp 4 --ep 4), the hidden_states are NOT FP4-packed and already have the fullhidden_size. This caused the output to be allocated with double the expected size (7168 * 2 = 14336 instead of 7168).Fix
The output size is now conditional based on whether
x_sf(hidden_states_scale) is provided:x_sf is not None: hidden_states are FP4-packed, usex.shape[1] * 2x_sf is None: hidden_states are not packed, usex.shape[1]Test plan
nightly-test-perf-4-gpu-b200should pass with this fix