Skip to content

Fix flashinfer cutlass MoE output shape for non-FP4-packed inputs#14028

Merged
Fridge003 merged 1 commit intomainfrom
fix/modelopt-fp4-output-shape
Nov 27, 2025
Merged

Fix flashinfer cutlass MoE output shape for non-FP4-packed inputs#14028
Fridge003 merged 1 commit intomainfrom
fix/modelopt-fp4-output-shape

Conversation

@alisonshao
Copy link
Collaborator

Summary

Fixes nightly-test-perf-4-gpu-b200 failure caused by ValueError: Invalid shape of output: expected (512, 7168), got torch.Size([512, 14336]) when starting DeepSeek-V3-FP4 with flashinfer_cutlass MoE backend.

Root Cause

PR #13327 introduced a regression in ModelOptNvFp4FusedMoEMethod.apply() when refactoring the MoE dispatcher implementation. The output tensor allocation was changed from:

symm_output = torch.empty(x.shape[0], original_col, ...)

to:

symm_output = torch.empty(x.shape[0], x.shape[1] * 2, ...)

The * 2 multiplier was intended for when x is FP4-packed (where each byte holds 2 FP4 values, so x.shape[1] is half the original hidden_size). However, when should_use_flashinfer_cutlass_moe_fp4_allgather() returns False (as in the failing test with --tp 4 --ep 4), the hidden_states are NOT FP4-packed and already have the full hidden_size. This caused the output to be allocated with double the expected size (7168 * 2 = 14336 instead of 7168).

Fix

The output size is now conditional based on whether x_sf (hidden_states_scale) is provided:

  • When x_sf is not None: hidden_states are FP4-packed, use x.shape[1] * 2
  • When x_sf is None: hidden_states are not packed, use x.shape[1]

Test plan

  • Syntax check passes
  • nightly-test-perf-4-gpu-b200 should pass with this fix

Fix incorrect output tensor shape calculation in ModelOptNvFp4FusedMoEMethod
that caused ValueError during server startup for DeepSeek-V3-FP4 models.
@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions bot added the quant LLM Quantization label Nov 27, 2025
@alisonshao
Copy link
Collaborator Author

/tag-and-rerun-ci

@alisonshao
Copy link
Collaborator Author

@Fridge003 Fridge003 merged commit 6330d66 into main Nov 27, 2025
103 of 154 checks passed
@Fridge003 Fridge003 deleted the fix/modelopt-fp4-output-shape branch November 27, 2025 01:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants