[Bugfix][MoE] Unpad routed output before shared expert add [Fixes #35949]#40794
Conversation
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
There was a problem hiding this comment.
Code Review
This pull request introduces logic to handle hidden dimension padding in the Fused MoE runner. It records the original hidden dimension before potential padding and ensures that the fused output is sliced back to its original size if padding was applied. I have no feedback to provide as there are no review comments to evaluate.
|
Thanks! Approved BTW @netanel-haber - do you know how this works with latentMoE (regardless of padding)? Is the routed hidden states are added to the shared hidden states only after applying the latent up proj to match hidden dims again? |
|
…m-project#35949] (vllm-project#40794) Signed-off-by: Netanel Haber <nhaber@nvidia.com>
|
I think this change might have broken |
|
Re gptoss test breakages, |
…m-project#35949] (vllm-project#40794) Signed-off-by: Netanel Haber <nhaber@nvidia.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
…m-project#35949] (vllm-project#40794) Signed-off-by: Netanel Haber <nhaber@nvidia.com> Signed-off-by: Adrian <info@zzit.ch>

Fixes https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4.
FI TRTLLM NVFP4 MoE can pad the routed hidden dim, e.g.
2688 -> 2816, viaalign_trtllm_fp4_moe_hidden_dim_for_fi.Before #35949,
FusedMoEreturned routed and shared outputs separately. The routed output was truncated back to the original hidden dim before model code added the shared expert output, so the world looked like:#35949 moved the shared/routed add into
MoERunner. That changed the order to add first and truncate later:Dynamo catches this during fake tensor tracing as a shape mismatch.
This PR records the routed hidden dim before
_maybe_pad_hidden_states()and trims the fused routed output back to that dim before shared expert addition.DailyOmniis on par for nano-v3-omni before and after this pr.