[Bugfix][MoE] Only unpad routed output before shared expert add#40865
Conversation
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
|
@bnellnm - I think this would fix the gptoss failure. Is there a way to manually trigger the gptoss test on this pr? |
There was a problem hiding this comment.
Code Review
This pull request adds a comment clarifying the purpose of tracking routed hidden dimensions and introduces a conditional check when truncating fused outputs. The review identifies a potential shape mismatch bug in latent MoE configurations where shared experts are absent, suggesting an updated condition to ensure truncation occurs when either shared output or a routed output transform is present.
There was a similar failure when the truncate came before the reduce in a prior PR that was fixed by moving the trunc afterwards. |
|
I ran the GPQA test before and after the pr locally (b200X2) and the pr fixes it @bnellnm |
|
What the pr basically does is not trigger the truncation if there is no shared expert, which is the case for gptoss, so it's behavior is reverted to pre pr, while conserving the behavior of my original pr, which is truncation pre addition when there is a shared expert, as in the case of nemotron nano v3. |
…here are no shared experts, for latent models Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
tomeras91
left a comment
There was a problem hiding this comment.
After making sure truncation is applied if either shared experts or a routed_output_transform is used, LGTM
…uted output transform (vllm-project#40865) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Dao Le <Dao007forever@gmail.com>
…uted output transform (vllm-project#40865) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
…uted output transform (vllm-project#40865) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
…uted output transform (vllm-project#40865) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Adrian <info@zzit.ch>
Only trim padded routed output before shared+routed add / latent moe up-proj. No-shared MoE keeps late truncation.
Not duplicate: #40853 (draft auto-pr) reverts #40794; this preserves the shared-output fix.
I've reproduced the following error locally before this pr's change, and the test passes with the change:
The PR skips fused truncation when there is no shared expert (or latent moe up-proj), as is the case with GPT-OSS. This restores GPT-OSS behavior to the state before #40794, while preserving #40794’s intent for Nemotron-Nano-v3: applying truncation before addition, which was broken by #35949.
There may still be an open Nemotron-Nano-v3 FI NVFP4 bug under investigation. This PR, like the previous one, does not attempt to fix that; it only aims to keep GPT-OSS working and make Nemotron-Nano-v3 TP=1 functional.
AI assistance was used.