[Bugfix][MoE] Only unpad routed output before shared expert add by netanel-haber · Pull Request #40865 · vllm-project/vllm

netanel-haber · 2026-04-25T06:36:00Z

Only trim padded routed output before shared+routed add / latent moe up-proj. No-shared MoE keeps late truncation.

Not duplicate: #40853 (draft auto-pr) reverts #40794; this preserves the shared-output fix.

I've reproduced the following error locally before this pr's change, and the test passes with the change:

python -m pytest -sv "./tests/evals/gpt_oss/test_gpqa_correctness.py::test_gpqa_correctness[gpt-oss-20b-flashinfer-mxfp4-bf16]" --config-list-file=configs/models-b200.txt

The PR skips fused truncation when there is no shared expert (or latent moe up-proj), as is the case with GPT-OSS. This restores GPT-OSS behavior to the state before #40794, while preserving #40794’s intent for Nemotron-Nano-v3: applying truncation before addition, which was broken by #35949.

There may still be an open Nemotron-Nano-v3 FI NVFP4 bug under investigation. This PR, like the previous one, does not attempt to fix that; it only aims to keep GPT-OSS working and make Nemotron-Nano-v3 TP=1 functional.

AI assistance was used.

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

netanel-haber · 2026-04-25T06:38:10Z

@bnellnm - I think this would fix the gptoss failure. Is there a way to manually trigger the gptoss test on this pr?

gemini-code-assist

Code Review

This pull request adds a comment clarifying the purpose of tracking routed hidden dimensions and introduces a conditional check when truncating fused outputs. The review identifies a potential shape mismatch bug in latent MoE configurations where shared experts are absent, suggesting an updated condition to ensure truncation occurs when either shared output or a routed output transform is present.

…-shared-output

bnellnm · 2026-04-25T18:45:51Z

@bnellnm - I think this would fix the gptoss failure. Is there a way to manually trigger the gptoss test on this pr?

There was a similar failure when the truncate came before the reduce in a prior PR that was fixed by moving the trunc afterwards.

netanel-haber · 2026-04-25T18:49:23Z

I ran the GPQA test before and after the pr locally (b200X2) and the pr fixes it @bnellnm

netanel-haber · 2026-04-25T18:54:06Z

What the pr basically does is not trigger the truncation if there is no shared expert, which is the case for gptoss, so it's behavior is reverted to pre pr, while conserving the behavior of my original pr, which is truncation pre addition when there is a shared expert, as in the case of nemotron nano v3.

…here are no shared experts, for latent models Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

tomeras91

After making sure truncation is applied if either shared experts or a routed_output_transform is used, LGTM

…uted output transform (vllm-project#40865) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Dao Le <Dao007forever@gmail.com>

…uted output transform (vllm-project#40865) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>

…uted output transform (vllm-project#40865) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

…uted output transform (vllm-project#40865) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Adrian <info@zzit.ch>

[Bugfix][MoE] Only unpad routed output before shared expert add

992cd02

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

netanel-haber requested review from mgoin and pavanimajety as code owners April 25, 2026 06:36

claude Bot reviewed Apr 25, 2026

View reviewed changes

mergify Bot added the bug Something isn't working label Apr 25, 2026

gemini-code-assist Bot reviewed Apr 25, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/runner/moe_runner.py Outdated

This was referenced Apr 25, 2026

[Bugfix][MoE] Unpad routed output before shared expert add [Fixes #35949] #40794

Merged

Revert "[Bugfix][MoE] Unpad routed output before shared expert add [Fixes #35949]" (#40794) #40853

Draft

This was referenced Apr 25, 2026

[CI Bug 2026-04-25] GPQA Eval (GPT-OSS): non-contiguous tensor crash in symm_mem all_reduce ZhanqiuHu/vllm-ci-watch#39

Open

[CI Summary 2026-04-25] 6 failed (5 new, 1 recurring), 1 fixed ZhanqiuHu/vllm-ci-watch#43

Open

jeejeelee added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 25, 2026

Merge branch 'main' into bugfix/only-truncate-padded-fused-output-for…

9bd6e7b

…-shared-output

bnellnm reviewed Apr 25, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/runner/moe_runner.py Outdated

netanel-haber added 2 commits April 25, 2026 22:15

still truncate if self.routed_output_transform is not None, even if t…

ab253dd

…here are no shared experts, for latent models Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

fix comment

71acc3e

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

tomeras91 approved these changes Apr 25, 2026

View reviewed changes

tomeras91 enabled auto-merge (squash) April 25, 2026 19:21

tomeras91 merged commit 12a3f64 into vllm-project:main Apr 25, 2026
66 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][MoE] Only unpad routed output before shared expert add#40865

[Bugfix][MoE] Only unpad routed output before shared expert add#40865
tomeras91 merged 4 commits intovllm-project:mainfrom
netanel-haber:bugfix/only-truncate-padded-fused-output-for-shared-output

netanel-haber commented Apr 25, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

netanel-haber commented Apr 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

bnellnm commented Apr 25, 2026

Uh oh!

netanel-haber commented Apr 25, 2026

Uh oh!

netanel-haber commented Apr 25, 2026

Uh oh!

tomeras91 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

netanel-haber commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

netanel-haber commented Apr 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

bnellnm commented Apr 25, 2026

Uh oh!

netanel-haber commented Apr 25, 2026

Uh oh!

netanel-haber commented Apr 25, 2026

Uh oh!

tomeras91 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

netanel-haber commented Apr 25, 2026 •

edited

Loading