[MoE Refactor] Move the shared/fused expert output sum into MoERunnerBase by bnellnm · Pull Request #35949 · vllm-project/vllm

bnellnm · 2026-03-04T03:40:37Z

Purpose

Move the summation of the shared and fused expert outputs into MoERunnerBase and cleanup the final shared/fused all reduce code. Remove corresponding code from models that use SharedFusedMoE
Move the final TP reduce code into MoERunnerBase.

Test Plan

CI MoE refactor tests

Run gsm8k evals on the following models:

arcee-ai/Trinity-Mini
deepseek-ai/DeepSeek-R1
google/gemma-4-26B-A4B-it
zai-org/GLM-4.7-Flash
openai/gpt-oss-20b
ibm-granite/granite-4.0-h-small
ai21labs/AI21-Jamba2-Mini
LiquidAI/LFM2.5-350M
meta-llama/Llama-4-Scout-17B-16E-Instruct
MiniMaxAI/MiniMax-M2.7
mistralai/Mixtral-8x7B-v0.1
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
allenai/OLMoE-1B-7B-0125-Instruct
microsoft/Phi-tiny-MoE-instruct
sarvamai/sarvam-30b
stepfun-ai/Step-3.5-Flash

Test Result

Model	Baseline	PR
arcee-ai/Trinity-Mini	0.8408	0.8476
deepseek-ai/DeepSeek-R1	0.9492	0.9522
google/gemma-4-26B-A4B-it	0.3017	0.3207
zai-org/GLM-4.7-Flash	0.8241	0.8393
openai/gpt-oss-20b	0.3154	0.3002
ibm-granite/granite-4.0-h-small	0.8400	0.8438
ai21labs/AI21-Jamba2-Mini	0.7665	0.7635
LiquidAI/LFM2.5-350M	0.2092	0.2153
meta-llama/Llama-4-Scout-17B-16E-Instruct	TIMEOUT	TIMEOUT
MiniMaxAI/MiniMax-M2.7	0.9249	0.9242
mistralai/Mixtral-8x7B-v0.1	0.5512	0.5701
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	0.9295	0.9333
allenai/OLMoE-1B-7B-0125-Instruct	0.6770	0.6907
microsoft/Phi-tiny-MoE-instruct	0.7020	0.7028
sarvamai/sarvam-30b	0.6588	0.6603
stepfun-ai/Step-3.5-Flash	FAIL	FAIL
baidu/ERNIE-4.5-VL-28B-A3B-Thinking	0.7506	0.7536

lm-eval RedHatAI/DeepSeek-Coder-V2-Lite-Instruct-FP8 with SP

main:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7551|±  |0.0118|
|     |       |strict-match    |     5|exact_match|↑  |0.7407|±  |0.0121|

PR:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7582|±  |0.0118|
|     |       |strict-match    |     5|exact_match|↑  |0.7392|±  |0.0121|

cc @robertgshaw2-redhat , @tlrmchlsmth , @yzong-rh

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2026-03-04T03:41:18Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces a significant and well-designed refactoring of the Mixture-of-Experts (MoE) implementation. The introduction of the MoERunner abstraction effectively separates the MoE execution logic from the layer definition, improving modularity and maintainability. The changes across various model files to adapt to this new abstraction are consistent and simplify the model-specific code. The addition of support for zero experts and the corresponding tests are also valuable.

However, I've identified a critical issue in vllm/model_executor/models/transformers/moe.py where a removed parameter is still being passed, which will lead to a runtime error. Additionally, there's a potential regression in the same file regarding the detection of shared experts, which now relies solely on parameter names instead of also checking the model configuration. Please see the detailed comments for suggestions on how to address these issues.

_{Note: Security Review did not run due to the size of the PR.}

Signed-off-by: Bill Nell <bnell@redhat.com>

…er place Signed-off-by: Bill Nell <bnell@redhat.com>

Signed-off-by: Bill Nell <bnell@redhat.com>

…Base (vllm-project#35949) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>

…Base (vllm-project#35949) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Signed-off-by: Yifan <yzong@redhat.com>

### What this PR does / why we need it? For the fusedmoe: vllm-project/vllm#33049 vllm-project/vllm#35949 FusedMoe refactor For the qwen3_vl: vllm-project/vllm#34539 A new Triton kernel has been added for fast rope position encoding. I've added a patch to fallback to native. We'll consider registering custom operators and implementing ascending later. vllm-project/vllm#38361 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: guxin108 <1252896542@qq.com>

] (#40794) Signed-off-by: Netanel Haber <nhaber@nvidia.com>

…m-project#35949] (vllm-project#40794) Signed-off-by: Netanel Haber <nhaber@nvidia.com>

…Base (vllm-project#35949) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>

…m-project#35949] (vllm-project#40794) Signed-off-by: Netanel Haber <nhaber@nvidia.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>

### What this PR does / why we need it? For the fusedmoe: vllm-project/vllm#33049 vllm-project/vllm#35949 FusedMoe refactor For the qwen3_vl: vllm-project/vllm#34539 A new Triton kernel has been added for fast rope position encoding. I've added a patch to fallback to native. We'll consider registering custom operators and implementing ascending later. vllm-project/vllm#38361 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: zouyida2052 <zouyida2002@gmail.com>

…stream breakages: NIXL connector, TpKVTopology rename, MoE refactor, transformers v5 (#1377) ## Summary Compatibility fixes for vLLM bump to `3975eb6de6`. Addresses breakages from multiple upstream PRs affecting NIXL connectors, MoE runner refactor, offloading tests, Qwen3 MoE models, and transformers v5 upgrade. ## Root Cause 1. **NIXL import gate** — Upstream PR vllm-project/vllm#39529 (commit `cc3993b05d`) moved NIXL imports to `vllm/distributed/nixl_utils.py` and changed the platform gate from `if not is_rocm()` to `if is_cuda()`. HPU is neither CUDA nor ROCm, so it falls into the `else` branch → tries `rixl._api` (ROCm-only) → fails → `NixlWrapper = None` → `RuntimeError("NIXL is not available")`. 2. **TpKVTopology rename** — Same upstream PR #39529 unified `TpKVTopology` + `HeteroTPTransferConfig` into `TransferTopology`, breaking vllm-gaudi NIXL connector imports. 3. **Offloading tests** — Upstream PR vllm-project/vllm#36645 changed `OffloadingManager.lookup()` API. 4. **MoE runner refactor** — Upstream PR vllm-project/vllm#35949 (commit `726efe177b`) moved reduce logic into `MoERunnerBase`, removing `reduce_results`, renaming `forward_dispatch` → `_forward_dispatch`, `forward_entry` → `_forward_entry`, `_maybe_reduce_output` → `_maybe_reduce_final_output`. Follow-up PR moved `MoERunnerBase` and `get_layer_from_name` to `moe_runner_base.py`. 5. **Qwen3 MoE** — `SharedFusedMoE` returns a combined tensor (not a tuple), and MoE runner now handles TP reduction internally, causing double-reduce in `qwen3_moe.py` / `qwen3_next.py`. 6. **Transformers v5 — granite tokenizer** — Upstream PR vllm-project/vllm#30566 updated transformers to allow v5. GPT2Tokenizer in v5 now respects `add_bos_token=True` (silently ignored in v4), causing degenerate outputs and 0.0 GSM8K accuracy on granite models. 7. **Transformers v5.6.x — DeepSeek-V2-Lite tokenizer** — In transformers v5.6.x, `LlamaTokenizerFast` was unified into `LlamaTokenizer`, which does not apply the ByteLevel BPE decoder declared in `tokenizer.json`. DeepSeek-V2-Lite-Chat's tokenizer decoding strips all spaces (Ġ chars not converted back), producing garbled output and 0.0 accuracy on GSM8K. Fixed natively in transformers v5.7.0. ## Fix 1. **NIXL import patch**: Add `patch_nixl_utils_for_hpu()` in `register_utils()` to monkey-patch `vllm.distributed.nixl_utils` — imports from `nixl._api` instead of `rixl._api` on HPU. Update `hetero_hpu_nixl_connector.py` to import from `vllm.distributed.nixl_utils` instead of hardcoded `nixl._api`. 2. **TpKVTopology → TransferTopology**: Rename in NIXL connector imports and monkey-patches. 3. **Offloading tests**: Replace `runner.manager.lookup.return_value` with `connector_scheduler._maximal_prefix_lookup`. 4. **MoE refactor**: Update imports (`MoERunnerBase` from `moe_runner_base`), method names (`_forward_dispatch`, `_forward_entry`, `_maybe_reduce_final_output`), remove dead `reduce_results` / `reduce_output()`. 5. **Qwen3 MoE**: Remove incorrect shared_expert tuple indexing and double TP reduction. 6. **Transformers v5 — granite**: Remove hardcoded `add_bos_token=True` from lm-eval model_args to fix GSM8K accuracy regression. 7. **Transformers v5.6.x — DeepSeek-V2-Lite**: Exclude `transformers 5.6.*` in `requirements.txt` to prevent installation of versions with broken ByteLevel BPE tokenizer decoding. Verified on Gaudi2: gsm8k accuracy 0.65 (expected 0.66, within tolerance) with transformers 5.7.0. --------- Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

mergify Bot added deepseek Related to DeepSeek models llama Related to Llama models qwen Related to Qwen models labels Mar 4, 2026

mergify Bot added the nvidia label Mar 4, 2026

github-project-automation Bot added this to NVIDIA Mar 4, 2026

mergify Bot added the needs-rebase label Mar 4, 2026

gemini-code-assist Bot reviewed Mar 4, 2026

View reviewed changes

Comment thread vllm/model_executor/models/transformers/moe.py Outdated

Comment thread vllm/model_executor/models/transformers/moe.py Outdated

mergify Bot added the gpt-oss Related to GPT-OSS models label Mar 4, 2026

github-project-automation Bot added this to gpt-oss Issues & Enhancements Mar 4, 2026

github-project-automation Bot moved this to To Triage in gpt-oss Issues & Enhancements Mar 4, 2026

bnellnm marked this pull request as ready for review March 5, 2026 21:56

bnellnm requested review from WoosukKwon, hmellor, jeejeelee, mgoin, patrickvonplaten, pavanimajety, robertgshaw2-redhat, sighingnow, tjtanaa, tlrmchlsmth and yewentao256 as code owners March 5, 2026 21:56

bnellnm force-pushed the moe-runner-5 branch from 39a66bb to 3d5de66 Compare March 18, 2026 16:06

bnellnm added 6 commits March 18, 2026 16:48

initial MoERunner refactor

4aeabf2

Signed-off-by: Bill Nell <bnell@redhat.com>

fix lint

a4d3acb

Signed-off-by: Bill Nell <bnell@redhat.com>

rebase

5b7f133

Signed-off-by: Bill Nell <bnell@redhat.com>

rebase + remove dead code

fad7f33

Signed-off-by: Bill Nell <bnell@redhat.com>

wip

83c1863

Signed-off-by: Bill Nell <bnell@redhat.com>

fix

7c7953e

Signed-off-by: Bill Nell <bnell@redhat.com>

bnellnm added 2 commits April 17, 2026 03:07

renames, add comments, simplify scaling factor handling

b83a93b

Signed-off-by: Bill Nell <bnell@redhat.com>

fix bug in routed scale initialization. move output transform to prop…

0c1f88c

…er place Signed-off-by: Bill Nell <bnell@redhat.com>

robertgshaw2-redhat reviewed Apr 17, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/runner/moe_runner_base.py Outdated

bnellnm added 2 commits April 17, 2026 19:01

move contiguous back to AR call

4d91791

Signed-off-by: Bill Nell <bnell@redhat.com>

move trunc after reduce

2ed7088

Signed-off-by: Bill Nell <bnell@redhat.com>

bnellnm requested a review from robertgshaw2-redhat April 20, 2026 01:40

robertgshaw2-redhat approved these changes Apr 20, 2026

View reviewed changes

github-project-automation Bot moved this from In progress to Ready in gpt-oss Issues & Enhancements Apr 20, 2026

github-project-automation Bot moved this from In review to Ready in NVIDIA Apr 20, 2026

robertgshaw2-redhat merged commit 726efe1 into vllm-project:main Apr 20, 2026
76 checks passed

github-project-automation Bot moved this from Ready to Done in gpt-oss Issues & Enhancements Apr 20, 2026

github-project-automation Bot moved this from Ready to Done in NVIDIA Apr 20, 2026

lk-chen mentioned this pull request Apr 21, 2026

[Bug]: gemma4 lm-eval is too low (flexible-extract 0.1683) vllm-project/tpu-inference#2332

Open

1 task

pawel-olejniczak mentioned this pull request Apr 21, 2026

[FIX_FOR_VLLM_CUSTOM=3975eb6de6ea914b9d7b27fd517e0c971ddeb6fc] Fix upstream breakages: NIXL connector, TpKVTopology rename, MoE refactor, transformers v5 vllm-project/vllm-gaudi#1377

Merged

netanel-haber mentioned this pull request Apr 24, 2026

[Bugfix][MoE] Unpad routed output before shared expert add [Fixes #35949] #40794

Merged

tomeras91 pushed a commit that referenced this pull request Apr 24, 2026

[Bugfix][MoE] Unpad routed output before shared expert add [Fixes #35949

e8eb049

] (#40794) Signed-off-by: Netanel Haber <nhaber@nvidia.com>

bnellnm deleted the moe-runner-5 branch April 24, 2026 19:40

gemini-code-assist Bot mentioned this pull request Apr 25, 2026

Revert "[Bugfix][MoE] Unpad routed output before shared expert add [Fixes #35949]" (#40794) #40853

Draft

hnt2601 pushed a commit to hnt2601/vllm that referenced this pull request Apr 25, 2026

[Bugfix][MoE] Unpad routed output before shared expert add [Fixes vll…

b4bfea6

…m-project#35949] (vllm-project#40794) Signed-off-by: Netanel Haber <nhaber@nvidia.com>

ZhanqiuHu mentioned this pull request Apr 25, 2026

[CI Investigate 2026-04-25] LoRA TP: model generates garbage text, LoRA adapter modules silently ignored ZhanqiuHu/vllm-ci-watch#40

Open

netanel-haber mentioned this pull request Apr 25, 2026

[Bugfix][MoE] Only unpad routed output before shared expert add #40865

Merged

shen-shanshan mentioned this pull request Apr 30, 2026

[Misc][main2main] Align with vLLM 0429 main vllm-project/vllm-ascend#8841

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MoE Refactor] Move the shared/fused expert output sum into MoERunnerBase#35949

[MoE Refactor] Move the shared/fused expert output sum into MoERunnerBase#35949
robertgshaw2-redhat merged 126 commits intovllm-project:mainfrom
neuralmagic:moe-runner-5

bnellnm commented Mar 4, 2026 •

edited

Loading

Uh oh!

mergify Bot commented Mar 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

bnellnm commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify Bot commented Mar 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bnellnm commented Mar 4, 2026 •

edited

Loading