Skip to content

[MoE Refactor] Move the shared/fused expert output sum into MoERunnerBase#35949

Merged
robertgshaw2-redhat merged 126 commits intovllm-project:mainfrom
neuralmagic:moe-runner-5
Apr 20, 2026
Merged

[MoE Refactor] Move the shared/fused expert output sum into MoERunnerBase#35949
robertgshaw2-redhat merged 126 commits intovllm-project:mainfrom
neuralmagic:moe-runner-5

Conversation

@bnellnm
Copy link
Copy Markdown
Collaborator

@bnellnm bnellnm commented Mar 4, 2026

Purpose

  • Move the summation of the shared and fused expert outputs into MoERunnerBase and cleanup the final shared/fused all reduce code. Remove corresponding code from models that use SharedFusedMoE
  • Move the final TP reduce code into MoERunnerBase.

Test Plan

CI MoE refactor tests

Run gsm8k evals on the following models:

  • arcee-ai/Trinity-Mini
  • deepseek-ai/DeepSeek-R1
  • google/gemma-4-26B-A4B-it
  • zai-org/GLM-4.7-Flash
  • openai/gpt-oss-20b
  • ibm-granite/granite-4.0-h-small
  • ai21labs/AI21-Jamba2-Mini
  • LiquidAI/LFM2.5-350M
  • meta-llama/Llama-4-Scout-17B-16E-Instruct
  • MiniMaxAI/MiniMax-M2.7
  • mistralai/Mixtral-8x7B-v0.1
  • nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
  • allenai/OLMoE-1B-7B-0125-Instruct
  • microsoft/Phi-tiny-MoE-instruct
  • sarvamai/sarvam-30b
  • stepfun-ai/Step-3.5-Flash

Test Result

Model Baseline PR
arcee-ai/Trinity-Mini 0.8408 0.8476
deepseek-ai/DeepSeek-R1 0.9492 0.9522
google/gemma-4-26B-A4B-it 0.3017 0.3207
zai-org/GLM-4.7-Flash 0.8241 0.8393
openai/gpt-oss-20b 0.3154 0.3002
ibm-granite/granite-4.0-h-small 0.8400 0.8438
ai21labs/AI21-Jamba2-Mini 0.7665 0.7635
LiquidAI/LFM2.5-350M 0.2092 0.2153
meta-llama/Llama-4-Scout-17B-16E-Instruct TIMEOUT TIMEOUT
MiniMaxAI/MiniMax-M2.7 0.9249 0.9242
mistralai/Mixtral-8x7B-v0.1 0.5512 0.5701
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 0.9295 0.9333
allenai/OLMoE-1B-7B-0125-Instruct 0.6770 0.6907
microsoft/Phi-tiny-MoE-instruct 0.7020 0.7028
sarvamai/sarvam-30b 0.6588 0.6603
stepfun-ai/Step-3.5-Flash FAIL FAIL
baidu/ERNIE-4.5-VL-28B-A3B-Thinking 0.7506 0.7536

lm-eval RedHatAI/DeepSeek-Coder-V2-Lite-Instruct-FP8 with SP

main:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7551|±  |0.0118|
|     |       |strict-match    |     5|exact_match|↑  |0.7407|±  |0.0121|

PR:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7582|±  |0.0118|
|     |       |strict-match    |     5|exact_match|↑  |0.7392|±  |0.0121|

cc @robertgshaw2-redhat , @tlrmchlsmth , @yzong-rh


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify Bot added deepseek Related to DeepSeek models llama Related to Llama models qwen Related to Qwen models labels Mar 4, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 4, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-designed refactoring of the Mixture-of-Experts (MoE) implementation. The introduction of the MoERunner abstraction effectively separates the MoE execution logic from the layer definition, improving modularity and maintainability. The changes across various model files to adapt to this new abstraction are consistent and simplify the model-specific code. The addition of support for zero experts and the corresponding tests are also valuable.

However, I've identified a critical issue in vllm/model_executor/models/transformers/moe.py where a removed parameter is still being passed, which will lead to a runtime error. Additionally, there's a potential regression in the same file regarding the detection of shared experts, which now relies solely on parameter names instead of also checking the model configuration. Please see the detailed comments for suggestions on how to address these issues.

Note: Security Review did not run due to the size of the PR.

Comment thread vllm/model_executor/models/transformers/moe.py Outdated
Comment thread vllm/model_executor/models/transformers/moe.py Outdated
bnellnm added 6 commits March 18, 2026 16:48
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
bnellnm added 2 commits April 17, 2026 03:07
Signed-off-by: Bill Nell <bnell@redhat.com>
…er place

Signed-off-by: Bill Nell <bnell@redhat.com>
Comment thread vllm/model_executor/layers/fused_moe/runner/moe_runner_base.py Outdated
bnellnm added 2 commits April 17, 2026 19:01
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
@github-project-automation github-project-automation Bot moved this from In progress to Ready in gpt-oss Issues & Enhancements Apr 20, 2026
@github-project-automation github-project-automation Bot moved this from In review to Ready in NVIDIA Apr 20, 2026
@robertgshaw2-redhat robertgshaw2-redhat merged commit 726efe1 into vllm-project:main Apr 20, 2026
76 checks passed
@github-project-automation github-project-automation Bot moved this from Ready to Done in NVIDIA Apr 20, 2026
bnellnm added a commit to neuralmagic/vllm that referenced this pull request Apr 20, 2026
…Base (vllm-project#35949)

Signed-off-by: Bill Nell <bnell@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
baonudesifeizhai pushed a commit to baonudesifeizhai/vllm that referenced this pull request Apr 23, 2026
…Base (vllm-project#35949)

Signed-off-by: Bill Nell <bnell@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
yzong-rh pushed a commit to yzong-rh/vllm that referenced this pull request Apr 23, 2026
…Base (vllm-project#35949)

Signed-off-by: Bill Nell <bnell@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Signed-off-by: Yifan <yzong@redhat.com>
guxin108 pushed a commit to guxin108/vllm-ascend that referenced this pull request Apr 24, 2026
### What this PR does / why we need it?
For the fusedmoe:
vllm-project/vllm#33049
vllm-project/vllm#35949
FusedMoe refactor

For the qwen3_vl:
vllm-project/vllm#34539
A new Triton kernel has been added for fast rope position encoding. I've
added a patch to fallback to native. We'll consider registering custom
operators and implementing ascending later.

vllm-project/vllm#38361

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version:
- vLLM main:
vllm-project/vllm@29e4870

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: guxin108 <1252896542@qq.com>
tomeras91 pushed a commit that referenced this pull request Apr 24, 2026
] (#40794)

Signed-off-by: Netanel Haber <nhaber@nvidia.com>
@bnellnm bnellnm deleted the moe-runner-5 branch April 24, 2026 19:40
hnt2601 pushed a commit to hnt2601/vllm that referenced this pull request Apr 25, 2026
avinashsingh77 pushed a commit to avinashsingh77/vllm that referenced this pull request Apr 27, 2026
…Base (vllm-project#35949)

Signed-off-by: Bill Nell <bnell@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
avinashsingh77 pushed a commit to avinashsingh77/vllm that referenced this pull request Apr 27, 2026
…m-project#35949] (vllm-project#40794)

Signed-off-by: Netanel Haber <nhaber@nvidia.com>
Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
zouyida2052 pushed a commit to zouyida2052/vllm-ascend that referenced this pull request Apr 28, 2026
### What this PR does / why we need it?
For the fusedmoe:
vllm-project/vllm#33049
vllm-project/vllm#35949
FusedMoe refactor

For the qwen3_vl:
vllm-project/vllm#34539
A new Triton kernel has been added for fast rope position encoding. I've
added a patch to fallback to native. We'll consider registering custom
operators and implementing ascending later.

vllm-project/vllm#38361

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version:
- vLLM main:
vllm-project/vllm@29e4870

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
iboiko-habana pushed a commit to vllm-project/vllm-gaudi that referenced this pull request Apr 29, 2026
…stream breakages: NIXL connector, TpKVTopology rename, MoE refactor, transformers v5 (#1377)

## Summary

Compatibility fixes for vLLM bump to `3975eb6de6`. Addresses breakages
from multiple upstream PRs affecting NIXL connectors, MoE runner
refactor, offloading tests, Qwen3 MoE models, and transformers v5
upgrade.

## Root Cause

1. **NIXL import gate** — Upstream PR
vllm-project/vllm#39529 (commit `cc3993b05d`)
moved NIXL imports to `vllm/distributed/nixl_utils.py` and changed the
platform gate from `if not is_rocm()` to `if is_cuda()`. HPU is neither
CUDA nor ROCm, so it falls into the `else` branch → tries `rixl._api`
(ROCm-only) → fails → `NixlWrapper = None` → `RuntimeError("NIXL is not
available")`.

2. **TpKVTopology rename** — Same upstream PR #39529 unified
`TpKVTopology` + `HeteroTPTransferConfig` into `TransferTopology`,
breaking vllm-gaudi NIXL connector imports.

3. **Offloading tests** — Upstream PR
vllm-project/vllm#36645 changed
`OffloadingManager.lookup()` API.

4. **MoE runner refactor** — Upstream PR
vllm-project/vllm#35949 (commit `726efe177b`)
moved reduce logic into `MoERunnerBase`, removing `reduce_results`,
renaming `forward_dispatch` → `_forward_dispatch`, `forward_entry` →
`_forward_entry`, `_maybe_reduce_output` → `_maybe_reduce_final_output`.
Follow-up PR moved `MoERunnerBase` and `get_layer_from_name` to
`moe_runner_base.py`.

5. **Qwen3 MoE** — `SharedFusedMoE` returns a combined tensor (not a
tuple), and MoE runner now handles TP reduction internally, causing
double-reduce in `qwen3_moe.py` / `qwen3_next.py`.

6. **Transformers v5 — granite tokenizer** — Upstream PR
vllm-project/vllm#30566 updated transformers to
allow v5. GPT2Tokenizer in v5 now respects `add_bos_token=True`
(silently ignored in v4), causing degenerate outputs and 0.0 GSM8K
accuracy on granite models.

7. **Transformers v5.6.x — DeepSeek-V2-Lite tokenizer** — In
transformers v5.6.x, `LlamaTokenizerFast` was unified into
`LlamaTokenizer`, which does not apply the ByteLevel BPE decoder
declared in `tokenizer.json`. DeepSeek-V2-Lite-Chat's tokenizer decoding
strips all spaces (Ġ chars not converted back), producing garbled output
and 0.0 accuracy on GSM8K. Fixed natively in transformers v5.7.0.

## Fix

1. **NIXL import patch**: Add `patch_nixl_utils_for_hpu()` in
`register_utils()` to monkey-patch `vllm.distributed.nixl_utils` —
imports from `nixl._api` instead of `rixl._api` on HPU. Update
`hetero_hpu_nixl_connector.py` to import from
`vllm.distributed.nixl_utils` instead of hardcoded `nixl._api`.
2. **TpKVTopology → TransferTopology**: Rename in NIXL connector imports
and monkey-patches.
3. **Offloading tests**: Replace `runner.manager.lookup.return_value`
with `connector_scheduler._maximal_prefix_lookup`.
4. **MoE refactor**: Update imports (`MoERunnerBase` from
`moe_runner_base`), method names (`_forward_dispatch`, `_forward_entry`,
`_maybe_reduce_final_output`), remove dead `reduce_results` /
`reduce_output()`.
5. **Qwen3 MoE**: Remove incorrect shared_expert tuple indexing and
double TP reduction.
6. **Transformers v5 — granite**: Remove hardcoded `add_bos_token=True`
from lm-eval model_args to fix GSM8K accuracy regression.
7. **Transformers v5.6.x — DeepSeek-V2-Lite**: Exclude `transformers
5.6.*` in `requirements.txt` to prevent installation of versions with
broken ByteLevel BPE tokenizer decoding. Verified on Gaudi2: gsm8k
accuracy 0.65 (expected 0.66, within tolerance) with transformers 5.7.0.

---------

Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek Related to DeepSeek models gpt-oss Related to GPT-OSS models llama Related to Llama models nvidia qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants