Skip to content

[MoE Refactor] Refactor ZeroExpertFusedMoE into new framework#35549

Merged
robertgshaw2-redhat merged 93 commits intovllm-project:mainfrom
neuralmagic:moe-runner-4
Apr 14, 2026
Merged

[MoE Refactor] Refactor ZeroExpertFusedMoE into new framework#35549
robertgshaw2-redhat merged 93 commits intovllm-project:mainfrom
neuralmagic:moe-runner-4

Conversation

@bnellnm
Copy link
Copy Markdown
Collaborator

@bnellnm bnellnm commented Feb 27, 2026

Purpose

Remove the ZeroExpertFusedMoE class and move its functionality into FusedMoE, MoERunnerBase and the new ZeroExpertRouter classes.

based on #35326

cc @baonudesifeizhai, @OftenDream , @yzong-rh

Test Plan

Added new tests for zero experts.

Test Result

cc @yzong-rh


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify Bot added the nvidia label Feb 27, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Feb 27, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a major and well-executed refactoring of the Mixture of Experts (MoE) implementation. It successfully removes the ZeroExpertFusedMoE class by introducing a more modular design with a new ZeroExpertRouter, a MoERunner abstraction, and a dedicated SharedExperts class. This significantly improves the structure and extensibility of the MoE framework. The changes are consistent across the codebase and are supported by a comprehensive new test suite for the zero-expert functionality. I've identified one critical issue in the new ChunkingMoERunner that could cause a crash when handling empty inputs, and I've provided a suggestion to fix it. Overall, this is an excellent refactoring effort.

Comment thread vllm/model_executor/layers/fused_moe/runner/chunking_moe_runner.py Outdated
@baonudesifeizhai
Copy link
Copy Markdown
Contributor

baonudesifeizhai commented Mar 15, 2026

they may more activity in sglang.....so ...

bnellnm added 10 commits March 18, 2026 16:48
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
@mergify mergify Bot removed the needs-rebase label Apr 4, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 6, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Apr 6, 2026
Signed-off-by: Bill Nell <bnell@redhat.com>
@mergify mergify Bot removed the needs-rebase label Apr 6, 2026
Signed-off-by: Bill Nell <bnell@redhat.com>
Comment thread tests/evals/gsm8k/configs/moe-refactor/LongCat-Flash-Chat-FP8.yaml Outdated
@robertgshaw2-redhat robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 14, 2026
@github-project-automation github-project-automation Bot moved this to Ready in NVIDIA Apr 14, 2026
@robertgshaw2-redhat robertgshaw2-redhat enabled auto-merge (squash) April 14, 2026 20:11
@robertgshaw2-redhat robertgshaw2-redhat merged commit 19ec9a0 into vllm-project:main Apr 14, 2026
73 of 75 checks passed
@github-project-automation github-project-automation Bot moved this from Ready to Done in NVIDIA Apr 14, 2026
zxd1997066 pushed a commit to zxd1997066/vllm that referenced this pull request Apr 15, 2026
…roject#35549)

Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: zengxian <xiangdong.zeng@intel.com>
@bnellnm bnellnm deleted the moe-runner-4 branch April 15, 2026 20:49
adobrzyn pushed a commit to vllm-project/vllm-gaudi that referenced this pull request Apr 16, 2026
…stream regressions in HPU worker, MoE router, and offloading tests (#1354)

Fix three upstream regressions that break HPU unit tests.


## Changes

1. **`vllm_gaudi/v1/worker/hpu_worker.py`** —
`compile_or_warm_up_model()` now returns
`CompilationTimes` NamedTuple instead of a plain `float`, matching the
new upstream
contract introduced in vllm-project/vllm#39240.

2. **`vllm_gaudi/ops/hpu_fused_moe.py`** — Add `zero_expert_type` and
`num_logical_experts`
parameters to the HPU override of `create_fused_moe_router()`, plus
`ZeroExpertRouter`
dispatch, matching the refactor in
vllm-project/vllm#35549.

3.
**`tests/unit_tests/kv_offload/offloading_connector/test_scheduler.py`**
— Remove
`block_size` from `OffloadingEvent` constructor calls and update
assertion, matching
   the field removal in vllm-project/vllm#36644.

## Fixed tests
- `tests/unit_tests/lora/test_llama_tp.py::test_llama_lora`
-
`tests/unit_tests/lora/test_llm_with_multi_loras.py::test_multiple_lora_requests`
-
`tests/unit_tests/test_embedding.py::test_embeddings[intfloat/e5-mistral-7b-instruct]`
-
`tests/unit_tests/ops/test_hpu_fused_moe.py::test_unquantized_fused_moe_method`
-
`tests/unit_tests/ops/test_hpu_compressed_tensors.py::test_compressed_tensors_wna16_moe_method`
-
`tests/unit_tests/ops/test_hpu_compressed_tensors.py::test_compressed_tensors_w8a8fp8_block_moe_method`
-
`tests/unit_tests/kv_offload/offloading_connector/test_scheduler.py::test_offloading_connector[True]`
-
`tests/unit_tests/kv_offload/offloading_connector/test_scheduler.py::test_offloading_connector[False]`

---------

Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
yeonsily pushed a commit to yeonsily/vllm-gaudi that referenced this pull request Apr 21, 2026
…stream regressions in HPU worker, MoE router, and offloading tests (vllm-project#1354)

Fix three upstream regressions that break HPU unit tests.

## Changes

1. **`vllm_gaudi/v1/worker/hpu_worker.py`** —
`compile_or_warm_up_model()` now returns
`CompilationTimes` NamedTuple instead of a plain `float`, matching the
new upstream
contract introduced in vllm-project/vllm#39240.

2. **`vllm_gaudi/ops/hpu_fused_moe.py`** — Add `zero_expert_type` and
`num_logical_experts`
parameters to the HPU override of `create_fused_moe_router()`, plus
`ZeroExpertRouter`
dispatch, matching the refactor in
vllm-project/vllm#35549.

3.
**`tests/unit_tests/kv_offload/offloading_connector/test_scheduler.py`**
— Remove
`block_size` from `OffloadingEvent` constructor calls and update
assertion, matching
   the field removal in vllm-project/vllm#36644.

## Fixed tests
- `tests/unit_tests/lora/test_llama_tp.py::test_llama_lora`
-
`tests/unit_tests/lora/test_llm_with_multi_loras.py::test_multiple_lora_requests`
-
`tests/unit_tests/test_embedding.py::test_embeddings[intfloat/e5-mistral-7b-instruct]`
-
`tests/unit_tests/ops/test_hpu_fused_moe.py::test_unquantized_fused_moe_method`
-
`tests/unit_tests/ops/test_hpu_compressed_tensors.py::test_compressed_tensors_wna16_moe_method`
-
`tests/unit_tests/ops/test_hpu_compressed_tensors.py::test_compressed_tensors_w8a8fp8_block_moe_method`
-
`tests/unit_tests/kv_offload/offloading_connector/test_scheduler.py::test_offloading_connector[True]`
-
`tests/unit_tests/kv_offload/offloading_connector/test_scheduler.py::test_offloading_connector[False]`

---------

Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
Signed-off-by: Yeonsil Yoon <yeon.sil.yoon@intel.com>
bmyrcha pushed a commit to bmyrcha/vllm-gaudi that referenced this pull request Apr 22, 2026
…stream regressions in HPU worker, MoE router, and offloading tests (vllm-project#1354)

Fix three upstream regressions that break HPU unit tests.

## Changes

1. **`vllm_gaudi/v1/worker/hpu_worker.py`** —
`compile_or_warm_up_model()` now returns
`CompilationTimes` NamedTuple instead of a plain `float`, matching the
new upstream
contract introduced in vllm-project/vllm#39240.

2. **`vllm_gaudi/ops/hpu_fused_moe.py`** — Add `zero_expert_type` and
`num_logical_experts`
parameters to the HPU override of `create_fused_moe_router()`, plus
`ZeroExpertRouter`
dispatch, matching the refactor in
vllm-project/vllm#35549.

3.
**`tests/unit_tests/kv_offload/offloading_connector/test_scheduler.py`**
— Remove
`block_size` from `OffloadingEvent` constructor calls and update
assertion, matching
   the field removal in vllm-project/vllm#36644.

## Fixed tests
- `tests/unit_tests/lora/test_llama_tp.py::test_llama_lora`
-
`tests/unit_tests/lora/test_llm_with_multi_loras.py::test_multiple_lora_requests`
-
`tests/unit_tests/test_embedding.py::test_embeddings[intfloat/e5-mistral-7b-instruct]`
-
`tests/unit_tests/ops/test_hpu_fused_moe.py::test_unquantized_fused_moe_method`
-
`tests/unit_tests/ops/test_hpu_compressed_tensors.py::test_compressed_tensors_wna16_moe_method`
-
`tests/unit_tests/ops/test_hpu_compressed_tensors.py::test_compressed_tensors_w8a8fp8_block_moe_method`
-
`tests/unit_tests/kv_offload/offloading_connector/test_scheduler.py::test_offloading_connector[True]`
-
`tests/unit_tests/kv_offload/offloading_connector/test_scheduler.py::test_offloading_connector[False]`

---------

Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
Signed-off-by: bmyrcha <bartosz.myrcha@intel.com>
whk-lab pushed a commit to whk-lab/vllm that referenced this pull request Apr 23, 2026
avinashsingh77 pushed a commit to avinashsingh77/vllm that referenced this pull request Apr 27, 2026
…roject#35549)

Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants