Skip to content

[MoE Refactor] EPLB refactoring for FusedMoE#41055

Merged
robertgshaw2-redhat merged 16 commits into
vllm-project:mainfrom
neuralmagic:eplb-manager
May 12, 2026
Merged

[MoE Refactor] EPLB refactoring for FusedMoE#41055
robertgshaw2-redhat merged 16 commits into
vllm-project:mainfrom
neuralmagic:eplb-manager

Conversation

@bnellnm
Copy link
Copy Markdown
Collaborator

@bnellnm bnellnm commented Apr 27, 2026

Purpose

  • Use eplb_state | None instead of enable_eplb flag + eplb_state in FusedMoE and router classes.
  • Add set method to EplbLayerState.
  • Update tests

Test Plan

CI

Test Result

cc @yzong-rh


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

bnellnm added 2 commits April 27, 2026 20:54
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the EplbManager class to centralize Expert Parallelism Load Balancing (EPLB) logic, refactoring MoE layers and routers to delegate state management and weight collection. Feedback recommends using p.detach() instead of p.data for safer tensor access and replacing assertions with explicit RuntimeErrors for critical validation to prevent issues in optimized Python environments.

Comment thread vllm/model_executor/layers/fused_moe/eplb_manager.py Outdated
Comment thread vllm/model_executor/layers/fused_moe/eplb_manager.py Outdated
Signed-off-by: Bill Nell <bnell@redhat.com>
@bnellnm bnellnm requested a review from WoosukKwon as a code owner April 28, 2026 19:52
Signed-off-by: Bill Nell <bnell@redhat.com>
bnellnm added 2 commits May 5, 2026 15:12
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
@yzong-rh
Copy link
Copy Markdown
Contributor

yzong-rh commented May 6, 2026

https://github.com/neuralmagic/vllm/blob/2bc4adcc0e4d5a58cf2b69cab9d6126ba8882641/vllm/distributed/eplb/eplb_state.py#L642-L647
Still expects layer.eplb_state.

https://github.com/neuralmagic/vllm/blob/2bc4adcc0e4d5a58cf2b69cab9d6126ba8882641/vllm/model_executor/layers/quantization/modelopt.py#L1934-L1937
Still expects layer.enable_eplb.

AI also found:

tests/model_executor/test_routed_experts_capture.py

  • uses router.enable_eplb
  • uses router.eplb_state.*

tests/kernels/moe/test_routing.py

  • calls create_fused_moe_router(..., enable_eplb=..., eplb_state=...)

tests/distributed/test_eplb_fused_moe_layer_dep_nvfp4.py

  • sets fml.enable_eplb = True

Not caused by this refactor but likely a bug:
https://github.com/neuralmagic/vllm/blob/2bc4adcc0e4d5a58cf2b69cab9d6126ba8882641/vllm/model_executor/models/sarvam.py#L664-L668
which uses an incorrect set_eplb_state signature.

@yzong-rh
Copy link
Copy Markdown
Contributor

yzong-rh commented May 6, 2026

Instead of creating a EplbManager wrapper, what if we flesh out EplbLayerState with set_state and get_expert_weights instead?
This moves the Eplb handling logic out of FusedMoE without introducing a manager class.

Signed-off-by: Bill Nell <bnell@redhat.com>
@bnellnm bnellnm changed the title [MoE Refactor] Add EplbManager class to handle EPLB functionality [MoE Refactor] EPLB refactoring for FusedMoE May 6, 2026
@bnellnm
Copy link
Copy Markdown
Collaborator Author

bnellnm commented May 6, 2026

Instead of creating a EplbManager wrapper, what if we flesh out EplbLayerState with set_state and get_expert_weights instead? This moves the Eplb handling logic out of FusedMoE without introducing a manager class.

Good point. I've redone the PR so that there's still only EplbLayerState. It's now mostly removing the flag and using the presence of the state as an indicator of whether or not EPLB is enabled. I ended up moving the static methods on the defunct manager to other places in a later PR anyway.

Signed-off-by: Bill Nell <bnell@redhat.com>
Copy link
Copy Markdown
Contributor

@yzong-rh yzong-rh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Comment thread vllm/model_executor/layers/fused_moe/layer.py Outdated
Comment thread vllm/model_executor/layers/fused_moe/layer.py Outdated
Comment thread vllm/model_executor/layers/fused_moe/layer.py Outdated
)

if self.enable_eplb and not self.quant_method.supports_eplb:
if enable_eplb and not self.quant_method.supports_eplb:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self, it seems weird that It would be the quant method that determines this

Shoudlnt it be the kernel?

@robertgshaw2-redhat robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label May 11, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 11, 2026

Hi @bnellnm, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@bnellnm bnellnm requested a review from tjtanaa as a code owner May 11, 2026 16:39
Copy link
Copy Markdown
Contributor

@ilmarkov ilmarkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just small nits.

Comment thread vllm/model_executor/layers/fused_moe/router/base_router.py Outdated
Signed-off-by: Bill Nell <bnell@redhat.com>
bnellnm added 2 commits May 12, 2026 13:45
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 12, 2026

Hi @bnellnm, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Bill Nell <bnell@redhat.com>
@robertgshaw2-redhat robertgshaw2-redhat merged commit d9b4990 into vllm-project:main May 12, 2026
92 checks passed
aoshen02 added a commit to aoshen02/vllm that referenced this pull request May 14, 2026
…oject#41055 API

PR vllm-project#41055 ([MoE Refactor] EPLB refactoring for FusedMoE) removed the
`enable_eplb` parameter from `BaseRouter.__init__`; the new API uses
`eplb_state=None` (disabled) vs. populated `eplb_state` (enabled).

Reverting vllm-project#39917 restored the pre-vllm-project#39917 test file that still passed
`enable_eplb=False`, causing TypeError on import/instantiation. Align
the test helper with the current API: `_make_router` now takes an
optional `eplb_state` (defaults to None), and the EPLB-enabled test
builds a fully-populated state and passes it in.

Signed-off-by: Ao Shen <aoshen@inferact.ai>

Signed-off-by: aoshen02 <aoshen@inferact.ai>
wangxiyuan pushed a commit to vllm-project/vllm-ascend that referenced this pull request May 14, 2026
### What this PR does / why we need it?
1. fix vllm-project/vllm#33322
overwrite `gpu_modelrunner.sync_and_gather_intermediate_tensors`, for
the sceniro `pp+sp+tp`, skip scatter the residual for ascend

2. vllm-project/vllm#35520
Adapted to the modifications of `ModelRunner v2` for hybrid attn in
interface level, .
Todo: Added support for Mamba in ModelRunner in Ascend. any pull_request
is welcome

3. vllm-project/vllm#40711

4. vllm-project/vllm#42121

5. vllm-project/vllm#41706

6. vllm-project/vllm#39917
Disable `async_schedule` when `enable_return_routed_experts=True`
7. vllm-project/vllm#41046
8. vllm-project/vllm#41055
9. vllm-project/vllm#41035
10. vllm-project/vllm#42434
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.20.1
- vLLM main:
vllm-project/vllm@c7aa186

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Tian-Fantasea pushed a commit to Tian-Fantasea/vllm-ascend that referenced this pull request May 19, 2026
### What this PR does / why we need it?
1. fix vllm-project/vllm#33322
overwrite `gpu_modelrunner.sync_and_gather_intermediate_tensors`, for
the sceniro `pp+sp+tp`, skip scatter the residual for ascend

2. vllm-project/vllm#35520
Adapted to the modifications of `ModelRunner v2` for hybrid attn in
interface level, .
Todo: Added support for Mamba in ModelRunner in Ascend. any pull_request
is welcome

3. vllm-project/vllm#40711

4. vllm-project/vllm#42121

5. vllm-project/vllm#41706

6. vllm-project/vllm#39917
Disable `async_schedule` when `enable_return_routed_experts=True`
7. vllm-project/vllm#41046
8. vllm-project/vllm#41055
9. vllm-project/vllm#41035
10. vllm-project/vllm#42434
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.20.1
- vLLM main:
vllm-project/vllm@c7aa186

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
shaopeng-666 pushed a commit to shaopeng-666/vllm-ascend that referenced this pull request May 19, 2026
### What this PR does / why we need it?
1. fix vllm-project/vllm#33322
overwrite `gpu_modelrunner.sync_and_gather_intermediate_tensors`, for
the sceniro `pp+sp+tp`, skip scatter the residual for ascend

2. vllm-project/vllm#35520
Adapted to the modifications of `ModelRunner v2` for hybrid attn in
interface level, .
Todo: Added support for Mamba in ModelRunner in Ascend. any pull_request
is welcome

3. vllm-project/vllm#40711

4. vllm-project/vllm#42121

5. vllm-project/vllm#41706

6. vllm-project/vllm#39917
Disable `async_schedule` when `enable_return_routed_experts=True`
7. vllm-project/vllm#41046
8. vllm-project/vllm#41055
9. vllm-project/vllm#41035
10. vllm-project/vllm#42434
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.20.1
- vLLM main:
vllm-project/vllm@c7aa186

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
iboiko-habana added a commit to vllm-project/vllm-gaudi that referenced this pull request May 19, 2026
…ltiModelEngineClient, Qwen3.5 compilation, and EPLB refactoring (#1436)

Fix upstream regressions affecting hourly CI:

1. **MultiModelEngineClient**: Added missing
`notify_kv_transfer_request_rejected` abstract method (upstream PR
vllm-project/vllm#41269)
2. **Qwen3.5 test harness**: Updated `test_common.py` to read
`enforce_eager` from model card config (with env var override), enabling
per-model compilation control
3. **EPLB refactoring**: Removed `EMPTY_EPLB_STATE` import and
`enable_eplb` parameter from `patched_create_fused_moe_router` after
upstream MoE refactor (upstream PR vllm-project/vllm#41055)

Note: The `enforce_eager: true` workaround for Qwen3.5 compilation has
been removed — the root cause (mamba_type str-vs-Enum comparison in
hybrid cache allocation) is properly fixed by #1449, which should merge
first.

Verified on HPU: unit tests pass on Gaudi 3 (MoE, FP8, compressed
tensors).

---------

Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
Signed-off-by: Pawel Olejniczak <pawelx.olejniczak@intel.com>
Co-authored-by: Iryna Boiko <iryna.boiko@intel.com>
mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026
Signed-off-by: Bill Nell <bnell@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026
Signed-off-by: Bill Nell <bnell@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
h1t35h pushed a commit to h1t35h/vllm that referenced this pull request May 21, 2026
Signed-off-by: Bill Nell <bnell@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants