[BugFix] Fix Ascend MoE routing expert count with EPLB by gcanlin · Pull Request #8864 · vllm-project/vllm-ascend

gcanlin · 2026-05-02T09:54:43Z

Summary

Fix Ascend MoE dynamic EPLB routing after the upstream vLLM MoE/EPLB refactor.

Upstream vLLM now distinguishes:

logical experts: the experts represented by router logits
physical/global experts: logical experts plus redundant EPLB replicas

router_logits.shape[-1] matches the logical expert count, but Ascend MoE quant paths were comparing it against moe_config.num_experts, which can include redundant physical experts when dynamic EPLB is enabled. This caused:

AssertionError: Number of global experts mismatch (excluding redundancy) in the Qwen3 MoE W8A8 dynamic EPLB TP2 test.

Changes

Add a helper to resolve the logical expert count from moe_config.num_logical_experts, with a fallback for older configs.
Use logical expert count for:
- router logits validation
- expert selection
- zero expert handling
- profile force-load-balance random routing
Preserve physical/global expert count for dispatch and redundant expert handling.
Apply the same logical/physical split to related Ascend MoE quant paths to avoid the same bug in other quant modes.

Root cause

vLLM upstream PRs such as #30623 separated router logic into dedicated router classes and made EPLB map logical expert IDs to physical expert IDs after top-k selection. Ascend code still treated moe_config.num_experts as the router-logits expert count, but with dynamic EPLB it represents physical/global experts.

Test

VLLM_USE_MODELSCOPE=True pytest -sv \
  tests/e2e/multicard/2-cards/test_qwen3_moe.py::test_qwen3_moe_w8a8_distributed_tp2_ep_dynamic_eplb

Test Result

(APIServer pid=328985) INFO:     127.0.0.1:60556 - "POST /v1/completions HTTP/1.1" 200 OK
[2026-05-02 10:28:33.692579][UC][I] Shutdown initiated (timeout=0) [329130,329130][core.py:1238,_handle_shutdown]
[2026-05-02 10:28:33.692621][UC][I] Shutdown complete [329130,329130][core.py:1261,_handle_shutdown]
[2026-05-02 10:28:33.692798][UC][I] Parent process exited, terminating worker queues [329270,330271][multiproc_executor.py:775,death_pipe_monitor]
[2026-05-02 10:28:33.692812][UC][I] Parent process exited, terminating worker queues [329278,330263][multiproc_executor.py:775,death_pipe_monitor]
[2026-05-02 10:28:33.692919][UC][I] WorkerProc shutting down. [329270,329270][multiproc_executor.py:872,worker_main]
[2026-05-02 10:28:33.692960][UC][I] WorkerProc shutting down. [329278,329278][multiproc_executor.py:872,worker_main]
(APIServer pid=328985) INFO:     Shutting down
(APIServer pid=328985) INFO:     Waiting for application shutdown.
(APIServer pid=328985) INFO:     Application shutdown complete.
(APIServer pid=328985) INFO:     Finished server process [328985]
(APIServer pid=328985) sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
/usr/local/python3.11.14/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
PASSED

========================================================== warnings summary ===========================================================
<frozen importlib._bootstrap>:241
  <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:241
  <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

vllm_ascend/patch/worker/patch_weight_utils.py:80
  /root/vllm-workspace2/vllm-ascend/vllm_ascend/patch/worker/patch_weight_utils.py:80: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    module = original_import(name, globals, locals, fromlist, level)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================== 1 passed, 3 warnings in 342.04s (0:05:42) ==============================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

gemini-code-assist · 2026-05-02T09:54:52Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a bug in Ascend MoE routing where the dynamic EPLB configuration caused a mismatch between logical and physical expert counts. By correctly separating these counts and updating the quantization paths, the fix ensures that router logits and expert selection logic operate on the expected logical expert count, preventing assertion failures in distributed MoE scenarios.

Highlights

Logical Expert Resolution: Introduced a helper function to correctly resolve the logical expert count, distinguishing it from the physical/global expert count used in dynamic EPLB configurations.
Quantization Path Updates: Updated multiple Ascend MoE quantization paths (W4A16, W4A4, W8A8, W8A8 MXFP8) to utilize the logical expert count for router logit validation, expert selection, and load balancing.
Refactoring: Cleaned up AscendFusedMoE and AscendSharedFusedMoE implementations, including removing redundant inheritance and improving shared expert consistency validation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2026-05-02T09:55:06Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

Suggested PR Title:

[Ops][Feature] Refactor MoE expert logic and unify SharedFusedMoE

Suggested PR Summary:

### What this PR does / why we need it?
This PR refactors the MoE implementation for Ascend by introducing a centralized `get_moe_num_logical_experts` utility to handle expert counts across various quantization methods (W4A16, W4A4, W8A8). It unifies `AscendSharedFusedMoE` into `AscendFusedMoE`, updates the runner to inherit from the standard `MoERunner`, and adds a consistency validation check for shared expert split computations. Feedback was provided regarding unresolved developer notes in Chinese and hardcoded logic in the `finalize` call within `fused_moe.py`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
New unit tests were added in `tests/ut/quantization/methods/test_moe_logical_experts.py` to verify the logical expert calculation.

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

gcanlin · 2026-05-02T13:29:00Z

The CI error is unrelated to this PR. The bug is from #8831. It doesn't adapt to vllm main. vllm-project/vllm#39446 broke it.

gcanlin requested review from realliujiaxu, wangxiyuan, whx-sjtu and zzzzwwjj as code owners May 2, 2026 09:54

github-actions Bot added module:tests module:ops module:quantization labels May 2, 2026

gcanlin mentioned this pull request May 2, 2026

[Misc] Drop vlllm 0.19.1 support #8855

Open

gemini-code-assist Bot reviewed May 2, 2026

View reviewed changes

Comment thread vllm_ascend/ops/fused_moe/fused_moe.py Outdated

gcanlin changed the title ~~[Bugfix] Fix Ascend MoE routing expert count with EPLB~~ [BugFix] Fix Ascend MoE routing expert count with EPLB May 2, 2026

gcanlin added ready read for review ready-for-test start test by label for PR labels May 2, 2026

[BugFix] Fix Ascend MoE routing expert count with EPLB

7052dc3

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

gcanlin force-pushed the moe-bugfix branch from abb844b to 7052dc3 Compare May 2, 2026 10:04

fix

d31d4c4

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

gcanlin mentioned this pull request May 2, 2026

[Misc][Main2Main] Upgrade vLLM to 0429(DSV4/v0.20.0) #8856

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] Fix Ascend MoE routing expert count with EPLB#8864

[BugFix] Fix Ascend MoE routing expert count with EPLB#8864
gcanlin wants to merge 2 commits intovllm-project:mainfrom
gcanlin:moe-bugfix

gcanlin commented May 2, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented May 2, 2026

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

gcanlin commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gcanlin commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Root cause

Test

Test Result

Uh oh!

gemini-code-assist Bot commented May 2, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gcanlin commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gcanlin commented May 2, 2026 •

edited

Loading