Skip to content

[BugFix] Fix Ascend MoE routing expert count with EPLB#8864

Open
gcanlin wants to merge 2 commits intovllm-project:mainfrom
gcanlin:moe-bugfix
Open

[BugFix] Fix Ascend MoE routing expert count with EPLB#8864
gcanlin wants to merge 2 commits intovllm-project:mainfrom
gcanlin:moe-bugfix

Conversation

@gcanlin
Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin commented May 2, 2026

Summary

Fix Ascend MoE dynamic EPLB routing after the upstream vLLM MoE/EPLB refactor.

Upstream vLLM now distinguishes:

  • logical experts: the experts represented by router logits
  • physical/global experts: logical experts plus redundant EPLB replicas

router_logits.shape[-1] matches the logical expert count, but Ascend MoE quant paths were comparing it against moe_config.num_experts, which can include redundant physical experts when dynamic EPLB is enabled. This caused:

AssertionError: Number of global experts mismatch (excluding redundancy) in the Qwen3 MoE W8A8 dynamic EPLB TP2 test.

Changes

  • Add a helper to resolve the logical expert count from moe_config.num_logical_experts, with a fallback for older configs.
  • Use logical expert count for:
    • router logits validation
    • expert selection
    • zero expert handling
    • profile force-load-balance random routing
  • Preserve physical/global expert count for dispatch and redundant expert handling.
  • Apply the same logical/physical split to related Ascend MoE quant paths to avoid the same bug in other quant modes.

Root cause

vLLM upstream PRs such as #30623 separated router logic into dedicated router classes and made EPLB map logical expert IDs to physical expert IDs after top-k selection. Ascend code still treated moe_config.num_experts as the router-logits expert count, but with dynamic EPLB it represents physical/global experts.

Test

VLLM_USE_MODELSCOPE=True pytest -sv \
  tests/e2e/multicard/2-cards/test_qwen3_moe.py::test_qwen3_moe_w8a8_distributed_tp2_ep_dynamic_eplb

Test Result

(APIServer pid=328985) INFO:     127.0.0.1:60556 - "POST /v1/completions HTTP/1.1" 200 OK
[2026-05-02 10:28:33.692579][UC][I] Shutdown initiated (timeout=0) [329130,329130][core.py:1238,_handle_shutdown]
[2026-05-02 10:28:33.692621][UC][I] Shutdown complete [329130,329130][core.py:1261,_handle_shutdown]
[2026-05-02 10:28:33.692798][UC][I] Parent process exited, terminating worker queues [329270,330271][multiproc_executor.py:775,death_pipe_monitor]
[2026-05-02 10:28:33.692812][UC][I] Parent process exited, terminating worker queues [329278,330263][multiproc_executor.py:775,death_pipe_monitor]
[2026-05-02 10:28:33.692919][UC][I] WorkerProc shutting down. [329270,329270][multiproc_executor.py:872,worker_main]
[2026-05-02 10:28:33.692960][UC][I] WorkerProc shutting down. [329278,329278][multiproc_executor.py:872,worker_main]
(APIServer pid=328985) INFO:     Shutting down
(APIServer pid=328985) INFO:     Waiting for application shutdown.
(APIServer pid=328985) INFO:     Application shutdown complete.
(APIServer pid=328985) INFO:     Finished server process [328985]
(APIServer pid=328985) sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
/usr/local/python3.11.14/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
PASSED

========================================================== warnings summary ===========================================================
<frozen importlib._bootstrap>:241
  <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:241
  <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

vllm_ascend/patch/worker/patch_weight_utils.py:80
  /root/vllm-workspace2/vllm-ascend/vllm_ascend/patch/worker/patch_weight_utils.py:80: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    module = original_import(name, globals, locals, fromlist, level)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================== 1 passed, 3 warnings in 342.04s (0:05:42) ==============================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a bug in Ascend MoE routing where the dynamic EPLB configuration caused a mismatch between logical and physical expert counts. By correctly separating these counts and updating the quantization paths, the fix ensures that router logits and expert selection logic operate on the expected logical expert count, preventing assertion failures in distributed MoE scenarios.

Highlights

  • Logical Expert Resolution: Introduced a helper function to correctly resolve the logical expert count, distinguishing it from the physical/global expert count used in dynamic EPLB configurations.
  • Quantization Path Updates: Updated multiple Ascend MoE quantization paths (W4A16, W4A4, W8A8, W8A8 MXFP8) to utilize the logical expert count for router logit validation, expert selection, and load balancing.
  • Refactoring: Cleaned up AscendFusedMoE and AscendSharedFusedMoE implementations, including removing redundant inheritance and improving shared expert consistency validation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Suggested PR Title:

[Ops][Feature] Refactor MoE expert logic and unify SharedFusedMoE

Suggested PR Summary:

### What this PR does / why we need it?
This PR refactors the MoE implementation for Ascend by introducing a centralized `get_moe_num_logical_experts` utility to handle expert counts across various quantization methods (W4A16, W4A4, W8A8). It unifies `AscendSharedFusedMoE` into `AscendFusedMoE`, updates the runner to inherit from the standard `MoERunner`, and adds a consistency validation check for shared expert split computations. Feedback was provided regarding unresolved developer notes in Chinese and hardcoded logic in the `finalize` call within `fused_moe.py`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
New unit tests were added in `tests/ut/quantization/methods/test_moe_logical_experts.py` to verify the logical expert calculation.

Comment thread vllm_ascend/ops/fused_moe/fused_moe.py Outdated
@gcanlin gcanlin changed the title [Bugfix] Fix Ascend MoE routing expert count with EPLB [BugFix] Fix Ascend MoE routing expert count with EPLB May 2, 2026
@gcanlin gcanlin added ready read for review ready-for-test start test by label for PR labels May 2, 2026
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@gcanlin
Copy link
Copy Markdown
Collaborator Author

gcanlin commented May 2, 2026

The CI error is unrelated to this PR. The bug is from #8831. It doesn't adapt to vllm main. vllm-project/vllm#39446 broke it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant