Bugfix: Align expert map shapes with redundant experts in EPLB adjustment by Mercykid-bash · Pull Request #5285 · vllm-project/vllm-ascend

Mercykid-bash · 2025-12-23T07:53:58Z

Overview

This PR fixes a shape mismatch bug between expert_placement_map and log2phy_expert_map when redundant experts are enabled in the vLLM-Ascend platform. The issue occurred during the initialization of expert maps and their updates via EPLB (Expert Load Balancer) adjustment, leading to potential tensor shape errors and incorrect expert routing in distributed MoE deployments.

Key Changes

Unify expert map shape calculation logic
- Ensure the shape of expert_placement_map and log2phy_expert_map strictly aligns with the total number of experts (including redundant experts) during initialization.
- Update the shape adjustment logic in EPLB dynamic update process to match the initial expert map dimensions.
Add shape consistency checks
- Add assertion statements to verify the shape consistency of the two maps after initialization and EPLB adjustment, preventing silent shape mismatches in subsequent operations.

Impact

Resolves tensor shape errors when using redundant experts with EPLB on Ascend platform.
Ensures correct expert routing and load balancing for MoE models with redundant expert configurations.
No breaking changes to existing functionality; compatible with non-redundant expert deployments.
vLLM version: release/v0.13.0
vLLM main: vllm-project/vllm@ad32e3e

gemini-code-assist

Code Review

This pull request addresses a shape mismatch bug concerning redundant experts in MoE layers. The modifications correctly adjust the shapes of expert_placement_map and log2phy_expert_map, and ensure the correct number of experts is passed to the underlying operators. The test updates are consistent with these fixes. My primary feedback focuses on enhancing the implementation within token_dispatcher.py. I suggest refactoring the code to avoid passing state via an instance attribute, which will improve the code's robustness and maintainability.

gemini-code-assist · 2025-12-23T07:56:30Z

-            quant_mode = 0
-            moe_expert_num = len(expert_map)
+        quant_mode = 2 if self.with_quant else 0
+        self.moe_expert_num = len(expert_map) + global_redundant_expert_num


Storing moe_expert_num as an instance attribute self.moe_expert_num creates an implicit dependency between token_dispatch (via get_dispatch_mc2_kwargs) and token_combine (via get_combine_mc_kwargs). This makes the code fragile and harder to reason about, as token_combine now relies on token_dispatch having been called first to set this attribute. A more robust approach is to pass state explicitly using the context_metadata dictionary.

I recommend the following refactoring:

In get_dispatch_mc2_kwargs: Calculate moe_expert_num as a local variable and do not assign it to self.

# In get_dispatch_mc2_kwargs moe_expert_num = len(expert_map) + global_redundant_expert_num kwargs_mc2 = { # ... "moe_expert_num": moe_expert_num, # ... }

In token_dispatch: Pass global_redundant_expert_num through context_metadata.

# In token_dispatch context_metadata["global_redundant_expert_num"] = global_redundant_expert_num

In get_combine_mc_kwargs: Recalculate moe_expert_num using the value from context_metadata instead of reading from self.

# In get_combine_mc_kwargs global_redundant_expert_num = context_metadata["global_redundant_expert_num"] moe_expert_num = len(expert_map) + global_redundant_expert_num kwargs_mc2 = { # ... "moe_expert_num": moe_expert_num, # ... }

This change will make the data flow explicit and improve the component's maintainability.

github-actions · 2025-12-23T08:05:22Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Mercykid-bash · 2025-12-23T12:30:07Z

Validation Result

We conducted comparative tests on the Qwen model under two configurations to verify the fix for expert map shape mismatch:

Baseline (without redundant experts, EPLB disabled)
The inference output is as follows:

{"id":"chatcmpl-8e1df21e02cbf858","object":"chat.completion","created":1766491376,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"\nOkay, the user asked, \"What is deeplearning?\" I need to explain this in a clear and simple way. Let me start by recalling what I know about deep learning.\n\nFirst, deep learning is a subset of machine learning,","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":14,"total_tokens":64,"completion_tokens":50,"prompt_tokens_details":null}

Test Configuration (with redundant experts enabled, EPLB adjustment triggered)
After enabling redundant experts and triggering dynamic EPLB (Expert Load Balancer) adjustment, the inference output is:

{"id":"chatcmpl-9a72d52057ab55e0","object":"chat.completion","created":1766492392,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"\nOkay, the user asked, \"What is deeplearning?\" I need to explain this in a clear and simple way. Let me start by recalling what I know about deep learning.\n\nFirst, deep learning is a subset of machine learning,","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":14,"total_tokens":64,"completion_tokens":50,"prompt_tokens_details":null}