Skip to content

Conversation

@yiz-liu
Copy link
Collaborator

@yiz-liu yiz-liu commented Aug 21, 2025

What this PR does / why we need it?

This method replaces the previous all-gather approach for small numbers of tokens.

The key changes include:

  • A new AscendFusedMoE layer that handles token splitting, local computation, and final aggregation via all-gather.
  • Logic in the model runner to dynamically select between the new MC2 method and the existing all-gather method based on the number of input tokens.
  • Sharding the MoE communication mask across tensor-parallel ranks.

Does this PR introduce any user-facing change?

None.

How was this patch tested?

Test case fixed.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new MC2 communication method for MoE layers, designed to optimize performance for smaller token counts. The changes include a new AscendFusedMoE layer, dynamic selection of the communication method in the model runner, and sharding of the MoE communication mask. My review identified a critical issue in the AscendFusedMoE layer where the all_gather operation uses the input tensor instead of the computed output, effectively discarding the results of the MoE computation. This needs to be addressed for the feature to function correctly.

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@yiz-liu yiz-liu changed the title [WIP][Feat] Add MC2 communication method for MoE layers [Feat] Add MC2 communication method for MoE layers Aug 25, 2025
@yiz-liu
Copy link
Collaborator Author

yiz-liu commented Aug 25, 2025

Look into the memory problem with DummyCommImpl.

@yiz-liu
Copy link
Collaborator Author

yiz-liu commented Aug 25, 2025

  1. Fix UT;
  2. Delete NaiveMulticast related codes.

@yiz-liu
Copy link
Collaborator Author

yiz-liu commented Aug 25, 2025

  1. Fix UT;
  2. Delete NaiveMulticast related codes.

Fixed, please merge it at your earliest convenience, @wangxiyuan

@yiz-liu yiz-liu force-pushed the feat-mc2 branch 2 times, most recently from fb26435 to f8bf600 Compare August 26, 2025 04:48
@codecov
Copy link

codecov bot commented Aug 26, 2025

Codecov Report

❌ Patch coverage is 32.00000% with 102 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.81%. Comparing base (5d8ec28) to head (df99292).
⚠️ Report is 648 commits behind head on main.

Files with missing lines Patch % Lines
vllm_ascend/distributed/moe_comm_method.py 31.73% 71 Missing ⚠️
vllm_ascend/ops/common_fused_moe.py 25.00% 30 Missing ⚠️
vllm_ascend/ascend_forward_context.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2469      +/-   ##
==========================================
- Coverage   77.99%   77.81%   -0.18%     
==========================================
  Files         134      134              
  Lines       18498    18489       -9     
==========================================
- Hits        14427    14387      -40     
- Misses       4071     4102      +31     
Flag Coverage Δ
unittests 77.81% <32.00%> (-0.18%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@yiz-liu yiz-liu changed the title [Feat] Add MC2 communication method for MoE layers [2/N][Feat] Add MC2 communication method for MoE layers Aug 26, 2025
This method replaces the previous all-gather approach for small numbers of tokens.

The key changes include:
- A new `AscendFusedMoE` layer that handles token splitting, local computation, and final aggregation via all-gather.
- Logic in the model runner to dynamically select between the new MC2 method and the existing all-gather method based on the number of input tokens.
- Sharding the MoE communication mask across tensor-parallel ranks.

Signed-off-by: Yizhou Liu <[email protected]>
This commit refactors the MoE communication method framework to improve modularity, clarity, and extensibility.

Key changes include:

- **Revised `MoECommMethod` Interface:**
  - Renamed `_pre_process` to `permute` and `_post_process` to `unpermute` for better clarity.
  - Introduced `prepare` and `finalize` methods to encapsulate logic that happens before and after the core MoE computation, such as tensor padding/splitting for MC2 and the final AllReduce.

- **Simplified `AscendFusedMoE`:**
  - The `forward_impl` is significantly simplified by delegating pre- and post-processing logic (padding, splitting, reduction) to the specific `MoECommMethod` implementation.
  - `AscendFusedMoE` now instantiates all communication method objects at initialization and selects the appropriate one at runtime based on a string identifier.

- **Centralized Expert Logic:**
  - Removed `unified_fused_experts` and introduced a new `fused_experts` function in `common_fused_moe.py`.
  - This new function utilizes the `permute`/`unpermute` methods from the `MoECommMethod` abstraction, decoupling the core expert logic from specific communication implementations.

- **Configuration and Invocation:**
  - The communication method is now selected and passed around as a string (e.g., "mc2", "allgather") instead of a class type, simplifying the invocation in the model runner.

These changes result in a cleaner separation of concerns, making the MoE implementation easier to understand, maintain, and extend with new communication strategies.

Signed-off-by: Yizhou Liu <[email protected]>
…zations and add MC2 group integration

I am proud to say MC2 is fully supported with ACL Graph now!

Signed-off-by: Yizhou Liu <[email protected]>
The test now uses the `FusedMoEConfig` for configuration instead of a generic `PretrainedConfig`.

It also calls the `permute` and `unpermute` methods on the communication implementation instance, rather than calling the `torch.ops` functions directly.

Signed-off-by: Yizhou Liu <[email protected]>
Removes the `moe_comm_pre_process` and `moe_comm_post_process` custom operators and their associated registration logic. This simplifies the MoE communication implementation by integrating the pre-processing logic directly into the communication methods.

Additionally, this change removes the unused `NaiveAll2AllManager` from the NPU communicator and refactors helper function usage for getting the MC2 communication name.

Signed-off-by: Yizhou Liu <[email protected]>
@wangxiyuan wangxiyuan merged commit a6bb502 into vllm-project:main Aug 26, 2025
20 of 23 checks passed
@yiz-liu yiz-liu deleted the feat-mc2 branch August 27, 2025 01:26
wangxiaoteng888 pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Sep 25, 2025
…#2469)

### What this PR does / why we need it?
This method replaces the previous all-gather approach for small numbers
of tokens.

The key changes include:
- A new `AscendFusedMoE` layer that handles token splitting, local
computation, and final aggregation via all-gather.
- Logic in the model runner to dynamically select between the new MC2
method and the existing all-gather method based on the number of input
tokens.
- Sharding the MoE communication mask across tensor-parallel ranks.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Test case fixed.


- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@b00e69f

---------

Signed-off-by: Yizhou Liu <[email protected]>
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Sep 26, 2025
…#2469)

### What this PR does / why we need it?
This method replaces the previous all-gather approach for small numbers
of tokens.

The key changes include:
- A new `AscendFusedMoE` layer that handles token splitting, local
computation, and final aggregation via all-gather.
- Logic in the model runner to dynamically select between the new MC2
method and the existing all-gather method based on the number of input
tokens.
- Sharding the MoE communication mask across tensor-parallel ranks.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Test case fixed.


- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@b00e69f

---------

Signed-off-by: Yizhou Liu <[email protected]>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
…#2469)

### What this PR does / why we need it?
This method replaces the previous all-gather approach for small numbers
of tokens.

The key changes include:
- A new `AscendFusedMoE` layer that handles token splitting, local
computation, and final aggregation via all-gather.
- Logic in the model runner to dynamically select between the new MC2
method and the existing all-gather method based on the number of input
tokens.
- Sharding the MoE communication mask across tensor-parallel ranks.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Test case fixed.


- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@b00e69f

---------

Signed-off-by: Yizhou Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants