-
Notifications
You must be signed in to change notification settings - Fork 619
[2/N][Feat] Add MC2 communication method for MoE layers #2469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new MC2 communication method for MoE layers, designed to optimize performance for smaller token counts. The changes include a new AscendFusedMoE layer, dynamic selection of the communication method in the model runner, and sharding of the MoE communication mask. My review identified a critical issue in the AscendFusedMoE layer where the all_gather operation uses the input tensor instead of the computed output, effectively discarding the results of the MoE computation. This needs to be addressed for the feature to function correctly.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
Look into the memory problem with |
|
Fixed, please merge it at your earliest convenience, @wangxiyuan |
fb26435 to
f8bf600
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2469 +/- ##
==========================================
- Coverage 77.99% 77.81% -0.18%
==========================================
Files 134 134
Lines 18498 18489 -9
==========================================
- Hits 14427 14387 -40
- Misses 4071 4102 +31
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
This method replaces the previous all-gather approach for small numbers of tokens. The key changes include: - A new `AscendFusedMoE` layer that handles token splitting, local computation, and final aggregation via all-gather. - Logic in the model runner to dynamically select between the new MC2 method and the existing all-gather method based on the number of input tokens. - Sharding the MoE communication mask across tensor-parallel ranks. Signed-off-by: Yizhou Liu <[email protected]>
This commit refactors the MoE communication method framework to improve modularity, clarity, and extensibility. Key changes include: - **Revised `MoECommMethod` Interface:** - Renamed `_pre_process` to `permute` and `_post_process` to `unpermute` for better clarity. - Introduced `prepare` and `finalize` methods to encapsulate logic that happens before and after the core MoE computation, such as tensor padding/splitting for MC2 and the final AllReduce. - **Simplified `AscendFusedMoE`:** - The `forward_impl` is significantly simplified by delegating pre- and post-processing logic (padding, splitting, reduction) to the specific `MoECommMethod` implementation. - `AscendFusedMoE` now instantiates all communication method objects at initialization and selects the appropriate one at runtime based on a string identifier. - **Centralized Expert Logic:** - Removed `unified_fused_experts` and introduced a new `fused_experts` function in `common_fused_moe.py`. - This new function utilizes the `permute`/`unpermute` methods from the `MoECommMethod` abstraction, decoupling the core expert logic from specific communication implementations. - **Configuration and Invocation:** - The communication method is now selected and passed around as a string (e.g., "mc2", "allgather") instead of a class type, simplifying the invocation in the model runner. These changes result in a cleaner separation of concerns, making the MoE implementation easier to understand, maintain, and extend with new communication strategies. Signed-off-by: Yizhou Liu <[email protected]>
…zations and add MC2 group integration I am proud to say MC2 is fully supported with ACL Graph now! Signed-off-by: Yizhou Liu <[email protected]>
The test now uses the `FusedMoEConfig` for configuration instead of a generic `PretrainedConfig`. It also calls the `permute` and `unpermute` methods on the communication implementation instance, rather than calling the `torch.ops` functions directly. Signed-off-by: Yizhou Liu <[email protected]>
Removes the `moe_comm_pre_process` and `moe_comm_post_process` custom operators and their associated registration logic. This simplifies the MoE communication implementation by integrating the pre-processing logic directly into the communication methods. Additionally, this change removes the unused `NaiveAll2AllManager` from the NPU communicator and refactors helper function usage for getting the MC2 communication name. Signed-off-by: Yizhou Liu <[email protected]>
…#2469) ### What this PR does / why we need it? This method replaces the previous all-gather approach for small numbers of tokens. The key changes include: - A new `AscendFusedMoE` layer that handles token splitting, local computation, and final aggregation via all-gather. - Logic in the model runner to dynamically select between the new MC2 method and the existing all-gather method based on the number of input tokens. - Sharding the MoE communication mask across tensor-parallel ranks. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Test case fixed. - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@b00e69f --------- Signed-off-by: Yizhou Liu <[email protected]>
…#2469) ### What this PR does / why we need it? This method replaces the previous all-gather approach for small numbers of tokens. The key changes include: - A new `AscendFusedMoE` layer that handles token splitting, local computation, and final aggregation via all-gather. - Logic in the model runner to dynamically select between the new MC2 method and the existing all-gather method based on the number of input tokens. - Sharding the MoE communication mask across tensor-parallel ranks. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Test case fixed. - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@b00e69f --------- Signed-off-by: Yizhou Liu <[email protected]>
…#2469) ### What this PR does / why we need it? This method replaces the previous all-gather approach for small numbers of tokens. The key changes include: - A new `AscendFusedMoE` layer that handles token splitting, local computation, and final aggregation via all-gather. - Logic in the model runner to dynamically select between the new MC2 method and the existing all-gather method based on the number of input tokens. - Sharding the MoE communication mask across tensor-parallel ranks. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Test case fixed. - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@b00e69f --------- Signed-off-by: Yizhou Liu <[email protected]>
What this PR does / why we need it?
This method replaces the previous all-gather approach for small numbers of tokens.
The key changes include:
AscendFusedMoElayer that handles token splitting, local computation, and final aggregation via all-gather.Does this PR introduce any user-facing change?
None.
How was this patch tested?
Test case fixed.