[bugfix] Use FUSED_MC2 MoE comm path for the op dispatch_ffn_combine#5156
[bugfix] Use FUSED_MC2 MoE comm path for the op dispatch_ffn_combine#5156wangxiyuan merged 2 commits intovllm-project:mainfrom
dispatch_ffn_combine#5156Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the Mixture-of-Experts (MoE) communication path by renaming FUSED_ALLTOALL to FUSED_MC2 and enabling it for the dispatch_ffn_combine op on Ascend A3 devices under specific quantization and parallelism settings. The changes significantly improve code clarity, especially in the select_moe_comm_method function, by simplifying the selection logic. The new FusedMC2CommImpl correctly utilizes the dispatch_ffn_combine custom op. My main feedback concerns a type hint mismatch in the fused_experts method, which could impact maintainability. Overall, this is a good improvement for enabling a more performant MoE path.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Signed-off-by: Chen Chen <0109chenchen@gmail.com>
dispatch_ffn_combinedispatch_ffn_combine
| fused_mc2_enable = quant_type == "w8a8_dynamic" and get_ep_group( | ||
| ).world_size <= 16 and (not dynamic_eplb) and (not is_mtp_model) | ||
| if fused_mc2_enable: | ||
| moe_comm_type = MoECommType.FUSED_MC2 | ||
| else: | ||
| moe_comm_type = (MoECommType.MC2 if num_tokens | ||
| <= mc2_tokens_capacity else MoECommType.ALLTOALL) |
There was a problem hiding this comment.
We may need to be more careful here.
Signed-off-by: Chen Chen <0109chenchen@gmail.com>
| else MoECommType.FUSED_ALLTOALL | ||
| if fused_all2all_enable else MoECommType.ALLTOALL) | ||
| fused_mc2_enable = envs_ascend.VLLM_ASCEND_ENABLE_FUSED_MC2 and quant_type == "w8a8_dynamic" and get_ep_group( | ||
| ).world_size <= 16 and (not dynamic_eplb) and (not is_mtp_model) |
There was a problem hiding this comment.
change is_mtp_model to check model type
…to eplb_refactor * 'main' of https://github.com/vllm-project/vllm-ascend: (52 commits) [Doc]Add the user_guide doc file regarding fine-grained TP. (vllm-project#5084) [pref] qwen3_next add triton ops : fused_sigmoid_gating_delta_rule_update (vllm-project#4818) [Feature] Add token mask for DispatchGmmCombineDecode operator (vllm-project#5171) [CI] Improve CI (vllm-project#5078) [Refactor] remove some metadata variables in attention_v1. (vllm-project#5160) Add Qwen3-VL-235B-A22B-Instruct tutorials (vllm-project#5167) [Doc] Add a perf tune section (vllm-project#5127) [Image] Refactor image build (vllm-project#5175) [refactor] refactor weight trans nz and transpose (vllm-project#4878) [BugFix]Fix precision issue for LoRA feature (vllm-project#4141) 【Doc】Deepseekv3.1/R1 doc enhancement (vllm-project#4827) support basic long_seq feature st (vllm-project#5140) [Bugfix] install trition for test_custom_op (vllm-project#5112) [2/N][Pangu][MoE] Remove Pangu Related Code (vllm-project#5130) [bugfix] Use FUSED_MC2 MoE comm path for the op `dispatch_ffn_combine` (vllm-project#5156) [BugFix] Fix top_p,top_k issue with EAGLE and add top_p,top_k in EAGLE e2e (vllm-project#5131) [Doc][P/D] Fix MooncakeConnector's name (vllm-project#5172) [Bugfix] Fix in_profile_run in mtp_proposer dummy_run (vllm-project#5165) [Doc] Refact benchmark doc (vllm-project#5173) [Nightly] Avoid max_model_len being smaller than the decoder prompt to prevent single-node-accuray-tests from failing (vllm-project#5174) ... Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
vllm-project#5156) ### What this PR does / why we need it? - Renames the MoE comm enum value `MoECommType.FUSED_ALLTOALL` to `MoECommType.FUSED_MC2` and updates all call sites. - Updates `select_moe_comm_method` to optionally select `FUSED_MC2` on Ascend A3 when: - `enable_expert_parallel=True` - quantization is `w8a8_dynamic` - `EP <= 16` - `dynamic_eplb` is disabled - `is_mtp_model = False` - Replaces the old “fused all-to-all” comm implementation with `FusedMC2CommImpl`, using `TokenDispatcherWithMC2` / `PrepareAndFinalizeWithMC2` and `dispatch_ffn_combine`. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: Chen Chen <0109chenchen@gmail.com>
…tch_ffn_c… (#5284) ### What this PR does / why we need it? - This PR removes the Expert Parallel (EP) HCCL buffer allocation that was previously introduced by the fused-op `dispatch_ffn_combine` (#3532 ), since the fused-op has switch to MC2 HCCL buffer (#5156 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: Chen Chen <0109chenchen@gmail.com>
vllm-project#5156) ### What this PR does / why we need it? - Renames the MoE comm enum value `MoECommType.FUSED_ALLTOALL` to `MoECommType.FUSED_MC2` and updates all call sites. - Updates `select_moe_comm_method` to optionally select `FUSED_MC2` on Ascend A3 when: - `enable_expert_parallel=True` - quantization is `w8a8_dynamic` - `EP <= 16` - `dynamic_eplb` is disabled - `is_mtp_model = False` - Replaces the old “fused all-to-all” comm implementation with `FusedMC2CommImpl`, using `TokenDispatcherWithMC2` / `PrepareAndFinalizeWithMC2` and `dispatch_ffn_combine`. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: Chen Chen <0109chenchen@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…tch_ffn_c… (vllm-project#5284) ### What this PR does / why we need it? - This PR removes the Expert Parallel (EP) HCCL buffer allocation that was previously introduced by the fused-op `dispatch_ffn_combine` (vllm-project#3532 ), since the fused-op has switch to MC2 HCCL buffer (vllm-project#5156 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: Chen Chen <0109chenchen@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
vllm-project#5156) ### What this PR does / why we need it? - Renames the MoE comm enum value `MoECommType.FUSED_ALLTOALL` to `MoECommType.FUSED_MC2` and updates all call sites. - Updates `select_moe_comm_method` to optionally select `FUSED_MC2` on Ascend A3 when: - `enable_expert_parallel=True` - quantization is `w8a8_dynamic` - `EP <= 16` - `dynamic_eplb` is disabled - `is_mtp_model = False` - Replaces the old “fused all-to-all” comm implementation with `FusedMC2CommImpl`, using `TokenDispatcherWithMC2` / `PrepareAndFinalizeWithMC2` and `dispatch_ffn_combine`. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: Chen Chen <0109chenchen@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…tch_ffn_c… (vllm-project#5284) ### What this PR does / why we need it? - This PR removes the Expert Parallel (EP) HCCL buffer allocation that was previously introduced by the fused-op `dispatch_ffn_combine` (vllm-project#3532 ), since the fused-op has switch to MC2 HCCL buffer (vllm-project#5156 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: Chen Chen <0109chenchen@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
What this PR does / why we need it?
MoECommType.FUSED_ALLTOALLtoMoECommType.FUSED_MC2and updates all call sites.select_moe_comm_methodto optionally selectFUSED_MC2on Ascend A3 when:enable_expert_parallel=Truew8a8_dynamicEP <= 16dynamic_eplbis disabledis_mtp_model = FalseFusedMC2CommImpl, usingTokenDispatcherWithMC2/PrepareAndFinalizeWithMC2anddispatch_ffn_combine.Does this PR introduce any user-facing change?
How was this patch tested?