[Feature]Use DispatchGmmCombineDecode operator to replace MC2(Optional) by wangqiankun13 · Pull Request #5040 · vllm-project/vllm-ascend

wangqiankun13 · 2025-12-15T11:39:01Z

What this PR does / why we need it?

This PR adds model-side integration for the previously introduced experimental AscendC fused operator DispatchGmmCombineDecode, used in MoE decoding.

The operator implementation itself was added in a prior PR#4139 .
This change only adapts the model execution path to optionally use the fused operator.

When the environment variable VLLM_ASCEND_ENABLE_FUSED_MC2=2 is set, the original MC2 path composed of multiple operators (A8W8 dispatch → GMM → SwiGLU → GMM → combine) might be replaced by the single fused operator DispatchGmmCombineDecode.

By default, the existing multi-operator MC2 implementation is preserved.

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e

gemini-code-assist

Code Review

This pull request introduces the DispatchGmmCombineDecode operator as a new fused MoE communication method (FUSED_MC2) to replace the existing MC2 implementation for w8a8_dynamic quantization on Ascend hardware. The changes include adding the necessary enum, environment variable, and implementation logic. My review focuses on improving code clarity and fixing a potential bug related to handling shared experts. I've identified a misleading docstring, hardcoded values that should be parameterized, and a complex conditional that could be simplified for better maintainability.

gemini-code-assist · 2025-12-15T11:41:04Z

+            shared_expert_num=1,
+            shared_expert_rank_num=0,


The parameters shared_expert_num and shared_expert_rank_num are hardcoded. This implementation ignores the shared_experts parameter passed to the function, which likely contains the necessary information for handling shared experts. This will lead to incorrect behavior when shared experts are used. The values should be derived from the function arguments to correctly support shared experts.

gemini-code-assist · 2025-12-15T11:41:04Z

+    """This implementation is for the scenarios listed below:
+    1. `enable_expert_parallel=True`.
+    2. `npu_moe_distribute_dispatch` and `npu_moe_distribute_combine` are available.
+    3. `enable_expert_parallel=False` is not supported.
+
+    This implementation uses the FusedMC2 communication method, which is optimized for
+    Communication and Computation parallelism on Ascend devices.
+    """


The docstring for FusedMC2CommImpl appears to be copied from MC2CommImpl and is misleading. It mentions npu_moe_distribute_dispatch and npu_moe_distribute_combine, but this implementation uses the dispatch_gmm_combine_decode operator. The docstring should be updated to reflect the actual implementation and its requirements, such as being specific to w8a8_dynamic quantization, to improve maintainability.

Suggested change

"""This implementation is for the scenarios listed below:

1. `enable_expert_parallel=True`.

2. `npu_moe_distribute_dispatch` and `npu_moe_distribute_combine` are available.

3. `enable_expert_parallel=False` is not supported.

This implementation uses the FusedMC2 communication method, which is optimized for

Communication and Computation parallelism on Ascend devices.

"""

"""This implementation is for the scenarios listed below:

1. `enable_expert_parallel=True`.

2. `VLLM_ASCEND_ENABLE_FUSED_MC2` is enabled.

3. `w8a8_dynamic` quantization is used.

This implementation uses the `dispatch_gmm_combine_decode` operator, which is a fused

operator for MoE decoding that combines communication and computation for optimization

on Ascend devices.

"""

gemini-code-assist · 2025-12-15T11:41:05Z

            moe_comm_type = (
-                MoECommType.MC2 if num_tokens <= mc2_tokens_capacity else
-                MoECommType.FUSED_ALLTOALL if quant_type == "w8a8_dynamic"
-                and get_ep_group().world_size <= 16 else MoECommType.ALLTOALL)
+                (MoECommType.FUSED_MC2 if envs_ascend.VLLM_ASCEND_ENABLE_FUSED_MC2 and quant_type == "w8a8_dynamic"
+                else MoECommType.MC2) if num_tokens <= mc2_tokens_capacity
+                else MoECommType.FUSED_ALLTOALL if quant_type == "w8a8_dynamic" and get_ep_group().world_size <= 16
+                else MoECommType.ALLTOALL)


This nested ternary expression is difficult to read and maintain. Consider refactoring it into a more explicit if/else structure to improve clarity.

Suggested change

moe_comm_type = (

MoECommType.MC2 if num_tokens <= mc2_tokens_capacity else

MoECommType.FUSED_ALLTOALL if quant_type == "w8a8_dynamic"

and get_ep_group().world_size <= 16 else MoECommType.ALLTOALL)

(MoECommType.FUSED_MC2 if envs_ascend.VLLM_ASCEND_ENABLE_FUSED_MC2 and quant_type == "w8a8_dynamic"

else MoECommType.MC2) if num_tokens <= mc2_tokens_capacity

else MoECommType.FUSED_ALLTOALL if quant_type == "w8a8_dynamic" and get_ep_group().world_size <= 16

else MoECommType.ALLTOALL)

if num_tokens <= mc2_tokens_capacity:

if envs_ascend.VLLM_ASCEND_ENABLE_FUSED_MC2 and quant_type == "w8a8_dynamic":

moe_comm_type = MoECommType.FUSED_MC2

else:

moe_comm_type = MoECommType.MC2

elif quant_type == "w8a8_dynamic" and get_ep_group().world_size <= 16:

moe_comm_type = MoECommType.FUSED_ALLTOALL

else:

moe_comm_type = MoECommType.ALLTOALL

github-actions · 2025-12-15T12:01:03Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

github-actions · 2025-12-16T03:11:42Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

weijinqian0 · 2025-12-16T08:31:12Z

-                and get_ep_group().world_size <= 16 else MoECommType.ALLTOALL)
+            if num_tokens <= mc2_tokens_capacity:
+                if envs_ascend.VLLM_ASCEND_ENABLE_FUSED_MC2 and quant_type == "w8a8_dynamic":
+                    moe_comm_type = MoECommType.FUSED_MC2


remove this envs variables

As new conclusion, we use the same one env variable with dispatch_fnn_combine operator.

linfeng-yuan

Since the operator currently only supports w8a8_dynamic. It is necessary to disblae ths usage of fused_mc2 in mtp_proposer.py in case the dtype of mtp is bfloat16 (notice both _dummy_run and propose should be changed):
Option 1 (Recommended):
Recognize the quant_type of mtp layer (e.g., through the instance class of FusedMoE) to decide the moe_comm_method of this layer.
Option 2:
Refer to #4751 and #4947, use hard_code to disable fused_op paths in mtp_proposer. But please add a note here and future plan.

github-actions · 2025-12-17T15:56:45Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-12-18T07:34:52Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-12-18T15:39:04Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

…bineDecode. This commit adds model-side integration for the previously introduced experimental AscendC fused operator DispatchGmmCombineDecode, used in MoE decoding. The operator implementation itself was added in a prior PR vllm-project#4139. This change only adapts the model execution path to optionally use the fused operator. When the environment variable VLLM_ASCEND_ENABLE_FUSED_MC2=2 is set, the original MC2 path composed of multiple operators (A8W8 dispatch → GMM → SwiGLU → GMM → combine) is replaced by the single fused operator DispatchGmmCombineDecode. By default, the existing multi-operator MC2 implementation is preserved. Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

wangqiankun13 · 2025-12-20T10:29:53Z

Since the operator currently only supports w8a8_dynamic. It is necessary to disblae ths usage of fused_mc2 in mtp_proposer.py in case the dtype of mtp is bfloat16 (notice both _dummy_run and propose should be changed): Option 1 (Recommended): Recognize the quant_type of mtp layer (e.g., through the instance class of FusedMoE) to decide the moe_comm_method of this layer. Option 2: Refer to #4751 and #4947, use hard_code to disable fused_op paths in mtp_proposer. But please add a note here and future plan.

Since FUESD_MC2 must be enabled by env variable now, my operator will not incur issues in default. I will add mtp guard later and has add a note and future here.

kiscad · 2025-12-20T12:50:21Z

LGTM

…issue 5476] (#5932) ### What this PR does / why we need it? In [PR 5040](#5040), the `dispatch_gmm_combine_decode` operator was configured with an incorrect global_bs parameter. This PR is to fix the bug. The global_bs provided as input should have the same meaning as in the `moe_distributed_dispatch` operator, specifically: (the maximum batch size across all cards) * (expert parallel world size). However, the implementation incorrectly used the variable max_num_tokens, which does not account for tensor parallelism. This error likely resulted in an unnecessarily large (overestimated) value. More info about this operator, please refer to RFC: issue #5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode | dataset | version | metric | mode | vllm-api-stream-chat | |----- | ----- | ----- | ----- | -----| | aime2024 | 604a78 | accuracy | gen | 80.00 | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@11b6af5 Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

…issue 5476] (vllm-project#5932) ### What this PR does / why we need it? In [PR 5040](vllm-project#5040), the `dispatch_gmm_combine_decode` operator was configured with an incorrect global_bs parameter. This PR is to fix the bug. The global_bs provided as input should have the same meaning as in the `moe_distributed_dispatch` operator, specifically: (the maximum batch size across all cards) * (expert parallel world size). However, the implementation incorrectly used the variable max_num_tokens, which does not account for tensor parallelism. This error likely resulted in an unnecessarily large (overestimated) value. More info about this operator, please refer to RFC: issue vllm-project#5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode | dataset | version | metric | mode | vllm-api-stream-chat | |----- | ----- | ----- | ----- | -----| | aime2024 | 604a78 | accuracy | gen | 80.00 | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@11b6af5 Signed-off-by: wangqiankun <wangqiankun13@huawei.com> Signed-off-by: huangning1995 <huangning12@huawei.com>

…m_combine_decode (#5931) ### What this PR does / why we need it? This PR is cherry-picked from [PR5932](#5932). In #5040, the dispatch_gmm_combine_decode operator was configured with an incorrect global_bs parameter. This PR is to fix the bug. The global_bs provided as input should have the same meaning as in the moe_distributed_dispatch operator, specifically: (the maximum batch size across all cards) * (expert parallel world size). However, the implementation incorrectly used the variable max_num_tokens, which does not account for tensor parallelism. This error likely resulted in an unnecessarily large (overestimated) value. More info about this operator, please refer to RFC: issue #5476 Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

…m_combine_decode (vllm-project#5931) ### What this PR does / why we need it? This PR is cherry-picked from [PR5932](vllm-project#5932). In vllm-project#5040, the dispatch_gmm_combine_decode operator was configured with an incorrect global_bs parameter. This PR is to fix the bug. The global_bs provided as input should have the same meaning as in the moe_distributed_dispatch operator, specifically: (the maximum batch size across all cards) * (expert parallel world size). However, the implementation incorrectly used the variable max_num_tokens, which does not account for tensor parallelism. This error likely resulted in an unnecessarily large (overestimated) value. More info about this operator, please refer to RFC: issue vllm-project#5476 Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

…l) (vllm-project#5040) ### What this PR does / why we need it? This PR adds model-side integration for the previously introduced experimental AscendC fused operator DispatchGmmCombineDecode, used in MoE decoding. The operator implementation itself was added in a prior PR[vllm-project#4139 ](vllm-project#4139). This change only adapts the model execution path to optionally use the fused operator. When the environment variable VLLM_ASCEND_ENABLE_FUSED_MC2=2 is set, the original MC2 path composed of multiple operators (A8W8 dispatch → GMM → SwiGLU → GMM → combine) might be replaced by the single fused operator DispatchGmmCombineDecode. By default, the existing multi-operator MC2 implementation is preserved. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: wangqiankun <wangqiankun13@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…issue 5476] (vllm-project#5932) ### What this PR does / why we need it? In [PR 5040](vllm-project#5040), the `dispatch_gmm_combine_decode` operator was configured with an incorrect global_bs parameter. This PR is to fix the bug. The global_bs provided as input should have the same meaning as in the `moe_distributed_dispatch` operator, specifically: (the maximum batch size across all cards) * (expert parallel world size). However, the implementation incorrectly used the variable max_num_tokens, which does not account for tensor parallelism. This error likely resulted in an unnecessarily large (overestimated) value. More info about this operator, please refer to RFC: issue vllm-project#5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode | dataset | version | metric | mode | vllm-api-stream-chat | |----- | ----- | ----- | ----- | -----| | aime2024 | 604a78 | accuracy | gen | 80.00 | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@11b6af5 Signed-off-by: wangqiankun <wangqiankun13@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…issue 5476] (vllm-project#5932) ### What this PR does / why we need it? In [PR 5040](vllm-project#5040), the `dispatch_gmm_combine_decode` operator was configured with an incorrect global_bs parameter. This PR is to fix the bug. The global_bs provided as input should have the same meaning as in the `moe_distributed_dispatch` operator, specifically: (the maximum batch size across all cards) * (expert parallel world size). However, the implementation incorrectly used the variable max_num_tokens, which does not account for tensor parallelism. This error likely resulted in an unnecessarily large (overestimated) value. More info about this operator, please refer to RFC: issue vllm-project#5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode | dataset | version | metric | mode | vllm-api-stream-chat | |----- | ----- | ----- | ----- | -----| | aime2024 | 604a78 | accuracy | gen | 80.00 | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@11b6af5 Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

…l) (vllm-project#5040) ### What this PR does / why we need it? This PR adds model-side integration for the previously introduced experimental AscendC fused operator DispatchGmmCombineDecode, used in MoE decoding. The operator implementation itself was added in a prior PR[vllm-project#4139 ](vllm-project#4139). This change only adapts the model execution path to optionally use the fused operator. When the environment variable VLLM_ASCEND_ENABLE_FUSED_MC2=2 is set, the original MC2 path composed of multiple operators (A8W8 dispatch → GMM → SwiGLU → GMM → combine) might be replaced by the single fused operator DispatchGmmCombineDecode. By default, the existing multi-operator MC2 implementation is preserved. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: wangqiankun <wangqiankun13@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…issue 5476] (vllm-project#5932) ### What this PR does / why we need it? In [PR 5040](vllm-project#5040), the `dispatch_gmm_combine_decode` operator was configured with an incorrect global_bs parameter. This PR is to fix the bug. The global_bs provided as input should have the same meaning as in the `moe_distributed_dispatch` operator, specifically: (the maximum batch size across all cards) * (expert parallel world size). However, the implementation incorrectly used the variable max_num_tokens, which does not account for tensor parallelism. This error likely resulted in an unnecessarily large (overestimated) value. More info about this operator, please refer to RFC: issue vllm-project#5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode | dataset | version | metric | mode | vllm-api-stream-chat | |----- | ----- | ----- | ----- | -----| | aime2024 | 604a78 | accuracy | gen | 80.00 | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@11b6af5 Signed-off-by: wangqiankun <wangqiankun13@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…issue 5476] (vllm-project#5932) ### What this PR does / why we need it? In [PR 5040](vllm-project#5040), the `dispatch_gmm_combine_decode` operator was configured with an incorrect global_bs parameter. This PR is to fix the bug. The global_bs provided as input should have the same meaning as in the `moe_distributed_dispatch` operator, specifically: (the maximum batch size across all cards) * (expert parallel world size). However, the implementation incorrectly used the variable max_num_tokens, which does not account for tensor parallelism. This error likely resulted in an unnecessarily large (overestimated) value. More info about this operator, please refer to RFC: issue vllm-project#5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode | dataset | version | metric | mode | vllm-api-stream-chat | |----- | ----- | ----- | ----- | -----| | aime2024 | 604a78 | accuracy | gen | 80.00 | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@11b6af5 Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

gemini-code-assist bot reviewed Dec 15, 2025

View reviewed changes

wangqiankun13 force-pushed the add_fused_mc2 branch from 388dd0e to 35add0e Compare December 15, 2025 11:51

github-actions bot added module:ops module:core module:quantization labels Dec 15, 2025

wangqiankun13 force-pushed the add_fused_mc2 branch 2 times, most recently from f0d7faa to a8ceb08 Compare December 15, 2025 16:01

wangqiankun13 changed the title ~~Use DispatchGmmCombineDecode operator to replace MC2~~ [Feature]Use DispatchGmmCombineDecode operator to replace MC2(Optional) Dec 16, 2025

github-actions bot added the merge-conflicts label Dec 16, 2025

weijinqian0 reviewed Dec 16, 2025

View reviewed changes

linfeng-yuan suggested changes Dec 16, 2025

View reviewed changes

wangqiankun13 force-pushed the add_fused_mc2 branch 2 times, most recently from ef9a4bd to c86a6ea Compare December 16, 2025 11:33

github-actions bot removed the merge-conflicts label Dec 16, 2025

wangqiankun13 force-pushed the add_fused_mc2 branch from c86a6ea to 428e7e9 Compare December 17, 2025 06:14

github-actions bot added the merge-conflicts label Dec 17, 2025

wangqiankun13 force-pushed the add_fused_mc2 branch from 428e7e9 to 7ae098c Compare December 18, 2025 01:54

github-actions bot removed the merge-conflicts label Dec 18, 2025

github-actions bot added the merge-conflicts label Dec 18, 2025

wangqiankun13 force-pushed the add_fused_mc2 branch from 7ae098c to efe35eb Compare December 18, 2025 11:01

github-actions bot added merge-conflicts and removed merge-conflicts labels Dec 18, 2025

wangqiankun13 force-pushed the add_fused_mc2 branch from efe35eb to b138669 Compare December 19, 2025 02:56

github-actions bot removed the merge-conflicts label Dec 19, 2025

wangqiankun13 force-pushed the add_fused_mc2 branch from b138669 to 25a202a Compare December 19, 2025 03:39

wangqiankun13 force-pushed the add_fused_mc2 branch 5 times, most recently from 4294e32 to 7e0616c Compare December 20, 2025 10:14

wangqiankun13 force-pushed the add_fused_mc2 branch from 7e0616c to e3f745d Compare December 20, 2025 10:25

wangqiankun13 requested a review from linfeng-yuan December 20, 2025 10:51

wangxiyuan added ready read for review ready-for-test start test by label for PR labels Dec 21, 2025

wangxiyuan approved these changes Dec 21, 2025

View reviewed changes

wangxiyuan merged commit 904c18f into vllm-project:main Dec 21, 2025
48 checks passed

puxiaokun3721 mentioned this pull request Dec 29, 2025

[RFC]: Support fused operator DispatchGmmCombineDecode #5476

Open

9 tasks

This was referenced Jan 19, 2026

[BugFix] Fix input parameter bug of dispatch_gmm_combine_decode[RFC: issue 5476] #5932

Merged

[v0.13.0][BugFix][Cherry Pick] Fix input parameter bug of dispatch_gmm_combine_decode #5931

Merged

Conversation

wangqiankun13 commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

wangqiankun13 Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

wangqiankun13 Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

wangqiankun13 Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 15, 2025

Uh oh!

github-actions bot commented Dec 16, 2025

Uh oh!

weijinqian0 Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangqiankun13 Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

linfeng-yuan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 17, 2025

Uh oh!

github-actions bot commented Dec 18, 2025

Uh oh!

github-actions bot commented Dec 18, 2025

Uh oh!

wangqiankun13 commented Dec 20, 2025

Uh oh!

kiscad commented Dec 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wangqiankun13 commented Dec 15, 2025 •

edited

Loading

weijinqian0 Dec 16, 2025 •

edited

Loading

linfeng-yuan left a comment •

edited

Loading