[Feature] Add token mask for DispatchGmmCombineDecode operator by wangqiankun13 · Pull Request #5171 · vllm-project/vllm-ascend

wangqiankun13 · 2025-12-18T10:36:20Z

What this PR does / why we need it?

In this PR, DispatchGmmCombineDecode add an optional input x_active_mask, with which
only token masked True will be dispatched and handle.

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e

gemini-code-assist

Code Review

This pull request introduces an x_active_mask to the dispatch_gmm_combine_decode operation, which is a valuable optimization to skip computations for inactive (e.g., padding) tokens. The changes are comprehensive, touching the operator definition, tiling logic, kernel implementation, and Python bindings. My review has identified a couple of critical issues in the kernel logic for calculating the number of active tokens that could lead to incorrect behavior, as well as a minor issue with a misleading error message in the tiling logic. Addressing these points will ensure the correctness and robustness of this new feature.

gemini-code-assist · 2025-12-18T10:40:41Z

+        SumParams params{1, axisBsAlignSize_, axisBS_};
+        Sum(sumOutTensor, tempTensor, params);


The srcStride parameter of SumParams is expected to be in units of elements, but axisBsAlignSize_ is a byte size. This is likely to cause an incorrect sum calculation for the active tokens. Using the simpler Sum overload that only takes the length should be correct here, as the data is contiguous.

Sum(sumOutTensor, tempTensor, axisBS_);

gemini-code-assist · 2025-12-18T10:40:42Z

+    SumParams params{1, axisBsAlignSize_, axisBS_};
+    Sum(sumOutTensor, maskTmpTensor, params);


The srcStride parameter for SumParams is expected to be in units of elements, but axisBsAlignSize_ is a byte size. This will likely lead to incorrect results when calculating the sum of active tokens. A simpler and more correct approach would be to use the Sum overload that takes the length directly, as the data is contiguous.

Sum(sumOutTensor, maskTmpTensor, axisBS_);

gemini-code-assist · 2025-12-18T10:40:42Z

+        OPS_ERR_IF(xActiveMaskDim0 != batchSize, OPS_LOG_E(nodeName,
+                    "gmm2WeightScale Dim0 must be batchSize(%u), but current dim is %lu.", batchSize, xActiveMaskDim0),
+                    return ge::GRAPH_FAILED);


The error message in this log appears to be a copy-paste error from another check. It refers to gmm2WeightScale Dim0 when it should be referring to xActiveMask Dim0. This could be misleading during debugging.

OPS_ERR_IF(xActiveMaskDim0 != batchSize, OPS_LOG_E(nodeName, "xActiveMask Dim0 must be batchSize(%u), but current dim is %lu.", batchSize, xActiveMaskDim0), return ge::GRAPH_FAILED);

github-actions · 2025-12-18T11:35:30Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

DispatchGmmCombineDecode supports one-dim x_active_mask, with which only token masked True will be dispatched and handle. Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

…to eplb_refactor * 'main' of https://github.com/vllm-project/vllm-ascend: (52 commits) [Doc]Add the user_guide doc file regarding fine-grained TP. (vllm-project#5084) [pref] qwen3_next add triton ops : fused_sigmoid_gating_delta_rule_update (vllm-project#4818) [Feature] Add token mask for DispatchGmmCombineDecode operator (vllm-project#5171) [CI] Improve CI (vllm-project#5078) [Refactor] remove some metadata variables in attention_v1. (vllm-project#5160) Add Qwen3-VL-235B-A22B-Instruct tutorials (vllm-project#5167) [Doc] Add a perf tune section (vllm-project#5127) [Image] Refactor image build (vllm-project#5175) [refactor] refactor weight trans nz and transpose (vllm-project#4878) [BugFix]Fix precision issue for LoRA feature (vllm-project#4141) 【Doc】Deepseekv3.1/R1 doc enhancement (vllm-project#4827) support basic long_seq feature st (vllm-project#5140) [Bugfix] install trition for test_custom_op (vllm-project#5112) [2/N][Pangu][MoE] Remove Pangu Related Code (vllm-project#5130) [bugfix] Use FUSED_MC2 MoE comm path for the op `dispatch_ffn_combine` (vllm-project#5156) [BugFix] Fix top_p,top_k issue with EAGLE and add top_p,top_k in EAGLE e2e (vllm-project#5131) [Doc][P/D] Fix MooncakeConnector's name (vllm-project#5172) [Bugfix] Fix in_profile_run in mtp_proposer dummy_run (vllm-project#5165) [Doc] Refact benchmark doc (vllm-project#5173) [Nightly] Avoid max_model_len being smaller than the decoder prompt to prevent single-node-accuray-tests from failing (vllm-project#5174) ... Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

…project#5171) ### What this PR does / why we need it? In this PR, DispatchGmmCombineDecode add an optional input x_active_mask, with which only token masked True will be dispatched and handle. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

…project#5171) ### What this PR does / why we need it? In this PR, DispatchGmmCombineDecode add an optional input x_active_mask, with which only token masked True will be dispatched and handle. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: wangqiankun <wangqiankun13@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

gemini-code-assist bot reviewed Dec 18, 2025

View reviewed changes

github-actions bot added the module:tests label Dec 18, 2025

[Feature] Add token mask for DispatchGmmCombineDecode operator

751ac4d

DispatchGmmCombineDecode supports one-dim x_active_mask, with which only token masked True will be dispatched and handle. Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

wangqiankun13 force-pushed the add_mc2_mask branch from 9207849 to 751ac4d Compare December 19, 2025 03:52

wangqiankun13 changed the title ~~Add mc2 mask~~ [Feature] Add token mask for DispatchGmmCombineDecode operator Dec 19, 2025

wangxiyuan approved these changes Dec 19, 2025

View reviewed changes

wangxiyuan merged commit 118b0ed into vllm-project:main Dec 19, 2025
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add token mask for DispatchGmmCombineDecode operator#5171

[Feature] Add token mask for DispatchGmmCombineDecode operator#5171
wangxiyuan merged 1 commit intovllm-project:mainfrom
wangqiankun13:add_mc2_mask

wangqiankun13 commented Dec 18, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 18, 2025

Uh oh!

gemini-code-assist bot Dec 18, 2025

Uh oh!

gemini-code-assist bot Dec 18, 2025

Uh oh!

github-actions bot commented Dec 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		SumParams params{1, axisBsAlignSize_, axisBS_};
		Sum(sumOutTensor, tempTensor, params);

		SumParams params{1, axisBsAlignSize_, axisBS_};
		Sum(sumOutTensor, maskTmpTensor, params);

Conversation

wangqiankun13 commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wangqiankun13 commented Dec 18, 2025 •

edited

Loading