[BugFix] Fix input parameter bug of dispatch_gmm_combine_decode[RFC: issue 5476] by wangqiankun13 · Pull Request #5932 · vllm-project/vllm-ascend

wangqiankun13 · 2026-01-15T11:32:16Z

What this PR does / why we need it?

In PR 5040, the dispatch_gmm_combine_decode operator was configured with an incorrect global_bs parameter. This PR is to fix the bug.

The global_bs provided as input should have the same meaning as in the moe_distributed_dispatch operator, specifically: (the maximum batch size across all cards) * (expert parallel world size).
However, the implementation incorrectly used the variable max_num_tokens, which does not account for tensor parallelism. This error likely resulted in an unnecessarily large (overestimated) value.

More info about this operator, please refer to RFC: issue #5476

Does this PR introduce any user-facing change?

No

How was this patch tested?

Acc
test qwen3-235b eplb on a single A3 node(ep16),
with dispatch_gmm_combine_decode

nic_name="xxxx"
local_ip="xxx.xxx.xxx.xxx"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name

export VLLM_ASCEND_ENABLE_FUSED_MC2=2
echo "VLLM_ASCEND_ENABLE_FUSED_MC2=${VLLM_ASCEND_ENABLE_FUSED_MC2}"

export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_BUFFSIZE=512
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
vllm serve /dataset/Qwen3-235B-A22B-Instruct-2507-w8a8-QuaRot/ \
        --served-model-name "qwen" \
        --host 0.0.0.0 \
        --port 8004 \
        --async-scheduling \
        --tensor-parallel-size 4 \
        --data-parallel-size 4 \
        --max-num-seqs 64 \
        --max-model-len 40960 \
        --max-num-batched-tokens 16384 \
        --gpu-memory-utilization 0.9 \
        --enable-expert-parallel \
        --no-enable-prefix-caching \
        --quantization "ascend" \
        --trust-remote-code \
        --speculative_config \
        '{
            "method": "eagle3",
            "model": "/dataset/Qwen3-235B-A22B-Instruct-2507-speculator-eagle3/",
            "num_speculative_tokens": 2
        }' \
        --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
        2>&1 | tee qwen3_235b_eagle3.log

dataset	version	metric	mode	vllm-api-stream-chat
aime2024	604a78	accuracy	gen	80.00

vLLM version: v0.13.0
vLLM main: vllm-project/vllm@11b6af5

github-actions · 2026-01-15T11:32:32Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request addresses a bug in the dispatch_gmm_combine_decode function by correcting the global_bs input parameter. The change replaces the usage of fused_global_bs with the correctly calculated global_bs, ensuring consistency with other related operator calls. The corresponding removal of the fused_global_bs attribute from TokenDispatcherWithMC2 is a clean and correct modification. The fix appears to be well-targeted and sound.

Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (24 commits) add dispath_ffn_combine_bf16 (vllm-project#5866) [BugFix] Fix input parameter bug of dispatch_gmm_combine_decode[RFC: issue 5476] (vllm-project#5932) [1/N][Feat] Xlite Qwen3 MoE Support (vllm-project#5951) [Bugfix] Fix setting of `speculative_config.enforce_eager` for dsv32 (vllm-project#5945) [bugfix][mm] change get_num_encoder_tokens to get_num_encoder_embeds in recompute_schedule.py (vllm-project#5132) [Bugfix] fix pcp qwen full graph FIA bug (vllm-project#6037) [Bugfix]Fixed precision issues caused by pooled request pooling (vllm-project#6049) 【main】【bugfix】Resolved memory deallocation failure in the pooling layer under re-computation workloads. (vllm-project#6045) [main][Bugfix] Fixed an problem related to embeddings sharing (vllm-project#5967) [Feature]refactor the npugraph_ex config, support online-infer with static kernel (vllm-project#5775) [CI][Lint] Show lint diff on failure (vllm-project#5956) [CI] Add wait logic for each individual case (vllm-project#6036) [CI] Add DeepSeek-V3.2-W8A8 nightly ci test (vllm-project#4633) model runner v2 support triton of penalty (vllm-project#5854) [Docs][Model] Support Qwen3-VL-Embedding & Qwen3-VL-Reranker (vllm-project#6034) [Tests] move qwen3 performance test from nightly to e2e (vllm-project#5980) [Bugfix] fix bug of pcp+mtp+async scheduler (vllm-project#5994) [Main2Main] Upgrade vllm commit to releases/v0.14.0 (vllm-project#5988) [Ops] Add layernorm for qwen3Next (vllm-project#5765) [Doc] Add layer_sharding additional config for DeepSeek-V3.2-W8A8 (vllm-project#5921) ...

…issue 5476] (vllm-project#5932) ### What this PR does / why we need it? In [PR 5040](vllm-project#5040), the `dispatch_gmm_combine_decode` operator was configured with an incorrect global_bs parameter. This PR is to fix the bug. The global_bs provided as input should have the same meaning as in the `moe_distributed_dispatch` operator, specifically: (the maximum batch size across all cards) * (expert parallel world size). However, the implementation incorrectly used the variable max_num_tokens, which does not account for tensor parallelism. This error likely resulted in an unnecessarily large (overestimated) value. More info about this operator, please refer to RFC: issue vllm-project#5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode | dataset | version | metric | mode | vllm-api-stream-chat | |----- | ----- | ----- | ----- | -----| | aime2024 | 604a78 | accuracy | gen | 80.00 | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@11b6af5 Signed-off-by: wangqiankun <wangqiankun13@huawei.com> Signed-off-by: huangning1995 <huangning12@huawei.com>

…de[RFC: issue 5476] (vllm-project#5932)" This reverts commit 59c7919.

…m_combine_decode (#5931) ### What this PR does / why we need it? This PR is cherry-picked from [PR5932](#5932). In #5040, the dispatch_gmm_combine_decode operator was configured with an incorrect global_bs parameter. This PR is to fix the bug. The global_bs provided as input should have the same meaning as in the moe_distributed_dispatch operator, specifically: (the maximum batch size across all cards) * (expert parallel world size). However, the implementation incorrectly used the variable max_num_tokens, which does not account for tensor parallelism. This error likely resulted in an unnecessarily large (overestimated) value. More info about this operator, please refer to RFC: issue #5476 Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

…m_combine_decode (vllm-project#5931) ### What this PR does / why we need it? This PR is cherry-picked from [PR5932](vllm-project#5932). In vllm-project#5040, the dispatch_gmm_combine_decode operator was configured with an incorrect global_bs parameter. This PR is to fix the bug. The global_bs provided as input should have the same meaning as in the moe_distributed_dispatch operator, specifically: (the maximum batch size across all cards) * (expert parallel world size). However, the implementation incorrectly used the variable max_num_tokens, which does not account for tensor parallelism. This error likely resulted in an unnecessarily large (overestimated) value. More info about this operator, please refer to RFC: issue vllm-project#5476 Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

…issue 5476] (vllm-project#5932) ### What this PR does / why we need it? In [PR 5040](vllm-project#5040), the `dispatch_gmm_combine_decode` operator was configured with an incorrect global_bs parameter. This PR is to fix the bug. The global_bs provided as input should have the same meaning as in the `moe_distributed_dispatch` operator, specifically: (the maximum batch size across all cards) * (expert parallel world size). However, the implementation incorrectly used the variable max_num_tokens, which does not account for tensor parallelism. This error likely resulted in an unnecessarily large (overestimated) value. More info about this operator, please refer to RFC: issue vllm-project#5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode | dataset | version | metric | mode | vllm-api-stream-chat | |----- | ----- | ----- | ----- | -----| | aime2024 | 604a78 | accuracy | gen | 80.00 | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@11b6af5 Signed-off-by: wangqiankun <wangqiankun13@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…issue 5476] (vllm-project#5932) ### What this PR does / why we need it? In [PR 5040](vllm-project#5040), the `dispatch_gmm_combine_decode` operator was configured with an incorrect global_bs parameter. This PR is to fix the bug. The global_bs provided as input should have the same meaning as in the `moe_distributed_dispatch` operator, specifically: (the maximum batch size across all cards) * (expert parallel world size). However, the implementation incorrectly used the variable max_num_tokens, which does not account for tensor parallelism. This error likely resulted in an unnecessarily large (overestimated) value. More info about this operator, please refer to RFC: issue vllm-project#5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode | dataset | version | metric | mode | vllm-api-stream-chat | |----- | ----- | ----- | ----- | -----| | aime2024 | 604a78 | accuracy | gen | 80.00 | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@11b6af5 Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

…issue 5476] (vllm-project#5932) ### What this PR does / why we need it? In [PR 5040](vllm-project#5040), the `dispatch_gmm_combine_decode` operator was configured with an incorrect global_bs parameter. This PR is to fix the bug. The global_bs provided as input should have the same meaning as in the `moe_distributed_dispatch` operator, specifically: (the maximum batch size across all cards) * (expert parallel world size). However, the implementation incorrectly used the variable max_num_tokens, which does not account for tensor parallelism. This error likely resulted in an unnecessarily large (overestimated) value. More info about this operator, please refer to RFC: issue vllm-project#5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode | dataset | version | metric | mode | vllm-api-stream-chat | |----- | ----- | ----- | ----- | -----| | aime2024 | 604a78 | accuracy | gen | 80.00 | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@11b6af5 Signed-off-by: wangqiankun <wangqiankun13@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…issue 5476] (vllm-project#5932) ### What this PR does / why we need it? In [PR 5040](vllm-project#5040), the `dispatch_gmm_combine_decode` operator was configured with an incorrect global_bs parameter. This PR is to fix the bug. The global_bs provided as input should have the same meaning as in the `moe_distributed_dispatch` operator, specifically: (the maximum batch size across all cards) * (expert parallel world size). However, the implementation incorrectly used the variable max_num_tokens, which does not account for tensor parallelism. This error likely resulted in an unnecessarily large (overestimated) value. More info about this operator, please refer to RFC: issue vllm-project#5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode | dataset | version | metric | mode | vllm-api-stream-chat | |----- | ----- | ----- | ----- | -----| | aime2024 | 604a78 | accuracy | gen | 80.00 | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@11b6af5 Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

wangqiankun13 requested review from realliujiaxu and zzzzwwjj as code owners January 15, 2026 11:32

github-actions bot added the module:ops label Jan 15, 2026

gemini-code-assist bot reviewed Jan 15, 2026

View reviewed changes

wangqiankun13 force-pushed the fix-global-bs-bug branch from dc26aed to 502d622 Compare January 16, 2026 01:33

wangxiyuan approved these changes Jan 19, 2026

View reviewed changes

wangxiyuan added ready read for review ready-for-test start test by label for PR labels Jan 19, 2026

wangxiyuan enabled auto-merge (squash) January 19, 2026 08:13

wangqiankun13 changed the title ~~[BugFix] Fix input parameter bug of dispatch_gmm_combine_decode~~ [BugFix] Fix input parameter bug of dispatch_gmm_combine_decode[RFC: issue 5476] Jan 19, 2026

auto-merge was automatically disabled January 19, 2026 08:50
Head branch was pushed to by a user without write access

wangqiankun13 force-pushed the fix-global-bs-bug branch 2 times, most recently from b5f288c to a0a9ebd Compare January 20, 2026 01:28

wangqiankun13 mentioned this pull request Jan 20, 2026

[v0.13.0][BugFix][Cherry Pick] Fix input parameter bug of dispatch_gmm_combine_decode #5931

Merged

wangqiankun13 force-pushed the fix-global-bs-bug branch from a0a9ebd to 07ed8d0 Compare January 20, 2026 06:14

[BugFix] Fix input parameter bug of dispatch_gmm_combine_decode

601d683

Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

wangqiankun13 force-pushed the fix-global-bs-bug branch from 07ed8d0 to 601d683 Compare January 20, 2026 07:33

wangxiyuan merged commit bec8641 into vllm-project:main Jan 21, 2026
20 checks passed

huangfeifei1995 added a commit to huangfeifei1995/vllm-ascend that referenced this pull request Jan 21, 2026

Revert "[BugFix] Fix input parameter bug of dispatch_gmm_combine_deco…

2bce9dc

…de[RFC: issue 5476] (vllm-project#5932)" This reverts commit 59c7919.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] Fix input parameter bug of dispatch_gmm_combine_decode[RFC: issue 5476]#5932

[BugFix] Fix input parameter bug of dispatch_gmm_combine_decode[RFC: issue 5476]#5932
wangxiyuan merged 1 commit intovllm-project:mainfrom
wangqiankun13:fix-global-bs-bug

wangqiankun13 commented Jan 15, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 15, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wangqiankun13 commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Jan 15, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wangqiankun13 commented Jan 15, 2026 •

edited

Loading