Skip to content

[v0.13.0][BugFix][Cherry Pick] Fix input parameter bug of dispatch_gmm_combine_decode#5931

Merged
wangxiyuan merged 1 commit intovllm-project:releases/v0.13.0from
wangqiankun13:v0.13.0-fix-global-bs-bug
Jan 22, 2026
Merged

[v0.13.0][BugFix][Cherry Pick] Fix input parameter bug of dispatch_gmm_combine_decode#5931
wangxiyuan merged 1 commit intovllm-project:releases/v0.13.0from
wangqiankun13:v0.13.0-fix-global-bs-bug

Conversation

@wangqiankun13
Copy link
Copy Markdown
Contributor

@wangqiankun13 wangqiankun13 commented Jan 15, 2026

What this PR does / why we need it?

This PR is cherry-picked from PR5932.

In #5040, the dispatch_gmm_combine_decode operator was configured with an incorrect global_bs parameter. This PR is to fix the bug.

The global_bs provided as input should have the same meaning as in the moe_distributed_dispatch operator, specifically: (the maximum batch size across all cards) * (expert parallel world size).
However, the implementation incorrectly used the variable max_num_tokens, which does not account for tensor parallelism. This error likely resulted in an unnecessarily large (overestimated) value.

More info about this operator, please refer to RFC: issue #5476

Does this PR introduce any user-facing change?

How was this patch tested?

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes a critical bug in the dispatch_gmm_combine_decode operator. The global_bs parameter was previously calculated incorrectly as it did not account for tensor parallelism, which could lead to runtime errors or incorrect results. The change ensures global_bs is calculated consistently with other MoE operators by factoring in the tensor parallel size. The removal of the now-unused fused_global_bs attribute is also a good code cleanup. The changes are correct and improve the robustness of the fused MoE implementation.

Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
@wangqiankun13 wangqiankun13 force-pushed the v0.13.0-fix-global-bs-bug branch from 02f41ca to fa6bc4e Compare January 16, 2026 01:33
@wangqiankun13 wangqiankun13 changed the title [BugFix] Fix input parameter bug of dispatch_gmm_combine_decode [BugFix][Cherry Pick] Fix input parameter bug of dispatch_gmm_combine_decode Jan 20, 2026
@wangqiankun13 wangqiankun13 changed the title [BugFix][Cherry Pick] Fix input parameter bug of dispatch_gmm_combine_decode [v0.13.0][BugFix][Cherry Pick] Fix input parameter bug of dispatch_gmm_combine_decode Jan 21, 2026
@wangxiyuan wangxiyuan added ready read for review ready-for-test start test by label for PR labels Jan 21, 2026
@wangxiyuan wangxiyuan merged commit 1548008 into vllm-project:releases/v0.13.0 Jan 22, 2026
20 checks passed
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Jan 22, 2026
…lm-ascend into FIA_v0.13.0

* 'releases/v0.13.0' of https://github.com/vllm-project/vllm-ascend:
  [Feature][Cherry Pick]Enable DispatchGmmCombineDecode when eagle is moe with w8a8, or not moe (vllm-project#6081)
  [v0.13.0][BugFix][Cherry Pick] Fix input parameter bug of dispatch_gmm_combine_decode (vllm-project#5931)
  [0.13.0][Bugfix] Fix Triton operator usage for multimodal models based on `the mrope_interleaved` parameter (vllm-project#6074)
  [v0.13.0][CI] Upgrade to CANN 8.5.0 (vllm-project#6101)
tangtiangu pushed a commit to tangtiangu/jiusi-vllm-ascend that referenced this pull request Feb 24, 2026
…m_combine_decode (vllm-project#5931)

### What this PR does / why we need it?
This PR is cherry-picked from
[PR5932](vllm-project#5932).

In vllm-project#5040, the
dispatch_gmm_combine_decode operator was configured with an incorrect
global_bs parameter. This PR is to fix the bug.

The global_bs provided as input should have the same meaning as in the
moe_distributed_dispatch operator, specifically: (the maximum batch size
across all cards) * (expert parallel world size).
However, the implementation incorrectly used the variable
max_num_tokens, which does not account for tensor parallelism. This
error likely resulted in an unnecessarily large (overestimated) value.

More info about this operator, please refer to RFC: issue
vllm-project#5476

Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
tangtiangu pushed a commit to tangtiangu/jiusi-vllm-ascend that referenced this pull request Feb 24, 2026
…m_combine_decode (vllm-project#5931)

### What this PR does / why we need it?
This PR is cherry-picked from
[PR5932](vllm-project#5932).

In vllm-project#5040, the
dispatch_gmm_combine_decode operator was configured with an incorrect
global_bs parameter. This PR is to fix the bug.

The global_bs provided as input should have the same meaning as in the
moe_distributed_dispatch operator, specifically: (the maximum batch size
across all cards) * (expert parallel world size).
However, the implementation incorrectly used the variable
max_num_tokens, which does not account for tensor parallelism. This
error likely resulted in an unnecessarily large (overestimated) value.

More info about this operator, please refer to RFC: issue
vllm-project#5476

Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants