[Feature][Cherry Pick]Enable DispatchGmmCombineDecode when eagle is moe with w8a8, or not moe#6081
Merged
wangxiyuan merged 1 commit intovllm-project:releases/v0.13.0from Jan 22, 2026
Conversation
aad7763 to
0979339
Compare
… or not moe Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
0979339 to
1607ec1
Compare
845473182
pushed a commit
to 845473182/vllm-ascend
that referenced
this pull request
Jan 22, 2026
…lm-ascend into FIA_v0.13.0 * 'releases/v0.13.0' of https://github.com/vllm-project/vllm-ascend: [Feature][Cherry Pick]Enable DispatchGmmCombineDecode when eagle is moe with w8a8, or not moe (vllm-project#6081) [v0.13.0][BugFix][Cherry Pick] Fix input parameter bug of dispatch_gmm_combine_decode (vllm-project#5931) [0.13.0][Bugfix] Fix Triton operator usage for multimodal models based on `the mrope_interleaved` parameter (vllm-project#6074) [v0.13.0][CI] Upgrade to CANN 8.5.0 (vllm-project#6101)
starmountain1997
pushed a commit
to starmountain1997/vllm-ascend
that referenced
this pull request
Jan 31, 2026
…oe with w8a8, or not moe (vllm-project#6081) ### What this PR does / why we need it? This PR is cherry-picked from vllm-project#5758. Operator DispatchGmmCombineDecode does not support non-W8A8 scenarios and cannot share the same communication domain with Operator Dispatch/Combine. for instance, when the draft model uses a non-W8A8 MOE architecture while the main model employs a W8A8 MOE architecture. Therefore days ago, I implemented an interception that unconditionally disables Operator DispatchGmmCombineDecode whenever the speculative mode is EAGLE or EAGLE-3. vllm-project#5293 However, this approach was not precise enough. This PR further refines the logic by specifically identifying the draft model's configuration: Operator DispatchGmmCombineDecode will now be disabled only when the draft model uses an MOE architecture and is non-W8A8. More info about this operator, please refer to RFC: issue vllm-project#5476 Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
tangtiangu
pushed a commit
to tangtiangu/jiusi-vllm-ascend
that referenced
this pull request
Feb 24, 2026
…oe with w8a8, or not moe (vllm-project#6081) ### What this PR does / why we need it? This PR is cherry-picked from vllm-project#5758. Operator DispatchGmmCombineDecode does not support non-W8A8 scenarios and cannot share the same communication domain with Operator Dispatch/Combine. for instance, when the draft model uses a non-W8A8 MOE architecture while the main model employs a W8A8 MOE architecture. Therefore days ago, I implemented an interception that unconditionally disables Operator DispatchGmmCombineDecode whenever the speculative mode is EAGLE or EAGLE-3. vllm-project#5293 However, this approach was not precise enough. This PR further refines the logic by specifically identifying the draft model's configuration: Operator DispatchGmmCombineDecode will now be disabled only when the draft model uses an MOE architecture and is non-W8A8. More info about this operator, please refer to RFC: issue vllm-project#5476 Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
tangtiangu
pushed a commit
to tangtiangu/jiusi-vllm-ascend
that referenced
this pull request
Feb 24, 2026
…oe with w8a8, or not moe (vllm-project#6081) ### What this PR does / why we need it? This PR is cherry-picked from vllm-project#5758. Operator DispatchGmmCombineDecode does not support non-W8A8 scenarios and cannot share the same communication domain with Operator Dispatch/Combine. for instance, when the draft model uses a non-W8A8 MOE architecture while the main model employs a W8A8 MOE architecture. Therefore days ago, I implemented an interception that unconditionally disables Operator DispatchGmmCombineDecode whenever the speculative mode is EAGLE or EAGLE-3. vllm-project#5293 However, this approach was not precise enough. This PR further refines the logic by specifically identifying the draft model's configuration: Operator DispatchGmmCombineDecode will now be disabled only when the draft model uses an MOE architecture and is non-W8A8. More info about this operator, please refer to RFC: issue vllm-project#5476 Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it?
This PR is cherry-picked from #5758.
Operator DispatchGmmCombineDecode does not support non-W8A8 scenarios and cannot share the same communication domain with Operator Dispatch/Combine.
for instance, when the draft model uses a non-W8A8 MOE architecture while the main model employs a W8A8 MOE architecture.
Therefore days ago, I implemented an interception that unconditionally disables Operator DispatchGmmCombineDecode whenever the speculative mode is EAGLE or EAGLE-3. #5293
However, this approach was not precise enough.
This PR further refines the logic by specifically identifying the draft model's configuration: Operator DispatchGmmCombineDecode will now be disabled only when the draft model uses an MOE architecture and is non-W8A8.
More info about this operator, please refer to RFC: issue #5476
Does this PR introduce any user-facing change?
How was this patch tested?