[Feature]Enable DispatchGmmCombineDecode when eagle is moe with w8a8 or not moe [RFC: issue 5476] by wangqiankun13 · Pull Request #5758 · vllm-project/vllm-ascend

wangqiankun13 · 2026-01-09T04:56:12Z

What this PR does / why we need it?

Operator DispatchGmmCombineDecode does not support non-W8A8 scenarios and cannot share the same communication domain with Operator Dispatch/Combine.

for instance, when the draft model uses a non-W8A8 MOE architecture while the main model employs a W8A8 MOE architecture.

Therefore days ago, I implemented an interception that unconditionally disables Operator DispatchGmmCombineDecode whenever the speculative mode is EAGLE or EAGLE-3. PR: 5293
However, this approach was not precise enough.
This PR further refines the logic by specifically identifying the draft model's configuration: Operator DispatchGmmCombineDecode will now be disabled only when the draft model uses an MOE architecture and is non-W8A8.

More info about this operator, please refer to RFC: issue #5476

Does this PR introduce any user-facing change?

No

How was this patch tested?

Acc
test qwen3-235b eplb on a single A3 node(ep16),
with dispatch_gmm_combine_decode

nic_name="xxxx"
local_ip="xxx.xxx.xxx.xxx"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name

export VLLM_ASCEND_ENABLE_FUSED_MC2=2
echo "VLLM_ASCEND_ENABLE_FUSED_MC2=${VLLM_ASCEND_ENABLE_FUSED_MC2}"

export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_BUFFSIZE=512
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
vllm serve /dataset/Qwen3-235B-A22B-Instruct-2507-w8a8-QuaRot/ \
        --served-model-name "qwen" \
        --host 0.0.0.0 \
        --port 8004 \
        --async-scheduling \
        --tensor-parallel-size 4 \
        --data-parallel-size 4 \
        --max-num-seqs 64 \
        --max-model-len 40960 \
        --max-num-batched-tokens 16384 \
        --gpu-memory-utilization 0.9 \
        --enable-expert-parallel \
        --no-enable-prefix-caching \
        --quantization "ascend" \
        --trust-remote-code \
        --speculative_config \
        '{
            "method": "eagle3",
            "model": "/dataset/Qwen3-235B-A22B-Instruct-2507-speculator-eagle3/",
            "num_speculative_tokens": 2
        }' \
        --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
        2>&1 | tee qwen3_235b_eagle3.log

dataset	version	metric	mode	vllm-api-stream-chat
aime2024	604a78	accuracy	gen	80.00

vLLM version: v0.13.0
vLLM main: vllm-project/vllm@2f4e654

gemini-code-assist

Code Review

This pull request introduces logic to conditionally enable DispatchGmmCombineDecode for speculative decoding, particularly for Eagle models. The change involves refactoring speculative_enable_dispatch_gmm_combine_decode to handle more complex conditions and adding a caching layer.

My main feedback is regarding the implementation of the caching mechanism. It uses a global variable without proper thread-safe access, which can lead to race conditions in a concurrent environment. I've provided a detailed comment on how to address this using a lock to ensure thread safety. Addressing this is critical for the stability of the application.

gemini-code-assist · 2026-01-09T04:58:40Z

+def speculative_enable_dispatch_gmm_combine_decode(
+        vllm_config: VllmConfig) -> bool:
+    global _DRAFT_ATAPT_FUSED_MC2_MODE_2
+    if _DRAFT_ATAPT_FUSED_MC2_MODE_2 is None:
+        _DRAFT_ATAPT_FUSED_MC2_MODE_2 = _draft_adapt_fused_mc2_mode_2(vllm_config)
+    return _DRAFT_ATAPT_FUSED_MC2_MODE_2


The caching implementation for speculative_enable_dispatch_gmm_combine_decode has a couple of critical issues:

Typo: The global variable _DRAFT_ATAPT_FUSED_MC2_MODE_2 seems to have a typo and should likely be _DRAFT_ADAPT_FUSED_MC2_MODE_2.

Race Condition: The use of a global variable for caching is not thread-safe. If multiple threads call this function concurrently when the global is None, it can lead to a race condition where the value is computed multiple times, and the global variable is written to without synchronization.

To fix this, I recommend renaming the variable and using a double-checked locking pattern to ensure thread-safe initialization.

First, update the global variable definition around line 72:

from threading import Lock ... _DRAFT_ADAPT_FUSED_MC2_MODE_2 = None _DRAFT_ADAPT_FUSED_MC2_MODE_2_LOCK = Lock()

Then, update this function as follows:

def speculative_enable_dispatch_gmm_combine_decode( vllm_config: VllmConfig) -> bool: global _DRAFT_ADAPT_FUSED_MC2_MODE_2 # Use double-checked locking for thread-safe initialization. if _DRAFT_ADAPT_FUSED_MC2_MODE_2 is None: with _DRAFT_ADAPT_FUSED_MC2_MODE_2_LOCK: if _DRAFT_ADAPT_FUSED_MC2_MODE_2 is None: _DRAFT_ADAPT_FUSED_MC2_MODE_2 = _draft_adapt_fused_mc2_mode_2(vllm_config) return _DRAFT_ADAPT_FUSED_MC2_MODE_2

This approach ensures correctness and stability in a multi-threaded environment.

github-actions · 2026-01-09T05:09:21Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

github-actions · 2026-01-16T13:02:19Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

… or not moe Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

…oe with w8a8, or not moe (#6081) ### What this PR does / why we need it? This PR is cherry-picked from #5758. Operator DispatchGmmCombineDecode does not support non-W8A8 scenarios and cannot share the same communication domain with Operator Dispatch/Combine. for instance, when the draft model uses a non-W8A8 MOE architecture while the main model employs a W8A8 MOE architecture. Therefore days ago, I implemented an interception that unconditionally disables Operator DispatchGmmCombineDecode whenever the speculative mode is EAGLE or EAGLE-3. #5293 However, this approach was not precise enough. This PR further refines the logic by specifically identifying the draft model's configuration: Operator DispatchGmmCombineDecode will now be disabled only when the draft model uses an MOE architecture and is non-W8A8. More info about this operator, please refer to RFC: issue #5476 Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (51 commits) [Bugfix] Remove `use_aclgraph` in mtp_proposer and use `use_cuda_graph` (vllm-project#6032) [BugFix] fix 3vl dense model load quant weight (vllm-project#6100) [CP&SP] Integrate FIA operator in mla_cp._forward_decode (vllm-project#5641) [CI][Doc] Upgrade wheel building's CANN to 8.5.0 and update the Docs (vllm-project#6145) [CI]Install clang in dokerfile for triton ascend (vllm-project#4409) [Main] Upgrade PTA to 2.9.0 (vllm-project#6112) [Graph][Fusion] Add QKVNormRope and QKVNormRopeWithBias (vllm-project#5721) [P/D][PCP]bugfix pcp force free twice caused logger error (vllm-project#6124) [BugFix]converting pa get_workspace back to capturing (vllm-project#5833) [CI] optimize lint term (vllm-project#5986) [Bugfix] Fix Triton operator usage for multimodal models based on `the mrope_interleaved` parameter (vllm-project#6042) [bugfix][npugraph_ex]fix the model output type issue caused by manually modify FX graph (vllm-project#6015) [BugFix] Support setting tp=1 for the Eagle draft model to take effect (vllm-project#6097) [Misc] Bump mooncake version to v0.3.8.post1 (vllm-project#6110) [Feature]Enable DispatchGmmCombineDecode when eagle is moe with w8a8 or not moe [RFC: issue 5476] (vllm-project#5758) [bugfix] adapt_remote_request_id (vllm-project#6051) [Feature] Add support of new W4A4_LAOS_DYNAMIC quantization method (vllm-project#5143) [Feature] Support DSA-CP for Hybrid scenario (vllm-project#5702) [CI] Upgrade CANN to 8.5.0 (vllm-project#6070) Default enable MLAPO (vllm-project#5952) ...

…or not moe [RFC: issue 5476] (vllm-project#5758) ### What this PR does / why we need it? Operator `DispatchGmmCombineDecode` does not support non-W8A8 scenarios and cannot share the same communication domain with Operator `Dispatch`/`Combine`. > for instance, when the draft model uses a non-W8A8 MOE architecture while the main model employs a W8A8 MOE architecture. Therefore days ago, I implemented an interception that unconditionally disables Operator `DispatchGmmCombineDecode` whenever the speculative mode is `EAGLE` or `EAGLE-3`. [PR: 5293](vllm-project#5293) However, this approach was not precise enough. This PR further refines the logic by specifically identifying the draft model's configuration: Operator `DispatchGmmCombineDecode` will now be disabled only when the draft model uses an MOE architecture and is non-W8A8. More info about this operator, please refer to RFC: issue vllm-project#5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode ```shell nic_name="xxxx" local_ip="xxx.xxx.xxx.xxx" export HCCL_IF_IP=$local_ip export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name export VLLM_ASCEND_ENABLE_FUSED_MC2=2 echo "VLLM_ASCEND_ENABLE_FUSED_MC2=${VLLM_ASCEND_ENABLE_FUSED_MC2}" export HCCL_OP_EXPANSION_MODE="AIV" export HCCL_BUFFSIZE=512 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export OMP_PROC_BIND=false export OMP_NUM_THREADS=10 export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD vllm serve /dataset/Qwen3-235B-A22B-Instruct-2507-w8a8-QuaRot/ \ --served-model-name "qwen" \ --host 0.0.0.0 \ --port 8004 \ --async-scheduling \ --tensor-parallel-size 4 \ --data-parallel-size 4 \ --max-num-seqs 64 \ --max-model-len 40960 \ --max-num-batched-tokens 16384 \ --gpu-memory-utilization 0.9 \ --enable-expert-parallel \ --no-enable-prefix-caching \ --quantization "ascend" \ --trust-remote-code \ --speculative_config \ '{ "method": "eagle3", "model": "/dataset/Qwen3-235B-A22B-Instruct-2507-speculator-eagle3/", "num_speculative_tokens": 2 }' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ 2>&1 | tee qwen3_235b_eagle3.log ``` | dataset | version | metric | mode | vllm-api-stream-chat | |----- | ----- | ----- | ----- | -----| | aime2024 | 604a78 | accuracy | gen | 80.00 | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

…oe with w8a8, or not moe (vllm-project#6081) ### What this PR does / why we need it? This PR is cherry-picked from vllm-project#5758. Operator DispatchGmmCombineDecode does not support non-W8A8 scenarios and cannot share the same communication domain with Operator Dispatch/Combine. for instance, when the draft model uses a non-W8A8 MOE architecture while the main model employs a W8A8 MOE architecture. Therefore days ago, I implemented an interception that unconditionally disables Operator DispatchGmmCombineDecode whenever the speculative mode is EAGLE or EAGLE-3. vllm-project#5293 However, this approach was not precise enough. This PR further refines the logic by specifically identifying the draft model's configuration: Operator DispatchGmmCombineDecode will now be disabled only when the draft model uses an MOE architecture and is non-W8A8. More info about this operator, please refer to RFC: issue vllm-project#5476 Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

…or not moe [RFC: issue 5476] (vllm-project#5758) ### What this PR does / why we need it? Operator `DispatchGmmCombineDecode` does not support non-W8A8 scenarios and cannot share the same communication domain with Operator `Dispatch`/`Combine`. > for instance, when the draft model uses a non-W8A8 MOE architecture while the main model employs a W8A8 MOE architecture. Therefore days ago, I implemented an interception that unconditionally disables Operator `DispatchGmmCombineDecode` whenever the speculative mode is `EAGLE` or `EAGLE-3`. [PR: 5293](vllm-project#5293) However, this approach was not precise enough. This PR further refines the logic by specifically identifying the draft model's configuration: Operator `DispatchGmmCombineDecode` will now be disabled only when the draft model uses an MOE architecture and is non-W8A8. More info about this operator, please refer to RFC: issue vllm-project#5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode ```shell nic_name="xxxx" local_ip="xxx.xxx.xxx.xxx" export HCCL_IF_IP=$local_ip export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name export VLLM_ASCEND_ENABLE_FUSED_MC2=2 echo "VLLM_ASCEND_ENABLE_FUSED_MC2=${VLLM_ASCEND_ENABLE_FUSED_MC2}" export HCCL_OP_EXPANSION_MODE="AIV" export HCCL_BUFFSIZE=512 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export OMP_PROC_BIND=false export OMP_NUM_THREADS=10 export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD vllm serve /dataset/Qwen3-235B-A22B-Instruct-2507-w8a8-QuaRot/ \ --served-model-name "qwen" \ --host 0.0.0.0 \ --port 8004 \ --async-scheduling \ --tensor-parallel-size 4 \ --data-parallel-size 4 \ --max-num-seqs 64 \ --max-model-len 40960 \ --max-num-batched-tokens 16384 \ --gpu-memory-utilization 0.9 \ --enable-expert-parallel \ --no-enable-prefix-caching \ --quantization "ascend" \ --trust-remote-code \ --speculative_config \ '{ "method": "eagle3", "model": "/dataset/Qwen3-235B-A22B-Instruct-2507-speculator-eagle3/", "num_speculative_tokens": 2 }' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ 2>&1 | tee qwen3_235b_eagle3.log ``` | dataset | version | metric | mode | vllm-api-stream-chat | |----- | ----- | ----- | ----- | -----| | aime2024 | 604a78 | accuracy | gen | 80.00 | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

…oe with w8a8, or not moe (vllm-project#6081) ### What this PR does / why we need it? This PR is cherry-picked from vllm-project#5758. Operator DispatchGmmCombineDecode does not support non-W8A8 scenarios and cannot share the same communication domain with Operator Dispatch/Combine. for instance, when the draft model uses a non-W8A8 MOE architecture while the main model employs a W8A8 MOE architecture. Therefore days ago, I implemented an interception that unconditionally disables Operator DispatchGmmCombineDecode whenever the speculative mode is EAGLE or EAGLE-3. vllm-project#5293 However, this approach was not precise enough. This PR further refines the logic by specifically identifying the draft model's configuration: Operator DispatchGmmCombineDecode will now be disabled only when the draft model uses an MOE architecture and is non-W8A8. More info about this operator, please refer to RFC: issue vllm-project#5476 Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

…or not moe [RFC: issue 5476] (vllm-project#5758) ### What this PR does / why we need it? Operator `DispatchGmmCombineDecode` does not support non-W8A8 scenarios and cannot share the same communication domain with Operator `Dispatch`/`Combine`. > for instance, when the draft model uses a non-W8A8 MOE architecture while the main model employs a W8A8 MOE architecture. Therefore days ago, I implemented an interception that unconditionally disables Operator `DispatchGmmCombineDecode` whenever the speculative mode is `EAGLE` or `EAGLE-3`. [PR: 5293](vllm-project#5293) However, this approach was not precise enough. This PR further refines the logic by specifically identifying the draft model's configuration: Operator `DispatchGmmCombineDecode` will now be disabled only when the draft model uses an MOE architecture and is non-W8A8. More info about this operator, please refer to RFC: issue vllm-project#5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode ```shell nic_name="xxxx" local_ip="xxx.xxx.xxx.xxx" export HCCL_IF_IP=$local_ip export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name export VLLM_ASCEND_ENABLE_FUSED_MC2=2 echo "VLLM_ASCEND_ENABLE_FUSED_MC2=${VLLM_ASCEND_ENABLE_FUSED_MC2}" export HCCL_OP_EXPANSION_MODE="AIV" export HCCL_BUFFSIZE=512 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export OMP_PROC_BIND=false export OMP_NUM_THREADS=10 export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD vllm serve /dataset/Qwen3-235B-A22B-Instruct-2507-w8a8-QuaRot/ \ --served-model-name "qwen" \ --host 0.0.0.0 \ --port 8004 \ --async-scheduling \ --tensor-parallel-size 4 \ --data-parallel-size 4 \ --max-num-seqs 64 \ --max-model-len 40960 \ --max-num-batched-tokens 16384 \ --gpu-memory-utilization 0.9 \ --enable-expert-parallel \ --no-enable-prefix-caching \ --quantization "ascend" \ --trust-remote-code \ --speculative_config \ '{ "method": "eagle3", "model": "/dataset/Qwen3-235B-A22B-Instruct-2507-speculator-eagle3/", "num_speculative_tokens": 2 }' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ 2>&1 | tee qwen3_235b_eagle3.log ``` | dataset | version | metric | mode | vllm-api-stream-chat | |----- | ----- | ----- | ----- | -----| | aime2024 | 604a78 | accuracy | gen | 80.00 | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: wangqiankun <wangqiankun13@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…or not moe [RFC: issue 5476] (vllm-project#5758) ### What this PR does / why we need it? Operator `DispatchGmmCombineDecode` does not support non-W8A8 scenarios and cannot share the same communication domain with Operator `Dispatch`/`Combine`. > for instance, when the draft model uses a non-W8A8 MOE architecture while the main model employs a W8A8 MOE architecture. Therefore days ago, I implemented an interception that unconditionally disables Operator `DispatchGmmCombineDecode` whenever the speculative mode is `EAGLE` or `EAGLE-3`. [PR: 5293](vllm-project#5293) However, this approach was not precise enough. This PR further refines the logic by specifically identifying the draft model's configuration: Operator `DispatchGmmCombineDecode` will now be disabled only when the draft model uses an MOE architecture and is non-W8A8. More info about this operator, please refer to RFC: issue vllm-project#5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode ```shell nic_name="xxxx" local_ip="xxx.xxx.xxx.xxx" export HCCL_IF_IP=$local_ip export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name export VLLM_ASCEND_ENABLE_FUSED_MC2=2 echo "VLLM_ASCEND_ENABLE_FUSED_MC2=${VLLM_ASCEND_ENABLE_FUSED_MC2}" export HCCL_OP_EXPANSION_MODE="AIV" export HCCL_BUFFSIZE=512 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export OMP_PROC_BIND=false export OMP_NUM_THREADS=10 export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD vllm serve /dataset/Qwen3-235B-A22B-Instruct-2507-w8a8-QuaRot/ \ --served-model-name "qwen" \ --host 0.0.0.0 \ --port 8004 \ --async-scheduling \ --tensor-parallel-size 4 \ --data-parallel-size 4 \ --max-num-seqs 64 \ --max-model-len 40960 \ --max-num-batched-tokens 16384 \ --gpu-memory-utilization 0.9 \ --enable-expert-parallel \ --no-enable-prefix-caching \ --quantization "ascend" \ --trust-remote-code \ --speculative_config \ '{ "method": "eagle3", "model": "/dataset/Qwen3-235B-A22B-Instruct-2507-speculator-eagle3/", "num_speculative_tokens": 2 }' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ 2>&1 | tee qwen3_235b_eagle3.log ``` | dataset | version | metric | mode | vllm-api-stream-chat | |----- | ----- | ----- | ----- | -----| | aime2024 | 604a78 | accuracy | gen | 80.00 | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

…or not moe [RFC: issue 5476] (vllm-project#5758) ### What this PR does / why we need it? Operator `DispatchGmmCombineDecode` does not support non-W8A8 scenarios and cannot share the same communication domain with Operator `Dispatch`/`Combine`. > for instance, when the draft model uses a non-W8A8 MOE architecture while the main model employs a W8A8 MOE architecture. Therefore days ago, I implemented an interception that unconditionally disables Operator `DispatchGmmCombineDecode` whenever the speculative mode is `EAGLE` or `EAGLE-3`. [PR: 5293](vllm-project#5293) However, this approach was not precise enough. This PR further refines the logic by specifically identifying the draft model's configuration: Operator `DispatchGmmCombineDecode` will now be disabled only when the draft model uses an MOE architecture and is non-W8A8. More info about this operator, please refer to RFC: issue vllm-project#5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode ```shell nic_name="xxxx" local_ip="xxx.xxx.xxx.xxx" export HCCL_IF_IP=$local_ip export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name export VLLM_ASCEND_ENABLE_FUSED_MC2=2 echo "VLLM_ASCEND_ENABLE_FUSED_MC2=${VLLM_ASCEND_ENABLE_FUSED_MC2}" export HCCL_OP_EXPANSION_MODE="AIV" export HCCL_BUFFSIZE=512 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export OMP_PROC_BIND=false export OMP_NUM_THREADS=10 export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD vllm serve /dataset/Qwen3-235B-A22B-Instruct-2507-w8a8-QuaRot/ \ --served-model-name "qwen" \ --host 0.0.0.0 \ --port 8004 \ --async-scheduling \ --tensor-parallel-size 4 \ --data-parallel-size 4 \ --max-num-seqs 64 \ --max-model-len 40960 \ --max-num-batched-tokens 16384 \ --gpu-memory-utilization 0.9 \ --enable-expert-parallel \ --no-enable-prefix-caching \ --quantization "ascend" \ --trust-remote-code \ --speculative_config \ '{ "method": "eagle3", "model": "/dataset/Qwen3-235B-A22B-Instruct-2507-speculator-eagle3/", "num_speculative_tokens": 2 }' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ 2>&1 | tee qwen3_235b_eagle3.log ``` | dataset | version | metric | mode | vllm-api-stream-chat | |----- | ----- | ----- | ----- | -----| | aime2024 | 604a78 | accuracy | gen | 80.00 | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: wangqiankun <wangqiankun13@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…or not moe [RFC: issue 5476] (vllm-project#5758) ### What this PR does / why we need it? Operator `DispatchGmmCombineDecode` does not support non-W8A8 scenarios and cannot share the same communication domain with Operator `Dispatch`/`Combine`. > for instance, when the draft model uses a non-W8A8 MOE architecture while the main model employs a W8A8 MOE architecture. Therefore days ago, I implemented an interception that unconditionally disables Operator `DispatchGmmCombineDecode` whenever the speculative mode is `EAGLE` or `EAGLE-3`. [PR: 5293](vllm-project#5293) However, this approach was not precise enough. This PR further refines the logic by specifically identifying the draft model's configuration: Operator `DispatchGmmCombineDecode` will now be disabled only when the draft model uses an MOE architecture and is non-W8A8. More info about this operator, please refer to RFC: issue vllm-project#5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode ```shell nic_name="xxxx" local_ip="xxx.xxx.xxx.xxx" export HCCL_IF_IP=$local_ip export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name export VLLM_ASCEND_ENABLE_FUSED_MC2=2 echo "VLLM_ASCEND_ENABLE_FUSED_MC2=${VLLM_ASCEND_ENABLE_FUSED_MC2}" export HCCL_OP_EXPANSION_MODE="AIV" export HCCL_BUFFSIZE=512 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export OMP_PROC_BIND=false export OMP_NUM_THREADS=10 export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD vllm serve /dataset/Qwen3-235B-A22B-Instruct-2507-w8a8-QuaRot/ \ --served-model-name "qwen" \ --host 0.0.0.0 \ --port 8004 \ --async-scheduling \ --tensor-parallel-size 4 \ --data-parallel-size 4 \ --max-num-seqs 64 \ --max-model-len 40960 \ --max-num-batched-tokens 16384 \ --gpu-memory-utilization 0.9 \ --enable-expert-parallel \ --no-enable-prefix-caching \ --quantization "ascend" \ --trust-remote-code \ --speculative_config \ '{ "method": "eagle3", "model": "/dataset/Qwen3-235B-A22B-Instruct-2507-speculator-eagle3/", "num_speculative_tokens": 2 }' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ 2>&1 | tee qwen3_235b_eagle3.log ``` | dataset | version | metric | mode | vllm-api-stream-chat | |----- | ----- | ----- | ----- | -----| | aime2024 | 604a78 | accuracy | gen | 80.00 | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

wangqiankun13 force-pushed the check_eagle branch from 70a047c to c5b1f24 Compare January 9, 2026 04:56

gemini-code-assist bot reviewed Jan 9, 2026

View reviewed changes

github-actions bot added the module:core label Jan 9, 2026

wangqiankun13 force-pushed the check_eagle branch 4 times, most recently from 0037deb to 5b17fbf Compare January 15, 2026 01:54

wangqiankun13 requested a review from wangxiyuan as a code owner January 15, 2026 01:54

wangqiankun13 changed the title ~~[Feature]Enable DispatchGmmCombineDecode when eagle is moe with w8a8 or not moe~~ [Feature]Enable DispatchGmmCombineDecode when eagle is moe with w8a8 or not moe [RFC: issue 5476] Jan 15, 2026

github-actions bot added the merge-conflicts label Jan 16, 2026

wangqiankun13 force-pushed the check_eagle branch from 5b17fbf to 79e57d3 Compare January 19, 2026 02:21

github-actions bot removed the merge-conflicts label Jan 19, 2026

wangqiankun13 force-pushed the check_eagle branch 3 times, most recently from 0be8a20 to d921c9d Compare January 19, 2026 07:17

wangxiyuan added ready read for review ready-for-test start test by label for PR labels Jan 19, 2026

wangqiankun13 force-pushed the check_eagle branch 6 times, most recently from 6d3f04e to 162c14c Compare January 20, 2026 11:10

[Feature]Enable DispatchGmmCombineDecode when eagle is moe with w8a8,…

19fbfc0

… or not moe Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

wangqiankun13 force-pushed the check_eagle branch from 162c14c to 19fbfc0 Compare January 21, 2026 04:00

wangqiankun13 mentioned this pull request Jan 21, 2026

[Feature][Cherry Pick]Enable DispatchGmmCombineDecode when eagle is moe with w8a8, or not moe #6081

Merged

wangxiyuan enabled auto-merge (squash) January 21, 2026 08:53

wangxiyuan approved these changes Jan 22, 2026

View reviewed changes

wangxiyuan merged commit 08d7014 into vllm-project:main Jan 22, 2026
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]Enable DispatchGmmCombineDecode when eagle is moe with w8a8 or not moe [RFC: issue 5476]#5758

[Feature]Enable DispatchGmmCombineDecode when eagle is moe with w8a8 or not moe [RFC: issue 5476]#5758
wangxiyuan merged 1 commit intovllm-project:mainfrom
wangqiankun13:check_eagle

wangqiankun13 commented Jan 9, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 9, 2026

Uh oh!

github-actions bot commented Jan 9, 2026

Uh oh!

github-actions bot commented Jan 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wangqiankun13 commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 9, 2026

Uh oh!

github-actions bot commented Jan 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wangqiankun13 commented Jan 9, 2026 •

edited

Loading