[BugFix]Disable dispatch_gmm_combine_decode operator when mtp drafter model uses non-w8a8 while main model uses w8a8, or drafter model is eagle series by wangqiankun13 · Pull Request #5293 · vllm-project/vllm-ascend

wangqiankun13 · 2025-12-23T11:30:06Z

…w8a8 while main model uses w8a8

What this PR does / why we need it?

Disable dispatch_gmm_combine_decode operator when mtp drafter model uses non-w8a8 while main model uses w8a8, or drafter model is eagle series.

More info about this operator, please refer to RFC: issue #5476

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: release/v0.13.0
vLLM main: vllm-project/vllm@ad32e3e

gemini-code-assist

Code Review

This pull request adds a guard for the dispatch_gmm_combine_decode operator to ensure it's only used when the MTP model has the correct quantization type, which aligns with the operator's requirements. However, there is a critical issue in how the MTP model is detected, which could lead to a runtime crash. I've added a comment with a suggested fix for this. Additionally, for consistency, you might consider using the newly introduced model_with_mtp variable in place of is_mtp_model throughout the select_moe_comm_method function.

gemini-code-assist · 2025-12-23T11:31:34Z

    quant_type = getattr(
        vllm_config.model_config.hf_config, 'moe_quantize',
        getattr(vllm_config.model_config.hf_config, 'quantize', None))
+    model_with_mtp = getattr(vllm_config.speculative_config, 'method', None) == "mtp"


This line will raise an AttributeError if vllm_config.speculative_config is None, as getattr would be called on a None object. This is a critical issue that can cause a crash. You should add a check for vllm_config.speculative_config before accessing its attributes.

Suggested change

model_with_mtp = getattr(vllm_config.speculative_config, 'method', None) == "mtp"

model_with_mtp = vllm_config.speculative_config and getattr(vllm_config.speculative_config, 'method', None) == "mtp"

github-actions · 2025-12-23T11:44:14Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

github-actions · 2025-12-29T02:05:34Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

… model uses non-w8a8 while main model uses w8a8, or drafter model is eagle series Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

… model uses non-w8a8 while main model uses w8a8, or drafter model is eagle series (vllm-project#5293) …w8a8 while main model uses w8a8 ### What this PR does / why we need it? Disable dispatch_gmm_combine_decode operator when mtp drafter model uses non-w8a8 while main model uses w8a8, or drafter model is eagle series. More info about this operator, please refer to RFC: issue vllm-project#5476 - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

…or not moe [RFC: issue 5476] (#5758) ### What this PR does / why we need it? Operator `DispatchGmmCombineDecode` does not support non-W8A8 scenarios and cannot share the same communication domain with Operator `Dispatch`/`Combine`. > for instance, when the draft model uses a non-W8A8 MOE architecture while the main model employs a W8A8 MOE architecture. Therefore days ago, I implemented an interception that unconditionally disables Operator `DispatchGmmCombineDecode` whenever the speculative mode is `EAGLE` or `EAGLE-3`. [PR: 5293](#5293) However, this approach was not precise enough. This PR further refines the logic by specifically identifying the draft model's configuration: Operator `DispatchGmmCombineDecode` will now be disabled only when the draft model uses an MOE architecture and is non-W8A8. More info about this operator, please refer to RFC: issue #5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode ```shell nic_name="xxxx" local_ip="xxx.xxx.xxx.xxx" export HCCL_IF_IP=$local_ip export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name export VLLM_ASCEND_ENABLE_FUSED_MC2=2 echo "VLLM_ASCEND_ENABLE_FUSED_MC2=${VLLM_ASCEND_ENABLE_FUSED_MC2}" export HCCL_OP_EXPANSION_MODE="AIV" export HCCL_BUFFSIZE=512 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export OMP_PROC_BIND=false export OMP_NUM_THREADS=10 export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD vllm serve /dataset/Qwen3-235B-A22B-Instruct-2507-w8a8-QuaRot/ \ --served-model-name "qwen" \ --host 0.0.0.0 \ --port 8004 \ --async-scheduling \ --tensor-parallel-size 4 \ --data-parallel-size 4 \ --max-num-seqs 64 \ --max-model-len 40960 \ --max-num-batched-tokens 16384 \ --gpu-memory-utilization 0.9 \ --enable-expert-parallel \ --no-enable-prefix-caching \ --quantization "ascend" \ --trust-remote-code \ --speculative_config \ '{ "method": "eagle3", "model": "/dataset/Qwen3-235B-A22B-Instruct-2507-speculator-eagle3/", "num_speculative_tokens": 2 }' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ 2>&1 | tee qwen3_235b_eagle3.log ``` | dataset | version | metric | mode | vllm-api-stream-chat | |----- | ----- | ----- | ----- | -----| | aime2024 | 604a78 | accuracy | gen | 80.00 | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

…oe with w8a8, or not moe (#6081) ### What this PR does / why we need it? This PR is cherry-picked from #5758. Operator DispatchGmmCombineDecode does not support non-W8A8 scenarios and cannot share the same communication domain with Operator Dispatch/Combine. for instance, when the draft model uses a non-W8A8 MOE architecture while the main model employs a W8A8 MOE architecture. Therefore days ago, I implemented an interception that unconditionally disables Operator DispatchGmmCombineDecode whenever the speculative mode is EAGLE or EAGLE-3. #5293 However, this approach was not precise enough. This PR further refines the logic by specifically identifying the draft model's configuration: Operator DispatchGmmCombineDecode will now be disabled only when the draft model uses an MOE architecture and is non-W8A8. More info about this operator, please refer to RFC: issue #5476 Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

…or not moe [RFC: issue 5476] (vllm-project#5758) ### What this PR does / why we need it? Operator `DispatchGmmCombineDecode` does not support non-W8A8 scenarios and cannot share the same communication domain with Operator `Dispatch`/`Combine`. > for instance, when the draft model uses a non-W8A8 MOE architecture while the main model employs a W8A8 MOE architecture. Therefore days ago, I implemented an interception that unconditionally disables Operator `DispatchGmmCombineDecode` whenever the speculative mode is `EAGLE` or `EAGLE-3`. [PR: 5293](vllm-project#5293) However, this approach was not precise enough. This PR further refines the logic by specifically identifying the draft model's configuration: Operator `DispatchGmmCombineDecode` will now be disabled only when the draft model uses an MOE architecture and is non-W8A8. More info about this operator, please refer to RFC: issue vllm-project#5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode ```shell nic_name="xxxx" local_ip="xxx.xxx.xxx.xxx" export HCCL_IF_IP=$local_ip export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name export VLLM_ASCEND_ENABLE_FUSED_MC2=2 echo "VLLM_ASCEND_ENABLE_FUSED_MC2=${VLLM_ASCEND_ENABLE_FUSED_MC2}" export HCCL_OP_EXPANSION_MODE="AIV" export HCCL_BUFFSIZE=512 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export OMP_PROC_BIND=false export OMP_NUM_THREADS=10 export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD vllm serve /dataset/Qwen3-235B-A22B-Instruct-2507-w8a8-QuaRot/ \ --served-model-name "qwen" \ --host 0.0.0.0 \ --port 8004 \ --async-scheduling \ --tensor-parallel-size 4 \ --data-parallel-size 4 \ --max-num-seqs 64 \ --max-model-len 40960 \ --max-num-batched-tokens 16384 \ --gpu-memory-utilization 0.9 \ --enable-expert-parallel \ --no-enable-prefix-caching \ --quantization "ascend" \ --trust-remote-code \ --speculative_config \ '{ "method": "eagle3", "model": "/dataset/Qwen3-235B-A22B-Instruct-2507-speculator-eagle3/", "num_speculative_tokens": 2 }' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ 2>&1 | tee qwen3_235b_eagle3.log ``` | dataset | version | metric | mode | vllm-api-stream-chat | |----- | ----- | ----- | ----- | -----| | aime2024 | 604a78 | accuracy | gen | 80.00 | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

…oe with w8a8, or not moe (vllm-project#6081) ### What this PR does / why we need it? This PR is cherry-picked from vllm-project#5758. Operator DispatchGmmCombineDecode does not support non-W8A8 scenarios and cannot share the same communication domain with Operator Dispatch/Combine. for instance, when the draft model uses a non-W8A8 MOE architecture while the main model employs a W8A8 MOE architecture. Therefore days ago, I implemented an interception that unconditionally disables Operator DispatchGmmCombineDecode whenever the speculative mode is EAGLE or EAGLE-3. vllm-project#5293 However, this approach was not precise enough. This PR further refines the logic by specifically identifying the draft model's configuration: Operator DispatchGmmCombineDecode will now be disabled only when the draft model uses an MOE architecture and is non-W8A8. More info about this operator, please refer to RFC: issue vllm-project#5476 Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

…or not moe [RFC: issue 5476] (vllm-project#5758) ### What this PR does / why we need it? Operator `DispatchGmmCombineDecode` does not support non-W8A8 scenarios and cannot share the same communication domain with Operator `Dispatch`/`Combine`. > for instance, when the draft model uses a non-W8A8 MOE architecture while the main model employs a W8A8 MOE architecture. Therefore days ago, I implemented an interception that unconditionally disables Operator `DispatchGmmCombineDecode` whenever the speculative mode is `EAGLE` or `EAGLE-3`. [PR: 5293](vllm-project#5293) However, this approach was not precise enough. This PR further refines the logic by specifically identifying the draft model's configuration: Operator `DispatchGmmCombineDecode` will now be disabled only when the draft model uses an MOE architecture and is non-W8A8. More info about this operator, please refer to RFC: issue vllm-project#5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode ```shell nic_name="xxxx" local_ip="xxx.xxx.xxx.xxx" export HCCL_IF_IP=$local_ip export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name export VLLM_ASCEND_ENABLE_FUSED_MC2=2 echo "VLLM_ASCEND_ENABLE_FUSED_MC2=${VLLM_ASCEND_ENABLE_FUSED_MC2}" export HCCL_OP_EXPANSION_MODE="AIV" export HCCL_BUFFSIZE=512 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export OMP_PROC_BIND=false export OMP_NUM_THREADS=10 export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD vllm serve /dataset/Qwen3-235B-A22B-Instruct-2507-w8a8-QuaRot/ \ --served-model-name "qwen" \ --host 0.0.0.0 \ --port 8004 \ --async-scheduling \ --tensor-parallel-size 4 \ --data-parallel-size 4 \ --max-num-seqs 64 \ --max-model-len 40960 \ --max-num-batched-tokens 16384 \ --gpu-memory-utilization 0.9 \ --enable-expert-parallel \ --no-enable-prefix-caching \ --quantization "ascend" \ --trust-remote-code \ --speculative_config \ '{ "method": "eagle3", "model": "/dataset/Qwen3-235B-A22B-Instruct-2507-speculator-eagle3/", "num_speculative_tokens": 2 }' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ 2>&1 | tee qwen3_235b_eagle3.log ``` | dataset | version | metric | mode | vllm-api-stream-chat | |----- | ----- | ----- | ----- | -----| | aime2024 | 604a78 | accuracy | gen | 80.00 | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

…oe with w8a8, or not moe (vllm-project#6081) ### What this PR does / why we need it? This PR is cherry-picked from vllm-project#5758. Operator DispatchGmmCombineDecode does not support non-W8A8 scenarios and cannot share the same communication domain with Operator Dispatch/Combine. for instance, when the draft model uses a non-W8A8 MOE architecture while the main model employs a W8A8 MOE architecture. Therefore days ago, I implemented an interception that unconditionally disables Operator DispatchGmmCombineDecode whenever the speculative mode is EAGLE or EAGLE-3. vllm-project#5293 However, this approach was not precise enough. This PR further refines the logic by specifically identifying the draft model's configuration: Operator DispatchGmmCombineDecode will now be disabled only when the draft model uses an MOE architecture and is non-W8A8. More info about this operator, please refer to RFC: issue vllm-project#5476 Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

… model uses non-w8a8 while main model uses w8a8, or drafter model is eagle series (vllm-project#5293) …w8a8 while main model uses w8a8 ### What this PR does / why we need it? Disable dispatch_gmm_combine_decode operator when mtp drafter model uses non-w8a8 while main model uses w8a8, or drafter model is eagle series. More info about this operator, please refer to RFC: issue vllm-project#5476 - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: wangqiankun <wangqiankun13@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…or not moe [RFC: issue 5476] (vllm-project#5758) ### What this PR does / why we need it? Operator `DispatchGmmCombineDecode` does not support non-W8A8 scenarios and cannot share the same communication domain with Operator `Dispatch`/`Combine`. > for instance, when the draft model uses a non-W8A8 MOE architecture while the main model employs a W8A8 MOE architecture. Therefore days ago, I implemented an interception that unconditionally disables Operator `DispatchGmmCombineDecode` whenever the speculative mode is `EAGLE` or `EAGLE-3`. [PR: 5293](vllm-project#5293) However, this approach was not precise enough. This PR further refines the logic by specifically identifying the draft model's configuration: Operator `DispatchGmmCombineDecode` will now be disabled only when the draft model uses an MOE architecture and is non-W8A8. More info about this operator, please refer to RFC: issue vllm-project#5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode ```shell nic_name="xxxx" local_ip="xxx.xxx.xxx.xxx" export HCCL_IF_IP=$local_ip export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name export VLLM_ASCEND_ENABLE_FUSED_MC2=2 echo "VLLM_ASCEND_ENABLE_FUSED_MC2=${VLLM_ASCEND_ENABLE_FUSED_MC2}" export HCCL_OP_EXPANSION_MODE="AIV" export HCCL_BUFFSIZE=512 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export OMP_PROC_BIND=false export OMP_NUM_THREADS=10 export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD vllm serve /dataset/Qwen3-235B-A22B-Instruct-2507-w8a8-QuaRot/ \ --served-model-name "qwen" \ --host 0.0.0.0 \ --port 8004 \ --async-scheduling \ --tensor-parallel-size 4 \ --data-parallel-size 4 \ --max-num-seqs 64 \ --max-model-len 40960 \ --max-num-batched-tokens 16384 \ --gpu-memory-utilization 0.9 \ --enable-expert-parallel \ --no-enable-prefix-caching \ --quantization "ascend" \ --trust-remote-code \ --speculative_config \ '{ "method": "eagle3", "model": "/dataset/Qwen3-235B-A22B-Instruct-2507-speculator-eagle3/", "num_speculative_tokens": 2 }' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ 2>&1 | tee qwen3_235b_eagle3.log ``` | dataset | version | metric | mode | vllm-api-stream-chat | |----- | ----- | ----- | ----- | -----| | aime2024 | 604a78 | accuracy | gen | 80.00 | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: wangqiankun <wangqiankun13@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

… model uses non-w8a8 while main model uses w8a8, or drafter model is eagle series (vllm-project#5293) …w8a8 while main model uses w8a8 ### What this PR does / why we need it? Disable dispatch_gmm_combine_decode operator when mtp drafter model uses non-w8a8 while main model uses w8a8, or drafter model is eagle series. More info about this operator, please refer to RFC: issue vllm-project#5476 - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

…or not moe [RFC: issue 5476] (vllm-project#5758) ### What this PR does / why we need it? Operator `DispatchGmmCombineDecode` does not support non-W8A8 scenarios and cannot share the same communication domain with Operator `Dispatch`/`Combine`. > for instance, when the draft model uses a non-W8A8 MOE architecture while the main model employs a W8A8 MOE architecture. Therefore days ago, I implemented an interception that unconditionally disables Operator `DispatchGmmCombineDecode` whenever the speculative mode is `EAGLE` or `EAGLE-3`. [PR: 5293](vllm-project#5293) However, this approach was not precise enough. This PR further refines the logic by specifically identifying the draft model's configuration: Operator `DispatchGmmCombineDecode` will now be disabled only when the draft model uses an MOE architecture and is non-W8A8. More info about this operator, please refer to RFC: issue vllm-project#5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode ```shell nic_name="xxxx" local_ip="xxx.xxx.xxx.xxx" export HCCL_IF_IP=$local_ip export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name export VLLM_ASCEND_ENABLE_FUSED_MC2=2 echo "VLLM_ASCEND_ENABLE_FUSED_MC2=${VLLM_ASCEND_ENABLE_FUSED_MC2}" export HCCL_OP_EXPANSION_MODE="AIV" export HCCL_BUFFSIZE=512 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export OMP_PROC_BIND=false export OMP_NUM_THREADS=10 export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD vllm serve /dataset/Qwen3-235B-A22B-Instruct-2507-w8a8-QuaRot/ \ --served-model-name "qwen" \ --host 0.0.0.0 \ --port 8004 \ --async-scheduling \ --tensor-parallel-size 4 \ --data-parallel-size 4 \ --max-num-seqs 64 \ --max-model-len 40960 \ --max-num-batched-tokens 16384 \ --gpu-memory-utilization 0.9 \ --enable-expert-parallel \ --no-enable-prefix-caching \ --quantization "ascend" \ --trust-remote-code \ --speculative_config \ '{ "method": "eagle3", "model": "/dataset/Qwen3-235B-A22B-Instruct-2507-speculator-eagle3/", "num_speculative_tokens": 2 }' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ 2>&1 | tee qwen3_235b_eagle3.log ``` | dataset | version | metric | mode | vllm-api-stream-chat | |----- | ----- | ----- | ----- | -----| | aime2024 | 604a78 | accuracy | gen | 80.00 | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

… model uses non-w8a8 while main model uses w8a8, or drafter model is eagle series (vllm-project#5293) …w8a8 while main model uses w8a8 ### What this PR does / why we need it? Disable dispatch_gmm_combine_decode operator when mtp drafter model uses non-w8a8 while main model uses w8a8, or drafter model is eagle series. More info about this operator, please refer to RFC: issue vllm-project#5476 - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: wangqiankun <wangqiankun13@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…or not moe [RFC: issue 5476] (vllm-project#5758) ### What this PR does / why we need it? Operator `DispatchGmmCombineDecode` does not support non-W8A8 scenarios and cannot share the same communication domain with Operator `Dispatch`/`Combine`. > for instance, when the draft model uses a non-W8A8 MOE architecture while the main model employs a W8A8 MOE architecture. Therefore days ago, I implemented an interception that unconditionally disables Operator `DispatchGmmCombineDecode` whenever the speculative mode is `EAGLE` or `EAGLE-3`. [PR: 5293](vllm-project#5293) However, this approach was not precise enough. This PR further refines the logic by specifically identifying the draft model's configuration: Operator `DispatchGmmCombineDecode` will now be disabled only when the draft model uses an MOE architecture and is non-W8A8. More info about this operator, please refer to RFC: issue vllm-project#5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode ```shell nic_name="xxxx" local_ip="xxx.xxx.xxx.xxx" export HCCL_IF_IP=$local_ip export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name export VLLM_ASCEND_ENABLE_FUSED_MC2=2 echo "VLLM_ASCEND_ENABLE_FUSED_MC2=${VLLM_ASCEND_ENABLE_FUSED_MC2}" export HCCL_OP_EXPANSION_MODE="AIV" export HCCL_BUFFSIZE=512 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export OMP_PROC_BIND=false export OMP_NUM_THREADS=10 export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD vllm serve /dataset/Qwen3-235B-A22B-Instruct-2507-w8a8-QuaRot/ \ --served-model-name "qwen" \ --host 0.0.0.0 \ --port 8004 \ --async-scheduling \ --tensor-parallel-size 4 \ --data-parallel-size 4 \ --max-num-seqs 64 \ --max-model-len 40960 \ --max-num-batched-tokens 16384 \ --gpu-memory-utilization 0.9 \ --enable-expert-parallel \ --no-enable-prefix-caching \ --quantization "ascend" \ --trust-remote-code \ --speculative_config \ '{ "method": "eagle3", "model": "/dataset/Qwen3-235B-A22B-Instruct-2507-speculator-eagle3/", "num_speculative_tokens": 2 }' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ 2>&1 | tee qwen3_235b_eagle3.log ``` | dataset | version | metric | mode | vllm-api-stream-chat | |----- | ----- | ----- | ----- | -----| | aime2024 | 604a78 | accuracy | gen | 80.00 | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: wangqiankun <wangqiankun13@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

… model uses non-w8a8 while main model uses w8a8, or drafter model is eagle series (vllm-project#5293) …w8a8 while main model uses w8a8 ### What this PR does / why we need it? Disable dispatch_gmm_combine_decode operator when mtp drafter model uses non-w8a8 while main model uses w8a8, or drafter model is eagle series. More info about this operator, please refer to RFC: issue vllm-project#5476 - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

…or not moe [RFC: issue 5476] (vllm-project#5758) ### What this PR does / why we need it? Operator `DispatchGmmCombineDecode` does not support non-W8A8 scenarios and cannot share the same communication domain with Operator `Dispatch`/`Combine`. > for instance, when the draft model uses a non-W8A8 MOE architecture while the main model employs a W8A8 MOE architecture. Therefore days ago, I implemented an interception that unconditionally disables Operator `DispatchGmmCombineDecode` whenever the speculative mode is `EAGLE` or `EAGLE-3`. [PR: 5293](vllm-project#5293) However, this approach was not precise enough. This PR further refines the logic by specifically identifying the draft model's configuration: Operator `DispatchGmmCombineDecode` will now be disabled only when the draft model uses an MOE architecture and is non-W8A8. More info about this operator, please refer to RFC: issue vllm-project#5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode ```shell nic_name="xxxx" local_ip="xxx.xxx.xxx.xxx" export HCCL_IF_IP=$local_ip export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name export VLLM_ASCEND_ENABLE_FUSED_MC2=2 echo "VLLM_ASCEND_ENABLE_FUSED_MC2=${VLLM_ASCEND_ENABLE_FUSED_MC2}" export HCCL_OP_EXPANSION_MODE="AIV" export HCCL_BUFFSIZE=512 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export OMP_PROC_BIND=false export OMP_NUM_THREADS=10 export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD vllm serve /dataset/Qwen3-235B-A22B-Instruct-2507-w8a8-QuaRot/ \ --served-model-name "qwen" \ --host 0.0.0.0 \ --port 8004 \ --async-scheduling \ --tensor-parallel-size 4 \ --data-parallel-size 4 \ --max-num-seqs 64 \ --max-model-len 40960 \ --max-num-batched-tokens 16384 \ --gpu-memory-utilization 0.9 \ --enable-expert-parallel \ --no-enable-prefix-caching \ --quantization "ascend" \ --trust-remote-code \ --speculative_config \ '{ "method": "eagle3", "model": "/dataset/Qwen3-235B-A22B-Instruct-2507-speculator-eagle3/", "num_speculative_tokens": 2 }' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ 2>&1 | tee qwen3_235b_eagle3.log ``` | dataset | version | metric | mode | vllm-api-stream-chat | |----- | ----- | ----- | ----- | -----| | aime2024 | 604a78 | accuracy | gen | 80.00 | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

gemini-code-assist bot reviewed Dec 23, 2025

View reviewed changes

github-actions bot added the module:core label Dec 23, 2025

wangqiankun13 force-pushed the guard_fp32_mtp branch 4 times, most recently from bb3e496 to 05c491a Compare December 25, 2025 02:22

wangqiankun13 changed the title ~~Add guard for dispatch_gmm_combine_decode operator when mtp uses non-…~~ [BugFix]Disable dispatch_gmm_combine_decode operator when mtp uses non-w8a8 while main model uses w8a8 Dec 25, 2025

github-actions bot added the merge-conflicts label Dec 29, 2025

wangqiankun13 force-pushed the guard_fp32_mtp branch 2 times, most recently from 722c0b0 to c1eef2d Compare December 30, 2025 07:55

github-actions bot removed the merge-conflicts label Dec 30, 2025

wangqiankun13 force-pushed the guard_fp32_mtp branch from c1eef2d to f413236 Compare December 30, 2025 08:28

[BugFix]Disable dispatch_gmm_combine_decode operator when mtp drafter…

218b644

… model uses non-w8a8 while main model uses w8a8, or drafter model is eagle series Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

wangqiankun13 force-pushed the guard_fp32_mtp branch from f413236 to 218b644 Compare December 30, 2025 09:32

yiz-liu added ready read for review ready-for-test start test by label for PR labels Dec 31, 2025

yiz-liu approved these changes Dec 31, 2025

View reviewed changes

wangxiyuan approved these changes Jan 4, 2026

View reviewed changes

wangxiyuan merged commit 350b95e into vllm-project:main Jan 4, 2026
60 of 66 checks passed

This was referenced Jan 15, 2026

[Feature]Enable DispatchGmmCombineDecode when eagle is moe with w8a8 or not moe [RFC: issue 5476] #5758

Merged

[Feature][Cherry Pick]Enable DispatchGmmCombineDecode when eagle is moe with w8a8, or not moe #6081

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix]Disable dispatch_gmm_combine_decode operator when mtp drafter model uses non-w8a8 while main model uses w8a8, or drafter model is eagle series#5293

[BugFix]Disable dispatch_gmm_combine_decode operator when mtp drafter model uses non-w8a8 while main model uses w8a8, or drafter model is eagle series#5293
wangxiyuan merged 1 commit intovllm-project:mainfrom
wangqiankun13:guard_fp32_mtp

wangqiankun13 commented Dec 23, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 23, 2025

Uh oh!

github-actions bot commented Dec 23, 2025

Uh oh!

github-actions bot commented Dec 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	model_with_mtp = getattr(vllm_config.speculative_config, 'method', None) == "mtp"
	model_with_mtp = vllm_config.speculative_config and getattr(vllm_config.speculative_config, 'method', None) == "mtp"

Conversation

wangqiankun13 commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 23, 2025

Uh oh!

github-actions bot commented Dec 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wangqiankun13 commented Dec 23, 2025 •

edited

Loading