Skip to content

[Feature]Enable DispatchGmmCombineDecode when eagle is moe with w8a8 or not moe [RFC: issue 5476]#5758

Merged
wangxiyuan merged 1 commit intovllm-project:mainfrom
wangqiankun13:check_eagle
Jan 22, 2026
Merged

[Feature]Enable DispatchGmmCombineDecode when eagle is moe with w8a8 or not moe [RFC: issue 5476]#5758
wangxiyuan merged 1 commit intovllm-project:mainfrom
wangqiankun13:check_eagle

Conversation

@wangqiankun13
Copy link
Copy Markdown
Contributor

@wangqiankun13 wangqiankun13 commented Jan 9, 2026

What this PR does / why we need it?

Operator DispatchGmmCombineDecode does not support non-W8A8 scenarios and cannot share the same communication domain with Operator Dispatch/Combine.

for instance, when the draft model uses a non-W8A8 MOE architecture while the main model employs a W8A8 MOE architecture.

Therefore days ago, I implemented an interception that unconditionally disables Operator DispatchGmmCombineDecode whenever the speculative mode is EAGLE or EAGLE-3. PR: 5293
However, this approach was not precise enough.
This PR further refines the logic by specifically identifying the draft model's configuration: Operator DispatchGmmCombineDecode will now be disabled only when the draft model uses an MOE architecture and is non-W8A8.

More info about this operator, please refer to RFC: issue #5476

Does this PR introduce any user-facing change?

No

How was this patch tested?

Acc
test qwen3-235b eplb on a single A3 node(ep16),
with dispatch_gmm_combine_decode

nic_name="xxxx"
local_ip="xxx.xxx.xxx.xxx"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name

export VLLM_ASCEND_ENABLE_FUSED_MC2=2
echo "VLLM_ASCEND_ENABLE_FUSED_MC2=${VLLM_ASCEND_ENABLE_FUSED_MC2}"

export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_BUFFSIZE=512
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
vllm serve /dataset/Qwen3-235B-A22B-Instruct-2507-w8a8-QuaRot/ \
        --served-model-name "qwen" \
        --host 0.0.0.0 \
        --port 8004 \
        --async-scheduling \
        --tensor-parallel-size 4 \
        --data-parallel-size 4 \
        --max-num-seqs 64 \
        --max-model-len 40960 \
        --max-num-batched-tokens 16384 \
        --gpu-memory-utilization 0.9 \
        --enable-expert-parallel \
        --no-enable-prefix-caching \
        --quantization "ascend" \
        --trust-remote-code \
        --speculative_config \
        '{
            "method": "eagle3",
            "model": "/dataset/Qwen3-235B-A22B-Instruct-2507-speculator-eagle3/",
            "num_speculative_tokens": 2
        }' \
        --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
        2>&1 | tee qwen3_235b_eagle3.log
dataset version metric mode vllm-api-stream-chat
aime2024 604a78 accuracy gen 80.00

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces logic to conditionally enable DispatchGmmCombineDecode for speculative decoding, particularly for Eagle models. The change involves refactoring speculative_enable_dispatch_gmm_combine_decode to handle more complex conditions and adding a caching layer.

My main feedback is regarding the implementation of the caching mechanism. It uses a global variable without proper thread-safe access, which can lead to race conditions in a concurrent environment. I've provided a detailed comment on how to address this using a lock to ensure thread safety. Addressing this is critical for the stability of the application.

Comment thread vllm_ascend/utils.py Outdated
Comment on lines +872 to +877
def speculative_enable_dispatch_gmm_combine_decode(
vllm_config: VllmConfig) -> bool:
global _DRAFT_ATAPT_FUSED_MC2_MODE_2
if _DRAFT_ATAPT_FUSED_MC2_MODE_2 is None:
_DRAFT_ATAPT_FUSED_MC2_MODE_2 = _draft_adapt_fused_mc2_mode_2(vllm_config)
return _DRAFT_ATAPT_FUSED_MC2_MODE_2
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The caching implementation for speculative_enable_dispatch_gmm_combine_decode has a couple of critical issues:

  1. Typo: The global variable _DRAFT_ATAPT_FUSED_MC2_MODE_2 seems to have a typo and should likely be _DRAFT_ADAPT_FUSED_MC2_MODE_2.

  2. Race Condition: The use of a global variable for caching is not thread-safe. If multiple threads call this function concurrently when the global is None, it can lead to a race condition where the value is computed multiple times, and the global variable is written to without synchronization.

To fix this, I recommend renaming the variable and using a double-checked locking pattern to ensure thread-safe initialization.

First, update the global variable definition around line 72:

from threading import Lock
...
_DRAFT_ADAPT_FUSED_MC2_MODE_2 = None
_DRAFT_ADAPT_FUSED_MC2_MODE_2_LOCK = Lock()

Then, update this function as follows:

def speculative_enable_dispatch_gmm_combine_decode(
        vllm_config: VllmConfig) -> bool:
    global _DRAFT_ADAPT_FUSED_MC2_MODE_2
    # Use double-checked locking for thread-safe initialization.
    if _DRAFT_ADAPT_FUSED_MC2_MODE_2 is None:
        with _DRAFT_ADAPT_FUSED_MC2_MODE_2_LOCK:
            if _DRAFT_ADAPT_FUSED_MC2_MODE_2 is None:
                _DRAFT_ADAPT_FUSED_MC2_MODE_2 = _draft_adapt_fused_mc2_mode_2(vllm_config)
    return _DRAFT_ADAPT_FUSED_MC2_MODE_2

This approach ensures correctness and stability in a multi-threaded environment.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 9, 2026

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@wangqiankun13 wangqiankun13 force-pushed the check_eagle branch 4 times, most recently from 0037deb to 5b17fbf Compare January 15, 2026 01:54
@wangqiankun13 wangqiankun13 changed the title [Feature]Enable DispatchGmmCombineDecode when eagle is moe with w8a8 or not moe [Feature]Enable DispatchGmmCombineDecode when eagle is moe with w8a8 or not moe [RFC: issue 5476] Jan 15, 2026
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@wangqiankun13 wangqiankun13 force-pushed the check_eagle branch 3 times, most recently from 0be8a20 to d921c9d Compare January 19, 2026 07:17
@wangxiyuan wangxiyuan added ready read for review ready-for-test start test by label for PR labels Jan 19, 2026
@wangqiankun13 wangqiankun13 force-pushed the check_eagle branch 6 times, most recently from 6d3f04e to 162c14c Compare January 20, 2026 11:10
… or not moe

Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
@wangxiyuan wangxiyuan merged commit 08d7014 into vllm-project:main Jan 22, 2026
20 checks passed
wangxiyuan pushed a commit that referenced this pull request Jan 22, 2026
…oe with w8a8, or not moe (#6081)

### What this PR does / why we need it?
This PR is cherry-picked from
#5758.

Operator DispatchGmmCombineDecode does not support non-W8A8 scenarios
and cannot share the same communication domain with Operator
Dispatch/Combine.

for instance, when the draft model uses a non-W8A8 MOE architecture
while the main model employs a W8A8 MOE architecture.

Therefore days ago, I implemented an interception that unconditionally
disables Operator DispatchGmmCombineDecode whenever the speculative mode
is EAGLE or EAGLE-3.
#5293
However, this approach was not precise enough.
This PR further refines the logic by specifically identifying the draft
model's configuration: Operator DispatchGmmCombineDecode will now be
disabled only when the draft model uses an MOE architecture and is
non-W8A8.

More info about this operator, please refer to RFC: issue
#5476


Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Jan 22, 2026
…to qwen3next_rebase

* 'main' of https://github.com/vllm-project/vllm-ascend: (51 commits)
  [Bugfix] Remove `use_aclgraph` in mtp_proposer and use `use_cuda_graph` (vllm-project#6032)
  [BugFix] fix 3vl dense model load quant weight (vllm-project#6100)
  [CP&SP] Integrate FIA operator in mla_cp._forward_decode (vllm-project#5641)
  [CI][Doc] Upgrade wheel building's CANN to 8.5.0 and update the Docs (vllm-project#6145)
  [CI]Install clang in dokerfile for triton ascend (vllm-project#4409)
  [Main] Upgrade PTA to 2.9.0 (vllm-project#6112)
  [Graph][Fusion] Add QKVNormRope and QKVNormRopeWithBias (vllm-project#5721)
  [P/D][PCP]bugfix pcp force free twice caused logger error (vllm-project#6124)
  [BugFix]converting pa get_workspace back to capturing (vllm-project#5833)
  [CI] optimize lint term (vllm-project#5986)
  [Bugfix] Fix Triton operator usage for multimodal models based on `the mrope_interleaved` parameter (vllm-project#6042)
  [bugfix][npugraph_ex]fix the model output type issue caused by manually modify FX graph (vllm-project#6015)
  [BugFix] Support setting tp=1 for the Eagle draft model to take effect (vllm-project#6097)
  [Misc] Bump mooncake version to v0.3.8.post1 (vllm-project#6110)
  [Feature]Enable DispatchGmmCombineDecode when eagle is moe with w8a8 or not moe [RFC: issue 5476] (vllm-project#5758)
  [bugfix] adapt_remote_request_id (vllm-project#6051)
  [Feature] Add support of new W4A4_LAOS_DYNAMIC quantization method (vllm-project#5143)
  [Feature] Support DSA-CP for Hybrid scenario (vllm-project#5702)
  [CI] Upgrade CANN to 8.5.0 (vllm-project#6070)
  Default enable MLAPO (vllm-project#5952)
  ...
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
…or not moe [RFC: issue 5476] (vllm-project#5758)

### What this PR does / why we need it?
Operator `DispatchGmmCombineDecode` does not support non-W8A8 scenarios
and cannot share the same communication domain with Operator
`Dispatch`/`Combine`.
> for instance, when the draft model uses a non-W8A8 MOE architecture
while the main model employs a W8A8 MOE architecture.

Therefore days ago, I implemented an interception that unconditionally
disables Operator `DispatchGmmCombineDecode` whenever the speculative
mode is `EAGLE` or `EAGLE-3`. [PR:
5293](vllm-project#5293)
However, this approach was not precise enough. 
This PR further refines the logic by specifically identifying the draft
model's configuration: Operator `DispatchGmmCombineDecode` will now be
disabled only when the draft model uses an MOE architecture and is
non-W8A8.

More info about this operator, please refer to RFC: issue
vllm-project#5476

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Acc
test qwen3-235b eplb on a single A3 node(ep16),
with dispatch_gmm_combine_decode

```shell
nic_name="xxxx"
local_ip="xxx.xxx.xxx.xxx"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name

export VLLM_ASCEND_ENABLE_FUSED_MC2=2
echo "VLLM_ASCEND_ENABLE_FUSED_MC2=${VLLM_ASCEND_ENABLE_FUSED_MC2}"

export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_BUFFSIZE=512
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
vllm serve /dataset/Qwen3-235B-A22B-Instruct-2507-w8a8-QuaRot/ \
        --served-model-name "qwen" \
        --host 0.0.0.0 \
        --port 8004 \
        --async-scheduling \
        --tensor-parallel-size 4 \
        --data-parallel-size 4 \
        --max-num-seqs 64 \
        --max-model-len 40960 \
        --max-num-batched-tokens 16384 \
        --gpu-memory-utilization 0.9 \
        --enable-expert-parallel \
        --no-enable-prefix-caching \
        --quantization "ascend" \
        --trust-remote-code \
        --speculative_config \
        '{
            "method": "eagle3",
            "model": "/dataset/Qwen3-235B-A22B-Instruct-2507-speculator-eagle3/",
            "num_speculative_tokens": 2
        }' \
        --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
        2>&1 | tee qwen3_235b_eagle3.log
```

| dataset | version | metric | mode | vllm-api-stream-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 80.00 |

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
…oe with w8a8, or not moe (vllm-project#6081)

### What this PR does / why we need it?
This PR is cherry-picked from
vllm-project#5758.

Operator DispatchGmmCombineDecode does not support non-W8A8 scenarios
and cannot share the same communication domain with Operator
Dispatch/Combine.

for instance, when the draft model uses a non-W8A8 MOE architecture
while the main model employs a W8A8 MOE architecture.

Therefore days ago, I implemented an interception that unconditionally
disables Operator DispatchGmmCombineDecode whenever the speculative mode
is EAGLE or EAGLE-3.
vllm-project#5293
However, this approach was not precise enough.
This PR further refines the logic by specifically identifying the draft
model's configuration: Operator DispatchGmmCombineDecode will now be
disabled only when the draft model uses an MOE architecture and is
non-W8A8.

More info about this operator, please refer to RFC: issue
vllm-project#5476


Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
…or not moe [RFC: issue 5476] (vllm-project#5758)

### What this PR does / why we need it?
Operator `DispatchGmmCombineDecode` does not support non-W8A8 scenarios
and cannot share the same communication domain with Operator
`Dispatch`/`Combine`.
> for instance, when the draft model uses a non-W8A8 MOE architecture
while the main model employs a W8A8 MOE architecture.

Therefore days ago, I implemented an interception that unconditionally
disables Operator `DispatchGmmCombineDecode` whenever the speculative
mode is `EAGLE` or `EAGLE-3`. [PR:
5293](vllm-project#5293)
However, this approach was not precise enough. 
This PR further refines the logic by specifically identifying the draft
model's configuration: Operator `DispatchGmmCombineDecode` will now be
disabled only when the draft model uses an MOE architecture and is
non-W8A8.

More info about this operator, please refer to RFC: issue
vllm-project#5476

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Acc
test qwen3-235b eplb on a single A3 node(ep16),
with dispatch_gmm_combine_decode

```shell
nic_name="xxxx"
local_ip="xxx.xxx.xxx.xxx"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name

export VLLM_ASCEND_ENABLE_FUSED_MC2=2
echo "VLLM_ASCEND_ENABLE_FUSED_MC2=${VLLM_ASCEND_ENABLE_FUSED_MC2}"

export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_BUFFSIZE=512
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
vllm serve /dataset/Qwen3-235B-A22B-Instruct-2507-w8a8-QuaRot/ \
        --served-model-name "qwen" \
        --host 0.0.0.0 \
        --port 8004 \
        --async-scheduling \
        --tensor-parallel-size 4 \
        --data-parallel-size 4 \
        --max-num-seqs 64 \
        --max-model-len 40960 \
        --max-num-batched-tokens 16384 \
        --gpu-memory-utilization 0.9 \
        --enable-expert-parallel \
        --no-enable-prefix-caching \
        --quantization "ascend" \
        --trust-remote-code \
        --speculative_config \
        '{
            "method": "eagle3",
            "model": "/dataset/Qwen3-235B-A22B-Instruct-2507-speculator-eagle3/",
            "num_speculative_tokens": 2
        }' \
        --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
        2>&1 | tee qwen3_235b_eagle3.log
```

| dataset | version | metric | mode | vllm-api-stream-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 80.00 |

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
tangtiangu pushed a commit to tangtiangu/jiusi-vllm-ascend that referenced this pull request Feb 24, 2026
…oe with w8a8, or not moe (vllm-project#6081)

### What this PR does / why we need it?
This PR is cherry-picked from
vllm-project#5758.

Operator DispatchGmmCombineDecode does not support non-W8A8 scenarios
and cannot share the same communication domain with Operator
Dispatch/Combine.

for instance, when the draft model uses a non-W8A8 MOE architecture
while the main model employs a W8A8 MOE architecture.

Therefore days ago, I implemented an interception that unconditionally
disables Operator DispatchGmmCombineDecode whenever the speculative mode
is EAGLE or EAGLE-3.
vllm-project#5293
However, this approach was not precise enough.
This PR further refines the logic by specifically identifying the draft
model's configuration: Operator DispatchGmmCombineDecode will now be
disabled only when the draft model uses an MOE architecture and is
non-W8A8.

More info about this operator, please refer to RFC: issue
vllm-project#5476


Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
tangtiangu pushed a commit to tangtiangu/jiusi-vllm-ascend that referenced this pull request Feb 24, 2026
…oe with w8a8, or not moe (vllm-project#6081)

### What this PR does / why we need it?
This PR is cherry-picked from
vllm-project#5758.

Operator DispatchGmmCombineDecode does not support non-W8A8 scenarios
and cannot share the same communication domain with Operator
Dispatch/Combine.

for instance, when the draft model uses a non-W8A8 MOE architecture
while the main model employs a W8A8 MOE architecture.

Therefore days ago, I implemented an interception that unconditionally
disables Operator DispatchGmmCombineDecode whenever the speculative mode
is EAGLE or EAGLE-3.
vllm-project#5293
However, this approach was not precise enough.
This PR further refines the logic by specifically identifying the draft
model's configuration: Operator DispatchGmmCombineDecode will now be
disabled only when the draft model uses an MOE architecture and is
non-W8A8.

More info about this operator, please refer to RFC: issue
vllm-project#5476


Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
…or not moe [RFC: issue 5476] (vllm-project#5758)

### What this PR does / why we need it?
Operator `DispatchGmmCombineDecode` does not support non-W8A8 scenarios
and cannot share the same communication domain with Operator
`Dispatch`/`Combine`.
> for instance, when the draft model uses a non-W8A8 MOE architecture
while the main model employs a W8A8 MOE architecture.

Therefore days ago, I implemented an interception that unconditionally
disables Operator `DispatchGmmCombineDecode` whenever the speculative
mode is `EAGLE` or `EAGLE-3`. [PR:
5293](vllm-project#5293)
However, this approach was not precise enough.
This PR further refines the logic by specifically identifying the draft
model's configuration: Operator `DispatchGmmCombineDecode` will now be
disabled only when the draft model uses an MOE architecture and is
non-W8A8.

More info about this operator, please refer to RFC: issue
vllm-project#5476

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Acc
test qwen3-235b eplb on a single A3 node(ep16),
with dispatch_gmm_combine_decode

```shell
nic_name="xxxx"
local_ip="xxx.xxx.xxx.xxx"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name

export VLLM_ASCEND_ENABLE_FUSED_MC2=2
echo "VLLM_ASCEND_ENABLE_FUSED_MC2=${VLLM_ASCEND_ENABLE_FUSED_MC2}"

export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_BUFFSIZE=512
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
vllm serve /dataset/Qwen3-235B-A22B-Instruct-2507-w8a8-QuaRot/ \
        --served-model-name "qwen" \
        --host 0.0.0.0 \
        --port 8004 \
        --async-scheduling \
        --tensor-parallel-size 4 \
        --data-parallel-size 4 \
        --max-num-seqs 64 \
        --max-model-len 40960 \
        --max-num-batched-tokens 16384 \
        --gpu-memory-utilization 0.9 \
        --enable-expert-parallel \
        --no-enable-prefix-caching \
        --quantization "ascend" \
        --trust-remote-code \
        --speculative_config \
        '{
            "method": "eagle3",
            "model": "/dataset/Qwen3-235B-A22B-Instruct-2507-speculator-eagle3/",
            "num_speculative_tokens": 2
        }' \
        --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
        2>&1 | tee qwen3_235b_eagle3.log
```

| dataset | version | metric | mode | vllm-api-stream-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 80.00 |

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
…or not moe [RFC: issue 5476] (vllm-project#5758)

### What this PR does / why we need it?
Operator `DispatchGmmCombineDecode` does not support non-W8A8 scenarios
and cannot share the same communication domain with Operator
`Dispatch`/`Combine`.
> for instance, when the draft model uses a non-W8A8 MOE architecture
while the main model employs a W8A8 MOE architecture.

Therefore days ago, I implemented an interception that unconditionally
disables Operator `DispatchGmmCombineDecode` whenever the speculative
mode is `EAGLE` or `EAGLE-3`. [PR:
5293](vllm-project#5293)
However, this approach was not precise enough. 
This PR further refines the logic by specifically identifying the draft
model's configuration: Operator `DispatchGmmCombineDecode` will now be
disabled only when the draft model uses an MOE architecture and is
non-W8A8.

More info about this operator, please refer to RFC: issue
vllm-project#5476

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Acc
test qwen3-235b eplb on a single A3 node(ep16),
with dispatch_gmm_combine_decode

```shell
nic_name="xxxx"
local_ip="xxx.xxx.xxx.xxx"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name

export VLLM_ASCEND_ENABLE_FUSED_MC2=2
echo "VLLM_ASCEND_ENABLE_FUSED_MC2=${VLLM_ASCEND_ENABLE_FUSED_MC2}"

export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_BUFFSIZE=512
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
vllm serve /dataset/Qwen3-235B-A22B-Instruct-2507-w8a8-QuaRot/ \
        --served-model-name "qwen" \
        --host 0.0.0.0 \
        --port 8004 \
        --async-scheduling \
        --tensor-parallel-size 4 \
        --data-parallel-size 4 \
        --max-num-seqs 64 \
        --max-model-len 40960 \
        --max-num-batched-tokens 16384 \
        --gpu-memory-utilization 0.9 \
        --enable-expert-parallel \
        --no-enable-prefix-caching \
        --quantization "ascend" \
        --trust-remote-code \
        --speculative_config \
        '{
            "method": "eagle3",
            "model": "/dataset/Qwen3-235B-A22B-Instruct-2507-speculator-eagle3/",
            "num_speculative_tokens": 2
        }' \
        --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
        2>&1 | tee qwen3_235b_eagle3.log
```

| dataset | version | metric | mode | vllm-api-stream-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 80.00 |

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
…or not moe [RFC: issue 5476] (vllm-project#5758)

### What this PR does / why we need it?
Operator `DispatchGmmCombineDecode` does not support non-W8A8 scenarios
and cannot share the same communication domain with Operator
`Dispatch`/`Combine`.
> for instance, when the draft model uses a non-W8A8 MOE architecture
while the main model employs a W8A8 MOE architecture.

Therefore days ago, I implemented an interception that unconditionally
disables Operator `DispatchGmmCombineDecode` whenever the speculative
mode is `EAGLE` or `EAGLE-3`. [PR:
5293](vllm-project#5293)
However, this approach was not precise enough.
This PR further refines the logic by specifically identifying the draft
model's configuration: Operator `DispatchGmmCombineDecode` will now be
disabled only when the draft model uses an MOE architecture and is
non-W8A8.

More info about this operator, please refer to RFC: issue
vllm-project#5476

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Acc
test qwen3-235b eplb on a single A3 node(ep16),
with dispatch_gmm_combine_decode

```shell
nic_name="xxxx"
local_ip="xxx.xxx.xxx.xxx"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name

export VLLM_ASCEND_ENABLE_FUSED_MC2=2
echo "VLLM_ASCEND_ENABLE_FUSED_MC2=${VLLM_ASCEND_ENABLE_FUSED_MC2}"

export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_BUFFSIZE=512
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
vllm serve /dataset/Qwen3-235B-A22B-Instruct-2507-w8a8-QuaRot/ \
        --served-model-name "qwen" \
        --host 0.0.0.0 \
        --port 8004 \
        --async-scheduling \
        --tensor-parallel-size 4 \
        --data-parallel-size 4 \
        --max-num-seqs 64 \
        --max-model-len 40960 \
        --max-num-batched-tokens 16384 \
        --gpu-memory-utilization 0.9 \
        --enable-expert-parallel \
        --no-enable-prefix-caching \
        --quantization "ascend" \
        --trust-remote-code \
        --speculative_config \
        '{
            "method": "eagle3",
            "model": "/dataset/Qwen3-235B-A22B-Instruct-2507-speculator-eagle3/",
            "num_speculative_tokens": 2
        }' \
        --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
        2>&1 | tee qwen3_235b_eagle3.log
```

| dataset | version | metric | mode | vllm-api-stream-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 80.00 |

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
…or not moe [RFC: issue 5476] (vllm-project#5758)

### What this PR does / why we need it?
Operator `DispatchGmmCombineDecode` does not support non-W8A8 scenarios
and cannot share the same communication domain with Operator
`Dispatch`/`Combine`.
> for instance, when the draft model uses a non-W8A8 MOE architecture
while the main model employs a W8A8 MOE architecture.

Therefore days ago, I implemented an interception that unconditionally
disables Operator `DispatchGmmCombineDecode` whenever the speculative
mode is `EAGLE` or `EAGLE-3`. [PR:
5293](vllm-project#5293)
However, this approach was not precise enough. 
This PR further refines the logic by specifically identifying the draft
model's configuration: Operator `DispatchGmmCombineDecode` will now be
disabled only when the draft model uses an MOE architecture and is
non-W8A8.

More info about this operator, please refer to RFC: issue
vllm-project#5476

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Acc
test qwen3-235b eplb on a single A3 node(ep16),
with dispatch_gmm_combine_decode

```shell
nic_name="xxxx"
local_ip="xxx.xxx.xxx.xxx"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name

export VLLM_ASCEND_ENABLE_FUSED_MC2=2
echo "VLLM_ASCEND_ENABLE_FUSED_MC2=${VLLM_ASCEND_ENABLE_FUSED_MC2}"

export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_BUFFSIZE=512
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
vllm serve /dataset/Qwen3-235B-A22B-Instruct-2507-w8a8-QuaRot/ \
        --served-model-name "qwen" \
        --host 0.0.0.0 \
        --port 8004 \
        --async-scheduling \
        --tensor-parallel-size 4 \
        --data-parallel-size 4 \
        --max-num-seqs 64 \
        --max-model-len 40960 \
        --max-num-batched-tokens 16384 \
        --gpu-memory-utilization 0.9 \
        --enable-expert-parallel \
        --no-enable-prefix-caching \
        --quantization "ascend" \
        --trust-remote-code \
        --speculative_config \
        '{
            "method": "eagle3",
            "model": "/dataset/Qwen3-235B-A22B-Instruct-2507-speculator-eagle3/",
            "num_speculative_tokens": 2
        }' \
        --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
        2>&1 | tee qwen3_235b_eagle3.log
```

| dataset | version | metric | mode | vllm-api-stream-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 80.00 |

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:core ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants