Skip to content

[CustomOp] Register AscendSharedFusedMoE custom op#2980

Merged
wangxiyuan merged 3 commits intovllm-project:mainfrom
shen-shanshan:op
Sep 19, 2025
Merged

[CustomOp] Register AscendSharedFusedMoE custom op#2980
wangxiyuan merged 3 commits intovllm-project:mainfrom
shen-shanshan:op

Conversation

@shen-shanshan
Copy link
Copy Markdown
Collaborator

@shen-shanshan shen-shanshan commented Sep 17, 2025

What this PR does / why we need it?

Register AscendSharedFusedMoE custom op.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

DeepSeek-V2-Lite is a MoE model with shared experts.

Test:

vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \
--trust-remote-code \
--enforce-eager \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95

curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite",
        "messages": [
            {"role": "user", "content": "介绍一下联通公司?"}
        ],
        "stream": false,
        "max_tokens": 100
    }'

Output:

中国联合网络通信集团有限公司(简称“中国联通”)于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成,在国内31个省(自治区、直辖市)和境外多个国家和地区设有分支机构,是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业,连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务,移动通信业务,国内

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the registration of the AscendSharedFusedMoE custom operator by replacing a model-specific monkey-patch with a centralized, global registration. This is a cleaner approach architecturally. However, I've raised one high-severity concern regarding the change in scope from a specific patch to a global one, which could introduce regressions in models that are not expecting this specific implementation of SharedFusedMoE. Apart from this risk, the implementation of the change is straightforward.

Comment thread vllm_ascend/utils.py
@shen-shanshan shen-shanshan marked this pull request as draft September 17, 2025 11:06
@wangxiyuan wangxiyuan added ready read for review ready-for-test start test by label for PR labels Sep 17, 2025
@shen-shanshan shen-shanshan changed the title [CustomOp] Register AscendSharedFusedMoE custom op [CustomOp][WIP] Register AscendSharedFusedMoE custom op Sep 18, 2025
@shen-shanshan shen-shanshan added ready read for review ready-for-test start test by label for PR and removed ready read for review ready-for-test start test by label for PR labels Sep 18, 2025
@shen-shanshan shen-shanshan marked this pull request as ready for review September 18, 2025 12:03
@shen-shanshan shen-shanshan changed the title [CustomOp][WIP] Register AscendSharedFusedMoE custom op [CustomOp] Register AscendSharedFusedMoE custom op Sep 18, 2025
@wangxiyuan wangxiyuan added ready-for-test start test by label for PR and removed ready-for-test start test by label for PR labels Sep 19, 2025
Comment on lines -443 to +474
fused_out = super().forward(
_, fused_out = AscendFusedMoE.forward(
self,
hidden_states=hidden_states,
router_logits=router_logits,
)
return shared_out, fused_out

def forward_impl(self, hidden_states: torch.Tensor,
router_logits: torch.Tensor):
shared_output = torch.empty(1)
fused_output = AscendFusedMoE.forward_impl(
self,
hidden_states=hidden_states,
router_logits=router_logits,
)
return shared_output, fused_output
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite get it—why do we need forward_impl? Why don't we just call AscendFusedMoE.forward_impl() directly from AscendSharedFusedMoE.forward?

Copy link
Copy Markdown
Collaborator Author

@shen-shanshan shen-shanshan Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yiz-liu

Calling Workflow:

  1. AscendSharedFusedMoE: forward()
  2. AscendFusedMoE: forward()
  3. CustomOp: forward()
  4. CustomOp: forward_oot()
  5. FusedMoE: forward_native()
  6. AscendSharedFusedMoE: forward_impl()
  7. AscendFusedMoE: forward_impl()

In AscendFusedMoE: forward_impl() (step 7), we just return fused_output, but in FusedMoE forward_native() (step 5), it recieve 2 results (shared_output and fused_output), which is conflict with line 443 in AscendSharedFusedMoE.

Thus, we need to return a dummy shared_output in forward_impl() of AscendSharedFusedMoE.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shen-shanshan and others added 3 commits September 19, 2025 06:56
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
@shen-shanshan
Copy link
Copy Markdown
Collaborator Author

Test GLM-4.5

Run:

export VLLM_USE_V1=1
export VLLM_WORKER_MULTIPROC_METHOD="spawn"
export VLLM_USE_MODELSCOPE=True
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export HCCL_BUFFSIZE=1024
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256

vllm serve /root/.cache/modelscope/hub/models/ZhipuAI/GLM-4___5 \
--host 0.0.0.0 \
--port 8004 \
--tensor-parallel-size 16 \
--max-num-seqs 8 \
--max-model-len 16384 \
--max-num-batched-tokens 2048 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95 \
--additional-config '{"ascend_scheduler_config":{"enabled":true, "enable_chunked_prefill": true}}' \
--enable-expert-parallel \
--enforce-eager

Curl:

curl -X POST http://172.22.0.155:8004/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/root/.cache/modelscope/hub/models/ZhipuAI/GLM-4___5",
        "messages": [
            {"role": "user", "content": "介绍一下联通公司?"}
        ],
        "stream": false,
        "max_tokens": 100
    }'

Output:

\n<think>嗯,用户让我介绍一下联通公司。首先得确定用户需要哪些方面的信息。可能用户只是想了解基本概况,比如成立时间、主要业务,或者也可能对联通的市场地位、技术发展感兴趣。\n\n接下来,我需要回忆联通的基础信息。成立于1994年,原称中国联通,2009年和中国网通合并,改名为中国联通。这点要明确,避免混淆。然后总部在北京,央企属性很重要,说明其背景和稳定性。\n\n业务方面

@wangxiyuan wangxiyuan merged commit 8326f15 into vllm-project:main Sep 19, 2025
19 of 20 checks passed
weijinqian0 pushed a commit to weijinqian0/vllm-ascend that referenced this pull request Sep 22, 2025
### What this PR does / why we need it?
Register `AscendSharedFusedMoE` custom op.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`DeepSeek-V2-Lite` is a MoE model with shared experts.

Test:

```bash
vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \
--trust-remote-code \
--enforce-eager \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95

curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite",
        "messages": [
            {"role": "user", "content": "介绍一下联通公司?"}
        ],
        "stream": false,
        "max_tokens": 100
    }'
```

Output:

```bash
中国联合网络通信集团有限公司(简称“中国联通”)于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成,在国内31个省(自治区、直辖市)和境外多个国家和地区设有分支机构,是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业,连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务,移动通信业务,国内
```


- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@486c559

---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Sep 22, 2025
### What this PR does / why we need it?
Register `AscendSharedFusedMoE` custom op.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`DeepSeek-V2-Lite` is a MoE model with shared experts.

Test:

```bash
vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \
--trust-remote-code \
--enforce-eager \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95

curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite",
        "messages": [
            {"role": "user", "content": "介绍一下联通公司?"}
        ],
        "stream": false,
        "max_tokens": 100
    }'
```

Output:

```bash
中国联合网络通信集团有限公司(简称“中国联通”)于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成,在国内31个省(自治区、直辖市)和境外多个国家和地区设有分支机构,是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业,连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务,移动通信业务,国内
```

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@486c559

---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Sep 22, 2025
### What this PR does / why we need it?
Register `AscendSharedFusedMoE` custom op.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`DeepSeek-V2-Lite` is a MoE model with shared experts.

Test:

```bash
vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \
--trust-remote-code \
--enforce-eager \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95

curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite",
        "messages": [
            {"role": "user", "content": "介绍一下联通公司?"}
        ],
        "stream": false,
        "max_tokens": 100
    }'
```

Output:

```bash
中国联合网络通信集团有限公司(简称“中国联通”)于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成,在国内31个省(自治区、直辖市)和境外多个国家和地区设有分支机构,是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业,连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务,移动通信业务,国内
```

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@486c559

---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
@yiz-liu
Copy link
Copy Markdown
Collaborator

yiz-liu commented Sep 23, 2025

@shen-shanshan This PR may have broken #2946 , please contact @whx-sjtu and see if you can work out a solution?

wangxiyuan pushed a commit that referenced this pull request Sep 28, 2025
This PR fixes accuracy problem of aclgraph on A2. The problem is
introduced by PR #2980, which makes the `all_reduce` of shared_experts
exposed to torch dynamo. This PR moves all the codes into forward_impl
to shiled from torch dynamo.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@17b4c66

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
### What this PR does / why we need it?
Register `AscendSharedFusedMoE` custom op.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`DeepSeek-V2-Lite` is a MoE model with shared experts.

Test:

```bash
vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \
--trust-remote-code \
--enforce-eager \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95

curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite",
        "messages": [
            {"role": "user", "content": "介绍一下联通公司?"}
        ],
        "stream": false,
        "max_tokens": 100
    }'
```

Output:

```bash
中国联合网络通信集团有限公司(简称“中国联通”)于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成,在国内31个省(自治区、直辖市)和境外多个国家和地区设有分支机构,是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业,连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务,移动通信业务,国内
```


- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@486c559

---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
This PR fixes accuracy problem of aclgraph on A2. The problem is
introduced by PR vllm-project#2980, which makes the `all_reduce` of shared_experts
exposed to torch dynamo. This PR moves all the codes into forward_impl
to shiled from torch dynamo.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@17b4c66

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
luolun pushed a commit to luolun/vllm-ascend that referenced this pull request Nov 19, 2025
This PR fixes accuracy problem of aclgraph on A2. The problem is
introduced by PR vllm-project#2980, which makes the `all_reduce` of shared_experts
exposed to torch dynamo. This PR moves all the codes into forward_impl
to shiled from torch dynamo.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@17b4c66

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: luolun <luolun1995@cmbchina.com>
luolun pushed a commit to luolun/vllm-ascend that referenced this pull request Nov 19, 2025
This PR fixes accuracy problem of aclgraph on A2. The problem is
introduced by PR vllm-project#2980, which makes the `all_reduce` of shared_experts
exposed to torch dynamo. This PR moves all the codes into forward_impl
to shiled from torch dynamo.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@17b4c66

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: luolun <luolun1995@cmbchina.com>
hwhaokun pushed a commit to hwhaokun/vllm-ascend that referenced this pull request Nov 19, 2025
### What this PR does / why we need it?
Register `AscendSharedFusedMoE` custom op.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`DeepSeek-V2-Lite` is a MoE model with shared experts.

Test:

```bash
vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \
--trust-remote-code \
--enforce-eager \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95

curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite",
        "messages": [
            {"role": "user", "content": "介绍一下联通公司?"}
        ],
        "stream": false,
        "max_tokens": 100
    }'
```

Output:

```bash
中国联合网络通信集团有限公司(简称“中国联通”)于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成,在国内31个省(自治区、直辖市)和境外多个国家和地区设有分支机构,是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业,连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务,移动通信业务,国内
```

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@486c559

---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: hwhaokun <haokun0405@163.com>
hwhaokun pushed a commit to hwhaokun/vllm-ascend that referenced this pull request Nov 19, 2025
This PR fixes accuracy problem of aclgraph on A2. The problem is
introduced by PR vllm-project#2980, which makes the `all_reduce` of shared_experts
exposed to torch dynamo. This PR moves all the codes into forward_impl
to shiled from torch dynamo.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@17b4c66

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: hwhaokun <haokun0405@163.com>
NSDie pushed a commit to NSDie/vllm-ascend that referenced this pull request Nov 24, 2025
### What this PR does / why we need it?
Register `AscendSharedFusedMoE` custom op.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`DeepSeek-V2-Lite` is a MoE model with shared experts.

Test:

```bash
vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \
--trust-remote-code \
--enforce-eager \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95

curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite",
        "messages": [
            {"role": "user", "content": "介绍一下联通公司?"}
        ],
        "stream": false,
        "max_tokens": 100
    }'
```

Output:

```bash
中国联合网络通信集团有限公司(简称“中国联通”)于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成,在国内31个省(自治区、直辖市)和境外多个国家和地区设有分支机构,是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业,连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务,移动通信业务,国内
```

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@486c559

---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: nsdie <yeyifan@huawei.com>
NSDie pushed a commit to NSDie/vllm-ascend that referenced this pull request Nov 24, 2025
This PR fixes accuracy problem of aclgraph on A2. The problem is
introduced by PR vllm-project#2980, which makes the `all_reduce` of shared_experts
exposed to torch dynamo. This PR moves all the codes into forward_impl
to shiled from torch dynamo.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@17b4c66

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: nsdie <yeyifan@huawei.com>
Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 9, 2025
### What this PR does / why we need it?
Register `AscendSharedFusedMoE` custom op.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`DeepSeek-V2-Lite` is a MoE model with shared experts.

Test:

```bash
vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \
--trust-remote-code \
--enforce-eager \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95

curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite",
        "messages": [
            {"role": "user", "content": "介绍一下联通公司?"}
        ],
        "stream": false,
        "max_tokens": 100
    }'
```

Output:

```bash
中国联合网络通信集团有限公司(简称“中国联通”)于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成,在国内31个省(自治区、直辖市)和境外多个国家和地区设有分支机构,是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业,连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务,移动通信业务,国内
```


- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@486c559

---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 9, 2025
This PR fixes accuracy problem of aclgraph on A2. The problem is
introduced by PR vllm-project#2980, which makes the `all_reduce` of shared_experts
exposed to torch dynamo. This PR moves all the codes into forward_impl
to shiled from torch dynamo.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@17b4c66

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:core module:ops ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants