[CustomOp] Register AscendSharedFusedMoE custom op by shen-shanshan · Pull Request #2980 · vllm-project/vllm-ascend

shen-shanshan · 2025-09-17T07:29:14Z

What this PR does / why we need it?

Register AscendSharedFusedMoE custom op.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

DeepSeek-V2-Lite is a MoE model with shared experts.

Test:

vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \
--trust-remote-code \
--enforce-eager \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95

curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite",
        "messages": [
            {"role": "user", "content": "介绍一下联通公司？"}
        ],
        "stream": false,
        "max_tokens": 100
    }'

Output:

中国联合网络通信集团有限公司（简称“中国联通”）于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成，在国内31个省（自治区、直辖市）和境外多个国家和地区设有分支机构，是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业，连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务，移动通信业务，国内

vLLM version: v0.10.2
vLLM main: vllm-project/vllm@486c559

github-actions · 2025-09-17T07:29:21Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request refactors the registration of the AscendSharedFusedMoE custom operator by replacing a model-specific monkey-patch with a centralized, global registration. This is a cleaner approach architecturally. However, I've raised one high-severity concern regarding the change in scope from a specific patch to a global one, which could introduce regressions in models that are not expecting this specific implementation of SharedFusedMoE. Apart from this risk, the implementation of the change is straightforward.

yiz-liu · 2025-09-19T02:37:43Z

-        fused_out = super().forward(
+        _, fused_out = AscendFusedMoE.forward(
+            self,
            hidden_states=hidden_states,
            router_logits=router_logits,
        )
        return shared_out, fused_out

+    def forward_impl(self, hidden_states: torch.Tensor,
+                     router_logits: torch.Tensor):
+        shared_output = torch.empty(1)
+        fused_output = AscendFusedMoE.forward_impl(
+            self,
+            hidden_states=hidden_states,
+            router_logits=router_logits,
+        )
+        return shared_output, fused_output


I don't quite get it—why do we need forward_impl? Why don't we just call AscendFusedMoE.forward_impl() directly from AscendSharedFusedMoE.forward?

@yiz-liu

Calling Workflow:

AscendSharedFusedMoE: forward()

AscendFusedMoE: forward()

CustomOp: forward()

CustomOp: forward_oot()

FusedMoE: forward_native()

AscendSharedFusedMoE: forward_impl()

AscendFusedMoE: forward_impl()

In AscendFusedMoE: forward_impl() (step 7), we just return fused_output, but in FusedMoE forward_native() (step 5), it recieve 2 results (shared_output and fused_output), which is conflict with line 443 in AscendSharedFusedMoE.

Thus, we need to return a dummy shared_output in forward_impl() of AscendSharedFusedMoE.

See https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L1729.

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

Signed-off-by: shen-shanshan <467638484@qq.com>

shen-shanshan · 2025-09-19T09:07:24Z

Test GLM-4.5

Run:

export VLLM_USE_V1=1
export VLLM_WORKER_MULTIPROC_METHOD="spawn"
export VLLM_USE_MODELSCOPE=True
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export HCCL_BUFFSIZE=1024
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256

vllm serve /root/.cache/modelscope/hub/models/ZhipuAI/GLM-4___5 \
--host 0.0.0.0 \
--port 8004 \
--tensor-parallel-size 16 \
--max-num-seqs 8 \
--max-model-len 16384 \
--max-num-batched-tokens 2048 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95 \
--additional-config '{"ascend_scheduler_config":{"enabled":true, "enable_chunked_prefill": true}}' \
--enable-expert-parallel \
--enforce-eager

Curl:

curl -X POST http://172.22.0.155:8004/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/root/.cache/modelscope/hub/models/ZhipuAI/GLM-4___5",
        "messages": [
            {"role": "user", "content": "介绍一下联通公司？"}
        ],
        "stream": false,
        "max_tokens": 100
    }'

Output:

\n<think>嗯，用户让我介绍一下联通公司。首先得确定用户需要哪些方面的信息。可能用户只是想了解基本概况，比如成立时间、主要业务，或者也可能对联通的市场地位、技术发展感兴趣。\n\n接下来，我需要回忆联通的基础信息。成立于1994年，原称中国联通，2009年和中国网通合并，改名为中国联通。这点要明确，避免混淆。然后总部在北京，央企属性很重要，说明其背景和稳定性。\n\n业务方面

### What this PR does / why we need it? Register `AscendSharedFusedMoE` custom op. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `DeepSeek-V2-Lite` is a MoE model with shared experts. Test: ```bash vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \ --trust-remote-code \ --enforce-eager \ --no-enable-prefix-caching \ --gpu-memory-utilization 0.95 curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite", "messages": [ {"role": "user", "content": "介绍一下联通公司？"} ], "stream": false, "max_tokens": 100 }' ``` Output: ```bash 中国联合网络通信集团有限公司（简称“中国联通”）于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成，在国内31个省（自治区、直辖市）和境外多个国家和地区设有分支机构，是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业，连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务，移动通信业务，国内 ``` - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@486c559 --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: shen-shanshan <467638484@qq.com>

### What this PR does / why we need it? Register `AscendSharedFusedMoE` custom op. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `DeepSeek-V2-Lite` is a MoE model with shared experts. Test: ```bash vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \ --trust-remote-code \ --enforce-eager \ --no-enable-prefix-caching \ --gpu-memory-utilization 0.95 curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite", "messages": [ {"role": "user", "content": "介绍一下联通公司？"} ], "stream": false, "max_tokens": 100 }' ``` Output: ```bash 中国联合网络通信集团有限公司（简称“中国联通”）于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成，在国内31个省（自治区、直辖市）和境外多个国家和地区设有分支机构，是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业，连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务，移动通信业务，国内 ``` - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@486c559 --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>

yiz-liu · 2025-09-23T13:04:50Z

@shen-shanshan This PR may have broken #2946 , please contact @whx-sjtu and see if you can work out a solution?

This PR fixes accuracy problem of aclgraph on A2. The problem is introduced by PR #2980, which makes the `all_reduce` of shared_experts exposed to torch dynamo. This PR moves all the codes into forward_impl to shiled from torch dynamo. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@17b4c66 --------- Signed-off-by: whx-sjtu <2952154980@qq.com>

### What this PR does / why we need it? Register `AscendSharedFusedMoE` custom op. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `DeepSeek-V2-Lite` is a MoE model with shared experts. Test: ```bash vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \ --trust-remote-code \ --enforce-eager \ --no-enable-prefix-caching \ --gpu-memory-utilization 0.95 curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite", "messages": [ {"role": "user", "content": "介绍一下联通公司？"} ], "stream": false, "max_tokens": 100 }' ``` Output: ```bash 中国联合网络通信集团有限公司（简称“中国联通”）于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成，在国内31个省（自治区、直辖市）和境外多个国家和地区设有分支机构，是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业，连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务，移动通信业务，国内 ``` - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@486c559 --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: shen-shanshan <467638484@qq.com>

This PR fixes accuracy problem of aclgraph on A2. The problem is introduced by PR vllm-project#2980, which makes the `all_reduce` of shared_experts exposed to torch dynamo. This PR moves all the codes into forward_impl to shiled from torch dynamo. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@17b4c66 --------- Signed-off-by: whx-sjtu <2952154980@qq.com>

This PR fixes accuracy problem of aclgraph on A2. The problem is introduced by PR vllm-project#2980, which makes the `all_reduce` of shared_experts exposed to torch dynamo. This PR moves all the codes into forward_impl to shiled from torch dynamo. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@17b4c66 --------- Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: luolun <luolun1995@cmbchina.com>

### What this PR does / why we need it? Register `AscendSharedFusedMoE` custom op. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `DeepSeek-V2-Lite` is a MoE model with shared experts. Test: ```bash vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \ --trust-remote-code \ --enforce-eager \ --no-enable-prefix-caching \ --gpu-memory-utilization 0.95 curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite", "messages": [ {"role": "user", "content": "介绍一下联通公司？"} ], "stream": false, "max_tokens": 100 }' ``` Output: ```bash 中国联合网络通信集团有限公司（简称“中国联通”）于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成，在国内31个省（自治区、直辖市）和境外多个国家和地区设有分支机构，是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业，连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务，移动通信业务，国内 ``` - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@486c559 --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: hwhaokun <haokun0405@163.com>

This PR fixes accuracy problem of aclgraph on A2. The problem is introduced by PR vllm-project#2980, which makes the `all_reduce` of shared_experts exposed to torch dynamo. This PR moves all the codes into forward_impl to shiled from torch dynamo. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@17b4c66 --------- Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: hwhaokun <haokun0405@163.com>

### What this PR does / why we need it? Register `AscendSharedFusedMoE` custom op. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `DeepSeek-V2-Lite` is a MoE model with shared experts. Test: ```bash vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \ --trust-remote-code \ --enforce-eager \ --no-enable-prefix-caching \ --gpu-memory-utilization 0.95 curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite", "messages": [ {"role": "user", "content": "介绍一下联通公司？"} ], "stream": false, "max_tokens": 100 }' ``` Output: ```bash 中国联合网络通信集团有限公司（简称“中国联通”）于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成，在国内31个省（自治区、直辖市）和境外多个国家和地区设有分支机构，是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业，连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务，移动通信业务，国内 ``` - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@486c559 --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: nsdie <yeyifan@huawei.com>

This PR fixes accuracy problem of aclgraph on A2. The problem is introduced by PR vllm-project#2980, which makes the `all_reduce` of shared_experts exposed to torch dynamo. This PR moves all the codes into forward_impl to shiled from torch dynamo. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@17b4c66 --------- Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: nsdie <yeyifan@huawei.com>

### What this PR does / why we need it? Register `AscendSharedFusedMoE` custom op. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `DeepSeek-V2-Lite` is a MoE model with shared experts. Test: ```bash vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \ --trust-remote-code \ --enforce-eager \ --no-enable-prefix-caching \ --gpu-memory-utilization 0.95 curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite", "messages": [ {"role": "user", "content": "介绍一下联通公司？"} ], "stream": false, "max_tokens": 100 }' ``` Output: ```bash 中国联合网络通信集团有限公司（简称“中国联通”）于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成，在国内31个省（自治区、直辖市）和境外多个国家和地区设有分支机构，是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业，连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务，移动通信业务，国内 ``` - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@486c559 --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: shen-shanshan <467638484@qq.com>

This PR fixes accuracy problem of aclgraph on A2. The problem is introduced by PR vllm-project#2980, which makes the `all_reduce` of shared_experts exposed to torch dynamo. This PR moves all the codes into forward_impl to shiled from torch dynamo. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@17b4c66 --------- Signed-off-by: whx-sjtu <2952154980@qq.com>

github-actions Bot added the module:core label Sep 17, 2025

shen-shanshan requested a review from MengqingCao September 17, 2025 07:29

gemini-code-assist Bot reviewed Sep 17, 2025

View reviewed changes

Comment thread vllm_ascend/utils.py

shen-shanshan marked this pull request as draft September 17, 2025 11:06

wangxiyuan added ready read for review ready-for-test start test by label for PR labels Sep 17, 2025

shen-shanshan changed the title ~~[CustomOp] Register AscendSharedFusedMoE custom op~~ [CustomOp][WIP] Register AscendSharedFusedMoE custom op Sep 18, 2025

github-actions Bot added the module:ops label Sep 18, 2025

shen-shanshan added ready read for review ready-for-test start test by label for PR and removed ready read for review ready-for-test start test by label for PR labels Sep 18, 2025

shen-shanshan marked this pull request as ready for review September 18, 2025 12:03

shen-shanshan changed the title ~~[CustomOp][WIP] Register AscendSharedFusedMoE custom op~~ [CustomOp] Register AscendSharedFusedMoE custom op Sep 18, 2025

wangxiyuan approved these changes Sep 19, 2025

View reviewed changes

wangxiyuan added ready-for-test start test by label for PR and removed ready-for-test start test by label for PR labels Sep 19, 2025

yiz-liu reviewed Sep 19, 2025

View reviewed changes

shen-shanshan and others added 3 commits September 19, 2025 06:56

Register AscendSharedFusedMoE custom op

100c1ad

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

update

0c3d5d3

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

update

2356c88

Signed-off-by: shen-shanshan <467638484@qq.com>

shen-shanshan force-pushed the op branch from 3967517 to 2356c88 Compare September 19, 2025 07:09

wangxiyuan merged commit 8326f15 into vllm-project:main Sep 19, 2025
19 of 20 checks passed

whx-sjtu mentioned this pull request Sep 24, 2025

[BugFix] Fix aclgraph accu problem in A2. #3163

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CustomOp] Register AscendSharedFusedMoE custom op#2980

[CustomOp] Register AscendSharedFusedMoE custom op#2980
wangxiyuan merged 3 commits intovllm-project:mainfrom
shen-shanshan:op

shen-shanshan commented Sep 17, 2025 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Sep 17, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

yiz-liu Sep 19, 2025

Uh oh!

shen-shanshan Sep 19, 2025 •

edited

Loading

Uh oh!

shen-shanshan Sep 19, 2025

Uh oh!

shen-shanshan commented Sep 19, 2025

Uh oh!

Uh oh!

yiz-liu commented Sep 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shen-shanshan commented Sep 17, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions Bot commented Sep 17, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

yiz-liu Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

shen-shanshan Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shen-shanshan Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

shen-shanshan commented Sep 19, 2025

Test GLM-4.5

Uh oh!

Uh oh!

yiz-liu commented Sep 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shen-shanshan commented Sep 17, 2025 •

edited by github-actions Bot

Loading

shen-shanshan Sep 19, 2025 •

edited

Loading