[CustomOp] Register AscendSharedFusedMoE custom op#2980
[CustomOp] Register AscendSharedFusedMoE custom op#2980wangxiyuan merged 3 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request refactors the registration of the AscendSharedFusedMoE custom operator by replacing a model-specific monkey-patch with a centralized, global registration. This is a cleaner approach architecturally. However, I've raised one high-severity concern regarding the change in scope from a specific patch to a global one, which could introduce regressions in models that are not expecting this specific implementation of SharedFusedMoE. Apart from this risk, the implementation of the change is straightforward.
| fused_out = super().forward( | ||
| _, fused_out = AscendFusedMoE.forward( | ||
| self, | ||
| hidden_states=hidden_states, | ||
| router_logits=router_logits, | ||
| ) | ||
| return shared_out, fused_out | ||
|
|
||
| def forward_impl(self, hidden_states: torch.Tensor, | ||
| router_logits: torch.Tensor): | ||
| shared_output = torch.empty(1) | ||
| fused_output = AscendFusedMoE.forward_impl( | ||
| self, | ||
| hidden_states=hidden_states, | ||
| router_logits=router_logits, | ||
| ) | ||
| return shared_output, fused_output |
There was a problem hiding this comment.
I don't quite get it—why do we need forward_impl? Why don't we just call AscendFusedMoE.forward_impl() directly from AscendSharedFusedMoE.forward?
There was a problem hiding this comment.
Calling Workflow:
- AscendSharedFusedMoE:
forward() - AscendFusedMoE:
forward() - CustomOp:
forward() - CustomOp:
forward_oot() - FusedMoE:
forward_native() - AscendSharedFusedMoE:
forward_impl() - AscendFusedMoE:
forward_impl()
In AscendFusedMoE: forward_impl() (step 7), we just return fused_output, but in FusedMoE forward_native() (step 5), it recieve 2 results (shared_output and fused_output), which is conflict with line 443 in AscendSharedFusedMoE.
Thus, we need to return a dummy shared_output in forward_impl() of AscendSharedFusedMoE.
There was a problem hiding this comment.
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Test GLM-4.5Run: export VLLM_USE_V1=1
export VLLM_WORKER_MULTIPROC_METHOD="spawn"
export VLLM_USE_MODELSCOPE=True
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export HCCL_BUFFSIZE=1024
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
vllm serve /root/.cache/modelscope/hub/models/ZhipuAI/GLM-4___5 \
--host 0.0.0.0 \
--port 8004 \
--tensor-parallel-size 16 \
--max-num-seqs 8 \
--max-model-len 16384 \
--max-num-batched-tokens 2048 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95 \
--additional-config '{"ascend_scheduler_config":{"enabled":true, "enable_chunked_prefill": true}}' \
--enable-expert-parallel \
--enforce-eagerCurl: curl -X POST http://172.22.0.155:8004/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/root/.cache/modelscope/hub/models/ZhipuAI/GLM-4___5",
"messages": [
{"role": "user", "content": "介绍一下联通公司?"}
],
"stream": false,
"max_tokens": 100
}'Output: \n<think>嗯,用户让我介绍一下联通公司。首先得确定用户需要哪些方面的信息。可能用户只是想了解基本概况,比如成立时间、主要业务,或者也可能对联通的市场地位、技术发展感兴趣。\n\n接下来,我需要回忆联通的基础信息。成立于1994年,原称中国联通,2009年和中国网通合并,改名为中国联通。这点要明确,避免混淆。然后总部在北京,央企属性很重要,说明其背景和稳定性。\n\n业务方面 |
### What this PR does / why we need it?
Register `AscendSharedFusedMoE` custom op.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`DeepSeek-V2-Lite` is a MoE model with shared experts.
Test:
```bash
vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \
--trust-remote-code \
--enforce-eager \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite",
"messages": [
{"role": "user", "content": "介绍一下联通公司?"}
],
"stream": false,
"max_tokens": 100
}'
```
Output:
```bash
中国联合网络通信集团有限公司(简称“中国联通”)于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成,在国内31个省(自治区、直辖市)和境外多个国家和地区设有分支机构,是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业,连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务,移动通信业务,国内
```
- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@486c559
---------
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it?
Register `AscendSharedFusedMoE` custom op.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`DeepSeek-V2-Lite` is a MoE model with shared experts.
Test:
```bash
vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \
--trust-remote-code \
--enforce-eager \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite",
"messages": [
{"role": "user", "content": "介绍一下联通公司?"}
],
"stream": false,
"max_tokens": 100
}'
```
Output:
```bash
中国联合网络通信集团有限公司(简称“中国联通”)于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成,在国内31个省(自治区、直辖市)和境外多个国家和地区设有分支机构,是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业,连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务,移动通信业务,国内
```
- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@486c559
---------
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
### What this PR does / why we need it?
Register `AscendSharedFusedMoE` custom op.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`DeepSeek-V2-Lite` is a MoE model with shared experts.
Test:
```bash
vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \
--trust-remote-code \
--enforce-eager \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite",
"messages": [
{"role": "user", "content": "介绍一下联通公司?"}
],
"stream": false,
"max_tokens": 100
}'
```
Output:
```bash
中国联合网络通信集团有限公司(简称“中国联通”)于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成,在国内31个省(自治区、直辖市)和境外多个国家和地区设有分支机构,是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业,连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务,移动通信业务,国内
```
- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@486c559
---------
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
|
@shen-shanshan This PR may have broken #2946 , please contact @whx-sjtu and see if you can work out a solution? |
This PR fixes accuracy problem of aclgraph on A2. The problem is introduced by PR #2980, which makes the `all_reduce` of shared_experts exposed to torch dynamo. This PR moves all the codes into forward_impl to shiled from torch dynamo. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@17b4c66 --------- Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
Register `AscendSharedFusedMoE` custom op.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`DeepSeek-V2-Lite` is a MoE model with shared experts.
Test:
```bash
vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \
--trust-remote-code \
--enforce-eager \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite",
"messages": [
{"role": "user", "content": "介绍一下联通公司?"}
],
"stream": false,
"max_tokens": 100
}'
```
Output:
```bash
中国联合网络通信集团有限公司(简称“中国联通”)于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成,在国内31个省(自治区、直辖市)和境外多个国家和地区设有分支机构,是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业,连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务,移动通信业务,国内
```
- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@486c559
---------
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
This PR fixes accuracy problem of aclgraph on A2. The problem is introduced by PR vllm-project#2980, which makes the `all_reduce` of shared_experts exposed to torch dynamo. This PR moves all the codes into forward_impl to shiled from torch dynamo. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@17b4c66 --------- Signed-off-by: whx-sjtu <2952154980@qq.com>
This PR fixes accuracy problem of aclgraph on A2. The problem is introduced by PR vllm-project#2980, which makes the `all_reduce` of shared_experts exposed to torch dynamo. This PR moves all the codes into forward_impl to shiled from torch dynamo. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@17b4c66 --------- Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: luolun <luolun1995@cmbchina.com>
This PR fixes accuracy problem of aclgraph on A2. The problem is introduced by PR vllm-project#2980, which makes the `all_reduce` of shared_experts exposed to torch dynamo. This PR moves all the codes into forward_impl to shiled from torch dynamo. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@17b4c66 --------- Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: luolun <luolun1995@cmbchina.com>
### What this PR does / why we need it?
Register `AscendSharedFusedMoE` custom op.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`DeepSeek-V2-Lite` is a MoE model with shared experts.
Test:
```bash
vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \
--trust-remote-code \
--enforce-eager \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite",
"messages": [
{"role": "user", "content": "介绍一下联通公司?"}
],
"stream": false,
"max_tokens": 100
}'
```
Output:
```bash
中国联合网络通信集团有限公司(简称“中国联通”)于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成,在国内31个省(自治区、直辖市)和境外多个国家和地区设有分支机构,是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业,连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务,移动通信业务,国内
```
- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@486c559
---------
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: hwhaokun <haokun0405@163.com>
This PR fixes accuracy problem of aclgraph on A2. The problem is introduced by PR vllm-project#2980, which makes the `all_reduce` of shared_experts exposed to torch dynamo. This PR moves all the codes into forward_impl to shiled from torch dynamo. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@17b4c66 --------- Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: hwhaokun <haokun0405@163.com>
### What this PR does / why we need it?
Register `AscendSharedFusedMoE` custom op.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`DeepSeek-V2-Lite` is a MoE model with shared experts.
Test:
```bash
vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \
--trust-remote-code \
--enforce-eager \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite",
"messages": [
{"role": "user", "content": "介绍一下联通公司?"}
],
"stream": false,
"max_tokens": 100
}'
```
Output:
```bash
中国联合网络通信集团有限公司(简称“中国联通”)于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成,在国内31个省(自治区、直辖市)和境外多个国家和地区设有分支机构,是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业,连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务,移动通信业务,国内
```
- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@486c559
---------
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: nsdie <yeyifan@huawei.com>
This PR fixes accuracy problem of aclgraph on A2. The problem is introduced by PR vllm-project#2980, which makes the `all_reduce` of shared_experts exposed to torch dynamo. This PR moves all the codes into forward_impl to shiled from torch dynamo. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@17b4c66 --------- Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: nsdie <yeyifan@huawei.com>
### What this PR does / why we need it?
Register `AscendSharedFusedMoE` custom op.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`DeepSeek-V2-Lite` is a MoE model with shared experts.
Test:
```bash
vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \
--trust-remote-code \
--enforce-eager \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite",
"messages": [
{"role": "user", "content": "介绍一下联通公司?"}
],
"stream": false,
"max_tokens": 100
}'
```
Output:
```bash
中国联合网络通信集团有限公司(简称“中国联通”)于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成,在国内31个省(自治区、直辖市)和境外多个国家和地区设有分支机构,是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业,连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务,移动通信业务,国内
```
- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@486c559
---------
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
This PR fixes accuracy problem of aclgraph on A2. The problem is introduced by PR vllm-project#2980, which makes the `all_reduce` of shared_experts exposed to torch dynamo. This PR moves all the codes into forward_impl to shiled from torch dynamo. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@17b4c66 --------- Signed-off-by: whx-sjtu <2952154980@qq.com>
What this PR does / why we need it?
Register
AscendSharedFusedMoEcustom op.Does this PR introduce any user-facing change?
No.
How was this patch tested?
DeepSeek-V2-Liteis a MoE model with shared experts.Test:
vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \ --trust-remote-code \ --enforce-eager \ --no-enable-prefix-caching \ --gpu-memory-utilization 0.95 curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite", "messages": [ {"role": "user", "content": "介绍一下联通公司?"} ], "stream": false, "max_tokens": 100 }'Output:
中国联合网络通信集团有限公司(简称“中国联通”)于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成,在国内31个省(自治区、直辖市)和境外多个国家和地区设有分支机构,是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业,连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务,移动通信业务,国内