Skip to content

[Kernel] add custom moe ops for prefill#4194

Merged
wangxiyuan merged 4 commits intovllm-project:mainfrom
shiro-zzzz:add_moe_normal
Dec 8, 2025
Merged

[Kernel] add custom moe ops for prefill#4194
wangxiyuan merged 4 commits intovllm-project:mainfrom
shiro-zzzz:add_moe_normal

Conversation

@shiro-zzzz
Copy link
Copy Markdown
Contributor

@shiro-zzzz shiro-zzzz commented Nov 14, 2025

What this PR does / why we need it?

1.Add the implementation of normal Aclnn operators: MoeCombineNormal, MoeDispatchNormal, NotifyDispatch,and DispatchLayout.

  • MoeCombineNormal: Implements the combine logic within MoE operations.
  • MoeDispatchNormal: Implements the dispatch logic within MoE operations.
  • NotifyDispatch: Exchanges topk_idx information among different ranks to calculate the device memory required for the dispatch stage.
  • DispatchLayout: Used to calculate information related to the device memory layout for the dispatch stage.

2.Provide PyTorch interfaces for normal operators—get_dispatch_layout, dispatch_prefill, and combine_prefill—to be used for MoE communication during the prefill stage in vLLM.

  • get_dispatch_layout: Calculates information related to the device memory layout for the dispatch operator, and is called before dispatch_prefill.
  • dispatch_prefill: Initiates the dispatch operation.
  • combine_prefill: Initiates the combine operation.

Does this PR introduce any user-facing change?

No

How was this patch tested?

The functionality has already been validated using the local Qwen model. Test cases will be added after support for multi-NPU use cases in the CI pipeline is finalized.

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces custom Mixture of Experts (MoE) operations for the prefill stage, which appears to be targeted for Ascend hardware. The changes include new kernel implementations, host-side logic, build scripts, and PyTorch bindings. While the core logic for the custom ops is complex and hardware-specific, I've identified several critical issues related to correctness in the PyTorch binding code, as well as high-severity issues in the build scripts and C++ host code that affect robustness and maintainability. Please address the critical correctness bugs and consider the high-severity suggestions to improve the quality of the code.

Comment thread csrc/torch_binding.cpp Outdated
Comment thread csrc/torch_binding.cpp Outdated
Comment thread csrc/build_aclnn.sh Outdated
Comment thread csrc/custom_ops/build.sh Outdated
Comment thread csrc/custom_ops/build.sh Outdated
Comment thread csrc/moe_combine_normal/op_host/moe_combine_normal_tiling.cpp
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 3, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@shiro-zzzz shiro-zzzz force-pushed the add_moe_normal branch 3 times, most recently from 9f41ea7 to 366c7b4 Compare December 3, 2025 08:21
@shiro-zzzz shiro-zzzz force-pushed the add_moe_normal branch 6 times, most recently from e8180d5 to c1b05d3 Compare December 4, 2025 11:10
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 4, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 6, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

1.Add the implementation of normal Aclnn operators: MoeCombineNormal, MoeDispatchNormal, NotifyDispatch,and DispatchLayout.
2.Provide PyTorch interfaces for the normal operators: get_dispatch_layout, dispatch_prefill, and combine_prefill.

Signed-off-by: shiro-zzzz <zhangdianhao@huawei.com>
@wangxiyuan wangxiyuan merged commit 0617d7d into vllm-project:main Dec 8, 2025
25 checks passed
MengqingCao added a commit that referenced this pull request Dec 8, 2025
@MengqingCao
Copy link
Copy Markdown
Collaborator

@shiro-zzzz this pr is reverted by #4806, plz fix the issue mentioned in #4806 and redo this, thx!

weijinqian0 pushed a commit to weijinqian0/vllm-ascend that referenced this pull request Dec 9, 2025
### What this PR does / why we need it?
1.Add the implementation of normal Aclnn operators: MoeCombineNormal,
MoeDispatchNormal, NotifyDispatch,and DispatchLayout.

- MoeCombineNormal: Implements the combine logic within MoE operations.
- MoeDispatchNormal: Implements the dispatch logic within MoE
operations.
- NotifyDispatch: Exchanges topk_idx information among different ranks
to calculate the device memory required for the dispatch stage.
- DispatchLayout: Used to calculate information related to the device
memory layout for the dispatch stage.

2.Provide PyTorch interfaces for normal operators—get_dispatch_layout,
dispatch_prefill, and combine_prefill—to be used for MoE communication
during the prefill stage in vLLM.

- get_dispatch_layout: Calculates information related to the device
memory layout for the dispatch operator, and is called before
dispatch_prefill.
- dispatch_prefill: Initiates the dispatch operation.
- combine_prefill: Initiates the combine operation.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
The functionality has already been validated using the local Qwen model.
Test cases will be added after support for multi-NPU use cases in the CI
pipeline is finalized.

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

Signed-off-by: shiro-zzzz <zhangdianhao@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
weijinqian0 pushed a commit to weijinqian0/vllm-ascend that referenced this pull request Dec 9, 2025
Reverts vllm-project#4194 as it broke CI in
https://github.com/vllm-project/vllm-ascend/actions/runs/20030369087/job/57437687382?pr=4791

Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 10, 2025
### What this PR does / why we need it?
1.Add the implementation of normal Aclnn operators: MoeCombineNormal,
MoeDispatchNormal, NotifyDispatch,and DispatchLayout.

- MoeCombineNormal: Implements the combine logic within MoE operations.
- MoeDispatchNormal: Implements the dispatch logic within MoE
operations.
- NotifyDispatch: Exchanges topk_idx information among different ranks
to calculate the device memory required for the dispatch stage.
- DispatchLayout: Used to calculate information related to the device
memory layout for the dispatch stage.

2.Provide PyTorch interfaces for normal operators—get_dispatch_layout,
dispatch_prefill, and combine_prefill—to be used for MoE communication
during the prefill stage in vLLM.

- get_dispatch_layout: Calculates information related to the device
memory layout for the dispatch operator, and is called before
dispatch_prefill.
- dispatch_prefill: Initiates the dispatch operation.
- combine_prefill: Initiates the combine operation.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
The functionality has already been validated using the local Qwen model.
Test cases will be added after support for multi-NPU use cases in the CI
pipeline is finalized.


- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

Signed-off-by: shiro-zzzz <zhangdianhao@huawei.com>
Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 10, 2025
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 10, 2025
### What this PR does / why we need it?
1.Add the implementation of normal Aclnn operators: MoeCombineNormal,
MoeDispatchNormal, NotifyDispatch,and DispatchLayout.

- MoeCombineNormal: Implements the combine logic within MoE operations.
- MoeDispatchNormal: Implements the dispatch logic within MoE
operations.
- NotifyDispatch: Exchanges topk_idx information among different ranks
to calculate the device memory required for the dispatch stage.
- DispatchLayout: Used to calculate information related to the device
memory layout for the dispatch stage.

2.Provide PyTorch interfaces for normal operators—get_dispatch_layout,
dispatch_prefill, and combine_prefill—to be used for MoE communication
during the prefill stage in vLLM.

- get_dispatch_layout: Calculates information related to the device
memory layout for the dispatch operator, and is called before
dispatch_prefill.
- dispatch_prefill: Initiates the dispatch operation.
- combine_prefill: Initiates the combine operation.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
The functionality has already been validated using the local Qwen model.
Test cases will be added after support for multi-NPU use cases in the CI
pipeline is finalized.


- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

Signed-off-by: shiro-zzzz <zhangdianhao@huawei.com>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants