Skip to content

add xpu op grouped topk#10

Closed
mayuyuace wants to merge 22 commits intovllm-project:mainfrom
mayuyuace:qiming/add_xpu_op_grouped_topk
Closed

add xpu op grouped topk#10
mayuyuace wants to merge 22 commits intovllm-project:mainfrom
mayuyuace:qiming/add_xpu_op_grouped_topk

Conversation

@mayuyuace
Copy link
Copy Markdown
Collaborator

add xpu grouped topk kernel

Copilot AI review requested due to automatic review settings August 8, 2025 07:52

This comment was marked as outdated.

mayuyuace and others added 6 commits August 8, 2025 15:57
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Comment thread benchmark/benchmark_grouped_topk.py Outdated
start_time = time.perf_counter()

for _ in range(num_iters):
topk_weights, topk_indices = grouped_topk(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do some benchmark among grouped_topk_native, grouped_topk_native with @torch.compile, grouped_topk

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grouped_topk_native, grouped_topk_native with @torch.compile has been added.

@dbyoung18
Copy link
Copy Markdown
Collaborator

Suggest follow upstream's structure, put moe related kernels under vllm-xpu-kernels/csrc/xpu/moe
reference: https://github.com/vllm-project/vllm/tree/main/csrc/moe

@mayuyuace
Copy link
Copy Markdown
Collaborator Author

Suggest follow upstream's structure, put moe related kernels under vllm-xpu-kernels/csrc/xpu/moe reference: https://github.com/vllm-project/vllm/tree/main/csrc/moe

Done.

@dbyoung18
Copy link
Copy Markdown
Collaborator

@jikunshang
Copy link
Copy Markdown
Collaborator

@jikunshang Considering later upstream, do u think we should follow vllm to separate moe kernels to another .so/module? https://github.com/vllm-project/vllm/blob/main/setup.py#L324-L325 https://github.com/vllm-project/vllm/blob/main/vllm/_custom_ops.py#L16-L25

agree, let's move this kernel into _moe.so since this is moe related kernel.

@mayuyuace
Copy link
Copy Markdown
Collaborator Author

@dbyoung18 @jikunshang
I have added _moe_C.abi3.so and dirs in benchmark & tests for moe.

@jikunshang
Copy link
Copy Markdown
Collaborator

@dbyoung18 @jikunshang I have added _moe_C.abi3.so and dirs in benchmark & tests for moe.

I have some discussion with @dbyoung18 about library name, we noticed that rocm add some non-cuda kernels in _rocm_C, see link maybe we should add this non-cuda kernel to _xpu_C extension. @dbyoung18 also have a similar kernel, he will try add _xpu_C extension first.

@mayuyuace
Copy link
Copy Markdown
Collaborator Author

mayuyuace commented Aug 19, 2025

@dbyoung18 @jikunshang I have added _moe_C.abi3.so and dirs in benchmark & tests for moe.

I have some discussion with @dbyoung18 about library name, we noticed that rocm add some non-cuda kernels in _rocm_C, see link maybe we should add this non-cuda kernel to _xpu_C extension. @dbyoung18 also have a similar kernel, he will try add _xpu_C extension first.

Since the name will affects all ops not only grouped_topk, maybe changing .so name should be done in another PR?

@dbyoung18
Copy link
Copy Markdown
Collaborator

@dbyoung18 @jikunshang I have added _moe_C.abi3.so and dirs in benchmark & tests for moe.

I have some discussion with @dbyoung18 about library name, we noticed that rocm add some non-cuda kernels in _rocm_C, see link maybe we should add this non-cuda kernel to _xpu_C extension. @dbyoung18 also have a similar kernel, he will try add _xpu_C extension first.

Since the name will affects all ops not only grouped_topk, maybe changing .so name should be done in another PR?

I just refactor cmake for _moe_C and submit a related kernel moe_sum, pls take as a reference.

@mayuyuace
Copy link
Copy Markdown
Collaborator Author

@dbyoung18 @jikunshang I have added _moe_C.abi3.so and dirs in benchmark & tests for moe.

I have some discussion with @dbyoung18 about library name, we noticed that rocm add some non-cuda kernels in _rocm_C, see link maybe we should add this non-cuda kernel to _xpu_C extension. @dbyoung18 also have a similar kernel, he will try add _xpu_C extension first.

Since the name will affects all ops not only grouped_topk, maybe changing .so name should be done in another PR?

I just refactor cmake for _moe_C and submit a related kernel moe_sum, pls take as a reference.

The PR for moe_sum looks like is same with what I did in this JIRA.

@dbyoung18
Copy link
Copy Markdown
Collaborator

@dbyoung18 @jikunshang I have added _moe_C.abi3.so and dirs in benchmark & tests for moe.

I have some discussion with @dbyoung18 about library name, we noticed that rocm add some non-cuda kernels in _rocm_C, see link maybe we should add this non-cuda kernel to _xpu_C extension. @dbyoung18 also have a similar kernel, he will try add _xpu_C extension first.

Since the name will affects all ops not only grouped_topk, maybe changing .so name should be done in another PR?

I just refactor cmake for _moe_C and submit a related kernel moe_sum, pls take as a reference.

The PR for moe_sum looks like is same with what I did in this JIRA.

1.Just noticed ur latest modifications. I made the change yesterday in parallel w/ u. The main parts of +_moe_C between ours are common.
2.@jikunshang Since grouped_topk is a part of moe, I would prefer put it under _moe_C not _xpu_C, what's ur options?
3.For UT, I think register all kernels to vllm-xpu-kernels/tests/register_ops.py is enough, as it's designed to replace _ipex_ops.py and for CUDA, moe related kernels also seen registered to _custom_ops.py . For single UT, it's ok to isolate according to catagory as vllm/tests/kernels.

I think our main philosophy in doing this is to align with the community as much as possible to reduce effort for later upstream.

@jikunshang
Copy link
Copy Markdown
Collaborator

vllm-project/vllm#23274 vllm add fused group_topk recently. Please take a look what we missed, thanks!

@mayuyuace
Copy link
Copy Markdown
Collaborator Author

vllm-project/vllm#23274 vllm add fused group_topk recently. Please take a look what we missed, thanks!

OK, I will rewrite a new kernel from vllm fused group_topk, and compare it with what we use now.
And I will create a new PR if the kernel performance from vllm is better.

@mayuyuace mayuyuace mentioned this pull request Sep 2, 2025
@jikunshang
Copy link
Copy Markdown
Collaborator

can we close this?

@mayuyuace mayuyuace closed this Sep 12, 2025
@mayuyuace mayuyuace deleted the qiming/add_xpu_op_grouped_topk branch November 12, 2025 07:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants