[XPU] Enable topk_per_row and indexer_quant_cache kernels for DeepSeekV3.2 and GLM5 by xwu-intel · Pull Request #37888 · vllm-project/vllm

xwu-intel · 2026-03-23T12:54:46Z

Continue the PR #37869

Waiting for fp8_mqa_logits and fp8_paged_mqa_logits xpu kernels... due to #37968

Purpose

This PR optimizes XPU operations in vllm by integrating high-performance kernels from vllm-xpu-kernels. Specifically, it replaces PyTorch fallback implementations for:

top_k_per_row_prefill
top_k_per_row_decode
indexer_k_quant_and_cache
cp_gather_indexer_k_quant_cache

The old PyTorch fallback paths were removed.

Test Plan

Locally verify DeepSeek V3.2 and GLM-5 reduced model to confirm kernel availability and correct execution on B60.

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request optimizes several XPU operations by replacing the Python-based fallback implementations with calls to high-performance C++ kernels from vllm-xpu-kernels. This is a valuable performance improvement. The changes also simplify the calling code by removing platform-specific branches. My review has identified a potential critical issue with a custom operator namespace and a typo in a function parameter name. Please see the detailed comments.

xwu-intel · 2026-03-24T01:14:54Z

@jikunshang @wuxun-zhang pls review. better to merged when the next vllm-xpu-kernels released

yma11 · 2026-03-24T02:15:34Z

+            self.max_total_seq_len,
+            self.topk_indices_buffer,
+        )
+


@chaojun-zhang will vLLM IR can help avoid these op dispatches using duplicated code?

@ProExpertProg suggestions here? Seems this custom op doesn't have native implementation, how to handle it with vLLM IR?

For this case, it's simple there would be a single forward method only and call the registered IR kernel directly regardless platforms. XPU will register its own IR kernel, which should be same one as cuda. Expect no more duplicated code.

wuxun-zhang · 2026-03-25T13:23:24Z

@xwu-intel This PR is going to enable xpu indexer related kernels for DeepSeek V3.2, I would suggest to change title to reflect this, something like enable topk_per_row and indexer_quant_cache kernels for DeepSeekV3.2.

mergify · 2026-04-01T04:26:08Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @xwu-intel.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

jikunshang · 2026-04-04T01:26:57Z

I think we can restart this as we bump up vllm-xpu-kernels 0.1.5 release.

wuxun-zhang · 2026-04-06T01:50:44Z

@xwu-intel Please rebase the PR.

Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com> Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com>

Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com>

wuxun-zhang · 2026-04-07T03:08:05Z

cc @xinyu-intel

Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com>

xwu-intel · 2026-04-07T07:13:44Z

Waiting for fp8_mqa_logits and fp8_paged_mqa_logits xpu kernels... due to #37968

mergify · 2026-04-09T04:06:20Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @xwu-intel.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist bot reviewed Mar 23, 2026

View reviewed changes

Comment thread vllm/_xpu_ops.py Outdated

Comment thread vllm/_xpu_ops.py Outdated

xwu-intel marked this pull request as ready for review March 23, 2026 13:22

xwu-intel mentioned this pull request Mar 23, 2026

Optimize XPU ops using latest vllm-xpu-kernels #37869

Closed

5 tasks

xwu-intel force-pushed the xpu-ops-optimization branch from 81382a2 to 8a98984 Compare March 23, 2026 13:33

jikunshang reviewed Mar 24, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/sparse_attn_indexer.py Outdated

wuxun-zhang reviewed Mar 24, 2026

View reviewed changes

Comment thread vllm/_xpu_ops.py Outdated

Comment thread vllm/_xpu_ops.py Outdated

Comment thread vllm/_xpu_ops.py Outdated

yma11 reviewed Mar 24, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/sparse_attn_indexer.py

jikunshang changed the title ~~Optimize XPU ops using latest vllm-xpu-kernels~~ [XPU] Optimize XPU ops using latest vllm-xpu-kernels Mar 24, 2026

xwu-intel requested review from jikunshang, wuxun-zhang and yma11 March 24, 2026 07:06

wuxun-zhang mentioned this pull request Mar 25, 2026

DeepSeek V3.2 support and optimization plan vllm-project/vllm-xpu-kernels#154

Open

8 tasks

xwu-intel changed the title ~~[XPU] Optimize XPU ops using latest vllm-xpu-kernels~~ [XPU] Enable topk_per_row and indexer_quant_cache kernels for DeepSeekV3.2 and GLM5 Mar 26, 2026

mergify bot added deepseek Related to DeepSeek models intel-gpu Related to Intel GPU labels Mar 26, 2026

mergify bot added the needs-rebase label Apr 1, 2026

wuxun-zhang reviewed Apr 6, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/sparse_attn_indexer.py Outdated

xwu-intel added 7 commits April 7, 2026 10:55

Optimize XPU ops using vllm-xpu-kernels

ff71e9c

Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com> Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com>

xpu: remove torch fallback paths for sparse indexer top-k ops

ad6f2ef

Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com>

format

5b70ae7

Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com>

fix

7aa3017

Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com>

refactor forward path

f7ffcfd

Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com>

fix

815d781

Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com>

fix

08ec43a

Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com>

xwu-intel force-pushed the xpu-ops-optimization branch from f127edb to 08ec43a Compare April 7, 2026 03:00

mergify bot removed the needs-rebase label Apr 7, 2026

wuxun-zhang reviewed Apr 7, 2026

View reviewed changes

Comment thread vllm/_xpu_ops.py Outdated

fix

7836eab

Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com>

mergify bot added the needs-rebase label Apr 9, 2026

xwu-intel mentioned this pull request Apr 10, 2026

GLM-5 / 5.1 support and optimization plan vllm-project/vllm-xpu-kernels#224

Open

7 tasks

Uh oh!

Conversation

xwu-intel commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

xwu-intel commented Mar 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yma11 Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

xinyu-intel Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

yma11 Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wuxun-zhang commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Apr 1, 2026

Uh oh!

jikunshang commented Apr 4, 2026

Uh oh!

Uh oh!

wuxun-zhang commented Apr 6, 2026

Uh oh!

wuxun-zhang commented Apr 7, 2026

Uh oh!

Uh oh!

xwu-intel commented Apr 7, 2026

Uh oh!

mergify bot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xwu-intel commented Mar 23, 2026 •

edited

Loading

wuxun-zhang commented Mar 25, 2026 •

edited

Loading