[XPU] Enable topk_per_row and indexer_quant_cache kernels for DeepSeekV3.2 and GLM5#37888
[XPU] Enable topk_per_row and indexer_quant_cache kernels for DeepSeekV3.2 and GLM5#37888xwu-intel wants to merge 8 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request optimizes several XPU operations by replacing the Python-based fallback implementations with calls to high-performance C++ kernels from vllm-xpu-kernels. This is a valuable performance improvement. The changes also simplify the calling code by removing platform-specific branches. My review has identified a potential critical issue with a custom operator namespace and a typo in a function parameter name. Please see the detailed comments.
81382a2 to
8a98984
Compare
|
@jikunshang @wuxun-zhang pls review. better to merged when the next vllm-xpu-kernels released |
| self.max_total_seq_len, | ||
| self.topk_indices_buffer, | ||
| ) | ||
|
|
There was a problem hiding this comment.
@chaojun-zhang will vLLM IR can help avoid these op dispatches using duplicated code?
There was a problem hiding this comment.
@ProExpertProg suggestions here? Seems this custom op doesn't have native implementation, how to handle it with vLLM IR?
There was a problem hiding this comment.
For this case, it's simple there would be a single forward method only and call the registered IR kernel directly regardless platforms. XPU will register its own IR kernel, which should be same one as cuda. Expect no more duplicated code.
|
@xwu-intel This PR is going to enable xpu indexer related kernels for DeepSeek V3.2, I would suggest to change title to reflect this, something like |
|
This pull request has merge conflicts that must be resolved before it can be |
|
I think we can restart this as we bump up vllm-xpu-kernels 0.1.5 release. |
|
@xwu-intel Please rebase the PR. |
Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com> Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com>
Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com>
Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com>
f127edb to
08ec43a
Compare
|
cc @xinyu-intel |
|
Waiting for fp8_mqa_logits and fp8_paged_mqa_logits xpu kernels... due to #37968 |
|
This pull request has merge conflicts that must be resolved before it can be |
Continue the PR #37869
Waiting for fp8_mqa_logits and fp8_paged_mqa_logits xpu kernels... due to #37968
Purpose
This PR optimizes XPU operations in vllm by integrating high-performance kernels from vllm-xpu-kernels. Specifically, it replaces PyTorch fallback implementations for:
The old PyTorch fallback paths were removed.
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.