add mla_preprocess kernel by kiscad · Pull Request #3226 · vllm-project/vllm-ascend

kiscad · 2025-09-28T06:23:58Z

What this PR does / why we need it?

Adds the mla_preprocess custom kernel to provide an optimized pre-processing operator for Multi-head Latent Attention (MLA) on Ascend NPUs.
Wires the new kernel into the C++ extension pipeline so vLLM can invoke it directly, cutting Python-side tensor shuffling and memory copies that previously bottlenecked MLA compilation paths.

Does this PR introduce any user-facing change?

No. The change only introduces a low-level kernel; public APIs and inference behavior remain unchanged.

How was this patch tested?

Dedicated Ascend kernels are not covered by our CI yet, so no extra automated tests were added. Future MLA-focused regression runs will cover this path.
vLLM version: v0.11.0rc3
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

github-actions · 2025-09-28T06:24:08Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces a new mla_preprocess kernel, which involves significant changes across the host-side tiling logic, kernel implementation, and PyTorch bindings. The implementation is complex and adds a substantial amount of new code. My review focuses on critical aspects of correctness, performance, and thread safety. I've identified a critical race condition in the host-side tiling logic that needs to be addressed to prevent data corruption in multi-threaded environments. Additionally, there are opportunities to improve kernel performance by optimizing memory copy operations and to enhance correctness and performance on the host by replacing floating-point calculations with integer arithmetic for tiling parameters.

kiscad · 2025-10-09T15:38:26Z

Hi @Yikun @wangxiyuan 👋
This PR is ready for review. Please take a look when you have time.
Let me know if any clarification is needed. Thanks!

Yikun · 2025-10-09T23:51:37Z

@kiscad

Please make sure 310P image build test passed.
Would you mind add a python e2e test: https://github.com/vllm-project/vllm-ascend/tree/main/tests/e2e/singlecard/ops to make sure mlapo ops works as expected? (I guess you can move a simple test from atb repo)

ApsarasX · 2025-10-10T03:38:56Z

So how should we use this kernel in the modeling file?

kiscad · 2025-10-10T06:32:13Z

So how should we use this kernel in the modeling file?

Yes, this kernel will be used with MLA and DSA (deepseek sparse attention) for better performance, and the adaptation of model implements will be submitted later.

Signed-off-by: Chen Chen <0109chenchen@gmail.com>

### What this PR does / why we need it? - Adds the `mla_preprocess` custom kernel to provide an optimized pre-processing operator for Multi-head Latent Attention (MLA) on Ascend NPUs. - Wires the new kernel into the C++ extension pipeline so vLLM can invoke it directly, cutting Python-side tensor shuffling and memory copies that previously bottlenecked MLA compilation paths. ### Does this PR introduce any user-facing change? - No. The change only introduces a low-level kernel; public APIs and inference behavior remain unchanged. ### How was this patch tested? - Dedicated Ascend kernels are not covered by our CI yet, so no extra automated tests were added. Future MLA-focused regression runs will cover this path. - vLLM version: v0.11.0 Signed-off-by: Chen Chen <0109chenchen@gmail.com>

### What this PR does / why we need it? - Adds the `mla_preprocess` custom kernel to provide an optimized pre-processing operator for Multi-head Latent Attention (MLA) on Ascend NPUs. - Wires the new kernel into the C++ extension pipeline so vLLM can invoke it directly, cutting Python-side tensor shuffling and memory copies that previously bottlenecked MLA compilation paths. ### Does this PR introduce any user-facing change? - No. The change only introduces a low-level kernel; public APIs and inference behavior remain unchanged. ### How was this patch tested? - Dedicated Ascend kernels are not covered by our CI yet, so no extra automated tests were added. Future MLA-focused regression runs will cover this path. - vLLM version: v0.11.0 Signed-off-by: Chen Chen <0109chenchen@gmail.com> Signed-off-by: luolun <luolun1995@cmbchina.com>

### What this PR does / why we need it? - Adds the `mla_preprocess` custom kernel to provide an optimized pre-processing operator for Multi-head Latent Attention (MLA) on Ascend NPUs. - Wires the new kernel into the C++ extension pipeline so vLLM can invoke it directly, cutting Python-side tensor shuffling and memory copies that previously bottlenecked MLA compilation paths. ### Does this PR introduce any user-facing change? - No. The change only introduces a low-level kernel; public APIs and inference behavior remain unchanged. ### How was this patch tested? - Dedicated Ascend kernels are not covered by our CI yet, so no extra automated tests were added. Future MLA-focused regression runs will cover this path. - vLLM version: v0.11.0 Signed-off-by: Chen Chen <0109chenchen@gmail.com> Signed-off-by: hwhaokun <haokun0405@163.com>

### What this PR does / why we need it? - Adds the `mla_preprocess` custom kernel to provide an optimized pre-processing operator for Multi-head Latent Attention (MLA) on Ascend NPUs. - Wires the new kernel into the C++ extension pipeline so vLLM can invoke it directly, cutting Python-side tensor shuffling and memory copies that previously bottlenecked MLA compilation paths. ### Does this PR introduce any user-facing change? - No. The change only introduces a low-level kernel; public APIs and inference behavior remain unchanged. ### How was this patch tested? - Dedicated Ascend kernels are not covered by our CI yet, so no extra automated tests were added. Future MLA-focused regression runs will cover this path. - vLLM version: v0.11.0 Signed-off-by: Chen Chen <0109chenchen@gmail.com> Signed-off-by: nsdie <yeyifan@huawei.com>

### What this PR does / why we need it? - Adds the `mla_preprocess` custom kernel to provide an optimized pre-processing operator for Multi-head Latent Attention (MLA) on Ascend NPUs. - Wires the new kernel into the C++ extension pipeline so vLLM can invoke it directly, cutting Python-side tensor shuffling and memory copies that previously bottlenecked MLA compilation paths. ### Does this PR introduce any user-facing change? - No. The change only introduces a low-level kernel; public APIs and inference behavior remain unchanged. ### How was this patch tested? - Dedicated Ascend kernels are not covered by our CI yet, so no extra automated tests were added. Future MLA-focused regression runs will cover this path. - vLLM version: v0.11.0 Signed-off-by: Chen Chen <0109chenchen@gmail.com>

gemini-code-assist Bot reviewed Sep 28, 2025

View reviewed changes

Comment thread csrc/mla_preprocess/op_host/mla_preprocess.h

Comment thread csrc/mla_preprocess/op_host/mla_preprocess.h

Comment thread csrc/mla_preprocess/op_kernel/mla_preprocess_kernel.cpp

Comment thread csrc/torch_binding.cpp

wangxiyuan added the ready read for review label Sep 28, 2025

wangxiyuan mentioned this pull request Sep 28, 2025

[Release]: Release checklist for v0.11.0rc1 #3141

Closed

42 tasks

kiscad force-pushed the main branch 2 times, most recently from 1f63240 to c9fc8a9 Compare September 28, 2025 08:51

Yikun added dist-test ready-for-test start test by label for PR labels Sep 28, 2025

kiscad force-pushed the main branch from c9fc8a9 to 15b3da1 Compare October 9, 2025 12:53

Yikun added ready read for review ready-for-test start test by label for PR and removed ready read for review ready-for-test start test by label for PR dist-test labels Oct 9, 2025

kiscad force-pushed the main branch 2 times, most recently from 782fbd5 to bc80c53 Compare October 11, 2025 09:33

github-actions Bot added the module:tests label Oct 11, 2025

kiscad force-pushed the main branch 2 times, most recently from 6aa333c to 8767d14 Compare October 11, 2025 14:54

add mla_preprocess kernel

b302b85

Signed-off-by: Chen Chen <0109chenchen@gmail.com>

kiscad force-pushed the main branch from 8767d14 to b302b85 Compare October 11, 2025 14:56

wangxiyuan approved these changes Oct 11, 2025

View reviewed changes

wangxiyuan merged commit bcc313e into vllm-project:main Oct 11, 2025
16 checks passed

SlightwindSec mentioned this pull request Nov 11, 2025

[Kernel] add custom op GmmSwigluQuantWeightNzTensorList #3804

Merged

kunpengW-code mentioned this pull request Nov 20, 2025

[RFC]: vLLM-Ascend Operator Direct Tuning #4298

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add mla_preprocess kernel#3226

add mla_preprocess kernel#3226
wangxiyuan merged 1 commit intovllm-project:mainfrom
kiscad:main

kiscad commented Sep 28, 2025 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Sep 28, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kiscad commented Oct 9, 2025

Uh oh!

Yikun commented Oct 9, 2025 •

edited

Loading

Uh oh!

ApsarasX commented Oct 10, 2025

Uh oh!

kiscad commented Oct 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kiscad commented Sep 28, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions Bot commented Sep 28, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kiscad commented Oct 9, 2025

Uh oh!

Yikun commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ApsarasX commented Oct 10, 2025

Uh oh!

kiscad commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kiscad commented Sep 28, 2025 •

edited by github-actions Bot

Loading

Yikun commented Oct 9, 2025 •

edited

Loading

kiscad commented Oct 10, 2025 •

edited

Loading