Skip to content

add mla_preprocess kernel#3226

Merged
wangxiyuan merged 1 commit intovllm-project:mainfrom
kiscad:main
Oct 11, 2025
Merged

add mla_preprocess kernel#3226
wangxiyuan merged 1 commit intovllm-project:mainfrom
kiscad:main

Conversation

@kiscad
Copy link
Copy Markdown
Contributor

@kiscad kiscad commented Sep 28, 2025

What this PR does / why we need it?

  • Adds the mla_preprocess custom kernel to provide an optimized pre-processing operator for Multi-head Latent Attention (MLA) on Ascend NPUs.
  • Wires the new kernel into the C++ extension pipeline so vLLM can invoke it directly, cutting Python-side tensor shuffling and memory copies that previously bottlenecked MLA compilation paths.

Does this PR introduce any user-facing change?

  • No. The change only introduces a low-level kernel; public APIs and inference behavior remain unchanged.

How was this patch tested?

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new mla_preprocess kernel, which involves significant changes across the host-side tiling logic, kernel implementation, and PyTorch bindings. The implementation is complex and adds a substantial amount of new code. My review focuses on critical aspects of correctness, performance, and thread safety. I've identified a critical race condition in the host-side tiling logic that needs to be addressed to prevent data corruption in multi-threaded environments. Additionally, there are opportunities to improve kernel performance by optimizing memory copy operations and to enhance correctness and performance on the host by replacing floating-point calculations with integer arithmetic for tiling parameters.

Comment thread csrc/mla_preprocess/op_host/mla_preprocess.h
Comment thread csrc/mla_preprocess/op_host/mla_preprocess.h
Comment thread csrc/mla_preprocess/op_kernel/mla_preprocess_kernel.cpp
Comment thread csrc/torch_binding.cpp
@wangxiyuan wangxiyuan added the ready read for review label Sep 28, 2025
@kiscad kiscad force-pushed the main branch 2 times, most recently from 1f63240 to c9fc8a9 Compare September 28, 2025 08:51
@Yikun Yikun added dist-test ready-for-test start test by label for PR labels Sep 28, 2025
@kiscad
Copy link
Copy Markdown
Contributor Author

kiscad commented Oct 9, 2025

Hi @Yikun @wangxiyuan 👋
This PR is ready for review. Please take a look when you have time.
Let me know if any clarification is needed. Thanks!

@Yikun Yikun added ready read for review ready-for-test start test by label for PR and removed ready read for review ready-for-test start test by label for PR dist-test labels Oct 9, 2025
@Yikun
Copy link
Copy Markdown
Member

Yikun commented Oct 9, 2025

@kiscad

  1. Please make sure 310P image build test passed.
  2. Would you mind add a python e2e test: https://github.com/vllm-project/vllm-ascend/tree/main/tests/e2e/singlecard/ops to make sure mlapo ops works as expected? (I guess you can move a simple test from atb repo)

@ApsarasX
Copy link
Copy Markdown
Collaborator

So how should we use this kernel in the modeling file?

@kiscad
Copy link
Copy Markdown
Contributor Author

kiscad commented Oct 10, 2025

So how should we use this kernel in the modeling file?

Yes, this kernel will be used with MLA and DSA (deepseek sparse attention) for better performance, and the adaptation of model implements will be submitted later.

@kiscad kiscad force-pushed the main branch 2 times, most recently from 782fbd5 to bc80c53 Compare October 11, 2025 09:33
@kiscad kiscad force-pushed the main branch 2 times, most recently from 6aa333c to 8767d14 Compare October 11, 2025 14:54
Signed-off-by: Chen Chen <0109chenchen@gmail.com>
@wangxiyuan wangxiyuan merged commit bcc313e into vllm-project:main Oct 11, 2025
16 checks passed
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
### What this PR does / why we need it?

- Adds the `mla_preprocess` custom kernel to provide an optimized
pre-processing operator for Multi-head Latent Attention (MLA) on Ascend
NPUs.
- Wires the new kernel into the C++ extension pipeline so vLLM can
invoke it directly, cutting Python-side tensor shuffling and memory
copies that previously bottlenecked MLA compilation paths.

### Does this PR introduce any user-facing change?

- No. The change only introduces a low-level kernel; public APIs and
inference behavior remain unchanged.

### How was this patch tested?

- Dedicated Ascend kernels are not covered by our CI yet, so no extra
automated tests were added. Future MLA-focused regression runs will
cover this path.

- vLLM version: v0.11.0

Signed-off-by: Chen Chen <0109chenchen@gmail.com>
luolun pushed a commit to luolun/vllm-ascend that referenced this pull request Nov 19, 2025
### What this PR does / why we need it?

- Adds the `mla_preprocess` custom kernel to provide an optimized
pre-processing operator for Multi-head Latent Attention (MLA) on Ascend
NPUs.
- Wires the new kernel into the C++ extension pipeline so vLLM can
invoke it directly, cutting Python-side tensor shuffling and memory
copies that previously bottlenecked MLA compilation paths.

### Does this PR introduce any user-facing change?

- No. The change only introduces a low-level kernel; public APIs and
inference behavior remain unchanged.

### How was this patch tested?

- Dedicated Ascend kernels are not covered by our CI yet, so no extra
automated tests were added. Future MLA-focused regression runs will
cover this path.

- vLLM version: v0.11.0

Signed-off-by: Chen Chen <0109chenchen@gmail.com>
Signed-off-by: luolun <luolun1995@cmbchina.com>
hwhaokun pushed a commit to hwhaokun/vllm-ascend that referenced this pull request Nov 19, 2025
### What this PR does / why we need it?

- Adds the `mla_preprocess` custom kernel to provide an optimized
pre-processing operator for Multi-head Latent Attention (MLA) on Ascend
NPUs.
- Wires the new kernel into the C++ extension pipeline so vLLM can
invoke it directly, cutting Python-side tensor shuffling and memory
copies that previously bottlenecked MLA compilation paths.

### Does this PR introduce any user-facing change?

- No. The change only introduces a low-level kernel; public APIs and
inference behavior remain unchanged.

### How was this patch tested?

- Dedicated Ascend kernels are not covered by our CI yet, so no extra
automated tests were added. Future MLA-focused regression runs will
cover this path.

- vLLM version: v0.11.0

Signed-off-by: Chen Chen <0109chenchen@gmail.com>
Signed-off-by: hwhaokun <haokun0405@163.com>
NSDie pushed a commit to NSDie/vllm-ascend that referenced this pull request Nov 24, 2025
### What this PR does / why we need it?

- Adds the `mla_preprocess` custom kernel to provide an optimized
pre-processing operator for Multi-head Latent Attention (MLA) on Ascend
NPUs.
- Wires the new kernel into the C++ extension pipeline so vLLM can
invoke it directly, cutting Python-side tensor shuffling and memory
copies that previously bottlenecked MLA compilation paths.

### Does this PR introduce any user-facing change?

- No. The change only introduces a low-level kernel; public APIs and
inference behavior remain unchanged.

### How was this patch tested?

- Dedicated Ascend kernels are not covered by our CI yet, so no extra
automated tests were added. Future MLA-focused regression runs will
cover this path.

- vLLM version: v0.11.0

Signed-off-by: Chen Chen <0109chenchen@gmail.com>
Signed-off-by: nsdie <yeyifan@huawei.com>
Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 9, 2025
### What this PR does / why we need it?

- Adds the `mla_preprocess` custom kernel to provide an optimized
pre-processing operator for Multi-head Latent Attention (MLA) on Ascend
NPUs.
- Wires the new kernel into the C++ extension pipeline so vLLM can
invoke it directly, cutting Python-side tensor shuffling and memory
copies that previously bottlenecked MLA compilation paths.

### Does this PR introduce any user-facing change?

- No. The change only introduces a low-level kernel; public APIs and
inference behavior remain unchanged.

### How was this patch tested?

- Dedicated Ascend kernels are not covered by our CI yet, so no extra
automated tests were added. Future MLA-focused regression runs will
cover this path.

- vLLM version: v0.11.0

Signed-off-by: Chen Chen <0109chenchen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants