[Hybrid] support prefix cache for Qwen3.5/Next with `--mamba-cache-mode align` by Angazenn · Pull Request #7103 · vllm-project/vllm-ascend

Angazenn · 2026-03-10T03:49:55Z

What this PR does / why we need it?

To support prefix cache for Qwen3.5/Next in vLLM-Ascend, this PR mainly follows the design in #30877 and inherits changes to functions which are overridden in vLLM-Ascend.

Note:

--mamba-cache-mode align && PD disaggregation is still not supported yet in vLLM v0.17.0(see https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py#L295).
The current implementation of hybrid kv cache might result in a very large block_size when scheduling. For example, if we run Qwen3.5-35B-A3B with -tp 2, the block_size is adjusted to 2048, which means that any prefix shorter than 2048 will never be cached. Although this behavior is consistent with vLLM, it still needs improvements in the future.
--mamba-cache-mode align requires to copy mamba states during forward steps. vLLM uses a triton kernel to implement it. However, the original version run into some bugs on Ascend hardwares. Thus we patch a new triton kernel to avoid this bug.

Does this PR introduce any user-facing change?

To use mamba prefix cache, set --enable-prefix-caching and --mamba-cache-mode align. Note that the mamba state copy function(see do_mamba_copy_block) does not provide a torch native version, thus it might have trouble if users can't use triton.

How was this patch tested?

vLLM version: v0.16.0
vLLM main: vllm-project/vllm@4034c3d

github-actions · 2026-03-10T03:50:08Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist · 2026-03-10T03:50:14Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces support for prefix caching in Mamba models, specifically targeting Qwen3.5/Next, within the vLLM-Ascend framework. It integrates a new utility for efficient batch memory copy operations, refines the block table management to correctly account for Mamba's state and distributed processing, and enables the necessary preprocessing steps during model execution. These changes allow for improved performance by leveraging prefix caching for Mamba-based architectures.

Highlights

Mamba Prefix Caching Support: Enabled prefix caching for Mamba models, specifically Qwen3.5/Next, when using the --mamba-cache-mode align configuration.
New Mamba Utility Patch: Introduced a new patch file, patch_mamba_utils.py, which provides Triton-based batch_memcpy functionalities essential for efficient Mamba state management.
Dynamic Block Table Management: Refactored BlockTable and NPUInputBatch initialization to dynamically calculate and pass max_num_blocks per request, accommodating Mamba's specific caching requirements and distributed processing.
Integrated Mamba Preprocessing: Integrated Mamba preprocessing logic into the model execution flow within model_runner_v1.py, ensuring proper state handling for prefix caching.
Removed Mamba Prefix Caching Restriction: Removed the previous NotImplementedError that prevented Mamba prefix caching, indicating full support for this feature.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

vllm_ascend/patch/worker/init.py
- Imported vllm_ascend.patch.worker.patch_mamba_utils.
vllm_ascend/patch/worker/patch_mamba_utils.py
- Added a new file defining batch_memcpy_kernel (Triton JIT) and batch_memcpy functions.
- Patched vllm.v1.worker.mamba_utils.batch_memcpy_kernel and batch_memcpy with the new implementations.
vllm_ascend/worker/block_table.py
- Imported get_total_cp_world_size.
- Modified the BlockTableGroup constructor to accept an optional max_num_blocks parameter.
- Updated the internal calculation of max_num_blocks_per_req to use get_total_cp_world_size and respect the new max_num_blocks parameter if provided.
vllm_ascend/worker/model_runner_v1.py
- Imported mamba_utils and get_total_cp_world_size from cp_utils.
- Added mamba_state_idx dictionary and _mamba_copy_bufs attribute to the class.
- Integrated mamba_utils.preprocess_mamba call within the execute_model method when mamba_cache_mode is "align".
- Reset _mamba_copy_bufs to None in initialize_kv_cache.
- Modified may_reinitialize_input_batch to calculate max_num_blocks for Mamba models, considering prefix caching and speculative blocks.
- Passed the newly calculated max_num_blocks to the NPUInputBatch constructor.
- Removed the NotImplementedError check for Mamba prefix caching in get_kv_cache_spec.
vllm_ascend/worker/npu_input_batch.py
- Added max_num_blocks_per_req as an optional parameter to the NPUInputBatch constructor.
- Passed max_num_blocks_per_req to the BlockTableGroup constructor.

Activity

No human activity has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for prefix caching for Mamba models on Ascend, a valuable feature that extends vLLM-Ascend functionality. A security audit found no high or critical vulnerabilities. For performance, there is a suggestion to optimize the new Triton kernel for memory copy; please refer to the specific comment for details. Additionally, suggested PR title and summary have been provided for adherence to repository style guidelines.

github-actions · 2026-03-10T16:06:39Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

MengqingCao

Plz add e2e test on prefix cache and single op test on the triton kernels

github-actions · 2026-03-13T08:21:28Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: Angazenn <supperccell@163.com>

MengqingCao · 2026-03-14T02:28:22Z

plz rebase your code after #7230 merged

…de align` (vllm-project#7103) ### What this PR does / why we need it? To support prefix cache for Qwen3.5/Next in vLLM-Ascend, this PR mainly follows the design in [#30877](vllm-project/vllm#30877) and inherits changes to functions which are overridden in vLLM-Ascend. Note: 1. `--mamba-cache-mode align` && PD disaggregation is still not supported yet in vLLM v0.17.0(see https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py#L295). 2. The current implementation of hybrid kv cache might result in a very large block_size when scheduling. For example, if we run Qwen3.5-35B-A3B with `-tp 2`, the block_size is adjusted to 2048, which means that any prefix shorter than 2048 will never be cached. Although this behavior is consistent with vLLM, it still needs improvements in the future. 3. `--mamba-cache-mode align` requires to copy mamba states during forward steps. vLLM uses a triton kernel to implement it. However, the original version run into some bugs on Ascend hardwares. Thus we patch a new triton kernel to avoid this bug. ### Does this PR introduce _any_ user-facing change? To use mamba prefix cache, set `--enable-prefix-caching` and `--mamba-cache-mode align`. Note that the mamba state copy function(see [do_mamba_copy_block](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/mamba_utils.py#L132)) does not provide a torch native version, thus it might have trouble if users can't use triton. - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@4034c3d --------- Signed-off-by: Angazenn <supperccell@163.com>

gemini-code-assist bot reviewed Mar 10, 2026

View reviewed changes

Comment thread vllm_ascend/patch/worker/patch_mamba_utils.py Outdated

github-actions bot added the merge-conflicts label Mar 10, 2026

Angazenn force-pushed the mamba_apc branch from 7764444 to 437a35d Compare March 11, 2026 02:32

github-actions bot removed the merge-conflicts label Mar 11, 2026

Angazenn force-pushed the mamba_apc branch 3 times, most recently from 10ea397 to 7623c0d Compare March 11, 2026 02:54

Angazenn marked this pull request as ready for review March 12, 2026 09:23

Angazenn requested review from MengqingCao and wangxiyuan as code owners March 12, 2026 09:23

Angazenn changed the title ~~[Draft] support prefix cache for Qwen3.5/Next with --mamba-cache-mode align~~ [Hybrid] support prefix cache for Qwen3.5/Next with --mamba-cache-mode align Mar 12, 2026

This was referenced Mar 13, 2026

[Feedback]: v0.17.0rc1 Release Feedback #7173

Open

[Release]: Release checklist for v0.17.0rc1 #7172

Closed

weijinqian0 approved these changes Mar 13, 2026

View reviewed changes

MengqingCao approved these changes Mar 13, 2026

View reviewed changes

github-actions bot added the merge-conflicts label Mar 13, 2026

Angazenn requested review from realliujiaxu, whx-sjtu and zzzzwwjj as code owners March 13, 2026 08:56

Angazenn added 6 commits March 13, 2026 17:00

npu support mamba prefix cache

589278c

Signed-off-by: Angazenn <supperccell@163.com>

add comments

2fcfbdb

Signed-off-by: Angazenn <supperccell@163.com>

fix lint

b60aee1

Signed-off-by: Angazenn <supperccell@163.com>

update kernel

8093dd3

Signed-off-by: Angazenn <supperccell@163.com>

bugfix

995e379

Signed-off-by: Angazenn <supperccell@163.com>

update triton kernel

4eb77bf

Signed-off-by: Angazenn <supperccell@163.com>

Angazenn force-pushed the mamba_apc branch from ffd0a7d to d634cef Compare March 13, 2026 09:02

modify

0cfa290

Signed-off-by: Angazenn <supperccell@163.com>

Angazenn force-pushed the mamba_apc branch from d634cef to 0cfa290 Compare March 13, 2026 09:03

Angazenn added ready read for review ready-for-test start test by label for PR labels Mar 13, 2026

github-actions bot removed the merge-conflicts label Mar 13, 2026

MengqingCao added ready read for review and removed ready read for review labels Mar 13, 2026

Angazenn and others added 6 commits March 13, 2026 17:56

add test

311b096

Signed-off-by: Angazenn <supperccell@163.com>

fix

07cc14b

Signed-off-by: Angazenn <supperccell@163.com>

bugfix

487eab3

Signed-off-by: Angazenn <supperccell@163.com>

fix lint

1a95456

Signed-off-by: Angazenn <supperccell@163.com>

fix

b14853a

Signed-off-by: Angazenn <supperccell@163.com>

Merge branch 'main' into mamba_apc

c9ba19d

Merge branch 'main' into mamba_apc

d1a1c1c

MengqingCao approved these changes Mar 15, 2026

View reviewed changes

MengqingCao merged commit ce5544b into vllm-project:main Mar 15, 2026
38 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hybrid] support prefix cache for Qwen3.5/Next with `--mamba-cache-mode align`#7103

[Hybrid] support prefix cache for Qwen3.5/Next with `--mamba-cache-mode align`#7103
MengqingCao merged 14 commits intovllm-project:mainfrom
Angazenn:mamba_apc

Angazenn commented Mar 10, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 10, 2026

Uh oh!

gemini-code-assist bot commented Mar 10, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

github-actions bot commented Mar 10, 2026

Uh oh!

MengqingCao left a comment

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

MengqingCao commented Mar 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Angazenn commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Mar 10, 2026

Uh oh!

gemini-code-assist bot commented Mar 10, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

github-actions bot commented Mar 10, 2026

Uh oh!

MengqingCao left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

MengqingCao commented Mar 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Angazenn commented Mar 10, 2026 •

edited

Loading