Skip to content

[Hybrid] support prefix cache for Qwen3.5/Next with --mamba-cache-mode align#7103

Merged
MengqingCao merged 14 commits intovllm-project:mainfrom
Angazenn:mamba_apc
Mar 15, 2026
Merged

[Hybrid] support prefix cache for Qwen3.5/Next with --mamba-cache-mode align#7103
MengqingCao merged 14 commits intovllm-project:mainfrom
Angazenn:mamba_apc

Conversation

@Angazenn
Copy link
Copy Markdown
Collaborator

@Angazenn Angazenn commented Mar 10, 2026

What this PR does / why we need it?

To support prefix cache for Qwen3.5/Next in vLLM-Ascend, this PR mainly follows the design in #30877 and inherits changes to functions which are overridden in vLLM-Ascend.

Note:

  1. --mamba-cache-mode align && PD disaggregation is still not supported yet in vLLM v0.17.0(see https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py#L295).
  2. The current implementation of hybrid kv cache might result in a very large block_size when scheduling. For example, if we run Qwen3.5-35B-A3B with -tp 2, the block_size is adjusted to 2048, which means that any prefix shorter than 2048 will never be cached. Although this behavior is consistent with vLLM, it still needs improvements in the future.
  3. --mamba-cache-mode align requires to copy mamba states during forward steps. vLLM uses a triton kernel to implement it. However, the original version run into some bugs on Ascend hardwares. Thus we patch a new triton kernel to avoid this bug.

Does this PR introduce any user-facing change?

To use mamba prefix cache, set --enable-prefix-caching and --mamba-cache-mode align. Note that the mamba state copy function(see do_mamba_copy_block) does not provide a torch native version, thus it might have trouble if users can't use triton.

How was this patch tested?

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces support for prefix caching in Mamba models, specifically targeting Qwen3.5/Next, within the vLLM-Ascend framework. It integrates a new utility for efficient batch memory copy operations, refines the block table management to correctly account for Mamba's state and distributed processing, and enables the necessary preprocessing steps during model execution. These changes allow for improved performance by leveraging prefix caching for Mamba-based architectures.

Highlights

  • Mamba Prefix Caching Support: Enabled prefix caching for Mamba models, specifically Qwen3.5/Next, when using the --mamba-cache-mode align configuration.
  • New Mamba Utility Patch: Introduced a new patch file, patch_mamba_utils.py, which provides Triton-based batch_memcpy functionalities essential for efficient Mamba state management.
  • Dynamic Block Table Management: Refactored BlockTable and NPUInputBatch initialization to dynamically calculate and pass max_num_blocks per request, accommodating Mamba's specific caching requirements and distributed processing.
  • Integrated Mamba Preprocessing: Integrated Mamba preprocessing logic into the model execution flow within model_runner_v1.py, ensuring proper state handling for prefix caching.
  • Removed Mamba Prefix Caching Restriction: Removed the previous NotImplementedError that prevented Mamba prefix caching, indicating full support for this feature.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • vllm_ascend/patch/worker/init.py
    • Imported vllm_ascend.patch.worker.patch_mamba_utils.
  • vllm_ascend/patch/worker/patch_mamba_utils.py
    • Added a new file defining batch_memcpy_kernel (Triton JIT) and batch_memcpy functions.
    • Patched vllm.v1.worker.mamba_utils.batch_memcpy_kernel and batch_memcpy with the new implementations.
  • vllm_ascend/worker/block_table.py
    • Imported get_total_cp_world_size.
    • Modified the BlockTableGroup constructor to accept an optional max_num_blocks parameter.
    • Updated the internal calculation of max_num_blocks_per_req to use get_total_cp_world_size and respect the new max_num_blocks parameter if provided.
  • vllm_ascend/worker/model_runner_v1.py
    • Imported mamba_utils and get_total_cp_world_size from cp_utils.
    • Added mamba_state_idx dictionary and _mamba_copy_bufs attribute to the class.
    • Integrated mamba_utils.preprocess_mamba call within the execute_model method when mamba_cache_mode is "align".
    • Reset _mamba_copy_bufs to None in initialize_kv_cache.
    • Modified may_reinitialize_input_batch to calculate max_num_blocks for Mamba models, considering prefix caching and speculative blocks.
    • Passed the newly calculated max_num_blocks to the NPUInputBatch constructor.
    • Removed the NotImplementedError check for Mamba prefix caching in get_kv_cache_spec.
  • vllm_ascend/worker/npu_input_batch.py
    • Added max_num_blocks_per_req as an optional parameter to the NPUInputBatch constructor.
    • Passed max_num_blocks_per_req to the BlockTableGroup constructor.
Activity
  • No human activity has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for prefix caching for Mamba models on Ascend, a valuable feature that extends vLLM-Ascend functionality. A security audit found no high or critical vulnerabilities. For performance, there is a suggestion to optimize the new Triton kernel for memory copy; please refer to the specific comment for details. Additionally, suggested PR title and summary have been provided for adherence to repository style guidelines.

Comment thread vllm_ascend/patch/worker/patch_mamba_utils.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@Angazenn Angazenn force-pushed the mamba_apc branch 3 times, most recently from 10ea397 to 7623c0d Compare March 11, 2026 02:54
@Angazenn Angazenn marked this pull request as ready for review March 12, 2026 09:23
@Angazenn Angazenn changed the title [Draft] support prefix cache for Qwen3.5/Next with --mamba-cache-mode align [Hybrid] support prefix cache for Qwen3.5/Next with --mamba-cache-mode align Mar 12, 2026
Copy link
Copy Markdown
Collaborator

@MengqingCao MengqingCao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plz add e2e test on prefix cache and single op test on the triton kernels

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: Angazenn <supperccell@163.com>
@Angazenn Angazenn added ready read for review ready-for-test start test by label for PR labels Mar 13, 2026
@MengqingCao MengqingCao added ready read for review and removed ready read for review labels Mar 13, 2026
Angazenn and others added 6 commits March 13, 2026 17:56
Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: Angazenn <supperccell@163.com>
@MengqingCao
Copy link
Copy Markdown
Collaborator

plz rebase your code after #7230 merged

@MengqingCao MengqingCao merged commit ce5544b into vllm-project:main Mar 15, 2026
38 checks passed
Nagisa125 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Mar 17, 2026
…de align` (vllm-project#7103)

### What this PR does / why we need it?
To support prefix cache for Qwen3.5/Next in vLLM-Ascend, this PR mainly
follows the design in
[#30877](vllm-project/vllm#30877) and inherits
changes to functions which are overridden in vLLM-Ascend.

Note:
1. `--mamba-cache-mode align` && PD disaggregation is still not
supported yet in vLLM v0.17.0(see
https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py#L295).
2. The current implementation of hybrid kv cache might result in a very
large block_size when scheduling. For example, if we run Qwen3.5-35B-A3B
with `-tp 2`, the block_size is adjusted to 2048, which means that any
prefix shorter than 2048 will never be cached. Although this behavior is
consistent with vLLM, it still needs improvements in the future.
3. `--mamba-cache-mode align` requires to copy mamba states during
forward steps. vLLM uses a triton kernel to implement it. However, the
original version run into some bugs on Ascend hardwares. Thus we patch a
new triton kernel to avoid this bug.

### Does this PR introduce _any_ user-facing change?
To use mamba prefix cache, set `--enable-prefix-caching` and
`--mamba-cache-mode align`. Note that the mamba state copy function(see
[do_mamba_copy_block](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/mamba_utils.py#L132))
does not provide a torch native version, thus it might have trouble if
users can't use triton.

- vLLM version: v0.16.0
- vLLM main:
vllm-project/vllm@4034c3d

---------

Signed-off-by: Angazenn <supperccell@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants