[V1][Hybrid] Mamba Prefix Caching with align mode#30877
[V1][Hybrid] Mamba Prefix Caching with align mode#30877heheda12345 merged 136 commits intovllm-project:mainfrom
Conversation
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
|
Hi @peakcrosser7, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
heheda12345
left a comment
There was a problem hiding this comment.
LGTM! Thanks @peakcrosser7 for the great job.
There is still a long way to go for vllm to reach stable & efficient mamba support. Though there are some known issues, I'd like to merge this PR first to make it possible for more people to contribute on this work stream. Given the known issues, we keep the prefix caching of linear attention as an experimental feature that needs to be enabled explicitly.
I list some of the problems below. Most of them are not related to prefix caching support directly but do block us for moving forward. Help wanted on them!
- Speculative decoding compatibility. There is some correctness issue in the current linear attention implementation as discussed in #30618. Though this PR includes the code for spec decode + prefix caching, it can only be enabled after #30618 is resolved.
- #31649 detected during the debugging of 30618
- We need more testing on resumed requests' prefix caching
| mamba_blocks_per_req = ( | ||
| max_num_blocks_per_req | ||
| if self.cache_config.enable_prefix_caching | ||
| else 1 | ||
| ) + kv_cache_group.kv_cache_spec.num_speculative_blocks |
There was a problem hiding this comment.
I'm still have trouble squaring this with the code for max_memory_usage_bytes which says that for align mode is is self.page_size_bytes * (2 + self.num_speculative_blocks). Does this imply that we have mamba_blocks_per_req = 2 + kv_cache_group.kv_cache_spec.num_speculative_blocks ?
|
Hi, @tdoublep. As for Please let me know if I misunderstood anything. Thanks! |
|
Looking at this PR, this is a good format that I should follow (didn't look at the code change itself). Role model format! |
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
tdoublep
left a comment
There was a problem hiding this comment.
Thanks for the great work! This feature enables prefix caching for a broader set of models.
Let's fix the issues that remain for MTP as a follow-up.
vllm/v1/kv_cache_interface.py
Outdated
| # We allocate 1 block for each request now, so max_memory_usage_bytes is | ||
| # the same as page_size_bytes. | ||
| # Need to update this when supporting prefix caching. |
There was a problem hiding this comment.
This comment is redundant now I think
There was a problem hiding this comment.
Thanks for pointing that out. We can remove it later.
| max_model_len = vllm_config.model_config.max_model_len | ||
| return cdiv(max_model_len, self.block_size) * self.page_size_bytes |
There was a problem hiding this comment.
I think that this code for "all" is wrong actually, but it is not an issue introduced by this PR. Will fix it as a follow-up.
There was a problem hiding this comment.
Agreed. It seems "all" mode performs allocation at the granularity of mamba_block_size, so we need to fix this later.
|
This PR appears to fail pre-commit, I have a fix: #32956 |
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: 陈建华 <1647430658@qq.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com>
…de align` (#7103) ### What this PR does / why we need it? To support prefix cache for Qwen3.5/Next in vLLM-Ascend, this PR mainly follows the design in [#30877](vllm-project/vllm#30877) and inherits changes to functions which are overridden in vLLM-Ascend. Note: 1. `--mamba-cache-mode align` && PD disaggregation is still not supported yet in vLLM v0.17.0(see https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py#L295). 2. The current implementation of hybrid kv cache might result in a very large block_size when scheduling. For example, if we run Qwen3.5-35B-A3B with `-tp 2`, the block_size is adjusted to 2048, which means that any prefix shorter than 2048 will never be cached. Although this behavior is consistent with vLLM, it still needs improvements in the future. 3. `--mamba-cache-mode align` requires to copy mamba states during forward steps. vLLM uses a triton kernel to implement it. However, the original version run into some bugs on Ascend hardwares. Thus we patch a new triton kernel to avoid this bug. ### Does this PR introduce _any_ user-facing change? To use mamba prefix cache, set `--enable-prefix-caching` and `--mamba-cache-mode align`. Note that the mamba state copy function(see [do_mamba_copy_block](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/mamba_utils.py#L132)) does not provide a torch native version, thus it might have trouble if users can't use triton. - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@4034c3d --------- Signed-off-by: Angazenn <supperccell@163.com>
…de align` (vllm-project#7103) ### What this PR does / why we need it? To support prefix cache for Qwen3.5/Next in vLLM-Ascend, this PR mainly follows the design in [#30877](vllm-project/vllm#30877) and inherits changes to functions which are overridden in vLLM-Ascend. Note: 1. `--mamba-cache-mode align` && PD disaggregation is still not supported yet in vLLM v0.17.0(see https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py#L295). 2. The current implementation of hybrid kv cache might result in a very large block_size when scheduling. For example, if we run Qwen3.5-35B-A3B with `-tp 2`, the block_size is adjusted to 2048, which means that any prefix shorter than 2048 will never be cached. Although this behavior is consistent with vLLM, it still needs improvements in the future. 3. `--mamba-cache-mode align` requires to copy mamba states during forward steps. vLLM uses a triton kernel to implement it. However, the original version run into some bugs on Ascend hardwares. Thus we patch a new triton kernel to avoid this bug. ### Does this PR introduce _any_ user-facing change? To use mamba prefix cache, set `--enable-prefix-caching` and `--mamba-cache-mode align`. Note that the mamba state copy function(see [do_mamba_copy_block](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/mamba_utils.py#L132)) does not provide a torch native version, thus it might have trouble if users can't use triton. - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@4034c3d --------- Signed-off-by: Angazenn <supperccell@163.com>
The cleaned-up version of #29272
Purpose
This PR enhances the design of #28176 , adopting the same memory layout as FullAttention while adding support for decode caching and speculative decoding.
The core idea of this Mamba Prefix-Caching implementation (referred to as LPC) is to directly cache Mamba states through block-aligned scheduling. This approach enables rapid support for Prefix-caching in Mamba models without modifications to the underlying kernel code. Furthermore, it maintains full compatibility with Speculative-Decoding/MTP/EAGLE.
Currently, this solution supports all Mamba model architectures including GDN, Mamba1, Mamba2, and Short Conv Attention, and has been adapted for relevant Mamba models such as Qwen3-Next-80B-A3B-Instruct and LFM2-700M.
Usage
To enable this feature, start the engine with the
--enable-prefix-cachingand--mamba-cache-mode alignflags.Design Details
Block-Aligned Scheduling
Following the design in #28176 , requests in the prefill phase are scheduled in multiples of
block_size. This ensures that the Mamba states can be mapped to a specific block's hash value.The prefix-caching stores variable-length chunk states—i.e., the number of tokens (or the incremental length) associated with each cached Mamba state may vary, but it is always a multiple of
block_size.Scheduler Logic with Mamba Prefix-Caching Enabled:
block_size, except for the final chunk of the request.block_size, ensuring its size is≤ block_size. This maximizes the length of the prompt that can be cached during the prefill phase.Block Allocation Design
Prefill Stage
During the prefill stage, requests are scheduled at a block-aligned chunk granularity. For a single scheduling step consisting of chunk_len tokens, the system allocates
chunk_len // block_sizeblocks:(chunk_len // block_size) - 1blocks are populated with null-blocks (placeholders).Note on Speculative Decoding (SPS): In the prefill stage with SPS enabled, the initial execution requires the allocation of

gammaadditional speculative blocks, which are subsequently reused in following steps.Decode Stage
Since only a small number of tokens are scheduled per step during decoding, the allocation logic is consistent with FullAttention, where blocks are incrementally allocated one by one.

Prefix Caching Logic
Scheduler-side Logic
Similar to the FullAttention prefix-caching logic. Only immutable blocks that store Mamba states are cached (excluding the null-blocks). And prefix matching is performed via a reverse hash lookup that requires only a single block to be matched.

Worker-side Logic
Prefill Phase:
The Preprocess stage is responsible for copying Mamba states before the model forward:
Condition 1: Copy the Mamba state from the previous step to the current step.

Condition 2: Copy the Mamba state from the prefix-cache hit block to the current step.

Decode Phase:
Without Speculative Decoding: The logic remains consistent with the standard Prefill Phase.
With Speculative Decoding:
The Preprocess stage copies Mamba states when a new block is allocated:
num_accepted_tokens.After receiving the full number of tokens corresponding to the previous block, the Post-process stage copies the Mamba state back to the previous block.

Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.