[V1] [Hybrid] Lighter Mamba Prefix Caching for Hybrid Models#28176
[V1] [Hybrid] Lighter Mamba Prefix Caching for Hybrid Models#28176peakcrosser7 wants to merge 3 commits intovllm-project:mainfrom
Conversation
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request introduces a lightweight prefix caching mechanism for Mamba-based hybrid models, which is a significant feature addition. The implementation seems well-thought-out and consistent with the design described. The changes span across the scheduler, cache manager, and model runner to support block-aligned caching of Mamba states. I've identified one critical issue regarding the handling of a new environment variable, which could lead to incorrect behavior. Other than that, the changes look good.
| "VLLM_USE_LIGHTER_MAMBA_CACHE": lambda: os.getenv( | ||
| "VLLM_USE_LIGHTER_MAMBA_CACHE", False | ||
| ), |
There was a problem hiding this comment.
The current implementation for parsing the VLLM_USE_LIGHTER_MAMBA_CACHE environment variable is incorrect. The lambda lambda: os.getenv("VLLM_USE_LIGHTER_MAMBA_CACHE", False) will evaluate to a truthy value for any non-empty string, including "0", which is likely not the intended behavior for a boolean flag. This can lead to the feature being unintentionally enabled. To ensure correct boolean parsing, it should be compared against "1", similar to how other boolean flags are handled in this file.
| "VLLM_USE_LIGHTER_MAMBA_CACHE": lambda: os.getenv( | |
| "VLLM_USE_LIGHTER_MAMBA_CACHE", False | |
| ), | |
| "VLLM_USE_LIGHTER_MAMBA_CACHE": lambda: os.getenv( | |
| "VLLM_USE_LIGHTER_MAMBA_CACHE", "0" | |
| ) == "1", |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| "VLLM_USE_LIGHTER_MAMBA_CACHE": lambda: os.getenv( | ||
| "VLLM_USE_LIGHTER_MAMBA_CACHE", False | ||
| ), |
There was a problem hiding this comment.
Parse
VLLM_USE_LIGHTER_MAMBA_CACHE as boolean
The new env flag is exposed as lambda: os.getenv("VLLM_USE_LIGHTER_MAMBA_CACHE", False). Unlike the rest of the boolean envs (which run the value through bool(int(...))), this returns the raw string. As a consequence setting VLLM_USE_LIGHTER_MAMBA_CACHE=0 or False will still be truthy and the lighter cache path is enabled unintentionally. This can enable an experimental code path for all deployments even when the user explicitly disables it. The getter should coerce the string to a real boolean, e.g. bool(int(os.getenv(..., "0"))).
Useful? React with 👍 / 👎.
| def max_memory_usage_bytes(self, vllm_config: VllmConfig) -> int: | ||
| max_model_len = vllm_config.model_config.max_model_len | ||
| return cdiv(max_model_len, self.block_size) * self.page_size_bytes | ||
| # We allocate 1 block for each request now, so max_memory_usage_bytes is | ||
| # the same as page_size_bytes. | ||
| # Need to update this when supporting prefix caching. | ||
| if not envs.VLLM_USE_LIGHTER_MAMBA_CACHE: | ||
| max_model_len = vllm_config.model_config.max_model_len | ||
| return cdiv(max_model_len, self.block_size) * self.page_size_bytes | ||
| else: | ||
| # NOTE: We allocate 1 block per request by default. With prefix | ||
| # caching enabled, up to 2 additional blocks are required: one | ||
| # for reading the matched prefix and one for caching the current | ||
| # state. | ||
| return self.page_size_bytes * (3 if self.enable_caching else 1) |
There was a problem hiding this comment.
Include speculative blocks in Mamba memory estimate
In the lighter Mamba branch MambaSpec.max_memory_usage_bytes now returns page_size_bytes * (3 if self.enable_caching else 1) regardless of num_speculative_blocks. However, allocation paths still reserve 1 + num_speculative_blocks blocks (plus an extra for caching) when speculative decoding (EAGLE/MTP) is active. With num_speculative_blocks > 0 the memory calculation now underestimates the number of blocks per request, so the block pool will be sized for at most 3 blocks while execution tries to allocate 4+, causing allocation failures or unexpected preemption. The returned size should include num_speculative_blocks in the multiplier.
Useful? React with 👍 / 👎.
heheda12345
left a comment
There was a problem hiding this comment.
- If I understand correctly, we will always cache the state at token num_tokens - num_tokens%block_size, but there is no ensure that tokens in other positions are cached. In this case, can you ensure system prompt is cached?
- Memory layout: I feel that we don't need such a new kv cache memory design. We can have a block_id list with length num_tokens / block_size + num_spec_decode_tokens, and always make sure that the state of the previous schedule step is at block num_computed_tokens / block_size
For example, if num_computed_tokens=29 and we schedule 1 new token, the kv cache before this step is:
block 0: N/A
block 1: token 29
block 2: N/A
block 3: N/A
-> run main model
block 0: N/A
block 1: token 30
block 2: token 31
block 3: token 32
-> adjust kv cache based on number of accepted tokens
- if no token is accepted:
block 0: N/A
block 1: token 30
block 2: N/A
block 3: N/A - if token 30 is accepted:
block 0: N/A
block 1: token 31
block 2: N/A
block 3: N/A - if token 30 & 31 are accepted:
block 0: N/A
block 1: token 31
block 2: token 32
block 3: N/A - if token 30 & 31 & 32 are accepted:
block 0: token 15
block 1: token 31
block 2: token 33
block 3: N/A
Then, in the next schedule step, the previous state is always at block [num_computed_tokens / block_size]
- I'm still concerning about whether we should increase the complexity of scheduler to avoid kernel changes of mamba layers.
Thank you for your attention to this PR. Regarding the first question: There is insufficient memory to store per-token state, so we cannot guarantee caching of system prompts especially when their token count is less than block_size. As for the third concern: We believe modifying the scheduler is significantly less complex than altering attention kernels. Besides FLA, multiple kernel implementations exist for linear attention, making it impractical to update all of them. Moreover, requiring kernels to retain excessive internal token states would degrade performance. This prefix-caching solution has been stably deployed in Alibaba Cloud’s Qwen3-Next online serving system for near 1 month, maintaining a consistently healthy cache hit ratio. |
Hi @heheda12345 , thank you very much for your detailed review.
Finally, thank you again for your review and feedback on this PR. We truly appreciate your time and insights, and we hope the clarifications above address your concerns. |
|
@minminsun I'm curious about the performance of chunk-granularity caching. Would it be possible to share any cache hit ratio results? Thanks! |
…guity This commit integrates the key optimization from vLLM PR #28176 to improve Qwen3-Next inference performance by ensuring Mamba state indices tensors are explicitly contiguous. ## Changes: ### 1. hybrid_linear_attn_backend.py - Added `.contiguous()` calls to `mamba_cache_indices` in three critical paths: * `_forward_metadata()`: Normal forward pass metadata preparation * `_capture_metadata()`: CUDA graph capture path * `_replay_metadata()`: CUDA graph replay path ### 2. mamba2_metadata.py - Added `.contiguous()` calls in two metadata preparation methods: * `prepare_decode()`: Decode-only path (used during CUDA graph) * `prepare_mixed()`: Mixed prefill/decode path ## Rationale: The vLLM PR #28176 identified that "state indices tensor must be explicitly contiguous because requests can contain multiple blocks." This optimization ensures better memory layout and improved kernel performance when processing batched requests with Mamba-based hybrid models like Qwen3-Next. ## Benefits: - Improved memory access patterns for Mamba state lookups - Better performance for multi-block requests - Consistent with vLLM's lightweight Mamba prefix caching approach - No functional changes, purely performance optimization Reference: vllm-project/vllm#28176
Though I cannot provide specific cache hit ratio metrics for our production services, the drop relative to block-granularity caching is less than 10%. |
|
Fantastic work. Any plans on upstreaming this? |
Copied and rebased from vllm-project#28176 Thanks to @peakcrosser7 and @minminsun Signed-off-by: Jannis Schönleber <joennlae@gmail.com>
|
I added an rebased version here: #30725 |
|
Hi, @joennlae ! Thanks for your positive feedback and creating the rebase PR! |
|
We are iterating on #29272 |
|
Ah perfect :-) |
|
Closed because of #30877 |

Purpose
Currently, Automatic Prefix Caching for Mamba-based hybrid models does not support architectures such as GDN. To address this, we propose a lightweight Mamba Prefix Caching design called Lighter-Mamba-Prefix-Cache.
Its core idea is to directly cache Mamba states using a block-aligned scheduling approach, enabling rapid support for Prefix Caching in Mamba models without modifying any kernel code, while maintaining compatibility with SPS, MTP, and Eagle.
This solution has already been validated on Qwen3-Next-80B-A3B-Instruct.
Design Details
Block Allocation Design
For each request, the number of blocks allocated per Mamba group is changed from the original fixed
1 + spsto2 + sps + N, where:1 + spsblocks: Reserved for runtime usage, identical to the original1 + spsblocks without prefix caching.Prefix Matching Logic
Block-Aligned Scheduling
Since requests are hashed at the granularity of
block_size, Mamba states must be aligned toblock_sizeboundaries before caching. This ensures that each Mamba state corresponds to exactly one block hash.Lighter prefix cache stores variable-length chunk states—i.e., the number of tokens (or the incremental length) associated with each cached Mamba state may vary, but it is always a multiple of
block_size.With Mamba Prefix Caching enabled, the scheduler behaves as follows:
block_size, except for the final chunk of the request.block_size, ensuring its size is≤ block_size. This maximizes the length of the prompt that can be cached during the prefill phase.Test Plan
TODO
Test Result
TODO
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.