[Bugfix] Fix KV cache sizing and allocation for hybrid Mamba/attention models#37429
[Bugfix] Fix KV cache sizing and allocation for hybrid Mamba/attention models#37429swtb3 wants to merge 4 commits intovllm-project:mainfrom
Conversation
…n models
Fix KV cache overestimation, memory waste, OOM, and throughput regression
for hybrid Mamba/attention models (e.g. Qwen3.5).
- Fix KV cache block count overestimation by detecting mixed Mamba/attention groups and sizing each independently instead of using worst-case uniform sizing - Fix token capacity reporting to only count Mamba groups when
mamba_cache_mode="all" (prefix caching)
- Add compact Mamba allocation: Mamba layers self-manage a small dedicated
block pool (O(1) per request) instead of sharing the attention pool,
eliminating 7x memory waste and OOM on large models
- Fix compact pool exhaustion causing 4x throughput regression by capping
"none" mode allocation at 1+spec blocks per request (matching kernel usage)
and making remove_skipped_blocks a no-op for permanent Mamba state
Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: swtb3 <135991636+swtb3@users.noreply.github.com>
Signed-off-by: swtb3 <135991636+swtb3@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a new 'compact Mamba allocation' strategy to significantly improve memory efficiency and token capacity for hybrid Mamba+attention models like Qwen3.5. The changes involve decoupling Mamba layer memory allocation from attention layers, allowing Mamba (which has O(1) state per request) to use fewer blocks. This is achieved by adding new logic to kv_cache_utils.py for determining KV cache configurations and concurrency estimates for mixed models, including separate block pools for Mamba layers in 'none' and 'align' cache modes. The MambaManager in single_type_kv_cache_manager.py is updated to handle this compact allocation, managing its own private block pool and ensuring blocks are allocated and freed correctly without interfering with the shared attention block pool. Extensive new test cases are added to validate the correctness, efficiency, and concurrency behavior of this new allocation scheme, including regression guards for pure attention/Mamba models and specific tests for the 'none' and 'align' Mamba cache modes.
|
Hi @swtb3, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
…ator wiring Add mamba_num_blocks field to KVCacheConfig and pass it through kv_cache_coordinator to MambaManager. These were missed when squashing the compact Mamba allocation commits onto a fresh branch. Co-authored-by: Claude <noreply@anthropic.com> 135991636+swtb3@users.noreply.github.com Signed-off-by: swtb3 <135991636+swtb3@users.noreply.github.com>
|
I can still report the performance degradation of TTFT. This is what I see in the startup log: Startup log |
…id models
Fix two bugs causing a 4x throughput regression on long-context hybrid Mamba/attention models (e.g. Qwen3.5 at 262K context):
1. Concurrency formula used max_model_len as attention cost, giving C=1 for long contexts. Mamba state is O(1) per request, so concurrency
should be independent of sequence length. Replace with shared-pool
cap formula that guarantees attention_blocks >= shared pool equivalent.
2. Mamba page sizes were padded to match attention even in compact mode where Mamba has its own separate tensors. Use real_page_size_bytes
for Mamba allocation, cost accounting, and tensor reshape (including
the model runner's stride and block count derivation).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: swtb3 <135991636+swtb3@users.noreply.github.com>
NickLucche
left a comment
There was a problem hiding this comment.
Mamba layers self-manage a dedicated O(1) block pool instead of sharing the attention pool, eliminating 7x memory waste and OOM
IIUC, that'd be overriding a lot of work that went into HMA, building around making a single shared pool work.
Any similar change would at least require an RFC to discuss.
Would you care to elaborate more of your suggested changes in this format?
cc @heheda12345
|
Thank you @swtb3 , just tested this and I don't see decrease in performance anymore. However, previous commit would bring GPU KV cache size from 100k to 400k, but now I am getting only a 30k increase which is still decent. |
To be honest folks, this has become a bit of a rabbit hole, It would be good to get some assistance from a maintainer who has more knowledge of vLLMs machinery. All I know is that hybrid mamba models are fantastic for keeping kv cache low...but vLLM currently ignores that improvement and treats all layers the same. |
Could you share the logs/ results? |
|
Tested against main @ c63ca2b Command: Before PR: After PR: Benchmark command: Benchmark before PR: Benchmark after PR: Full startup log before PRFull startup log after PR |
|
Thank you for the great PR. I support the idea of separating Mamba and Attention cache Management. My understanding of the your Mamba change: Mamba layers now allocate O(1) blocks (independent of request length), while Attention layers allocate O(n) blocks. Is that correct? If so, I'm concerned about prefix cache hit ratio:
Was this impact taken into account in the initial design? In high-reuse scenarios (e.g., multi-turn chat), does recomputation overhead outweigh memory savings? I would greatly appreciate it if you could provide any relevant theoretical analysis or experimental results on the hit ratio. Thanks again for your work. |
Hello, thanks for the kudos. Trouble is this ended up becoming a rabbit hole. I welcome contributions from other people, at this stage who have a deeper knowledge of KV cache in vLLM and more recent changes to it |
Thank you for your timely reply. Looking forward to having more discussions in the future. |
Summary
mamba_cache_mode="all"1+specblocks per request and makingremove_skipped_blocksa no-op for permanent Mamba stateSupersedes #37124 (closed due to rebase notification issue).
Test plan
AI assistance was used (Claude). This is not duplicating an existing PR — it supersedes #37124 with a clean branch.
🤖 Generated with Claude Code