[Bugfix] Fix KV cache overestimation for hybrid Mamba/attention model…#37124
[Bugfix] Fix KV cache overestimation for hybrid Mamba/attention model…#37124swtb3 wants to merge 7 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request effectively addresses the KV cache memory overestimation for hybrid Mamba/attention models like Qwen3.5. The changes correctly introduce specialized logic to handle the different memory requirements of Mamba and attention layers independently, which resolves the reported issue. The modifications to memory allocation, concurrency estimation, and reporting are well-implemented and gated behind a check for mixed-architecture models, minimizing the risk of regressions for other model types. The accompanying tests are thorough and provide good coverage for the new logic, ensuring the fix is robust. Overall, this is a high-quality contribution that significantly improves memory efficiency for this class of models.
2264e4a to
2f538a8
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
@swtb3 happy to test this after the conflicts are solved, thank you! |
vllm-project#37121) Qwen3.5 mixes 24 GatedDeltaNet layers (O(1) state) with 8 full attention layers (O(n) KV per token). vLLM treated all layers uniformly, causing ~7x memory overestimation (7.57 GiB allocated, ~1 GiB used). Reporting fixes: - get_max_concurrency_for_kv_cache_config: sum per-group costs independently instead of multiplying the largest cost by the largest group count - _report_kv_cache_config: count tokens from attention groups only - _max_memory_usage_bytes_from_groups: sum actual per-group memory usage instead of calling get_uniform_page_size (which crashes with non-uniform sizes) Allocation fix: - New elif branch in get_kv_cache_config_from_groups for mixed Mamba+attention: gives each layer its own tensor at its natural page size - Skip page size unification in get_kv_cache_groups so Mamba layers keep their small page size instead of being padded to match attention All changes gated behind _has_mixed_mamba_attention() — no impact on pure-attention, pure-Mamba, or attention+sliding_window models. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: swtb <135991636+swtb3@users.noreply.github.com> Signed-off-by: swtb-ryder <sbayly@ryderarchitecture.com>
…all" When prefix caching is enabled (mamba_cache_mode="all"), Mamba states are cached per-token and scale with sequence length. Only exclude Mamba groups from token capacity reporting in "none" and "align" modes. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: swtb <135991636+swtb3@users.noreply.github.com> Signed-off-by: swtb-ryder <sbayly@ryderarchitecture.com>
…rid models Hybrid Mamba/attention models (e.g., Qwen3.5) suffered OOM and massive memory waste because Mamba layers shared the attention BlockPool. Each Mamba layer's tensor was sized for N blocks (the full pool) but only used 1 block per request, wasting ~399 MB per layer. Decouple Mamba from the shared BlockPool by giving MambaManager a self-managed compact block space (0..C-1 where C = max concurrent requests). Freed memory goes to attention, yielding ~47x token capacity improvement on Qwen3.5-4B. - Add `mamba_num_blocks` field to `KVCacheConfig` - Implement compact allocation branch in `get_kv_cache_config_from_groups` - Add compact mode to `MambaManager` with self-managed block lifecycle - Update cross-worker tensor scaling for separate Mamba/attention pools - Update concurrency calculation for compact allocation - Preserve shared-pool behavior for `mamba_cache_mode="all"` - Add 13 new tests covering allocation, manager, and edge cases Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: swtb3 <135991636+swtb3@users.noreply.github.com>
2f538a8 to
a3ad054
Compare
done |
|
Without PR: GPU KV cache size: 101,600 tokens I am however experiencing a dramatic drop in performance: Before PR After PR This is what I am running: NCCL_P2P_LEVEL=SYS
NCCL_IB_DISABLE=1 \
NCCL_NET_GDR_LEVEL=SYS \
NCCL_MIN_NCHANNELS=4 \
NCCL_ALLOC_P2P_NET_LL_BUFFERS=1 \
VLLM_ENABLE_FLA_PACKED_RECURRENT_DECODE=1 \
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
vllm serve \
Qwen/Qwen3.5-27B-FP8 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.94 \
--max-model-len 262144 \
--max-num-seqs 32 \
--max-num-batched-tokens 8192 \
--block-size 32 \
--language-model-only \
-O3 \
--enable-auto-tool-choice \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--attention-backend TRITON_ATTN \
--enable-prefix-caching \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 2 \
--speculative-config.rejection_sample_method probabilistic \
--load-format instanttensorTested against main 09e4576 |
|
Before
After
So TTFT explodes, TPOT/ITL comes down. suspiciously a factor of ~7 difference in each case. Wonder if my changes have caused some sequential queuing |
|
Due to rebase error, every update on this PR is pinging all vLLM maintainers. Could you please close it an open a new one so that the right subset of reviewers are getting notified? Thanks |
1 similar comment
|
Due to rebase error, every update on this PR is pinging all vLLM maintainers. Could you please close it an open a new one so that the right subset of reviewers are getting notified? Thanks |
Purpose
Qwen3.5 mixes 24 GatedDeltaNet layers (O(1) state) with 8 full attention layers (O(n) KV per token). vLLM treated all layers uniformly, causing ~7x memory overestimation (7.57 GiB allocated, ~1 GiB used).
Reporting fixes:
Allocation fix:
All changes gated behind _has_mixed_mamba_attention() — no impact on pure-attention, pure-Mamba, or attention+sliding_window models.
Test plan
pytest tests/v1/core/test_kv_cache_utils.py -v -s— 57/57 passedTest Result
pass
Notes
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.