[Bugfix] Fix hybrid Mamba KV cache allocation with request-constant pools#41495
[Bugfix] Fix hybrid Mamba KV cache allocation with request-constant pools#41495lesj0610 wants to merge 10 commits intovllm-project:mainfrom
Conversation
Add explicit KV cache memory-model metadata, compact request-constant block pools, and pool-aware config/manager/worker handling for hybrid Mamba and attention models. Mamba cache mode 'all' keeps the legacy token-proportional path. Unsupported request-constant combinations fail closed for prefix caching, offload, connector, and full CUDA graph paths. Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com> (cherry picked from commit 378322e014aeab09467a98e2348c04fd168d9c6b)
There was a problem hiding this comment.
Code Review
This pull request introduces a multi-pool KV cache architecture to support mixed memory models, specifically enabling efficient Mamba state management alongside traditional attention mechanisms. It defines TOKEN_PROPORTIONAL and REQUEST_CONSTANT memory models, implements a new CompactBlockPool for fixed-size request states, and updates the KVCacheCoordinator and KVCacheManager to be pool-aware. The changes also include extensive updates to configuration utilities, worker reshape logic, and a comprehensive suite of new tests. Feedback highlights potential issues in vllm/v1/core/kv_cache_utils.py, including a possible division-by-zero error during block normalization and a logic flaw where memory reservation checks might incorrectly fail during CUDA graph profiling if no token-proportional groups are present.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: efccac882e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Keep the existing fail-closed behavior for hybrid specs whose page sizes cannot be aligned by block-size adjustment. Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Validate request-constant pool capacity with max_num_seqs instead of rejecting full CUDA graph capture outright. Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
|
Hi maintainers — chiming in as a downstream user evaluating Qwen3.5 migration. This PR addresses a real production concern. Per #37121, vLLM is over-allocating ~7x KV cache memory for hybrid Mamba/attention models like Qwen3.5: the profiler treats Mamba/GDN groups (which have request-constant O(1) state) the same as attention groups (token-proportional O(n) state). On Qwen3.5-4B-AWQ this means losing ~86% of the KV budget to padding. For our planned production stack ( The PR is well-scoped (request-constant vs token-proportional split, additive code path keeping Thanks @lesj0610 for the work! |
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com> # Conflicts: # vllm/v1/core/sched/scheduler.py # vllm/v1/worker/gpu/attn_utils.py
Summary
Fix KV cache sizing for hybrid Mamba/attention models, mainly the Qwen3.5/3.6 GDN path.
Mamba state in
mamba_cache_mode="none"and"align"is per-request, not per-token. The old code handled it like normal attention KV, which wastes attention capacity and makes tensor sizing harder.This separates request-constant Mamba/GDN groups into a compact pool.
mamba_cache_mode="all"keeps the old shared-pool behavior.Changes
Related PRs
Validation
Commands run on this branch:
.venv/bin/ruff check \ vllm/config/compilation.py \ vllm/v1/core/kv_cache_utils.py \ vllm/v1/core/block_pool.py \ vllm/v1/core/kv_cache_manager.py \ tests/v1/core/test_kv_cache_utils.py \ tests/v1/core/test_prefix_caching.py .venv/bin/python -m pytest tests/v1/core/test_kv_cache_utils.py -q .venv/bin/python -m pytest \ tests/v1/core/test_kv_cache_utils.py \ tests/v1/core/test_block_pool.py \ tests/v1/core/test_prefix_caching.py \ -q -k 'request_constant or mixed_memory_model or real_mamba_spec or compact_pool or token_proportional_capacity or num_blocks_override or take_events'Result:
ruffpassed.tests/v1/core/test_kv_cache_utils.pypassed with75 passed. The focused pytest command also passed with13 passed, 131 deselected.Other focused validation during branch work:
Runtime capacity checks were run in eager mode (
enforce_eager=True).Runtime runs loaded
Qwen3_5ForConditionalGenerationand the Triton/FLA GDN prefill kernel. Qwen3.5-9B and Qwen3.6-27B also passed short English/Korean/Arabic answer checks with thinking disabled.Cudagraph smoke was run with Qwen3.5-4B TP=1,
kv_cache_dtype=auto, andcudagraph_mode=FULL; load and generation completed.AI assistance was used (Codex, Claude, Gemini)