[Bugfix] Fix hybrid Mamba KV cache allocation with request-constant pools by lesj0610 · Pull Request #41495 · vllm-project/vllm

lesj0610 · 2026-05-02T07:08:44Z

Summary

Fix KV cache sizing for hybrid Mamba/attention models, mainly the Qwen3.5/3.6 GDN path.

Mamba state in mamba_cache_mode="none" and "align" is per-request, not per-token. The old code handled it like normal attention KV, which wastes attention capacity and makes tensor sizing harder.

This separates request-constant Mamba/GDN groups into a compact pool. mamba_cache_mode="all" keeps the old shared-pool behavior.

Changes

Add KV cache memory-model metadata for token-proportional and request-constant groups.
Add a compact block pool for request-constant groups. Block id 0 is still reserved as the null block.
Generate separate KV pool configs for attention KV and Mamba/GDN state KV.
Make scheduler, manager, and worker reshape paths use the right pool/page size.
Keep unsupported paths fail-closed: prefix caching, CPU offload, and KV connector.
Validate compact-pool capacity for cudagraph capture.
Keep cudagraph memory profiling working with a minimal mixed-memory KV config.

Related PRs

[Bugfix] Fix KV cache sizing and allocation for hybrid Mamba/attention models #37429: same bug area, different approach. This uses explicit memory-model metadata and pool dispatch instead of Mamba-specific free lists.
[Bugfix] Exclude O(1) Mamba groups from hybrid KV cache token capacity #40384: fixes capacity reporting for O(1) Mamba groups. This branch also changes allocation, reshape/stride math, cudagraph validation, and unsupported-path handling.

Validation

Commands run on this branch:

.venv/bin/ruff check \
  vllm/config/compilation.py \
  vllm/v1/core/kv_cache_utils.py \
  vllm/v1/core/block_pool.py \
  vllm/v1/core/kv_cache_manager.py \
  tests/v1/core/test_kv_cache_utils.py \
  tests/v1/core/test_prefix_caching.py

.venv/bin/python -m pytest tests/v1/core/test_kv_cache_utils.py -q

.venv/bin/python -m pytest \
  tests/v1/core/test_kv_cache_utils.py \
  tests/v1/core/test_block_pool.py \
  tests/v1/core/test_prefix_caching.py \
  -q -k 'request_constant or mixed_memory_model or real_mamba_spec or compact_pool or token_proportional_capacity or num_blocks_override or take_events'

Result: ruff passed. tests/v1/core/test_kv_cache_utils.py passed with 75 passed. The focused pytest command also passed with 13 passed, 131 deselected.

Other focused validation during branch work:

109 passed for block pool, KV cache invariants, coordinator, prefix-cache gate, config generation, and manager paths.
17 passed for mixed/request-constant KV config tests.
CPU offload request-constant reject test passed.
Cudagraph profiling override regression test passed.
Request-constant cudagraph capacity tests passed.

Runtime capacity checks were run in eager mode (enforce_eager=True).

Model	TP	Before GPU KV	After GPU KV	Change
Qwen3.5-4B dense GDN	1	~250K tokens	~352K tokens	1.4x
Qwen3.5-9B dense GDN	1	~36K tokens	~49K tokens	1.3x
Qwen3.6-27B dense GDN	2	~284K tokens	~376K tokens	1.3x

Runtime runs loaded Qwen3_5ForConditionalGeneration and the Triton/FLA GDN prefill kernel. Qwen3.5-9B and Qwen3.6-27B also passed short English/Korean/Arabic answer checks with thinking disabled.

Cudagraph smoke was run with Qwen3.5-4B TP=1, kv_cache_dtype=auto, and cudagraph_mode=FULL; load and generation completed.

AI assistance was used (Codex, Claude, Gemini)

Add explicit KV cache memory-model metadata, compact request-constant block pools, and pool-aware config/manager/worker handling for hybrid Mamba and attention models. Mamba cache mode 'all' keeps the legacy token-proportional path. Unsupported request-constant combinations fail closed for prefix caching, offload, connector, and full CUDA graph paths. Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com> (cherry picked from commit 378322e014aeab09467a98e2348c04fd168d9c6b)

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request introduces a multi-pool KV cache architecture to support mixed memory models, specifically enabling efficient Mamba state management alongside traditional attention mechanisms. It defines TOKEN_PROPORTIONAL and REQUEST_CONSTANT memory models, implements a new CompactBlockPool for fixed-size request states, and updates the KVCacheCoordinator and KVCacheManager to be pool-aware. The changes also include extensive updates to configuration utilities, worker reshape logic, and a comprehensive suite of new tests. Feedback highlights potential issues in vllm/v1/core/kv_cache_utils.py, including a possible division-by-zero error during block normalization and a logic flaw where memory reservation checks might incorrectly fail during CUDA graph profiling if no token-proportional groups are present.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: efccac882e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

Keep the existing fail-closed behavior for hybrid specs whose page sizes cannot be aligned by block-size adjustment. Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

Validate request-constant pool capacity with max_num_seqs instead of rejecting full CUDA graph capture outright. Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

naroam1 · 2026-05-07T13:48:28Z

Hi maintainers — chiming in as a downstream user evaluating Qwen3.5 migration.

This PR addresses a real production concern. Per #37121, vLLM is over-allocating ~7x KV cache memory for hybrid Mamba/attention models like Qwen3.5: the profiler treats Mamba/GDN groups (which have request-constant O(1) state) the same as attention groups (token-proportional O(n) state). On Qwen3.5-4B-AWQ this means losing ~86% of the KV budget to padding.

For our planned production stack (Qwen3.5-4B, Qwen3.5-9B-AWQ, Qwen3.5-27B-AWQ), this would substantially affect --gpu-memory-utilization headroom and effective concurrency.

The PR is well-scoped (request-constant vs token-proportional split, additive code path keeping mamba_cache_mode="all" shared-pool behavior unchanged). Could a code-owner — @tdoublep @tomeras91 per CODEOWNERS for vllm/model_executor/layers/mamba — take a look? Happy to validate against our production stack post-merge.

Thanks @lesj0610 for the work!

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com> # Conflicts: # vllm/v1/core/sched/scheduler.py # vllm/v1/worker/gpu/attn_utils.py

lesj0610 added 2 commits May 2, 2026 10:16

Handle mixed KV override for cudagraph profiling

e33ff55

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com> (cherry picked from commit 378322e014aeab09467a98e2348c04fd168d9c6b)

mergify Bot added intel-gpu Related to Intel GPU v1 bug Something isn't working labels May 2, 2026

Merge branch 'main' into lesj/gdn-kv-mamba-attn-kv-fix-pr

efccac8

lesj0610 marked this pull request as ready for review May 2, 2026 07:10

lesj0610 requested review from ApostaC, NickLucche, ProExpertProg, WoosukKwon, alexm-redhat, heheda12345, hmellor, houseroad, jikunshang, mgoin, njhill, orozery, robertgshaw2-redhat, tlrmchlsmth, xuechendi, yewentao256, youkaichao and ywang96 as code owners May 2, 2026 07:10

claude Bot reviewed May 2, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 2, 2026

View reviewed changes

Comment thread vllm/v1/core/kv_cache_utils.py

Comment thread vllm/v1/core/kv_cache_utils.py Outdated

chatgpt-codex-connector Bot reviewed May 2, 2026

View reviewed changes

Comment thread vllm/v1/core/kv_cache_manager.py Outdated

Fix request-constant KV review edge cases

47af837

Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

lesj0610 changed the title ~~[BugFix] Fix KV cache sizing and allocation for hybrid Mamba/attention models~~ [Bugfix] Fix hybrid Mamba KV cache allocation with request-constant pools May 2, 2026

lesj0610 and others added 2 commits May 2, 2026 19:23

malventano mentioned this pull request May 3, 2026

[Bug]: [kv_offload+HMA] Fails on chat subsequent request #41515

Open

1 task

Fix legacy KV cache metadata for connector tests

e52ba7a

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

mergify Bot added the kv-connector label May 3, 2026

lesj0610 added 2 commits May 3, 2026 21:43

Merge branch 'main' into lesj/gdn-kv-mamba-attn-kv-fix-pr

f4119a7

Merge branch 'main' into lesj/gdn-kv-mamba-attn-kv-fix-pr

da970f0

Merge remote-tracking branch 'origin/main' into HEAD

c265343

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com> # Conflicts: # vllm/v1/core/sched/scheduler.py # vllm/v1/worker/gpu/attn_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix hybrid Mamba KV cache allocation with request-constant pools#41495

[Bugfix] Fix hybrid Mamba KV cache allocation with request-constant pools#41495
lesj0610 wants to merge 10 commits intovllm-project:mainfrom
lesj0610:lesj/gdn-kv-mamba-attn-kv-fix-pr

lesj0610 commented May 2, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

naroam1 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

lesj0610 commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Related PRs

Validation

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

naroam1 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lesj0610 commented May 2, 2026 •

edited

Loading