Skip to content

[Bugfix] Fix hybrid Mamba KV cache allocation with request-constant pools#41495

Open
lesj0610 wants to merge 10 commits intovllm-project:mainfrom
lesj0610:lesj/gdn-kv-mamba-attn-kv-fix-pr
Open

[Bugfix] Fix hybrid Mamba KV cache allocation with request-constant pools#41495
lesj0610 wants to merge 10 commits intovllm-project:mainfrom
lesj0610:lesj/gdn-kv-mamba-attn-kv-fix-pr

Conversation

@lesj0610
Copy link
Copy Markdown
Contributor

@lesj0610 lesj0610 commented May 2, 2026

Summary

Fix KV cache sizing for hybrid Mamba/attention models, mainly the Qwen3.5/3.6 GDN path.

Mamba state in mamba_cache_mode="none" and "align" is per-request, not per-token. The old code handled it like normal attention KV, which wastes attention capacity and makes tensor sizing harder.

This separates request-constant Mamba/GDN groups into a compact pool. mamba_cache_mode="all" keeps the old shared-pool behavior.

Changes

  • Add KV cache memory-model metadata for token-proportional and request-constant groups.
  • Add a compact block pool for request-constant groups. Block id 0 is still reserved as the null block.
  • Generate separate KV pool configs for attention KV and Mamba/GDN state KV.
  • Make scheduler, manager, and worker reshape paths use the right pool/page size.
  • Keep unsupported paths fail-closed: prefix caching, CPU offload, and KV connector.
  • Validate compact-pool capacity for cudagraph capture.
  • Keep cudagraph memory profiling working with a minimal mixed-memory KV config.

Related PRs

Validation

Commands run on this branch:

.venv/bin/ruff check \
  vllm/config/compilation.py \
  vllm/v1/core/kv_cache_utils.py \
  vllm/v1/core/block_pool.py \
  vllm/v1/core/kv_cache_manager.py \
  tests/v1/core/test_kv_cache_utils.py \
  tests/v1/core/test_prefix_caching.py

.venv/bin/python -m pytest tests/v1/core/test_kv_cache_utils.py -q

.venv/bin/python -m pytest \
  tests/v1/core/test_kv_cache_utils.py \
  tests/v1/core/test_block_pool.py \
  tests/v1/core/test_prefix_caching.py \
  -q -k 'request_constant or mixed_memory_model or real_mamba_spec or compact_pool or token_proportional_capacity or num_blocks_override or take_events'

Result: ruff passed. tests/v1/core/test_kv_cache_utils.py passed with 75 passed. The focused pytest command also passed with 13 passed, 131 deselected.

Other focused validation during branch work:

  • 109 passed for block pool, KV cache invariants, coordinator, prefix-cache gate, config generation, and manager paths.
  • 17 passed for mixed/request-constant KV config tests.
  • CPU offload request-constant reject test passed.
  • Cudagraph profiling override regression test passed.
  • Request-constant cudagraph capacity tests passed.

Runtime capacity checks were run in eager mode (enforce_eager=True).

Model TP Before GPU KV After GPU KV Change
Qwen3.5-4B dense GDN 1 ~250K tokens ~352K tokens 1.4x
Qwen3.5-9B dense GDN 1 ~36K tokens ~49K tokens 1.3x
Qwen3.6-27B dense GDN 2 ~284K tokens ~376K tokens 1.3x

Runtime runs loaded Qwen3_5ForConditionalGeneration and the Triton/FLA GDN prefill kernel. Qwen3.5-9B and Qwen3.6-27B also passed short English/Korean/Arabic answer checks with thinking disabled.

Cudagraph smoke was run with Qwen3.5-4B TP=1, kv_cache_dtype=auto, and cudagraph_mode=FULL; load and generation completed.

AI assistance was used (Codex, Claude, Gemini)

lesj0610 added 2 commits May 2, 2026 10:16
Add explicit KV cache memory-model metadata, compact request-constant block pools, and pool-aware config/manager/worker handling for hybrid Mamba and attention models.

Mamba cache mode 'all' keeps the legacy token-proportional path. Unsupported request-constant combinations fail closed for prefix caching, offload, connector, and full CUDA graph paths.

Co-authored-by: OpenAI Codex <codex@openai.com>

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
(cherry picked from commit 378322e014aeab09467a98e2348c04fd168d9c6b)
@mergify mergify Bot added intel-gpu Related to Intel GPU v1 bug Something isn't working labels May 2, 2026
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a multi-pool KV cache architecture to support mixed memory models, specifically enabling efficient Mamba state management alongside traditional attention mechanisms. It defines TOKEN_PROPORTIONAL and REQUEST_CONSTANT memory models, implements a new CompactBlockPool for fixed-size request states, and updates the KVCacheCoordinator and KVCacheManager to be pool-aware. The changes also include extensive updates to configuration utilities, worker reshape logic, and a comprehensive suite of new tests. Feedback highlights potential issues in vllm/v1/core/kv_cache_utils.py, including a possible division-by-zero error during block normalization and a logic flaw where memory reservation checks might incorrectly fail during CUDA graph profiling if no token-proportional groups are present.

Comment thread vllm/v1/core/kv_cache_utils.py
Comment thread vllm/v1/core/kv_cache_utils.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: efccac882e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread vllm/v1/core/kv_cache_manager.py Outdated
Co-authored-by: OpenAI Codex <codex@openai.com>

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
@lesj0610 lesj0610 changed the title [BugFix] Fix KV cache sizing and allocation for hybrid Mamba/attention models [Bugfix] Fix hybrid Mamba KV cache allocation with request-constant pools May 2, 2026
lesj0610 and others added 2 commits May 2, 2026 19:23
Keep the existing fail-closed behavior for hybrid specs whose page sizes cannot be aligned by block-size adjustment.

Co-authored-by: OpenAI Codex <codex@openai.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Validate request-constant pool capacity with max_num_seqs instead of rejecting full CUDA graph capture outright.

Co-authored-by: OpenAI Codex <codex@openai.com>

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
@mergify mergify Bot added the kv-connector label May 3, 2026
@naroam1
Copy link
Copy Markdown

naroam1 commented May 7, 2026

Hi maintainers — chiming in as a downstream user evaluating Qwen3.5 migration.

This PR addresses a real production concern. Per #37121, vLLM is over-allocating ~7x KV cache memory for hybrid Mamba/attention models like Qwen3.5: the profiler treats Mamba/GDN groups (which have request-constant O(1) state) the same as attention groups (token-proportional O(n) state). On Qwen3.5-4B-AWQ this means losing ~86% of the KV budget to padding.

For our planned production stack (Qwen3.5-4B, Qwen3.5-9B-AWQ, Qwen3.5-27B-AWQ), this would substantially affect --gpu-memory-utilization headroom and effective concurrency.

The PR is well-scoped (request-constant vs token-proportional split, additive code path keeping mamba_cache_mode="all" shared-pool behavior unchanged). Could a code-owner — @tdoublep @tomeras91 per CODEOWNERS for vllm/model_executor/layers/mamba — take a look? Happy to validate against our production stack post-merge.

Thanks @lesj0610 for the work!

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

# Conflicts:
#	vllm/v1/core/sched/scheduler.py
#	vllm/v1/worker/gpu/attn_utils.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working intel-gpu Related to Intel GPU kv-connector v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants