Skip to content

[Core] Per-group BlockPool for hybrid Mamba/attention models#39031

Closed
arbi-dev wants to merge 1 commit intovllm-project:mainfrom
arbi-dev:per-group-blockpool
Closed

[Core] Per-group BlockPool for hybrid Mamba/attention models#39031
arbi-dev wants to merge 1 commit intovllm-project:mainfrom
arbi-dev:per-group-blockpool

Conversation

@arbi-dev
Copy link
Copy Markdown

@arbi-dev arbi-dev commented Apr 5, 2026

Summary

Hybrid models like Qwen3.5 (GDN + attention) and Nemotron-3-Nano (Mamba + attention) crash on current main with NotImplementedError: The page size of the layer is not divisible by the maximum page size because attention and Mamba/GDN page sizes are not integer multiples of each other.

This PR fixes the crash and improves KV cache token capacity by giving each KV cache group its own BlockPool with its natural page size:

  • Skip page size unification for O(1)+O(n) hybrid groups -- attention keeps its natural page size (e.g., 32KB) instead of being inflated to match Mamba pages (e.g., 1MB)
  • Per-group BlockPool -- O(1) groups (Mamba/GDN in none/align mode) get a small fixed pool (max_seqs blocks), O(n) groups (attention) get the remaining memory
  • Per-layer tensors -- each layer gets its own allocation at its natural page size
  • Prefix caching -- _MultiGroupPoolView handles cache lookups across separate pools

Results

On RTX 4090 (24GB):

Model Status on main With this PR Token capacity
Qwen3.5-0.8B (18 GDN + 6 attn) Crashes Works 1,094,912 tokens @ 209 tok/s
Nemotron-3-Nano-4B (21 Mamba + 4 attn) Crashes Works 100,768 tokens @ 298 tok/s

Backward compatibility

The per-group path only activates when MambaSpec + non-MambaSpec groups coexist with mamba_cache_mode != "all". These models are unaffected:

  • Pure attention (Llama, Qwen2.5) -- single group, no split
  • MOE (Mixtral) -- single group (experts are in FFN, not attention)
  • Sliding window (Gemma3) -- both groups are O(n), no split
  • Pure Mamba -- single group, no split

Related PRs

This PR takes a different approach: instead of trying to unify page sizes, it gives each group type its own pool with natural page sizes, eliminating the unification problem entirely.

Test plan

  • 20 new unit tests covering both Qwen3.5 and Nemotron architectures (test_per_group_blockpool.py)
  • Existing test_kv_cache_utils.py passes (48/48)
  • E2E: Qwen3.5-0.8B loads and produces correct output
  • E2E: Nemotron-3-Nano-4B loads and produces correct output
  • Benchmark on larger Qwen3.5-27B with TP

AI assistance was used (Claude). The submitting human has reviewed all changes and run the tests.

Signed-off-by: arbi-dev dmitri.evseev@arbi.city

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces per-group BlockPool allocation for hybrid models (e.g., Mamba and Attention), allowing O(1) groups to have fixed-size pools while O(n) groups utilize the remaining memory to maximize token capacity. It also adds support for non-divisible page sizes via padding and includes comprehensive tests for Qwen3.5 and Nemotron architectures. However, two critical issues were identified: the block eviction logic incorrectly broadcasts non-unique block IDs across all pools, which will cause unintended cache thrashing, and the block size adjustment logic for non-divisible pages lacks a safety check, potentially leading to out-of-bounds memory access when the forced minimum block size exceeds the padded page capacity.

Comment thread vllm/v1/core/kv_cache_manager.py Outdated
Comment on lines +465 to +469
for pool in self._unique_pools:
# Only evict block IDs that are valid for this pool
valid_ids = {bid for bid in block_ids if bid < pool.num_gpu_blocks}
if valid_ids:
pool.evict_blocks(valid_ids)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of evict_blocks broadcasts block IDs to all unique pools. Since block IDs are indices starting from 0 within each BlockPool, they are not unique across pools. If an ID (e.g., 5) is present in block_ids, it will be evicted from every pool where the ID is within bounds, even if that specific pool's block was not intended for eviction. This will cause unnecessary prefix cache thrashing and performance degradation in hybrid models with multiple pools. To fix this, block IDs should either be made globally unique across the coordinator, or the eviction interface should be updated to specify the target pool/group.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7055d32. When multiple pools exist, eviction now only targets blocks that are actually cached in each pool (checks block_hash is not None to confirm ownership). Single-pool path unchanged.

Comment on lines +949 to +951
new_block_size = (max_page_size // per_token // 16) * 16
if new_block_size < 16:
new_block_size = 16
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for adjusting block_size when page sizes are not divisible can lead to memory corruption or crashes. If max_page_size // per_token is less than 16, new_block_size is forced to 16. However, the resulting required page size (16 * per_token) will exceed max_page_size. Since the tensor is allocated with page_size_padded=max_page_size, the model runner will perform out-of-bounds accesses when indexing tokens beyond what fits in max_page_size. An explicit check should be added to ensure new_block_size * per_token <= max_page_size, or the unification should fail if this constraint cannot be met.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7055d32. Added a safety check: if new_block_size * per_token > max_page_size, we raise NotImplementedError instead of silently overflowing. Note that this code path only runs when per-group split is NOT active (fallback for non-hybrid models with non-divisible pages). When per-group is active, we skip unification entirely and each group keeps its natural page size.

Hybrid models like Qwen3.5 (GDN + attention) and Nemotron-3-Nano
(Mamba + attention) crash on upstream with NotImplementedError because
attention and Mamba page sizes are not integer multiples of each other.

This PR fixes the crash and dramatically improves KV cache token
capacity by giving each KV cache group its own BlockPool with its
natural page size:

- Skip page size unification for O(1)+O(n) hybrid groups — attention
  keeps its natural small page size (e.g., 32KB) instead of being
  inflated to match Mamba pages (e.g., 1MB)
- Per-group BlockPool — O(1) groups (Mamba/GDN in none/align mode) get
  a small fixed pool (max_seqs blocks), O(n) groups (attention) get
  the remaining memory
- Per-layer tensors — each layer gets its own allocation at its
  natural page size

Results on RTX 4090 (24GB):
- Qwen3.5-0.8B: 1,094,912 token capacity, 209 tok/s
- Nemotron-3-Nano-4B: 100,768 token capacity, 298 tok/s
- Both crash on upstream main (NotImplementedError)

Backward compatible: pure attention, pure Mamba, MOE, and sliding
window models are unaffected — the per-group path only activates when
MambaSpec + non-MambaSpec groups coexist in mamba_cache_mode != "all".

Test plan:
- 20 unit tests covering both Qwen3.5 and Nemotron architectures
- E2E verified: model load + inference on both models
- Existing test_kv_cache_utils.py tests pass (48/48)

Signed-off-by: arbi-dev <dmitri.evseev@arbi.city>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@arbi-dev arbi-dev force-pushed the per-group-blockpool branch from cab0c67 to 7055d32 Compare April 5, 2026 16:54
@vadiklyutiy vadiklyutiy self-requested a review April 6, 2026 09:40
@vadiklyutiy
Copy link
Copy Markdown
Collaborator

vadiklyutiy commented Apr 6, 2026

@arbi-dev could you provide details about fail of Qwen3.5-0.8B?

The below 3 cmds

vllm serve Qwen/Qwen3.5-0.8B 
vllm serve Qwen/Qwen3.5-0.8B --attention-backend FLASH_ATTN
vllm serve Qwen/Qwen3.5-0.8B --attention-backend TRITON_ATTN

works fine on B200 for me.

@arbi-dev
Copy link
Copy Markdown
Author

arbi-dev commented Apr 7, 2026

@vadiklyutiy Thanks for testing — you're right, stock vllm serve Qwen/Qwen3.5-0.8B works fine. I was wrong about this being a general issue.

After further investigation, the NotImplementedError in unify_kv_cache_spec_page_size only triggers with custom kv_cache_dtype values whose per-token byte count has prime factors incompatible with the GDN/Mamba layer page sizes (which are powers of 2). All standard dtypes (bf16, fp8, int8) produce power-of-2 page sizes and unify cleanly — no issue for stock users.

We hit this while developing a compressed KV cache backend (TQKV) where the per-token layout includes a 4-byte norm alongside packed quantized bytes, making the page size non-power-of-2.

I'm closing this PR since it doesn't address a real problem for existing users. We'll bundle the per-group BlockPool with our TQKV backend PR when it's ready, where the motivation will be clear and testable.

Sorry for the noise, and thanks for the review feedback — it helped us understand the issue properly.

@arbi-dev arbi-dev closed this Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants