[Core] Per-group BlockPool for hybrid Mamba/attention models by arbi-dev · Pull Request #39031 · vllm-project/vllm

arbi-dev · 2026-04-05T16:49:13Z

Summary

Hybrid models like Qwen3.5 (GDN + attention) and Nemotron-3-Nano (Mamba + attention) crash on current main with NotImplementedError: The page size of the layer is not divisible by the maximum page size because attention and Mamba/GDN page sizes are not integer multiples of each other.

This PR fixes the crash and improves KV cache token capacity by giving each KV cache group its own BlockPool with its natural page size:

Skip page size unification for O(1)+O(n) hybrid groups -- attention keeps its natural page size (e.g., 32KB) instead of being inflated to match Mamba pages (e.g., 1MB)
Per-group BlockPool -- O(1) groups (Mamba/GDN in none/align mode) get a small fixed pool (max_seqs blocks), O(n) groups (attention) get the remaining memory
Per-layer tensors -- each layer gets its own allocation at its natural page size
Prefix caching -- _MultiGroupPoolView handles cache lookups across separate pools

Results

On RTX 4090 (24GB):

Model	Status on main	With this PR	Token capacity
Qwen3.5-0.8B (18 GDN + 6 attn)	Crashes	Works	1,094,912 tokens @ 209 tok/s
Nemotron-3-Nano-4B (21 Mamba + 4 attn)	Crashes	Works	100,768 tokens @ 298 tok/s

Backward compatibility

The per-group path only activates when MambaSpec + non-MambaSpec groups coexist with mamba_cache_mode != "all". These models are unaffected:

Pure attention (Llama, Qwen2.5) -- single group, no split
MOE (Mixtral) -- single group (experts are in FFN, not attention)
Sliding window (Gemma3) -- both groups are O(n), no split
Pure Mamba -- single group, no split

Related PRs

[Bugfix] Fix KV cache sizing and allocation for hybrid Mamba/attention models #37429 -- Similar goal but does not skip page unification, so attention pages are still inflated to match Mamba. Also doesn't handle prefix caching across separate pools.
[Hybrid] Map multiple FullAttn layers to a single page #35703 -- Maps multiple attention layers onto a single page to reduce block_size
Canonical KV Cache Allocation for HMA Models #37885 -- Canonical contiguous allocation for HMA models

This PR takes a different approach: instead of trying to unify page sizes, it gives each group type its own pool with natural page sizes, eliminating the unification problem entirely.

Test plan

20 new unit tests covering both Qwen3.5 and Nemotron architectures (test_per_group_blockpool.py)
Existing test_kv_cache_utils.py passes (48/48)
E2E: Qwen3.5-0.8B loads and produces correct output
E2E: Nemotron-3-Nano-4B loads and produces correct output
Benchmark on larger Qwen3.5-27B with TP

AI assistance was used (Claude). The submitting human has reviewed all changes and run the tests.

Signed-off-by: arbi-dev dmitri.evseev@arbi.city

gemini-code-assist

Code Review

This pull request introduces per-group BlockPool allocation for hybrid models (e.g., Mamba and Attention), allowing O(1) groups to have fixed-size pools while O(n) groups utilize the remaining memory to maximize token capacity. It also adds support for non-divisible page sizes via padding and includes comprehensive tests for Qwen3.5 and Nemotron architectures. However, two critical issues were identified: the block eviction logic incorrectly broadcasts non-unique block IDs across all pools, which will cause unintended cache thrashing, and the block size adjustment logic for non-divisible pages lacks a safety check, potentially leading to out-of-bounds memory access when the forced minimum block size exceeds the padded page capacity.

gemini-code-assist · 2026-04-05T16:51:13Z

+        for pool in self._unique_pools:
+            # Only evict block IDs that are valid for this pool
+            valid_ids = {bid for bid in block_ids if bid < pool.num_gpu_blocks}
+            if valid_ids:
+                pool.evict_blocks(valid_ids)


The current implementation of evict_blocks broadcasts block IDs to all unique pools. Since block IDs are indices starting from 0 within each BlockPool, they are not unique across pools. If an ID (e.g., 5) is present in block_ids, it will be evicted from every pool where the ID is within bounds, even if that specific pool's block was not intended for eviction. This will cause unnecessary prefix cache thrashing and performance degradation in hybrid models with multiple pools. To fix this, block IDs should either be made globally unique across the coordinator, or the eviction interface should be updated to specify the target pool/group.

Fixed in 7055d32. When multiple pools exist, eviction now only targets blocks that are actually cached in each pool (checks block_hash is not None to confirm ownership). Single-pool path unchanged.

gemini-code-assist · 2026-04-05T16:51:13Z

+                    new_block_size = (max_page_size // per_token // 16) * 16
+                    if new_block_size < 16:
+                        new_block_size = 16


The logic for adjusting block_size when page sizes are not divisible can lead to memory corruption or crashes. If max_page_size // per_token is less than 16, new_block_size is forced to 16. However, the resulting required page size (16 * per_token) will exceed max_page_size. Since the tensor is allocated with page_size_padded=max_page_size, the model runner will perform out-of-bounds accesses when indexing tokens beyond what fits in max_page_size. An explicit check should be added to ensure new_block_size * per_token <= max_page_size, or the unification should fail if this constraint cannot be met.

Fixed in 7055d32. Added a safety check: if new_block_size * per_token > max_page_size, we raise NotImplementedError instead of silently overflowing. Note that this code path only runs when per-group split is NOT active (fallback for non-hybrid models with non-divisible pages). When per-group is active, we skip unification entirely and each group keeps its natural page size.

Hybrid models like Qwen3.5 (GDN + attention) and Nemotron-3-Nano (Mamba + attention) crash on upstream with NotImplementedError because attention and Mamba page sizes are not integer multiples of each other. This PR fixes the crash and dramatically improves KV cache token capacity by giving each KV cache group its own BlockPool with its natural page size: - Skip page size unification for O(1)+O(n) hybrid groups — attention keeps its natural small page size (e.g., 32KB) instead of being inflated to match Mamba pages (e.g., 1MB) - Per-group BlockPool — O(1) groups (Mamba/GDN in none/align mode) get a small fixed pool (max_seqs blocks), O(n) groups (attention) get the remaining memory - Per-layer tensors — each layer gets its own allocation at its natural page size Results on RTX 4090 (24GB): - Qwen3.5-0.8B: 1,094,912 token capacity, 209 tok/s - Nemotron-3-Nano-4B: 100,768 token capacity, 298 tok/s - Both crash on upstream main (NotImplementedError) Backward compatible: pure attention, pure Mamba, MOE, and sliding window models are unaffected — the per-group path only activates when MambaSpec + non-MambaSpec groups coexist in mamba_cache_mode != "all". Test plan: - 20 unit tests covering both Qwen3.5 and Nemotron architectures - E2E verified: model load + inference on both models - Existing test_kv_cache_utils.py tests pass (48/48) Signed-off-by: arbi-dev <dmitri.evseev@arbi.city> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vadiklyutiy · 2026-04-06T10:07:23Z

@arbi-dev could you provide details about fail of Qwen3.5-0.8B?

The below 3 cmds

vllm serve Qwen/Qwen3.5-0.8B

vllm serve Qwen/Qwen3.5-0.8B --attention-backend FLASH_ATTN

vllm serve Qwen/Qwen3.5-0.8B --attention-backend TRITON_ATTN

works fine on B200 for me.

arbi-dev · 2026-04-07T12:16:21Z

@vadiklyutiy Thanks for testing — you're right, stock vllm serve Qwen/Qwen3.5-0.8B works fine. I was wrong about this being a general issue.

After further investigation, the NotImplementedError in unify_kv_cache_spec_page_size only triggers with custom kv_cache_dtype values whose per-token byte count has prime factors incompatible with the GDN/Mamba layer page sizes (which are powers of 2). All standard dtypes (bf16, fp8, int8) produce power-of-2 page sizes and unify cleanly — no issue for stock users.

We hit this while developing a compressed KV cache backend (TQKV) where the per-token layout includes a 4-byte norm alongside packed quantized bytes, making the page size non-power-of-2.

I'm closing this PR since it doesn't address a real problem for existing users. We'll bundle the per-group BlockPool with our TQKV backend PR when it's ready, where the motivation will be clear and testable.

Sorry for the noise, and thanks for the review feedback — it helped us understand the issue properly.

arbi-dev requested review from ApostaC, WoosukKwon, alexm-redhat, heheda12345, njhill, orozery, robertgshaw2-redhat and ywang96 as code owners April 5, 2026 16:49

mergify Bot added the v1 label Apr 5, 2026

gemini-code-assist Bot reviewed Apr 5, 2026

View reviewed changes

arbi-dev force-pushed the per-group-blockpool branch from cab0c67 to 7055d32 Compare April 5, 2026 16:54

vadiklyutiy self-requested a review April 6, 2026 09:40

arbi-dev closed this Apr 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core] Per-group BlockPool for hybrid Mamba/attention models#39031

[Core] Per-group BlockPool for hybrid Mamba/attention models#39031
arbi-dev wants to merge 1 commit intovllm-project:mainfrom
arbi-dev:per-group-blockpool

arbi-dev commented Apr 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 5, 2026

Uh oh!

arbi-dev Apr 5, 2026

Uh oh!

gemini-code-assist Bot Apr 5, 2026

Uh oh!

arbi-dev Apr 5, 2026

Uh oh!

vadiklyutiy commented Apr 6, 2026 •

edited

Loading

Uh oh!

arbi-dev commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

arbi-dev commented Apr 5, 2026

Summary

Results

Backward compatibility

Related PRs

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

arbi-dev Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

arbi-dev Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

vadiklyutiy commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arbi-dev commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vadiklyutiy commented Apr 6, 2026 •

edited

Loading