[Core] Per-group BlockPool for hybrid Mamba/attention models#39031
[Core] Per-group BlockPool for hybrid Mamba/attention models#39031arbi-dev wants to merge 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces per-group BlockPool allocation for hybrid models (e.g., Mamba and Attention), allowing O(1) groups to have fixed-size pools while O(n) groups utilize the remaining memory to maximize token capacity. It also adds support for non-divisible page sizes via padding and includes comprehensive tests for Qwen3.5 and Nemotron architectures. However, two critical issues were identified: the block eviction logic incorrectly broadcasts non-unique block IDs across all pools, which will cause unintended cache thrashing, and the block size adjustment logic for non-divisible pages lacks a safety check, potentially leading to out-of-bounds memory access when the forced minimum block size exceeds the padded page capacity.
| for pool in self._unique_pools: | ||
| # Only evict block IDs that are valid for this pool | ||
| valid_ids = {bid for bid in block_ids if bid < pool.num_gpu_blocks} | ||
| if valid_ids: | ||
| pool.evict_blocks(valid_ids) |
There was a problem hiding this comment.
The current implementation of evict_blocks broadcasts block IDs to all unique pools. Since block IDs are indices starting from 0 within each BlockPool, they are not unique across pools. If an ID (e.g., 5) is present in block_ids, it will be evicted from every pool where the ID is within bounds, even if that specific pool's block was not intended for eviction. This will cause unnecessary prefix cache thrashing and performance degradation in hybrid models with multiple pools. To fix this, block IDs should either be made globally unique across the coordinator, or the eviction interface should be updated to specify the target pool/group.
There was a problem hiding this comment.
Fixed in 7055d32. When multiple pools exist, eviction now only targets blocks that are actually cached in each pool (checks block_hash is not None to confirm ownership). Single-pool path unchanged.
| new_block_size = (max_page_size // per_token // 16) * 16 | ||
| if new_block_size < 16: | ||
| new_block_size = 16 |
There was a problem hiding this comment.
The logic for adjusting block_size when page sizes are not divisible can lead to memory corruption or crashes. If max_page_size // per_token is less than 16, new_block_size is forced to 16. However, the resulting required page size (16 * per_token) will exceed max_page_size. Since the tensor is allocated with page_size_padded=max_page_size, the model runner will perform out-of-bounds accesses when indexing tokens beyond what fits in max_page_size. An explicit check should be added to ensure new_block_size * per_token <= max_page_size, or the unification should fail if this constraint cannot be met.
There was a problem hiding this comment.
Fixed in 7055d32. Added a safety check: if new_block_size * per_token > max_page_size, we raise NotImplementedError instead of silently overflowing. Note that this code path only runs when per-group split is NOT active (fallback for non-hybrid models with non-divisible pages). When per-group is active, we skip unification entirely and each group keeps its natural page size.
Hybrid models like Qwen3.5 (GDN + attention) and Nemotron-3-Nano (Mamba + attention) crash on upstream with NotImplementedError because attention and Mamba page sizes are not integer multiples of each other. This PR fixes the crash and dramatically improves KV cache token capacity by giving each KV cache group its own BlockPool with its natural page size: - Skip page size unification for O(1)+O(n) hybrid groups — attention keeps its natural small page size (e.g., 32KB) instead of being inflated to match Mamba pages (e.g., 1MB) - Per-group BlockPool — O(1) groups (Mamba/GDN in none/align mode) get a small fixed pool (max_seqs blocks), O(n) groups (attention) get the remaining memory - Per-layer tensors — each layer gets its own allocation at its natural page size Results on RTX 4090 (24GB): - Qwen3.5-0.8B: 1,094,912 token capacity, 209 tok/s - Nemotron-3-Nano-4B: 100,768 token capacity, 298 tok/s - Both crash on upstream main (NotImplementedError) Backward compatible: pure attention, pure Mamba, MOE, and sliding window models are unaffected — the per-group path only activates when MambaSpec + non-MambaSpec groups coexist in mamba_cache_mode != "all". Test plan: - 20 unit tests covering both Qwen3.5 and Nemotron architectures - E2E verified: model load + inference on both models - Existing test_kv_cache_utils.py tests pass (48/48) Signed-off-by: arbi-dev <dmitri.evseev@arbi.city> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cab0c67 to
7055d32
Compare
|
@arbi-dev could you provide details about fail of Qwen3.5-0.8B? The below 3 cmds works fine on B200 for me. |
|
@vadiklyutiy Thanks for testing — you're right, stock After further investigation, the We hit this while developing a compressed KV cache backend (TQKV) where the per-token layout includes a 4-byte norm alongside packed quantized bytes, making the page size non-power-of-2. I'm closing this PR since it doesn't address a real problem for existing users. We'll bundle the per-group BlockPool with our TQKV backend PR when it's ready, where the motivation will be clear and testable. Sorry for the noise, and thanks for the review feedback — it helped us understand the issue properly. |
Summary
Hybrid models like Qwen3.5 (GDN + attention) and Nemotron-3-Nano (Mamba + attention) crash on current main with
NotImplementedError: The page size of the layer is not divisible by the maximum page sizebecause attention and Mamba/GDN page sizes are not integer multiples of each other.This PR fixes the crash and improves KV cache token capacity by giving each KV cache group its own BlockPool with its natural page size:
Results
On RTX 4090 (24GB):
Backward compatibility
The per-group path only activates when MambaSpec + non-MambaSpec groups coexist with mamba_cache_mode != "all". These models are unaffected:
Related PRs
This PR takes a different approach: instead of trying to unify page sizes, it gives each group type its own pool with natural page sizes, eliminating the unification problem entirely.
Test plan
AI assistance was used (Claude). The submitting human has reviewed all changes and run the tests.
Signed-off-by: arbi-dev dmitri.evseev@arbi.city