[Bugfix] Exclude O(1) Mamba groups from hybrid KV cache token capacity#40384
[Bugfix] Exclude O(1) Mamba groups from hybrid KV cache token capacity#40384jhsmith409 wants to merge 2 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the token_capacity_kv_cache_groups utility function to correctly identify KV cache groups that scale with sequence length, specifically handling Mamba and Attention specifications. This function is integrated into the KV cache reporting and scheduler initialization to improve the accuracy of token capacity calculations in hybrid models. I have no feedback to provide.
E2E log delta on a hybrid MoEApplied this patch on top of our TurboQuant-hybrid test overlay (JartX#10 LCM fallback + #40074 + #39748 + #40092, all on cu130-nightly), model That's a ~2.5× increase in reported capacity on a 1-attention + Mamba-group(s) hybrid, matching the expected correction (the prior log was dividing the shared block pool by every group instead of just the groups that scale with sequence length). No startup-time or runtime regressions observed; completions still succeed (verified with the same 28 k-token-prompt stress harness I used for #40074 and #39748). Note: our setup runs with |
d53f9b5 to
b5e1a26
Compare
|
Hi @jhsmith409, great PR — thanks for the clean upstream version! (Small disclaimer: I'm from Ukraine and my English is still a work in progress, so I'm using AI to help with translation. Hope it reads okay!) Really nice that you extracted the filter into Here's my A5000 TP=2 datapoint on Qwen3.6-35B-A3B-FP8 (
This model has 1 attention + 1 Mamba group, so the factor on Ampere comes out to roughly 2× — consistent in direction with the 2.52× you measured on 5090 + NVFP4. I've been running the same logic as a runtime monkey-patch (Patch 9) since v5.7, based on ai-jz/vllm#1. Happy to drop it in favor of this PR the moment it lands. If it helps reviewers, I can re-run with #40384 cherry-picked cleanly on current main (my patch removed) to give an independent A5000 datapoint — just let me know. |
|
Update — I just switched my local patch to apply your exact architecture (helper function + clean callsites), instead of my previous inline implementation. Full mirror of your PR diff: imports of A5000 TP=2 clean-room numbers (
Identical behavior to my previous inline patch (which was logically equivalent), with per-context deltas ≤0.6% — well within the ~0.8%/step thermal boost drift we see on workstation A5000s between back-to-back runs. So the design in this PR works cleanly on Ampere SM 8.6 with the FP8 hybrid model too. Happy to add a formal "Tested on A5000 TP=2" line in the PR body if you'd like, just let me know. |
|
Hi @jhsmith409 — quick cross-reference for completeness: (Small disclaimer: I'm from Ukraine and my English is still a work in progress, so I'm using AI to help with translation. Hope it reads okay!) While auditing this PR's reach across the codebase, I noticed there's a third site with the same bug class that this PR doesn't cover: Active path is gated on PR #37118 (@allgather, Nov 2025) already proposes a fix for that worker site, but with a different formula ( Probably worth a maintainer aligning the two PRs so we don't end up with inconsistent treatment of the same bug class. Just flagging so it's visible in this thread. Also filed #40417 about extending the helper to exclude |
|
Thanks for the cross-reference @Sandermage — and no worries about the translation, the writeup is perfectly clear. I use Claude to review what I write as well. I dug through #37118 and #40417 before replying. My read is that the worker site you flagged is best left to #37118, because the two PRs are answering structurally different questions even though they both touch a
On a single-attention-group hybrid like Qwen3.6-A3B the two formulas collapse to the same number, but on a multi-attention-group hybrid (Nemotron-H-style) they diverge in opposite directions: my helper would under-size the routed-experts buffer (re-introducing the #37118 bug class), and HollowMan6's full-address-space formula would over-bound the scheduler's per-token planning capacity. So I think the right outcome is to land both, scoped to their respective sites, rather than try to share a helper. Flagging this on #37118 as well so HollowMan6 sees the cross-reference. On #40417 — agreed, the Let me know which way you'd prefer to go on the #40417 fold. |
|
Ah, thanks for the detailed breakdown @jhsmith409 — you're right and I was wrong to frame these as "same bug, different sites should align formulas". The scheduler's per-token divisor and the routed-experts side buffer are structurally different concerns: O(1) groups should be excluded from the first (per-token planning capacity) but must be included in the second (index-able address space). On a Nemotron-H-style multi-attention-group hybrid the two formulas correctly diverge. Updating my mental model. Landing both PRs scoped to their respective sites is clearly the right outcome — I'll add a note on #37118 making the same correction so the record is clean. On #40417 — great, |
|
Nice work! On 35B-A3B,GPU KV cache size improved from 288,288 tokens to 1,157,376 tokens on 3090*2.Decode speeds get little higher when high concurrency with large context length. @mgoin Could you please take a look at this PR when you have a moment? |
On hybrid attention + Mamba models (Qwen3-Next, Qwen3.5/3.6 MoE hybrids, RecurrentGemma, Jamba, Zamba2, Nemotron-H, …), `kv_cache_config.kv_cache_groups` contains one attention group plus one (or several) Mamba groups. In the default `mamba_cache_mode='none'` (and `'align'`) the Mamba state is O(1) per request and pre-reserves a fixed number of blocks; only the attention groups scale with sequence length. Both `_report_kv_cache_config()` and `Scheduler.__init__()` currently compute per-token capacity as `num_blocks // len(all_groups) * block_size`, so each extra Mamba group deflates the reported capacity and the scheduler's `max_num_kv_tokens` (used to size the `routed_experts` buffer for MoE). A Qwen3.6-35B-A3B hybrid run on cu130-nightly with turboquant_k8v4 KV cache and max_model_len=8192 reports "GPU KV cache size: 146,704 tokens" today; with this fix the attention group alone owns the shared block pool, so the reported count matches what the scheduler can actually allocate. Factors the filter into a small helper (`token_capacity_kv_cache_groups`) and uses it in both sites. Falls back to the current behavior if the filter would produce an empty list, preserving dense-model and Mamba-only paths. Credit to @Sandermage (ref vllm-project#40124 patch 9, based on ai-jz/vllm#1) for identifying the bug and the filter design. Co-authored-by: Sandermage <sandermage@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Jim Smith <jhsmith0@me.com>
Covers the six meaningful shapes of kv_cache_config the helper sees: - dense (all AttentionSpec groups) → unchanged - hybrid with mamba_cache_mode='none' or 'align' → Mamba groups dropped - hybrid with mamba_cache_mode='all' → Mamba kept (scales with seq len) - Mamba-only model under 'none' → filter would empty, fallback kicks in - 1 attn + 3 Mamba (Nemotron-H shape) → single-group result - empty config → empty list (no IndexError) Signed-off-by: Jim Smith <jhsmith0@me.com>
90dda2e to
9aaf611
Compare
Summary
On hybrid attention + Mamba models (Qwen3-Next, Qwen3.5/3.6 MoE hybrids, RecurrentGemma, Jamba, Zamba2, Nemotron-H, …), the reported GPU KV cache token capacity and the scheduler's
max_num_kv_tokensare deflated by the number of Mamba groups, which in the defaultmamba_cache_mode='none'(and'align') pre-reserve a fixed number of blocks and do not scale with sequence length.Both
_report_kv_cache_config()(vllm/v1/core/kv_cache_utils.py) andScheduler.__init__(vllm/v1/core/sched/scheduler.py) currently compute per-token capacity as:For a typical hybrid with one attention group and N Mamba groups, that's off by a factor of
(1 + N) / 1— 2× understatement for the common case, 4× for Nemotron-H-style 1 attn + 3 mamba groups. Themax_num_kv_tokensnumber is what sizes therouted_expertsbuffer for MoE and what the scheduler believes is its budget; getting this wrong shows up as (a) misleading boot-time logs and (b) over-conservative scheduling of concurrent requests on the very models (hybrid MoE) where extra concurrency is the whole point.Fix
token_capacity_kv_cache_groups(vllm_config, kv_cache_config)inkv_cache_utils.pythat returns only the groups that scale with sequence length (attention always, Mamba only whenmamba_cache_mode == 'all')._report_kv_cache_configandScheduler.__init__.The helper is exported (no leading underscore) because
scheduler.pyimports it; if the maintainers would rather keep it scheduler-local or inline it, happy to rewrite.Why this is not duplicating an existing PR
Checked on 2026-04-20:
gh pr list --repo vllm-project/vllm --state open --search \"max_num_kv_tokens\"→ none.gh pr list --repo vllm-project/vllm --state open --search \"mamba kv_cache_groups token\"→ none.Test plan + results
No existing test exercises the filter directly; I'll follow up with a small unit test in a separate commit once PR feedback lands (or now, if reviewers prefer). Syntax check (
python -m py_compile) is clean;ruff check/ruff formatwere not available in my local sandbox but the edits follow the surrounding style.End-to-end verification on our runtime stack (cu130-nightly + TurboQuant hybrid overlay +
RedHatAI/Qwen3.6-35B-A3B-NVFP4,turboquant_k8v4,max_model_len=8192,max_num_seqs=8,--gpu-memory-utilization=0.85, torch.compile + cudagraph):INFO kv_cache_utils.py:1363] GPU KV cache size: 143,936 tokensAI-assist disclosure (per AGENTS.md)
Change was drafted with help from Claude (Anthropic); human submitter reviewed every line end-to-end and understands the hybrid KV cache group semantics. Original bug identification and filter design credit to @Sandermage — ref his issue #40124 tracking table (patch 9) and the
ai-jz/vllm#1approach he references.Co-authored-by:trailers included.