Skip to content

[PD][Feature] Add KV consumer partial-group caching for hybrid Mamba models#42524

Open
underfituu wants to merge 9 commits into
vllm-project:mainfrom
underfituu:feat/kv-consumer-hybrid-mamba
Open

[PD][Feature] Add KV consumer partial-group caching for hybrid Mamba models#42524
underfituu wants to merge 9 commits into
vllm-project:mainfrom
underfituu:feat/kv-consumer-hybrid-mamba

Conversation

@underfituu

@underfituu underfituu commented May 13, 2026

Copy link
Copy Markdown
Contributor

What this PR does

This PR fixes the 0% D-side prefix-cache hit rate issue for hybrid Mamba models under Prefill-Decode (PD) disaggregation with align mode, as reported in #42547, by excluding the Mamba group from D-side KV cache hit rate accounting.

Why we need it

For PD remote-prefill requests, vLLM allocates D-side KV slots based on the computed cache hits before pulling the KV cache from the P-side. The current hybrid coordinator requires a common hit length across all cache groups. As reported in #42554, since the connector only transfers the running state in the Mamba group, the D-side hit rate drops to 0%.

As a result, the D-side treats the reusable attention KV as missing, allocates extra KV slots, and pulls the same prefix KV cache again. This increases block pressure and limits decode concurrency, leading to severe performance degradation.

Design

We introduce a per-request flag, skip_mamba_align, in the kv_cache_manager derived from request.kv_transfer_params. When enabled, the hit-length computation bypasses the Mamba group, allowing full attention prefix blocks to be successfully reused during the decode phase.

Test

  • Model: Qwen3.5-35B-A3B (Mamba hybrid model)
  • Hardware: 2 × NVIDIA H20
  • Transport: NIXL (NixlConnector, kv_role=kv_both, kv_parallel_size=2)
  • Proxy: tests/v1/kv_connector/nixl_integration/toy_proxy_server.py
  • Server Flags (P and D sides):
  --enforce-eager
  --mamba-cache-mode align
  --max-num-batched-tokens 16384
  --max-num-seqs 256
  --block-size 128
  --no-disable-hybrid-kv-cache-manager
  --max-model-len 24576
  --async-scheduling
  --gpu-memory-utilization 0.90
  • Prefix caching set:

    • P side: --enable-prefix-caching
    • D side: toggled --enable-prefix-caching (ON, with PR) vs --no-enable-prefix-caching (OFF, baseline)
  • vllm bench serve:

  --dataset-name prefix_repetition
  --prefix-repetition-output-len 1024

Results

OFF: D-side prefix caching disabled.
ON: D-side prefix caching enabled with this PR.
P-side prefix caching is enabled in both cases.

prefix_len suffix_len num_prefix num_prompts concurrency OFF TTFT (ms) ON TTFT (ms) ΔTTFT OFF Output TPS ON Output TPS ΔTPS
16,384 1,024 2 512 256 278,653 47,264 -83.0% 401.4 804.4 +100.4%
16,384 1,024 4 512 256 302,969 45,333 -85.0% 399.1 713.9 +78.9%
16,384 1,024 8 512 256 301,226 63,055 -79.1% 391.7 748.8 +91.2%
16,384 1,024 2 256 48 10,801 2,953 -72.7% 429.8 436.5 +1.6%
4,096 1,024 2 512 256 103,409 64,204 -37.9% 891.4 939.8 +5.4%

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions

Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

Agent Guidelines

IMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban.

🚀

@mergify mergify Bot added the v1 label May 13, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for KV consumer partial-group caching in disaggregated inference, ensuring Mamba layers do not interfere with prefix-caching block alignment. It updates the KV cache coordinator, manager, and scheduler to handle VllmConfig and adjusts logic for KV consumers. The reviewer suggested centralizing the KV consumer check by adding a property to VllmConfig to reduce code duplication.

Comment thread vllm/v1/core/kv_cache_utils.py Outdated
Comment on lines +611 to +615
enable_kv_consumer_partial_group_caching = (
connector_enabled
and vllm_config.kv_transfer_config.is_kv_consumer
and cache_config.enable_prefix_caching
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic to determine if the instance is a KV consumer is duplicated in multiple places within this pull request. This increases maintenance overhead and risk of inconsistencies.

Specifically, the check vllm_config.kv_transfer_config is not None and vllm_config.kv_transfer_config.is_kv_consumer (or a variation of it) is present in:

  • vllm/model_executor/models/config.py
  • vllm/v1/core/kv_cache_coordinator.py
  • vllm/v1/core/kv_cache_utils.py (this file)
  • vllm/v1/core/sched/scheduler.py

To improve maintainability, consider centralizing this logic. For example, you could add a property to the VllmConfig class:

# In vllm/config/config.py
class VllmConfig:
    ...
    @property
    def is_kv_consumer(self) -> bool:
        return (self.kv_transfer_config is not None and
                self.kv_transfer_config.is_kv_consumer)

This would allow you to replace the repeated checks with a cleaner vllm_config.is_kv_consumer.

@underfituu underfituu force-pushed the feat/kv-consumer-hybrid-mamba branch 2 times, most recently from 3dd6509 to 45014c9 Compare May 14, 2026 02:01
@underfituu underfituu changed the title [Feature] Add KV consumer partial-group caching for hybrid Mamba models [PD][Feature] Add KV consumer partial-group caching for hybrid Mamba models May 14, 2026
@mergify mergify Bot added the kv-connector label May 14, 2026
@underfituu underfituu force-pushed the feat/kv-consumer-hybrid-mamba branch from 98701d9 to 28da40f Compare May 14, 2026 08:03

@NickLucche NickLucche left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing @underfituu .
We dont have consumers with PD disagg (check #42554 out), as nixl is using kv_both.

@underfituu

Copy link
Copy Markdown
Contributor Author

Thanks for contributing @underfituu . We dont have consumers with PD disagg (check #42554 out), as nixl is using kv_both.

Thanks for pointing that out! We actually overlooked the kv_both scenario. Let us head back to the drawing board and rethink our approach.

lHrHenry233 added a commit to underfituu/vllm that referenced this pull request May 21, 2026
On hybrid Mamba models running PD-disaggregated with NIXL (kv_role=
kv_both), the D-side prefix-cache hit drops to 0% because Mamba state is
not transferred across the connector. The HybridKVCacheCoordinator hit
loop min-reduces curr_hit_length to 0 as soon as the Mamba group reports
0 cached blocks, collapsing the FullAttention hit that NIXL just landed.

A previous attempt (vllm-project#42524) gated this at node granularity via
KVTransferConfig.is_kv_consumer, but kv_both is treated as both producer
and consumer, so producer-side requests on the same node would also lose
Mamba alignment. This change moves the skip to per-request granularity,
gated on request.kv_transfer_params["do_remote_prefill"]:
- producer-side requests retain Mamba block-aligned scheduling
- pulling requests skip the SSM-group min-reduction so the FullAttention
  prefix-cache hit is preserved on the consumer side

Three files, no schema change:
- scheduler.py: drop the External-KV verification assert; gate
  need_mamba_block_aligned_split on "not load_kv_async" so PD-pulled
  requests do not get block-aligned-split.
- kv_cache_manager.py: derive skip_mamba_align from request and forward
  it to coordinator.find_longest_cache_hit.
- kv_cache_coordinator.py: thread skip_mamba_align through the four
  find_longest_cache_hit signatures; HybridKVCacheCoordinator skips
  MambaSpec groups during the hit min-reduction when skip_mamba_align
  is True (Mamba group gets an empty hit-blocks list, downstream
  allocation treats it as "no cached blocks", exactly matching the
  current cold-start D-side shape).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: lHrHenry233 <2381623149@qq.com>
@lHrHenry233 lHrHenry233 force-pushed the feat/kv-consumer-hybrid-mamba branch from 5729cde to 7fa7ecf Compare May 21, 2026 08:50
@underfituu underfituu force-pushed the feat/kv-consumer-hybrid-mamba branch from 7fa7ecf to 4897d6e Compare May 21, 2026 09:57
lHrHenry233 added a commit to underfituu/vllm that referenced this pull request May 21, 2026
On hybrid Mamba models running PD-disaggregated with NIXL (kv_role=
kv_both), the D-side prefix-cache hit drops to 0% because Mamba state is
not transferred across the connector. The HybridKVCacheCoordinator hit
loop min-reduces curr_hit_length to 0 as soon as the Mamba group reports
0 cached blocks, collapsing the FullAttention hit that NIXL just landed.

A previous attempt (vllm-project#42524) gated this at node granularity via
KVTransferConfig.is_kv_consumer, but kv_both is treated as both producer
and consumer, so producer-side requests on the same node would also lose
Mamba alignment. This change moves the skip to per-request granularity,
gated on request.kv_transfer_params["do_remote_prefill"]:
- producer-side requests retain Mamba block-aligned scheduling
- pulling requests skip the SSM-group min-reduction so the FullAttention
  prefix-cache hit is preserved on the consumer side

Three files, no schema change:
- scheduler.py: drop the External-KV verification assert; gate
  need_mamba_block_aligned_split on "not load_kv_async" so PD-pulled
  requests do not get block-aligned-split.
- kv_cache_manager.py: derive skip_mamba_align from request and forward
  it to coordinator.find_longest_cache_hit.
- kv_cache_coordinator.py: thread skip_mamba_align through the four
  find_longest_cache_hit signatures; HybridKVCacheCoordinator skips
  MambaSpec groups during the hit min-reduction when skip_mamba_align
  is True (Mamba group gets an empty hit-blocks list, downstream
  allocation treats it as "no cached blocks", exactly matching the
  current cold-start D-side shape).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: lHrHenry233 <2381623149@qq.com>
@underfituu

Copy link
Copy Markdown
Contributor Author

Thanks for contributing @underfituu . We dont have consumers with PD disagg (check #42554 out), as nixl is using kv_both.

Hey @NickLucche , the latest commit introduces a per-request flag for skip_mamba_align, making it fully compatible with NIXL. We’d highly appreciate it if you could test this patch in your environment to see if the D-side cache hit is fully restored.

Signed-off-by: lHrHenry233 <2381623149@qq.com>
@underfituu underfituu force-pushed the feat/kv-consumer-hybrid-mamba branch from 2442bc8 to dc0a381 Compare May 21, 2026 14:23
Signed-off-by: lHrHenry233 <2381623149@qq.com>
Signed-off-by: lHrHenry233 <2381623149@qq.com>
Signed-off-by: lHrHenry233 <2381623149@qq.com>

@NickLucche NickLucche left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @underfituu sorry for being late on this.
Would you mind elaborating your idea a bit more? I think I am reading it wrong, but are you trying to skip prefix caching on P..?

Comment on lines +223 to +226
skip_mamba_align = bool(
request.kv_transfer_params
and request.kv_transfer_params.get("do_remote_prefill")
)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we're spiling request-level pd checking into the kv cache manager. Let's discuss the proposed change a bit more to see if/where we can move this

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we're spiling request-level pd checking into the kv cache manager. Let's discuss the proposed change a bit more to see if/where we can move this

I agree with this concern. However, without changing the hit accounting path, the D node has no way to reuse the gated-attention KV cache independently of the Mamba state. We considered this flag-based approach the smallest change we could make:

  • it does not change P-side prefix caching;
  • it does not change normal non-PD hybrid Mamba prefix-cache behavior;
  • it only changes the D-side hit-length calculation for requests that are actually using remote prefill;
  • it avoids changing the KV transfer protocol, block layout, or Mamba state representation.

I agree that the exact location and naming of the flag can be improved. We would be happy to further discuss with you and the community on where this flag should live and how it should be named. But from our current understanding, some change to the hybrid cache-hit accounting is necessary; otherwise the valid gated-attention KV hit on the D side will continue to be discarded by the Mamba group's 0 hit.

@lHrHenry233

Copy link
Copy Markdown
Contributor

Hey @underfituu sorry for being late on this. Would you mind elaborating your idea a bit more? I think I am reading it wrong, but are you trying to skip prefix caching on P..?

Thanks @NickLucche, let me clarify the intent here.

No, we are not trying to skip prefix caching on the P side. In our setup, prefix caching on the P node remains enabled and unchanged.

The issue we are trying to fix is on the D side for PD-disaggregated hybrid Mamba models. After the D side pulls the remote prefill result from P, the attention KV blocks can be available/reusable on D, but the Mamba/SSM state is not represented as normal prefix-cache blocks in the D-side block table. Then HybridKVCacheCoordinator computes the cache hit length by min-reducing across KV groups. Since the Mamba group reports 0 cached blocks, it collapses the valid attention / gated-attention KV hit length to 0 as well.

So the idea is only to change the D-side cache-hit accounting for PD-pull requests: when the request is actually using remote prefill, we skip the Mamba/SSM group in the hybrid hit-length min-reduction, so the D node can reuse Qwen3.5's gated-attention KV cache independently of the Mamba state. The Mamba state is still treated as not locally prefix-cached on D.

@NickLucche NickLucche left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for elaborating and for the work @lHrHenry233 @underfituu , I think this is an interesting approach.

I would like to pick one solution between #42547 and this PR.
To make changes even less invasive and avoiding having to thread skip_mamba_align through various layers, I think we should build on top #43874 (and RFC #43807).

This way we could apply this D-specific fix by checking for role at config time once kv_both is deprecated.
The changes would then be local to HybridKVCacheCoordinator, which could check whether current instance is consumer/D, and set skip_mamba_align once accordingly.

Also, I assume you're running based on #42554 right?
Ow one will run into

(EngineCore pid=618132) ERROR 06-01 13:44:14 [core.py:1161]   File "/home/NickLucche/llmd/vllm/vllm/distributed/kv_transfer/kv_connector/v1/nixl/worker.py", line 2338, in _apply_prefix_caching
(EngineCore pid=618132) ERROR 06-01 13:44:14 [core.py:1161]     assert num_local_blocks == num_remote_blocks

Comment thread vllm/v1/core/kv_cache_coordinator.py Outdated
Comment on lines +563 to +564
# PD-pull path: Mamba state is not transferred, so do not let
# an empty Mamba lookup shrink the hit length. We still leave

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean with "Mamba state is not transferred", that's not right we transfer the temporal/Conv states.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean with "Mamba state is not transferred", that's not right we transfer the temporal/Conv states.

Here the state means the cahe aside from the running state. The annotation is indeeded confusing. We will improve it.

Comment thread vllm/v1/core/kv_cache_coordinator.py Outdated

def __init__(
self,
vllm_config: VllmConfig,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cruft, this crashes at runtime

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cruft, this crashes at runtime

Sorry, this derives from the first commit and we have deleted it.

@lHrHenry233

lHrHenry233 commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Thanks for elaborating and for the work @lHrHenry233 @underfituu , I think this is an interesting approach.

I would like to pick one solution between #42547 and this PR. To make changes even less invasive and avoiding having to thread skip_mamba_align through various layers, I think we should build on top #43874 (and RFC #43807).

This way we could apply this D-specific fix by checking for role at config time once kv_both is deprecated. The changes would then be local to HybridKVCacheCoordinator, which could check whether current instance is consumer/D, and set skip_mamba_align once accordingly.

Also, I assume you're running based on #42554 right? Ow one will run into

(EngineCore pid=618132) ERROR 06-01 13:44:14 [core.py:1161]   File "/home/NickLucche/llmd/vllm/vllm/distributed/kv_transfer/kv_connector/v1/nixl/worker.py", line 2338, in _apply_prefix_caching
(EngineCore pid=618132) ERROR 06-01 13:44:14 [core.py:1161]     assert num_local_blocks == num_remote_blocks

Thanks for your feedback! Yes, you are right—our worker is running based on #42554. We will also update the PR description later with more details to make this clear.

@ZhanqiuHu

Copy link
Copy Markdown
Contributor

Thanks! @NickLucche Here an example alternative for reference showing the "per-group" approach we discussed offline: #44243

@underfituu

Copy link
Copy Markdown
Contributor Author

Thanks! @NickLucche Here an example alternative for reference showing the "per-group" approach we discussed offline: #44243

Thanks for sharing this. We're currently reviewing the approach and will share our thoughts on the 'per-group' logic shortly. Stay tuned.

Signed-off-by: lHrHenry233 <2381623149@qq.com>
Signed-off-by: lHrHenry233 <2381623149@qq.com>
Signed-off-by: lHrHenry233 <2381623149@qq.com>
@lHrHenry233 lHrHenry233 force-pushed the feat/kv-consumer-hybrid-mamba branch from 013196a to 6b7f617 Compare June 2, 2026 06:22
@mergify

mergify Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @underfituu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants