[PD][Nixl] Mamba prefix caching mode support by NickLucche · Pull Request #42554 · vllm-project/vllm

NickLucche · 2026-05-13T17:47:31Z

This PR adds support for PD Mamba setups to make use of prefix caching ("all", "align" as well as the upcoming #37898).
It merely adds the logic to handle the result of a prefix cache hit, so it is agnostic to the actual caching implementation.
Running without this PR with prefix caching enabled will run into this assertion

vllm/vllm/distributed/kv_transfer/kv_connector/v1/nixl/worker.py

Lines 2346 to 2347 in 256dbca

    
           if _is_ssm_spec(self._group_spec_types[i]): 
        
               assert num_local_blocks == num_remote_blocks

as prefix caching in mamba will insert placeholder blocks (align-mode) to signal block-aligned checkpoints eg [0, 0, 14].

Unfortunately the underlying prefix caching logic (both all/align) needs a fix to correctly register cache hits with PD #42547.
I am pushing the fix in a separate PR as it is far more general and impactful and would therefore welcome any discussion there (this would be a nice use-case for stacked PRs btw).

Another known limitation is that we're not adding support for heterogeneous block_size and prefix-caching yet, so we leave the following assertion in place

vllm/vllm/distributed/kv_transfer/kv_connector/v1/nixl/worker.py

Lines 1458 to 1464 in 256dbca

    
           raise RuntimeError( 
        
               "Prefix caching with heterogeneous physical_blocks_per_logical " 
        
               "is not supported for Mamba hybrid models. " 
        
               f"Local: {self._physical_blocks_per_logical_kv_block}, " 
        
               f"Remote: {remote_physical_per_logical}. " 
        
               "Disable prefix caching with --no-enable-prefix-caching." 
        
           )

Will follow up with a separate PR to address that case. I feel agreeing on fix #42547 is actually the most important part here.

Test with

--enable-prefix-caching --mamba-cache-mode align --no-disable-hybrid-kv-cache-manager

# D
 VLLM_NIXL_SIDE_CHANNEL_PORT=$(just port 5558) VLLM_SSM_CONV_STATE_LAYOUT=DS 
vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 --port $(just port 8200) --enforce-eager --tensor-parallel-size 1 --gpu-memory-utilization 0.9 --trust-remote-code --max-model-len 131072 --block-size 128 --enable-prefix-caching --no-disable-hybrid-kv-cache-manager --mamba-cache-mode align --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'

# P
VLLM_NIXL_SIDE_CHANNEL_PORT=$(just port 5557) vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 --port $(just port 8100)  --gpu-memory-utilization 0.9 --trust-remote-code --enforce-eager --max-model-len 131072 --block-size 128 --enable-prefix-caching --no-disable-hybrid-kv-cache-manager --mamba-cache-mode align --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'

# proxy
python vllm//tests/v1/kv_connector/nixl_integration/toy_proxy_server.py --port $(just port 8192) --prefiller-port $(just port 8100) --decoder-port $(just port 8200)

# After #42547
# P
(APIServer pid=2669040) INFO 05-13 17:41:28 [loggers.py:271] Engine 000: Avg prompt throughput: 1177.3 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 26.4%, External prefix cache hit rate: 0.0%

# D
(APIServer pid=2668541) INFO 05-13 17:41:25 [loggers.py:271] Engine 000: Avg prompt throughput: 0.2 tokens/s, Avg generation throughput: 2.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 26.4%, External prefix cache hit rate: 100.0%

or run unit test with

pytest -v -s tests/v1/kv_connector/unit/test_nixl_connector_hma.py::test_apply_prefix_caching_ssm_prefix_cache_hit

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request introduces prefix caching support for SSM and Full Attention (FA) groups in the NIXL KV connector. Key changes include updating the _apply_prefix_caching logic to handle SSM placeholder trimming and FA partial prefix hits, along with scheduler adjustments to support external KV connectors and skip Mamba block alignment during asynchronous KV loading. Review feedback identified critical issues in the slicing logic and assertions within _apply_prefix_caching that would cause failures during full prefix cache hits (when num_local_blocks is zero).

markmc · 2026-05-14T10:49:49Z

xref #42620 which proposes adding KVConnectorBase_V1.supports_mamba_external_kv()

markmc · 2026-05-14T10:52:27Z

#42524 is also in this territory

Signed-off-by: NickLucche <nlucches@redhat.com>

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: JisoLya <523420504@qq.com>

Signed-off-by: NickLucche <nlucches@redhat.com>

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>

NickLucche requested review from ApostaC, WoosukKwon, alexm-redhat, heheda12345, njhill, orozery, robertgshaw2-redhat, xuechendi and ywang96 as code owners May 13, 2026 17:47

claude Bot reviewed May 13, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

Comment thread vllm/distributed/kv_transfer/kv_connector/v1/nixl/worker.py

Comment thread vllm/distributed/kv_transfer/kv_connector/v1/nixl/worker.py

mergify Bot added v1 kv-connector labels May 13, 2026

This was referenced May 13, 2026

[PD][Core] Fix Mamba prefix cache with PD #42547

Closed

[Roadmap]: PD Disaggregation with NixlConnector Roadmap #33702

Open

This was referenced May 14, 2026

Allow LMCacheConnectorV1 to support hybrid KV loads #42620

Open

[PD][Feature] Add KV consumer partial-group caching for hybrid Mamba models #42524

Open

ZhanqiuHu approved these changes May 14, 2026

View reviewed changes

ZhanqiuHu mentioned this pull request May 14, 2026

[Tracking Issue]: NIXL P/D Disaggregation for Hybrid Models #40017

Open

19 tasks

arpera mentioned this pull request May 15, 2026

Skip Mamba splitting during async KV load #41635

Closed

4 tasks

Etelis mentioned this pull request May 18, 2026

[Bug]: [kv_offload+HMA] Fails on chat subsequent request #41515

Open

1 task

NickLucche added the ready ONLY add when PR is ready to merge/full CI is needed label May 19, 2026

NickLucche added 4 commits May 20, 2026 15:13

prefix caching for matching block_size

6e39b53

Signed-off-by: NickLucche <nlucches@redhat.com>

partial hit for FA

683598e

Signed-off-by: NickLucche <nlucches@redhat.com>

test

0871fba

Signed-off-by: NickLucche <nlucches@redhat.com>

comment

68dc38b

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche force-pushed the mamba-prefix-caching-pd branch from 91956ba to 68dc38b Compare May 20, 2026 13:13

hoobnn mentioned this pull request May 26, 2026

[V1][Mamba] Opt-in granular prefill to fix align-mode prefix-cache misses on incremental requests (#43587) #43628

Closed

Merge branch 'main' into mamba-prefix-caching-pd

f4ede21

njhill mentioned this pull request Jun 1, 2026

[Bugfix] NIXL PD: Don't transfer spec-decode lookahead blocks #44151

Draft

Merge branch 'main' into mamba-prefix-caching-pd

eb34471

lHrHenry233 mentioned this pull request Jun 4, 2026

[PD][Feature] Add KV consumer partial-group caching for hybrid Mamba models vllm-project/vllm-ascend#10009

Open

mgoin approved these changes Jun 4, 2026

View reviewed changes

vllm-bot merged commit 68f5e56 into vllm-project:main Jun 4, 2026
68 of 70 checks passed

JisoLya pushed a commit to JisoLya/vllm that referenced this pull request Jun 5, 2026

[PD][Nixl] Mamba prefix caching mode support (vllm-project#42554)

db5e78c

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: JisoLya <523420504@qq.com>

knight0528 pushed a commit to knight0528/vllm that referenced this pull request Jun 8, 2026

[PD][Nixl] Mamba prefix caching mode support (vllm-project#42554)

5be1ade

Signed-off-by: NickLucche <nlucches@redhat.com>

waqahmed-amd-fi pushed a commit to waqahmed-amd-fi/vllm that referenced this pull request Jun 10, 2026

[PD][Nixl] Mamba prefix caching mode support (vllm-project#42554)

6a949d7

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>

terafin mentioned this pull request Jun 11, 2026

[Bug]: EngineDeadError after first L1 sleep/wake cycle with --kv-offloading-backend native + --enable-sleep-mode (v0.22.1) #45268

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PD][Nixl] Mamba prefix caching mode support #42554

[PD][Nixl] Mamba prefix caching mode support #42554
vllm-bot merged 6 commits into
vllm-project:mainfrom
NickLucche:mamba-prefix-caching-pd

NickLucche commented May 13, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

markmc commented May 14, 2026

Uh oh!

markmc commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	if _is_ssm_spec(self._group_spec_types[i]):
	assert num_local_blocks == num_remote_blocks

	raise RuntimeError(
	"Prefix caching with heterogeneous physical_blocks_per_logical "
	"is not supported for Mamba hybrid models. "
	f"Local: {self._physical_blocks_per_logical_kv_block}, "
	f"Remote: {remote_physical_per_logical}. "
	"Disable prefix caching with --no-enable-prefix-caching."
	)

Uh oh!

Conversation

NickLucche commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test with

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

markmc commented May 14, 2026

Uh oh!

markmc commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

NickLucche commented May 13, 2026 •

edited

Loading