[PD][Nixl] Mamba prefix caching mode support #42554
Merged
Merged
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces prefix caching support for SSM and Full Attention (FA) groups in the NIXL KV connector. Key changes include updating the _apply_prefix_caching logic to handle SSM placeholder trimming and FA partial prefix hits, along with scheduler adjustments to support external KV connectors and skip Mamba block alignment during asynchronous KV loading. Review feedback identified critical issues in the slicing logic and assertions within _apply_prefix_caching that would cause failures during full prefix cache hits (when num_local_blocks is zero).
This was referenced May 13, 2026
Member
|
xref #42620 which proposes adding |
Member
|
#42524 is also in this territory |
This was referenced May 14, 2026
ZhanqiuHu
approved these changes
May 14, 2026
19 tasks
4 tasks
1 task
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
91956ba to
68dc38b
Compare
mgoin
approved these changes
Jun 4, 2026
JisoLya
pushed a commit
to JisoLya/vllm
that referenced
this pull request
Jun 5, 2026
Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: JisoLya <523420504@qq.com>
knight0528
pushed a commit
to knight0528/vllm
that referenced
this pull request
Jun 8, 2026
Signed-off-by: NickLucche <nlucches@redhat.com>
waqahmed-amd-fi
pushed a commit
to waqahmed-amd-fi/vllm
that referenced
this pull request
Jun 10, 2026
Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>
This was referenced Jun 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds support for PD Mamba setups to make use of prefix caching ("all", "align" as well as the upcoming #37898).
It merely adds the logic to handle the result of a prefix cache hit, so it is agnostic to the actual caching implementation.
Running without this PR with prefix caching enabled will run into this assertion
vllm/vllm/distributed/kv_transfer/kv_connector/v1/nixl/worker.py
Lines 2346 to 2347 in 256dbca
as prefix caching in mamba will insert placeholder blocks (align-mode) to signal block-aligned checkpoints eg
[0, 0, 14].Unfortunately the underlying prefix caching logic (both all/align) needs a fix to correctly register cache hits with PD #42547.
I am pushing the fix in a separate PR as it is far more general and impactful and would therefore welcome any discussion there (this would be a nice use-case for stacked PRs btw).
Another known limitation is that we're not adding support for heterogeneous block_size and prefix-caching yet, so we leave the following assertion in place
vllm/vllm/distributed/kv_transfer/kv_connector/v1/nixl/worker.py
Lines 1458 to 1464 in 256dbca
Will follow up with a separate PR to address that case. I feel agreeing on fix #42547 is actually the most important part here.
Test with
--enable-prefix-caching --mamba-cache-mode align --no-disable-hybrid-kv-cache-manageror run unit test with