Skip to content

[PD][Core] Fix Mamba prefix cache hit rate in PD disaggregation#44243

Merged
NickLucche merged 9 commits into
vllm-project:mainfrom
ZhanqiuHu:zhanqiu/fix-pd-mamba-prefix-cache
Jun 11, 2026
Merged

[PD][Core] Fix Mamba prefix cache hit rate in PD disaggregation#44243
NickLucche merged 9 commits into
vllm-project:mainfrom
ZhanqiuHu:zhanqiu/fix-pd-mamba-prefix-cache

Conversation

@ZhanqiuHu

@ZhanqiuHu ZhanqiuHu commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Co-authored with @underfituu.

Fix the bug described in #42524; overwrite find_longest_cache_hit to bypass truncation of full attention groups.

Note on the expected behavior: the last state of Mamba will always be transfer, but full attention will only transfer the prefix cache miss part.

@NickLucche NickLucche left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is quite interesting

Comment thread vllm/v1/core/sched/scheduler.py Outdated
Comment on lines +650 to +654
):
new_computed_blocks, per_group_hits = (
self._get_computed_blocks_per_group(request)
)
num_new_local_computed_tokens = min(per_group_hits)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dumb q: what's the main issue with evaluating per-group for all hybrid models, regardless of mamba?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mm sw probably needs right2left alignment

@ZhanqiuHu ZhanqiuHu Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i was thinking how we can generalize this, let me think more :)
I think num_new_local_computed_tokens = min(per_group_hits) is actually wrong, but technically num_new_local_computed_tokens is not really used later with kv connector.
sw definitely needs different handling

@mergify

mergify Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ZhanqiuHu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 4, 2026
ZhanqiuHu added 3 commits June 4, 2026 14:39
Per-group cache evaluation so FA local hits are preserved even when
Mamba has no local state on the D-side. Worker-side prefix caching
now handles SSM groups correctly (end-trim instead of strict equality).

Fixes 0% D-side prefix cache hit rate for Mamba hybrid models in PD.

Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
@ZhanqiuHu ZhanqiuHu force-pushed the zhanqiu/fix-pd-mamba-prefix-cache branch from 88da91d to a45c78a Compare June 4, 2026 18:43
@mergify mergify Bot removed the needs-rebase label Jun 4, 2026
ZhanqiuHu added 3 commits June 4, 2026 16:23
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@NickLucche NickLucche left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as per offline discussion

@NickLucche NickLucche added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 8, 2026
@NickLucche NickLucche enabled auto-merge (squash) June 8, 2026 16:21
@mergify

mergify Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Hi @ZhanqiuHu, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

)
if (
self.connector is not None
and self.has_mamba_layers

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the non-intrusive changes! Abstracting this into a separate function looks much more elegant. We recently noticed that DeepSeek-V4 also supports partial hits, so would it make sense to generalize this logic to accommodate it?

@NickLucche NickLucche merged commit 55911db into vllm-project:main Jun 11, 2026
75 checks passed
Saddss pushed a commit to Saddss/vllm that referenced this pull request Jun 14, 2026
…-project#44243)

Co-authored-by: lHrHenry233 <2381623149@qq.com>
Co-authored-by: underfituu <hzhucong@163.com>
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants