[PD][Core] Fix Mamba prefix cache hit rate in PD disaggregation#44243
Conversation
e40d378 to
33cd846
Compare
NickLucche
left a comment
There was a problem hiding this comment.
I think this is quite interesting
| ): | ||
| new_computed_blocks, per_group_hits = ( | ||
| self._get_computed_blocks_per_group(request) | ||
| ) | ||
| num_new_local_computed_tokens = min(per_group_hits) |
There was a problem hiding this comment.
dumb q: what's the main issue with evaluating per-group for all hybrid models, regardless of mamba?
There was a problem hiding this comment.
mm sw probably needs right2left alignment
There was a problem hiding this comment.
yeah i was thinking how we can generalize this, let me think more :)
I think num_new_local_computed_tokens = min(per_group_hits) is actually wrong, but technically num_new_local_computed_tokens is not really used later with kv connector.
sw definitely needs different handling
|
This pull request has merge conflicts that must be resolved before it can be |
Per-group cache evaluation so FA local hits are preserved even when Mamba has no local state on the D-side. Worker-side prefix caching now handles SSM groups correctly (end-trim instead of strict equality). Fixes 0% D-side prefix cache hit rate for Mamba hybrid models in PD. Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
88da91d to
a45c78a
Compare
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
NickLucche
left a comment
There was a problem hiding this comment.
LGTM as per offline discussion
|
Hi @ZhanqiuHu, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
| ) | ||
| if ( | ||
| self.connector is not None | ||
| and self.has_mamba_layers |
There was a problem hiding this comment.
Thanks for the non-intrusive changes! Abstracting this into a separate function looks much more elegant. We recently noticed that DeepSeek-V4 also supports partial hits, so would it make sense to generalize this logic to accommodate it?
…-project#44243) Co-authored-by: lHrHenry233 <2381623149@qq.com> Co-authored-by: underfituu <hzhucong@163.com> Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
Co-authored with @underfituu.
Fix the bug described in #42524; overwrite
find_longest_cache_hitto bypass truncation of full attention groups.Note on the expected behavior: the last state of Mamba will always be transfer, but full attention will only transfer the prefix cache miss part.