[HiCache] Add CP support for HiCache#20977
Conversation
Signed-off-by: Shangming Cai <csmthu@gmail.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
…#20977) Signed-off-by: Shangming Cai <csmthu@gmail.com>
|
Hi ! |
Yeah, you are right, I thought it was only a tag issue, so I basically vibe this, but it turns out we need to handle the control plane coordinate as well. |
| self.enable_cp = self.attn_cp_size > 1 | ||
| if self.enable_pp or self.enable_cp: | ||
| self.mha_suffix = ( | ||
| f"{self.local_rank}_{self.pp_rank}_{self.attn_cp_rank}" | ||
| ) | ||
| self.mla_suffix = f"{self.pp_rank}_{self.attn_cp_rank}" |
There was a problem hiding this comment.
Should we consider separating checking for enable_pp and enable_cp, or if only enable_pp is used, then attn_cp_rank should be the default value (_0)?
There was a problem hiding this comment.
That depends on our design. If we want to support hetero setups, then maybe we should let the suffix always be f"{self.local_rank}_{self.pp_rank}_{self.attn_cp_rank}" ). They can all be 0 for most cases. But we might also need to put the pp size and cp size in the suffix.
|
/tag-and-rerun-ci |
Signed-off-by: Shangming Cai <csmthu@gmail.com>
…md v2 Add [PPPrefillDiag] and [PPPrefillProblem] to _PPPrefillDebugFilter so they are silenced by default (visible with SGLANG_DEBUG_HICACHE_VERBOSE=1). Update PR_PLAN.md with: - Remove PR 1 (CP support) — already merged as sgl-project#20977 - Further split PR 3 into 3a-3e sub-PRs - Add dead code / ENV-gated code inventory - Add debug log tiering strategy with minimal ENV combos for each problem type
Signed-off-by: Shangming Cai <csmthu@gmail.com>
|
Hi ! Why is the information about CP entered in the kv cache key? This means that each CP rank must receive the same cache when it is retrieved from the cache. In other words, CP heterogeneity is automatically maintained and, in general, knowledge about CP in keys is not required. I maybe missing something, but so far in my PR I have removed information about CP from the keys, review these changes, please |
@vladnosiv Current CP impl makes each CP rank allocate the full KVCache length, and use allgather to fetch from peers. But we are thinking of supporting a ring-based solution without a full local copy, so I assume it for MHA/GQA now, since MHA/GQA are more memory-intensive than MLA. Anyway, I think it can be changed and optimized at that time, when we are supporting models like Qwen3.5 or others. |
Signed-off-by: Shangming Cai <csmthu@gmail.com>

Motivation
This PR is mostly for Qwen3 CP + Hicache. MLA models + Hicache will reuse cp 0's data for all ranks, so we don't need to distinguish the key.
CC: @whybeyoung
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci