Skip to content

[HiCache] Add CP support for HiCache#20977

Merged
ShangmingCai merged 4 commits into
mainfrom
support_cp_hicache
Apr 10, 2026
Merged

[HiCache] Add CP support for HiCache#20977
ShangmingCai merged 4 commits into
mainfrom
support_cp_hicache

Conversation

@ShangmingCai
Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai commented Mar 20, 2026

Motivation

This PR is mostly for Qwen3 CP + Hicache. MLA models + Hicache will reuse cp 0's data for all ranks, so we don't need to distinguish the key.
CC: @whybeyoung

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

Signed-off-by: Shangming Cai <csmthu@gmail.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the hicache Hierarchical Caching for SGLang label Mar 20, 2026
whybeyoung added a commit to whybeyoung/sglang that referenced this pull request Mar 20, 2026
@vladnosiv
Copy link
Copy Markdown
Contributor

Hi !
Don't you think that for full support, you also need to synchronize the cache state, as it is done here #20460 ?

@ShangmingCai
Copy link
Copy Markdown
Collaborator Author

Hi ! Don't you think that for full support, you also need to synchronize the cache state, as it is done here #20460 ?

Yeah, you are right, I thought it was only a tag issue, so I basically vibe this, but it turns out we need to handle the control plane coordinate as well.

Comment on lines +409 to +414
self.enable_cp = self.attn_cp_size > 1
if self.enable_pp or self.enable_cp:
self.mha_suffix = (
f"{self.local_rank}_{self.pp_rank}_{self.attn_cp_rank}"
)
self.mla_suffix = f"{self.pp_rank}_{self.attn_cp_rank}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider separating checking for enable_pp and enable_cp, or if only enable_pp is used, then attn_cp_rank should be the default value (_0)?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That depends on our design. If we want to support hetero setups, then maybe we should let the suffix always be f"{self.local_rank}_{self.pp_rank}_{self.attn_cp_rank}" ). They can all be 0 for most cases. But we might also need to put the pp size and cp size in the suffix.

@ShangmingCai
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label Apr 2, 2026
@ShangmingCai
Copy link
Copy Markdown
Collaborator Author

ShangmingCai commented Apr 10, 2026

CI has passed. Let us make it more robust after hicache refactor.

image

@ShangmingCai ShangmingCai merged commit 1c76f32 into main Apr 10, 2026
299 of 358 checks passed
@ShangmingCai ShangmingCai deleted the support_cp_hicache branch April 10, 2026 09:52
Fridge003 pushed a commit that referenced this pull request Apr 11, 2026
Signed-off-by: Shangming Cai <csmthu@gmail.com>
whybeyoung added a commit to whybeyoung/sglang that referenced this pull request Apr 12, 2026
…md v2

Add [PPPrefillDiag] and [PPPrefillProblem] to _PPPrefillDebugFilter
so they are silenced by default (visible with SGLANG_DEBUG_HICACHE_VERBOSE=1).

Update PR_PLAN.md with:
- Remove PR 1 (CP support) — already merged as sgl-project#20977
- Further split PR 3 into 3a-3e sub-PRs
- Add dead code / ENV-gated code inventory
- Add debug log tiering strategy with minimal ENV combos for each problem type
pyc96 pushed a commit to pyc96/sglang that referenced this pull request Apr 14, 2026
Signed-off-by: Shangming Cai <csmthu@gmail.com>
@vladnosiv
Copy link
Copy Markdown
Contributor

Hi !
I've been testing in the context of my PR (#20460), and I have questions about these changes.

Why is the information about CP entered in the kv cache key?
CP uses a strategy of splitting Q and replicating KV. That is, before each attention layer, it's guaranteed that the KV caches on the CP ranks are the same, and after the layer after all-gather, this guarantee is fulfilled again.

This means that each CP rank must receive the same cache when it is retrieved from the cache.
And when writing, it is enough for one rank to be written to the cache.

In other words, CP heterogeneity is automatically maintained and, in general, knowledge about CP in keys is not required.

I maybe missing something, but so far in my PR I have removed information about CP from the keys, review these changes, please

@ShangmingCai
Copy link
Copy Markdown
Collaborator Author

Hi ! I've been testing in the context of my PR (#20460), and I have questions about these changes.

Why is the information about CP entered in the kv cache key? CP uses a strategy of splitting Q and replicating KV. That is, before each attention layer, it's guaranteed that the KV caches on the CP ranks are the same, and after the layer after all-gather, this guarantee is fulfilled again.

This means that each CP rank must receive the same cache when it is retrieved from the cache. And when writing, it is enough for one rank to be written to the cache.

In other words, CP heterogeneity is automatically maintained and, in general, knowledge about CP in keys is not required.

I maybe missing something, but so far in my PR I have removed information about CP from the keys, review these changes, please

@vladnosiv Current CP impl makes each CP rank allocate the full KVCache length, and use allgather to fetch from peers. But we are thinking of supporting a ring-based solution without a full local copy, so I assume it for MHA/GQA now, since MHA/GQA are more memory-intensive than MLA. Anyway, I think it can be changed and optimized at that time, when we are supporting models like Qwen3.5 or others.

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hicache Hierarchical Caching for SGLang run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants