[HiCache] Add PP Support with suffix pp rank by stmatengss · Pull Request #15175 · sgl-project/sglang

stmatengss · 2025-12-15T09:07:07Z

Motivation

When PP=2 and TP=8, PP=0 and PP=1 share the same KVCache for the HiCache Storage Backend, but they have the same hash key. To resolve this, append a suffix to the key.

CC. @whybeyoung @XucSh

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-15T09:07:33Z

Summary of Changes

Hello @stmatengss, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue in the HiCache Storage Backend where KVCache keys could collide when using Pipeline Parallelism (PP) with multiple ranks. By introducing pp_rank and pp_size parameters throughout the cache management system, the changes ensure that KVCache keys are uniquely suffixed with the pipeline parallelism rank. This prevents data corruption and allows for correct operation of HiCache in a distributed PP environment, particularly within the Mooncake Store implementation.

Highlights

Pipeline Parallelism (PP) Support: Introduced pp_rank and pp_size parameters across various cache management components, including CacheController, CacheInitParams, HiCacheStorageConfig, and HiRadixCache, to enable and manage pipeline parallelism.
Unique KVCache Keys for PP: Implemented a mechanism within the MooncakeStore to append the pp_rank as a suffix to KVCache keys. This resolves key collision issues that occur when different pipeline parallelism ranks share the same KVCache, ensuring data integrity in distributed environments.
Configuration and Metrics Updates: Updated the cache initialization parameters and storage configuration to properly propagate pp_rank and pp_size. Additionally, these new parameters are now included in the labels for StorageMetricsCollector to facilitate better monitoring of pipeline parallelism.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for Pipeline Parallelism (PP) in HiCache by introducing pp_rank and pp_size to differentiate keys in the cache storage. The changes primarily involve plumbing these new parameters through various components. My review identified a bug in mooncake_store.py where attributes for key suffixes (mha_suffix and mla_suffix) are not initialized when storage_config is None, which would lead to a runtime error. I've provided a code suggestion to fix this bug and refactor the related logic for better clarity and to eliminate redundancy.

gemini-code-assist · 2025-12-15T09:08:47Z

python/sglang/srt/mem_cache/storage/mooncake_store/mooncake_store.py

            if storage_config is not None:
                self.is_mla_backend = storage_config.is_mla_model
                self.local_rank = storage_config.tp_rank
+                self.pp_rank = storage_config.pp_rank
+                self.pp_size = storage_config.pp_size
+                self.enable_pp = self.pp_size > 1
+                if self.enable_pp:
+                    self.mha_suffix = f"{self.local_rank}_{self.pp_rank}"
+                    self.mla_suffix = f"{self.pp_rank}"
+                else:
+                    self.mha_suffix = f"{self.local_rank}"
+                    self.mla_suffix = ""
            else:
                self.is_mla_backend = False
                self.local_rank = 0
+                self.pp_rank = 0
+                self.pp_size = 1
+                self.enable_pp = False
+                self.suffix = f"{self.local_rank}"



There's a bug in the else block for when storage_config is None. It initializes self.suffix, which is an unused variable, but fails to initialize self.mha_suffix and self.mla_suffix. This will lead to an AttributeError in other methods like _get_mha_buffer_meta that rely on these attributes.

I suggest refactoring this block to correctly initialize the suffixes and reduce code duplication. This makes the logic clearer and fixes the bug.

Suggested change

if storage_config is not None:

self.is_mla_backend = storage_config.is_mla_model

self.local_rank = storage_config.tp_rank

self.pp_rank = storage_config.pp_rank

self.pp_size = storage_config.pp_size

self.enable_pp = self.pp_size > 1

if self.enable_pp:

self.mha_suffix = f"{self.local_rank}_{self.pp_rank}"

self.mla_suffix = f"{self.pp_rank}"

else:

self.mha_suffix = f"{self.local_rank}"

self.mla_suffix = ""

else:

self.is_mla_backend = False

self.local_rank = 0

self.pp_rank = 0

self.pp_size = 1

self.enable_pp = False

self.suffix = f"{self.local_rank}"

if storage_config is not None:

self.is_mla_backend = storage_config.is_mla_model

self.local_rank = storage_config.tp_rank

self.pp_rank = storage_config.pp_rank

self.pp_size = storage_config.pp_size

else:

self.is_mla_backend = False

self.local_rank = 0

self.pp_rank = 0

self.pp_size = 1

self.enable_pp = self.pp_size > 1

if self.enable_pp:

self.mha_suffix = f"{self.local_rank}_{self.pp_rank}"

self.mla_suffix = f"{self.pp_rank}"

else:

self.mha_suffix = f"{self.local_rank}"

self.mla_suffix = ""

Copilot

Pull request overview

This pull request adds Pipeline Parallelism (PP) support to HiCache by introducing PP rank suffixes to distinguish KV cache keys across different PP ranks. When PP=2 and TP=8, PP ranks 0 and 1 previously shared the same hash keys, causing cache collisions. This change resolves the issue by appending PP rank information to cache keys.

Key Changes:

Added pp_rank and pp_size fields to configuration dataclasses (HiCacheStorageConfig and CacheInitParams)
Implemented suffix logic in MooncakeStore to differentiate cache keys based on PP rank
Threaded PP rank/size parameters through the cache initialization pipeline

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`python/sglang/srt/mem_cache/hicache_storage.py`	Added `pp_rank` and `pp_size` fields to `HiCacheStorageConfig` dataclass
`python/sglang/srt/mem_cache/cache_init_params.py`	Added `pp_rank` and `pp_size` fields with defaults to `CacheInitParams`
`python/sglang/srt/managers/scheduler.py`	Passed `pp_rank` and `pp_size` to `CacheInitParams` during initialization
`python/sglang/srt/managers/cache_controller.py`	Added `pp_rank` and `pp_size` parameters and passed them to storage config generation
`python/sglang/srt/mem_cache/hiradix_cache.py`	Passed `pp_rank` and `pp_size` to cache controller and metrics collector
`python/sglang/srt/mem_cache/storage/mooncake_store/mooncake_store.py`	Implemented suffix logic using PP rank to create distinct cache keys for MHA and MLA models

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-15T09:15:46Z

python/sglang/srt/mem_cache/storage/mooncake_store/mooncake_store.py

+                self.pp_rank = 0
+                self.pp_size = 1
+                self.enable_pp = False
+                self.suffix = f"{self.local_rank}"


In the else block, the code sets self.suffix but this variable is never used. Based on the if block above, this should set both self.mha_suffix and self.mla_suffix to maintain consistency with the enabled PP case. The mha_suffix should be set to the local_rank and mla_suffix should be an empty string to match the non-PP behavior.

Suggested change

self.suffix = f"{self.local_rank}"

self.mha_suffix = f"{self.local_rank}"

self.mla_suffix = ""

Copilot · 2025-12-15T09:15:47Z

python/sglang/srt/mem_cache/storage/mooncake_store/mooncake_store.py

+                self.enable_pp = self.pp_size > 1
+                if self.enable_pp:
+                    self.mha_suffix = f"{self.local_rank}_{self.pp_rank}"
+                    self.mla_suffix = f"{self.pp_rank}"
+                else:
+                    self.mha_suffix = f"{self.local_rank}"
+                    self.mla_suffix = ""


Consider adding a comment explaining the suffix logic for PP support. For example, explain that when PP is enabled, different PP ranks need distinct keys to avoid cache collisions since they process different model layers. This would help future maintainers understand why MHA uses both tp_rank and pp_rank while MLA only uses pp_rank in the suffix.

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

stmatengss · 2025-12-15T11:03:45Z

/tag-run-ci-label

whybeyoung · 2025-12-16T01:25:52Z

Nice Work

huangtingwei9988

LGTM

stmatengss · 2025-12-18T05:18:14Z

/rerun-failed-ci

stmatengss · 2025-12-19T04:20:10Z

@ShangmingCai PTAL

stmatengss · 2025-12-19T15:19:04Z

/rerun-failed-ci

stmatengss · 2025-12-22T07:00:48Z

/rerun-failed-ci

whybeyoung · 2025-12-24T03:08:35Z

fail in pp2tp8 ,using deepseek 3.2 model

stmatengss · 2025-12-24T05:43:08Z

fail in pp2tp8 ,using deepseek 3.2 model

Will verify in DeepSeek 3.2. And add more tests.

XucSh · 2025-12-25T06:26:36Z

same

It's another bug. See #15805

stmatengss · 2025-12-26T12:39:25Z

/rerun-failed-ci

stmatengss · 2025-12-28T08:07:30Z

/rerun-failed-ci

ShangmingCai · 2025-12-28T08:25:26Z

lint should be fixed

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

XucSh · 2025-12-30T02:23:13Z

/rerun-failed-ci

stmatengss · 2025-12-30T12:42:29Z

/rerun-failed-ci

ShangmingCai · 2025-12-30T13:14:35Z

/rerun-failed-ci

stmatengss · 2025-12-30T16:08:36Z

/rerun-failed-ci

XucSh · 2025-12-31T05:33:08Z

/rerun-failed-ci

stmatengss · 2025-12-31T07:18:46Z

/rerun-failed-ci

stmatengss · 2026-01-01T09:50:36Z

/rerun-failed-ci

ShangmingCai · 2026-01-01T11:31:02Z

Maybe need to merge main again or this test won't pass.

======================================================================
ERROR: test_bs_1_speed (__main__.TestMiMoV2Flash)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/public_sglang_ci/runner-l6b-gpu-01234567/_work/sglang/sglang/python/sglang/srt/utils/common.py", line 2506, in retry
    return fn()
  File "/public_sglang_ci/runner-l6b-gpu-01234567/_work/sglang/sglang/python/sglang/test/test_utils.py", line 1720, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
AssertionError: 2.7142857142857144 not greater than 3.2

ShangmingCai · 2026-01-02T09:51:26Z

/rerun-failed-ci

stmatengss · 2026-01-02T14:15:14Z

/rerun-failed-ci

Co-authored-by: Xuchun Shang <xuchun.shang@gmail.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: Shangming Cai <csmthu@gmail.com>

[HiCache] Add PP Support

5a798bd

stmatengss requested review from Ying1123, hanming-lu, hnyls2002, merrymercy, xiezhq-hermann and yizhang2077 as code owners December 15, 2025 09:07

github-actions bot added the hicache Hierarchical Caching for SGLang label Dec 15, 2025

stmatengss requested a review from Copilot December 15, 2025 09:07

gemini-code-assist bot reviewed Dec 15, 2025

View reviewed changes

Copilot started reviewing on behalf of stmatengss December 15, 2025 09:11 View session

Copilot AI reviewed Dec 15, 2025

View reviewed changes

update

bffe267

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

XucSh mentioned this pull request Dec 15, 2025

[Feature] supporting Mooncake Hicache under the hybrid parallel strategy of PP and TP. #15136

Closed

2 tasks

github-actions bot added the run-ci label Dec 15, 2025

Merge branch 'main' into add_hicache_pp_support

963923a

stmatengss assigned xiezhq-hermann, whybeyoung and huangtingwei9988 Dec 16, 2025

huangtingwei9988 approved these changes Dec 16, 2025

View reviewed changes

stmatengss requested a review from ShangmingCai December 19, 2025 04:19

Merge branch 'main' into add_hicache_pp_support

747da9a

xiezhq-hermann added the high priority label Dec 23, 2025

stmatengss mentioned this pull request Dec 25, 2025

[HiCache] Fix deadlock when creating new group #15805

Merged

6 tasks

Merge branch 'main' into add_hicache_pp_support

c041b17

fix lint

3c919fd

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

xiezhq-hermann approved these changes Dec 29, 2025

View reviewed changes

whybeyoung approved these changes Dec 30, 2025

View reviewed changes

Merge branch 'main' into add_hicache_pp_support

5d4e83f

whybeyoung enabled auto-merge (squash) December 30, 2025 05:24

whybeyoung disabled auto-merge December 30, 2025 07:07

ShangmingCai mentioned this pull request Dec 30, 2025

[Roadmap] Pipeline parallelism roadmap #11857

Open

23 tasks

Merge branch 'main' into add_hicache_pp_support

ce32781

Merge branch 'main' into add_hicache_pp_support

5110150

stmatengss merged commit bdde949 into sgl-project:main Jan 3, 2026
223 of 234 checks passed

stmatengss mentioned this pull request Apr 2, 2026

Support PP key for file backend #21901

Merged

5 tasks

	self.suffix = f"{self.local_rank}"
	self.mha_suffix = f"{self.local_rank}"
	self.mla_suffix = ""

Conversation

stmatengss commented Dec 15, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Dec 15, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

stmatengss commented Dec 15, 2025

Uh oh!

whybeyoung commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huangtingwei9988 left a comment

Choose a reason for hiding this comment

Uh oh!

stmatengss commented Dec 18, 2025

Uh oh!

stmatengss commented Dec 19, 2025

Uh oh!

stmatengss commented Dec 19, 2025

Uh oh!

stmatengss commented Dec 22, 2025

Uh oh!

whybeyoung commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stmatengss commented Dec 24, 2025

Uh oh!

XucSh commented Dec 25, 2025

Uh oh!

stmatengss commented Dec 26, 2025

Uh oh!

stmatengss commented Dec 28, 2025

Uh oh!

ShangmingCai commented Dec 28, 2025

Uh oh!

XucSh commented Dec 30, 2025

Uh oh!

stmatengss commented Dec 30, 2025

Uh oh!

ShangmingCai commented Dec 30, 2025

Uh oh!

stmatengss commented Dec 30, 2025

Uh oh!

XucSh commented Dec 31, 2025

Uh oh!

stmatengss commented Dec 31, 2025

Uh oh!

stmatengss commented Jan 1, 2026

Uh oh!

ShangmingCai commented Jan 1, 2026

Uh oh!

ShangmingCai commented Jan 2, 2026

Uh oh!

stmatengss commented Jan 2, 2026

Uh oh!

whybeyoung commented Dec 16, 2025 •

edited

Loading

whybeyoung commented Dec 24, 2025 •

edited

Loading