Skip to content

[HiCache] Add PP Support with suffix pp rank#15175

Merged
stmatengss merged 9 commits intosgl-project:mainfrom
kvcache-ai:add_hicache_pp_support
Jan 3, 2026
Merged

[HiCache] Add PP Support with suffix pp rank#15175
stmatengss merged 9 commits intosgl-project:mainfrom
kvcache-ai:add_hicache_pp_support

Conversation

@stmatengss
Copy link
Copy Markdown
Collaborator

Motivation

When PP=2 and TP=8, PP=0 and PP=1 share the same KVCache for the HiCache Storage Backend, but they have the same hash key. To resolve this, append a suffix to the key.

CC. @whybeyoung @XucSh

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@github-actions github-actions bot added the hicache Hierarchical Caching for SGLang label Dec 15, 2025
@stmatengss stmatengss requested a review from Copilot December 15, 2025 09:07
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @stmatengss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue in the HiCache Storage Backend where KVCache keys could collide when using Pipeline Parallelism (PP) with multiple ranks. By introducing pp_rank and pp_size parameters throughout the cache management system, the changes ensure that KVCache keys are uniquely suffixed with the pipeline parallelism rank. This prevents data corruption and allows for correct operation of HiCache in a distributed PP environment, particularly within the Mooncake Store implementation.

Highlights

  • Pipeline Parallelism (PP) Support: Introduced pp_rank and pp_size parameters across various cache management components, including CacheController, CacheInitParams, HiCacheStorageConfig, and HiRadixCache, to enable and manage pipeline parallelism.
  • Unique KVCache Keys for PP: Implemented a mechanism within the MooncakeStore to append the pp_rank as a suffix to KVCache keys. This resolves key collision issues that occur when different pipeline parallelism ranks share the same KVCache, ensuring data integrity in distributed environments.
  • Configuration and Metrics Updates: Updated the cache initialization parameters and storage configuration to properly propagate pp_rank and pp_size. Additionally, these new parameters are now included in the labels for StorageMetricsCollector to facilitate better monitoring of pipeline parallelism.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Pipeline Parallelism (PP) in HiCache by introducing pp_rank and pp_size to differentiate keys in the cache storage. The changes primarily involve plumbing these new parameters through various components. My review identified a bug in mooncake_store.py where attributes for key suffixes (mha_suffix and mla_suffix) are not initialized when storage_config is None, which would lead to a runtime error. I've provided a code suggestion to fix this bug and refactor the related logic for better clarity and to eliminate redundancy.

Comment on lines 248 to 267
if storage_config is not None:
self.is_mla_backend = storage_config.is_mla_model
self.local_rank = storage_config.tp_rank
self.pp_rank = storage_config.pp_rank
self.pp_size = storage_config.pp_size
self.enable_pp = self.pp_size > 1
if self.enable_pp:
self.mha_suffix = f"{self.local_rank}_{self.pp_rank}"
self.mla_suffix = f"{self.pp_rank}"
else:
self.mha_suffix = f"{self.local_rank}"
self.mla_suffix = ""
else:
self.is_mla_backend = False
self.local_rank = 0
self.pp_rank = 0
self.pp_size = 1
self.enable_pp = False
self.suffix = f"{self.local_rank}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's a bug in the else block for when storage_config is None. It initializes self.suffix, which is an unused variable, but fails to initialize self.mha_suffix and self.mla_suffix. This will lead to an AttributeError in other methods like _get_mha_buffer_meta that rely on these attributes.

I suggest refactoring this block to correctly initialize the suffixes and reduce code duplication. This makes the logic clearer and fixes the bug.

Suggested change
if storage_config is not None:
self.is_mla_backend = storage_config.is_mla_model
self.local_rank = storage_config.tp_rank
self.pp_rank = storage_config.pp_rank
self.pp_size = storage_config.pp_size
self.enable_pp = self.pp_size > 1
if self.enable_pp:
self.mha_suffix = f"{self.local_rank}_{self.pp_rank}"
self.mla_suffix = f"{self.pp_rank}"
else:
self.mha_suffix = f"{self.local_rank}"
self.mla_suffix = ""
else:
self.is_mla_backend = False
self.local_rank = 0
self.pp_rank = 0
self.pp_size = 1
self.enable_pp = False
self.suffix = f"{self.local_rank}"
if storage_config is not None:
self.is_mla_backend = storage_config.is_mla_model
self.local_rank = storage_config.tp_rank
self.pp_rank = storage_config.pp_rank
self.pp_size = storage_config.pp_size
else:
self.is_mla_backend = False
self.local_rank = 0
self.pp_rank = 0
self.pp_size = 1
self.enable_pp = self.pp_size > 1
if self.enable_pp:
self.mha_suffix = f"{self.local_rank}_{self.pp_rank}"
self.mla_suffix = f"{self.pp_rank}"
else:
self.mha_suffix = f"{self.local_rank}"
self.mla_suffix = ""

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds Pipeline Parallelism (PP) support to HiCache by introducing PP rank suffixes to distinguish KV cache keys across different PP ranks. When PP=2 and TP=8, PP ranks 0 and 1 previously shared the same hash keys, causing cache collisions. This change resolves the issue by appending PP rank information to cache keys.

Key Changes:

  • Added pp_rank and pp_size fields to configuration dataclasses (HiCacheStorageConfig and CacheInitParams)
  • Implemented suffix logic in MooncakeStore to differentiate cache keys based on PP rank
  • Threaded PP rank/size parameters through the cache initialization pipeline

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
python/sglang/srt/mem_cache/hicache_storage.py Added pp_rank and pp_size fields to HiCacheStorageConfig dataclass
python/sglang/srt/mem_cache/cache_init_params.py Added pp_rank and pp_size fields with defaults to CacheInitParams
python/sglang/srt/managers/scheduler.py Passed pp_rank and pp_size to CacheInitParams during initialization
python/sglang/srt/managers/cache_controller.py Added pp_rank and pp_size parameters and passed them to storage config generation
python/sglang/srt/mem_cache/hiradix_cache.py Passed pp_rank and pp_size to cache controller and metrics collector
python/sglang/srt/mem_cache/storage/mooncake_store/mooncake_store.py Implemented suffix logic using PP rank to create distinct cache keys for MHA and MLA models

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

self.pp_rank = 0
self.pp_size = 1
self.enable_pp = False
self.suffix = f"{self.local_rank}"
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the else block, the code sets self.suffix but this variable is never used. Based on the if block above, this should set both self.mha_suffix and self.mla_suffix to maintain consistency with the enabled PP case. The mha_suffix should be set to the local_rank and mla_suffix should be an empty string to match the non-PP behavior.

Suggested change
self.suffix = f"{self.local_rank}"
self.mha_suffix = f"{self.local_rank}"
self.mla_suffix = ""

Copilot uses AI. Check for mistakes.
Comment on lines +253 to +259
self.enable_pp = self.pp_size > 1
if self.enable_pp:
self.mha_suffix = f"{self.local_rank}_{self.pp_rank}"
self.mla_suffix = f"{self.pp_rank}"
else:
self.mha_suffix = f"{self.local_rank}"
self.mla_suffix = ""
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a comment explaining the suffix logic for PP support. For example, explain that when PP is enabled, different PP ranks need distinct keys to avoid cache collisions since they process different model layers. This would help future maintainers understand why MHA uses both tp_rank and pp_rank while MLA only uses pp_rank in the suffix.

Copilot uses AI. Check for mistakes.
Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>
@stmatengss
Copy link
Copy Markdown
Collaborator Author

/tag-run-ci-label

@whybeyoung
Copy link
Copy Markdown
Collaborator

whybeyoung commented Dec 16, 2025

Nice Work

Copy link
Copy Markdown
Collaborator

@huangtingwei9988 huangtingwei9988 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@stmatengss
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@stmatengss
Copy link
Copy Markdown
Collaborator Author

@ShangmingCai PTAL

@stmatengss
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

1 similar comment
@stmatengss
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@whybeyoung
Copy link
Copy Markdown
Collaborator

whybeyoung commented Dec 24, 2025

fail in pp2tp8 ,using deepseek 3.2 model

@stmatengss
Copy link
Copy Markdown
Collaborator Author

fail in pp2tp8 ,using deepseek 3.2 model

Will verify in DeepSeek 3.2. And add more tests.

@XucSh
Copy link
Copy Markdown
Collaborator

XucSh commented Dec 25, 2025

same

It's another bug. See #15805

@stmatengss
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@stmatengss
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@ShangmingCai
Copy link
Copy Markdown
Collaborator

lint should be fixed

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>
@XucSh
Copy link
Copy Markdown
Collaborator

XucSh commented Dec 30, 2025

/rerun-failed-ci

@whybeyoung whybeyoung enabled auto-merge (squash) December 30, 2025 05:24
@whybeyoung whybeyoung disabled auto-merge December 30, 2025 07:07
@stmatengss
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

1 similar comment
@ShangmingCai
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@stmatengss
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@XucSh
Copy link
Copy Markdown
Collaborator

XucSh commented Dec 31, 2025

/rerun-failed-ci

2 similar comments
@stmatengss
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@stmatengss
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@ShangmingCai
Copy link
Copy Markdown
Collaborator

Maybe need to merge main again or this test won't pass.

======================================================================
ERROR: test_bs_1_speed (__main__.TestMiMoV2Flash)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/public_sglang_ci/runner-l6b-gpu-01234567/_work/sglang/sglang/python/sglang/srt/utils/common.py", line 2506, in retry
    return fn()
  File "/public_sglang_ci/runner-l6b-gpu-01234567/_work/sglang/sglang/python/sglang/test/test_utils.py", line 1720, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
AssertionError: 2.7142857142857144 not greater than 3.2

@ShangmingCai
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

1 similar comment
@stmatengss
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@stmatengss stmatengss merged commit bdde949 into sgl-project:main Jan 3, 2026
223 of 234 checks passed
YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026
Co-authored-by: Xuchun Shang <xuchun.shang@gmail.com>
Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
@stmatengss stmatengss mentioned this pull request Apr 2, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hicache Hierarchical Caching for SGLang high priority run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants