[HiCache] Add PP Support with suffix pp rank#15175
[HiCache] Add PP Support with suffix pp rank#15175stmatengss merged 9 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @stmatengss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical issue in the HiCache Storage Backend where KVCache keys could collide when using Pipeline Parallelism (PP) with multiple ranks. By introducing Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for Pipeline Parallelism (PP) in HiCache by introducing pp_rank and pp_size to differentiate keys in the cache storage. The changes primarily involve plumbing these new parameters through various components. My review identified a bug in mooncake_store.py where attributes for key suffixes (mha_suffix and mla_suffix) are not initialized when storage_config is None, which would lead to a runtime error. I've provided a code suggestion to fix this bug and refactor the related logic for better clarity and to eliminate redundancy.
| if storage_config is not None: | ||
| self.is_mla_backend = storage_config.is_mla_model | ||
| self.local_rank = storage_config.tp_rank | ||
| self.pp_rank = storage_config.pp_rank | ||
| self.pp_size = storage_config.pp_size | ||
| self.enable_pp = self.pp_size > 1 | ||
| if self.enable_pp: | ||
| self.mha_suffix = f"{self.local_rank}_{self.pp_rank}" | ||
| self.mla_suffix = f"{self.pp_rank}" | ||
| else: | ||
| self.mha_suffix = f"{self.local_rank}" | ||
| self.mla_suffix = "" | ||
| else: | ||
| self.is_mla_backend = False | ||
| self.local_rank = 0 | ||
| self.pp_rank = 0 | ||
| self.pp_size = 1 | ||
| self.enable_pp = False | ||
| self.suffix = f"{self.local_rank}" | ||
|
|
There was a problem hiding this comment.
There's a bug in the else block for when storage_config is None. It initializes self.suffix, which is an unused variable, but fails to initialize self.mha_suffix and self.mla_suffix. This will lead to an AttributeError in other methods like _get_mha_buffer_meta that rely on these attributes.
I suggest refactoring this block to correctly initialize the suffixes and reduce code duplication. This makes the logic clearer and fixes the bug.
| if storage_config is not None: | |
| self.is_mla_backend = storage_config.is_mla_model | |
| self.local_rank = storage_config.tp_rank | |
| self.pp_rank = storage_config.pp_rank | |
| self.pp_size = storage_config.pp_size | |
| self.enable_pp = self.pp_size > 1 | |
| if self.enable_pp: | |
| self.mha_suffix = f"{self.local_rank}_{self.pp_rank}" | |
| self.mla_suffix = f"{self.pp_rank}" | |
| else: | |
| self.mha_suffix = f"{self.local_rank}" | |
| self.mla_suffix = "" | |
| else: | |
| self.is_mla_backend = False | |
| self.local_rank = 0 | |
| self.pp_rank = 0 | |
| self.pp_size = 1 | |
| self.enable_pp = False | |
| self.suffix = f"{self.local_rank}" | |
| if storage_config is not None: | |
| self.is_mla_backend = storage_config.is_mla_model | |
| self.local_rank = storage_config.tp_rank | |
| self.pp_rank = storage_config.pp_rank | |
| self.pp_size = storage_config.pp_size | |
| else: | |
| self.is_mla_backend = False | |
| self.local_rank = 0 | |
| self.pp_rank = 0 | |
| self.pp_size = 1 | |
| self.enable_pp = self.pp_size > 1 | |
| if self.enable_pp: | |
| self.mha_suffix = f"{self.local_rank}_{self.pp_rank}" | |
| self.mla_suffix = f"{self.pp_rank}" | |
| else: | |
| self.mha_suffix = f"{self.local_rank}" | |
| self.mla_suffix = "" |
There was a problem hiding this comment.
Pull request overview
This pull request adds Pipeline Parallelism (PP) support to HiCache by introducing PP rank suffixes to distinguish KV cache keys across different PP ranks. When PP=2 and TP=8, PP ranks 0 and 1 previously shared the same hash keys, causing cache collisions. This change resolves the issue by appending PP rank information to cache keys.
Key Changes:
- Added
pp_rankandpp_sizefields to configuration dataclasses (HiCacheStorageConfigandCacheInitParams) - Implemented suffix logic in
MooncakeStoreto differentiate cache keys based on PP rank - Threaded PP rank/size parameters through the cache initialization pipeline
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
python/sglang/srt/mem_cache/hicache_storage.py |
Added pp_rank and pp_size fields to HiCacheStorageConfig dataclass |
python/sglang/srt/mem_cache/cache_init_params.py |
Added pp_rank and pp_size fields with defaults to CacheInitParams |
python/sglang/srt/managers/scheduler.py |
Passed pp_rank and pp_size to CacheInitParams during initialization |
python/sglang/srt/managers/cache_controller.py |
Added pp_rank and pp_size parameters and passed them to storage config generation |
python/sglang/srt/mem_cache/hiradix_cache.py |
Passed pp_rank and pp_size to cache controller and metrics collector |
python/sglang/srt/mem_cache/storage/mooncake_store/mooncake_store.py |
Implemented suffix logic using PP rank to create distinct cache keys for MHA and MLA models |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self.pp_rank = 0 | ||
| self.pp_size = 1 | ||
| self.enable_pp = False | ||
| self.suffix = f"{self.local_rank}" |
There was a problem hiding this comment.
In the else block, the code sets self.suffix but this variable is never used. Based on the if block above, this should set both self.mha_suffix and self.mla_suffix to maintain consistency with the enabled PP case. The mha_suffix should be set to the local_rank and mla_suffix should be an empty string to match the non-PP behavior.
| self.suffix = f"{self.local_rank}" | |
| self.mha_suffix = f"{self.local_rank}" | |
| self.mla_suffix = "" |
| self.enable_pp = self.pp_size > 1 | ||
| if self.enable_pp: | ||
| self.mha_suffix = f"{self.local_rank}_{self.pp_rank}" | ||
| self.mla_suffix = f"{self.pp_rank}" | ||
| else: | ||
| self.mha_suffix = f"{self.local_rank}" | ||
| self.mla_suffix = "" |
There was a problem hiding this comment.
Consider adding a comment explaining the suffix logic for PP support. For example, explain that when PP is enabled, different PP ranks need distinct keys to avoid cache collisions since they process different model layers. This would help future maintainers understand why MHA uses both tp_rank and pp_rank while MLA only uses pp_rank in the suffix.
|
/tag-run-ci-label |
|
Nice Work |
|
/rerun-failed-ci |
|
@ShangmingCai PTAL |
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
|
fail in pp2tp8 ,using deepseek 3.2 model |
Will verify in DeepSeek 3.2. And add more tests. |
It's another bug. See #15805 |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
lint should be fixed |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
2 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
Maybe need to merge main again or this test won't pass. ======================================================================
ERROR: test_bs_1_speed (__main__.TestMiMoV2Flash)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/public_sglang_ci/runner-l6b-gpu-01234567/_work/sglang/sglang/python/sglang/srt/utils/common.py", line 2506, in retry
return fn()
File "/public_sglang_ci/runner-l6b-gpu-01234567/_work/sglang/sglang/python/sglang/test/test_utils.py", line 1720, in <lambda>
lambda: super(CustomTestCase, self)._callTestMethod(method),
AssertionError: 2.7142857142857144 not greater than 3.2 |
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
Co-authored-by: Xuchun Shang <xuchun.shang@gmail.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: Shangming Cai <csmthu@gmail.com>
Motivation
When PP=2 and TP=8, PP=0 and PP=1 share the same KVCache for the HiCache Storage Backend, but they have the same hash key. To resolve this, append a suffix to the key.
CC. @whybeyoung @XucSh
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist