[HiCache][HA 1/N] Support HiCache storage runtime attach/detach#15892
Conversation
Summary of ChangesHello @alphabetc1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request delivers a crucial enhancement to SGLang's HiCache by enabling dynamic management of its L3 storage backend. Previously, changing storage configurations necessitated a full server restart, impacting availability. With this change, operators can now attach, detach, or switch storage backends on the fly, facilitating dynamic scaling, improving fault tolerance through quick failover, and simplifying hot upgrades. The implementation ensures operational safety by enforcing a strict idle-state requirement before any storage modification, and it exposes these capabilities via new, intuitive HTTP administration endpoints. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable feature for runtime management of HiCache storage, allowing operators to attach and detach storage backends without server restarts. The implementation is robust, with careful attention to thread safety, error handling, and state consistency, particularly in the cache_controller.py. The addition of comprehensive documentation and an end-to-end test is also commendable. I have one suggestion to improve maintainability by refactoring some duplicated code.
| self.tp_world_size = torch.distributed.get_world_size(group=self.tp_group) | ||
| if self.tp_world_size > 1: | ||
| group_ranks = torch.distributed.get_process_group_ranks(self.tp_group) | ||
| self.prefetch_tp_group = torch.distributed.new_group( | ||
| group_ranks, backend="gloo" | ||
| ) | ||
|
|
||
| self.page_get_func = self._generic_page_get | ||
| self.page_set_func = self._generic_page_set | ||
| if (self.storage_backend_type in ["hf3fs", "mooncake", "eic"]) or ( | ||
| self.storage_backend_type == "dynamic" | ||
| and bool(self.storage_config.extra_config.get("interface_v1", 0)) | ||
| ): | ||
| self.page_get_func = self._page_get_zero_copy | ||
| self.page_set_func = self._page_set_zero_copy |
There was a problem hiding this comment.
There's some code duplication between this new attach_storage_backend method and the existing __init__ method.
Specifically:
- The logic for creating
prefetch_tp_group(lines 501-506). - The logic for selecting
page_get_funcandpage_set_func(lines 508-515).
To improve maintainability and reduce redundancy, consider extracting these blocks into private helper methods. For example:
def _create_prefetch_tp_group(self):
self.tp_world_size = torch.distributed.get_world_size(group=self.tp_group)
if self.tp_world_size > 1:
group_ranks = torch.distributed.get_process_group_ranks(self.tp_group)
self.prefetch_tp_group = torch.distributed.new_group(
group_ranks, backend="gloo"
)
else:
self.prefetch_tp_group = None
def _select_page_transfer_funcs(self):
self.page_get_func = self._generic_page_get
self.page_set_func = self._generic_page_set
if (self.storage_backend_type in ["hf3fs", "mooncake", "eic"]) or (
self.storage_backend_type == "dynamic"
and bool(self.storage_config.extra_config.get("interface_v1", 0))
):
self.page_get_func = self._page_get_zero_copy
self.page_set_func = self._page_set_zero_copyThen you can call these helpers from both __init__ and attach_storage_backend.
|
TODO: |
|
@xiezhq-hermann Hi, sorry to bother you — could you help review this PR? thanks |
|
Thank you @alphabetc1 for the PR, I quite like this feature and I am wondering would it be possible to refactor the existing storage backend initialization using the same attach and detach interfaces as well. For example, if user specified a storage backend, it implicitly attaches a storage backend and when process shutdown it automatically detatches the storage backend. While the current PR does not change the existing execution path, there are duplication and potential maintenance issues in the long run. Let me know how your thoughts and thanks again : ) |
Thanks for the review and suggestion! |
1295769 to
fab4275
Compare
|
It is a very useful PR. It can support model update and fault tolerance. |
|
If the CI still fails, merge the main and run it again. |
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
|
/rerun-failed-ci 4 |
|
/rerun-failed-ci 1 |
| # }' | ||
| @app.api_route("/hicache/storage-backend", methods=["PUT"]) | ||
| @auth_level(AuthLevel.ADMIN_OPTIONAL) | ||
| async def attach_hicache_storage_backend(obj: AttachHiCacheStorageReqInput): |
There was a problem hiding this comment.
switched to a more RESTful API, cc @slin1237 @stmatengss
There was a problem hiding this comment.
Complies with router standard. LGTM.
|
/rerun-failed-ci |
|
/rerun-failed-ci 1 |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
|
|
||
| @app.api_route("/clear_hicache_storage_backend", methods=["GET", "POST"]) | ||
| @auth_level(AuthLevel.ADMIN_OPTIONAL) | ||
| async def clear_hicache_storage_backend_deprecated(): |
Motivation
Previously, the HiCache storage backend could only be configured at process startup. Changing the backend meant restarting the whole server, which hurts availability and makes operations clumsy.
This PR adds runtime attach/detach so operators can enable, disable, or switch the L3 storage backend without restarting. This is especially useful for:
Dynamic enable/switch of HiCache storage
Turn HiCache storage on or off, or switch backends, on the fly based on load, cost, or debugging needs.
Fault tolerance (failover)
Production backends can go bad (timeouts, misconfig, partial outages). With runtime detach, you can quickly stop sending traffic to a broken backend and avoid repeated IO failures impacting the serving path. Then you can attach a healthy backend to restore service, improving resilience and reducing MTTR.
Hot upgrade (switchover)
Backends often need upgrades or migrations (new cluster, protocol changes, config updates). With runtime attach/detach, you can perform a controlled switchover when traffic is low, without restarting the server, making “hot” transitions safer and easier to operate.
Modifications
The control path is:
python/sglang/srt/entrypoints/http_server.py)PUT /hicache/storage-backend,DELETE /hicache/storage-backend,GET /hicache/storage-backendpython/sglang/srt/managers/tokenizer_communicator_mixin.py)_Communicatorpython/sglang/srt/managers/scheduler.py)tree_cache.attach_storage_backend(...)/detach_storage_backend(...)python/sglang/srt/mem_cache/hiradix_cache.py)storage_backend_extra_config_json(supports both backend config and prefetch knobs)cache_controller.attach_storage_backend(...)/detach_storage_backend(...)python/sglang/srt/managers/cache_controller.py)StorageBackendFactory)On the Scheduler side, add a strict idle-state check:
_is_idle_for_hicache_storage_op()Conditions:
_is_no_request()isTruewaiting_queueis emptygrammar_queueis empty (if enabled)HiCacheController adds runtime operations:
attach_storage_backend(...): create the backend, register host buffers, and start prefetch/backup threadsdetach_storage_backend(): stop prefetch/backup threads and release the backend (best-effort close)New/exposed HTTP APIs
PUT /hicache/storage-backendDELETE /hicache/storage-backendGET /hicache/storage-backendFlow Diagram
Attach
Detach
Accuracy Tests
Benchmarking and Profiling
python3 -m pytest test/srt/hicache/test_hicache_storage_runtime_attach_detach.py -vChecklist