Skip to content

[HiCache][HA 1/N] Support HiCache storage runtime attach/detach#15892

Merged
xiezhq-hermann merged 26 commits intosgl-project:mainfrom
alphabetc1:feat/hicache_store_runtime_attach_detach
Jan 27, 2026
Merged

[HiCache][HA 1/N] Support HiCache storage runtime attach/detach#15892
xiezhq-hermann merged 26 commits intosgl-project:mainfrom
alphabetc1:feat/hicache_store_runtime_attach_detach

Conversation

@alphabetc1
Copy link
Collaborator

@alphabetc1 alphabetc1 commented Dec 26, 2025

Motivation

Previously, the HiCache storage backend could only be configured at process startup. Changing the backend meant restarting the whole server, which hurts availability and makes operations clumsy.

This PR adds runtime attach/detach so operators can enable, disable, or switch the L3 storage backend without restarting. This is especially useful for:

  • Dynamic enable/switch of HiCache storage
    Turn HiCache storage on or off, or switch backends, on the fly based on load, cost, or debugging needs.

  • Fault tolerance (failover)
    Production backends can go bad (timeouts, misconfig, partial outages). With runtime detach, you can quickly stop sending traffic to a broken backend and avoid repeated IO failures impacting the serving path. Then you can attach a healthy backend to restore service, improving resilience and reducing MTTR.

  • Hot upgrade (switchover)
    Backends often need upgrades or migrations (new cluster, protocol changes, config updates). With runtime attach/detach, you can perform a controlled switchover when traffic is low, without restarting the server, making “hot” transitions safer and easier to operate.

Modifications

The control path is:

  1. HTTP Server (python/sglang/srt/entrypoints/http_server.py)
    • Exposes PUT /hicache/storage-backend, DELETE /hicache/storage-backend, GET /hicache/storage-backend
  2. TokenizerManager (python/sglang/srt/managers/tokenizer_communicator_mixin.py)
    • Sends the request to the Scheduler via _Communicator
  3. Scheduler (python/sglang/srt/managers/scheduler.py)
    • Performs a strict idle check
    • Calls tree_cache.attach_storage_backend(...) / detach_storage_backend(...)
  4. HiRadixCache (python/sglang/srt/mem_cache/hiradix_cache.py)
    • Parses storage_backend_extra_config_json (supports both backend config and prefetch knobs)
    • Calls cache_controller.attach_storage_backend(...) / detach_storage_backend(...)
  5. HiCacheController (python/sglang/srt/managers/cache_controller.py)
    • Creates/destroys the storage backend instance (via StorageBackendFactory)
    • Starts/stops backend background threads at runtime (prefetch/backup)

On the Scheduler side, add a strict idle-state check: _is_idle_for_hicache_storage_op()
Conditions:

  • _is_no_request() is True
  • waiting_queue is empty
  • grammar_queue is empty (if enabled)

HiCacheController adds runtime operations:

  • attach_storage_backend(...): create the backend, register host buffers, and start prefetch/backup threads
  • detach_storage_backend(): stop prefetch/backup threads and release the backend (best-effort close)

New/exposed HTTP APIs

  • Attach: PUT /hicache/storage-backend
  • Detach: DELETE /hicache/storage-backend
  • Status: GET /hicache/storage-backend

Flow Diagram

Attach

sequenceDiagram
  participant C as Client
  participant H as HTTP Server
  participant A as Auth Middleware
  participant T as TokenizerManager
  participant Q as _Communicator/ZMQ
  participant S as Scheduler
  participant R as HiRadixCache
  participant CC as HiCacheController
  participant SB as StorageBackend

  C->>H: PUT /hicache/storage-backend (json body)
  H->>A: auth check (ADMIN_OPTIONAL)
  A-->>H: allow/deny
  H->>H: admin_api_key configured?
  H->>T: attach_hicache_storage(...)
  T->>Q: send AttachHiCacheStorageReqInput
  Q->>S: dispatch by type
  S->>S: _is_idle_for_hicache_storage_op?
  S->>R: tree_cache.attach_storage_backend(...)
  R->>CC: cache_controller.attach_storage_backend(...)
  CC->>SB: StorageBackendFactory.create_backend(...)
  SB-->>CC: backend instance
  CC-->>R: threads started + flags set
  R-->>S: ok,msg
  S-->>Q: AttachHiCacheStorageReqOutput (per-rank)
  Q-->>T: merge results
  T-->>H: success + message
  H-->>C: 200/400
Loading

Detach

sequenceDiagram
  participant C as Client
  participant H as HTTP Server
  participant A as Auth Middleware
  participant T as TokenizerManager
  participant Q as _Communicator/ZMQ
  participant S as Scheduler
  participant R as HiRadixCache
  participant CC as HiCacheController
  participant SB as StorageBackend

  C->>H: DELETE /hicache/storage-backend
  H->>A: auth check (ADMIN_OPTIONAL)
  A-->>H: allow/deny
  H->>H: admin_api_key configured?
  H->>T: detach_hicache_storage()
  T->>Q: send DetachHiCacheStorageReqInput
  Q->>S: dispatch by type
  S->>S: _is_idle_for_hicache_storage_op?
  S->>R: tree_cache.detach_storage_backend()
  R->>CC: cache_controller.detach_storage_backend()
  CC->>SB: stop threads + close backend
  SB-->>CC: cleanup done
  CC-->>R: flags reset
  R-->>S: ok,msg
  S-->>Q: DetachHiCacheStorageReqOutput (per-rank)
  Q-->>T: merge results
  T-->>H: success + message
  H-->>C: 200/400
Loading

Accuracy Tests

Benchmarking and Profiling

  • UT
    python3 -m pytest test/srt/hicache/test_hicache_storage_runtime_attach_detach.py -v
  • E2E manual flow
# launch sglang with hierarchical-cache/admin-api-key enabled
export SGLANG_HICACHE_FILE_BACKEND_STORAGE_DIR=/root/code/tmp/sglang_hicache_file_test

python -m sglang.launch_server \
  --model-path /root/models/Meta-Llama-3.1-8B-Instruct \
  --host 0.0.0.0 --port 30000 \
  --enable-hierarchical-cache \
  --mem-fraction-static 0.3 \
  --page-size 64 \
  --hicache-ratio 2 \
  --admin-api-key 123 \
  --served-model-name test

# attach/update hicache storage
curl -s -X PUT http://127.0.0.1:30000/hicache/storage-backend \
  -H 'Authorization: Bearer 123' \
  -H 'Content-Type: application/json' \
  -d '{
    "hicache_storage_backend": "file",
  }'
 
# or, attach hicache storage with prefetch_policy/extra_config_json/write_policy
curl -s -X DELETE http://127.0.0.1:30000/hicache/storage-backend \
  -H 'Authorization: Bearer 123' \
  -H 'Content-Type: application/json' \
  -d '{
    "hicache_storage_backend": "file",
    "hicache_storage_backend_extra_config_json": "{}",
    "hicache_storage_prefetch_policy": "wait_complete"
    "hicache_write_policy": "write_back"
  }'

# check
curl -s GET http://127.0.0.1:30000/hicache/storage-backend -H 'Authorization: Bearer 123'

Checklist

@github-actions github-actions bot added documentation Improvements or additions to documentation hicache Hierarchical Caching for SGLang labels Dec 26, 2025
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @alphabetc1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers a crucial enhancement to SGLang's HiCache by enabling dynamic management of its L3 storage backend. Previously, changing storage configurations necessitated a full server restart, impacting availability. With this change, operators can now attach, detach, or switch storage backends on the fly, facilitating dynamic scaling, improving fault tolerance through quick failover, and simplifying hot upgrades. The implementation ensures operational safety by enforcing a strict idle-state requirement before any storage modification, and it exposes these capabilities via new, intuitive HTTP administration endpoints.

Highlights

  • Runtime HiCache Storage Management: Introduces the ability to attach, detach, and switch HiCache L3 storage backends at runtime without requiring a server restart, significantly improving operational flexibility and availability.
  • New HTTP Admin APIs: Adds new HTTP endpoints: POST /attach_hicache_storage_backend, POST /detach_hicache_storage_backend, and GET /hicache_storage_backend for managing and querying the storage backend status.
  • Strict Idle-State Check: Implements a strict idle-state check in the Scheduler to ensure that attach/detach operations only occur when no requests are running or queued, preventing consistency issues.
  • Dynamic Thread Management: The HiCacheController now includes dedicated mechanisms to start and stop storage-related background threads (prefetch/backup) dynamically during attach and detach operations.
  • Comprehensive Testing and Documentation: Includes a new E2E smoke test to validate the runtime attach/detach functionality and dedicated documentation explaining the feature, its architecture, and usage.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature for runtime management of HiCache storage, allowing operators to attach and detach storage backends without server restarts. The implementation is robust, with careful attention to thread safety, error handling, and state consistency, particularly in the cache_controller.py. The addition of comprehensive documentation and an end-to-end test is also commendable. I have one suggestion to improve maintainability by refactoring some duplicated code.

Comment on lines +501 to +515
self.tp_world_size = torch.distributed.get_world_size(group=self.tp_group)
if self.tp_world_size > 1:
group_ranks = torch.distributed.get_process_group_ranks(self.tp_group)
self.prefetch_tp_group = torch.distributed.new_group(
group_ranks, backend="gloo"
)

self.page_get_func = self._generic_page_get
self.page_set_func = self._generic_page_set
if (self.storage_backend_type in ["hf3fs", "mooncake", "eic"]) or (
self.storage_backend_type == "dynamic"
and bool(self.storage_config.extra_config.get("interface_v1", 0))
):
self.page_get_func = self._page_get_zero_copy
self.page_set_func = self._page_set_zero_copy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's some code duplication between this new attach_storage_backend method and the existing __init__ method.
Specifically:

  • The logic for creating prefetch_tp_group (lines 501-506).
  • The logic for selecting page_get_func and page_set_func (lines 508-515).

To improve maintainability and reduce redundancy, consider extracting these blocks into private helper methods. For example:

def _create_prefetch_tp_group(self):
    self.tp_world_size = torch.distributed.get_world_size(group=self.tp_group)
    if self.tp_world_size > 1:
        group_ranks = torch.distributed.get_process_group_ranks(self.tp_group)
        self.prefetch_tp_group = torch.distributed.new_group(
            group_ranks, backend="gloo"
        )
    else:
        self.prefetch_tp_group = None

def _select_page_transfer_funcs(self):
    self.page_get_func = self._generic_page_get
    self.page_set_func = self._generic_page_set
    if (self.storage_backend_type in ["hf3fs", "mooncake", "eic"]) or (
        self.storage_backend_type == "dynamic"
        and bool(self.storage_config.extra_config.get("interface_v1", 0))
    ):
        self.page_get_func = self._page_get_zero_copy
        self.page_set_func = self._page_set_zero_copy

Then you can call these helpers from both __init__ and attach_storage_backend.

@alphabetc1
Copy link
Collaborator Author

TODO:
For endpoints that modify internal state, we may need an additional layer of authorization, for example as in #15908.

@alphabetc1
Copy link
Collaborator Author

@xiezhq-hermann Hi, sorry to bother you — could you help review this PR? thanks

@xiezhq-hermann
Copy link
Collaborator

Thank you @alphabetc1 for the PR, I quite like this feature and I am wondering would it be possible to refactor the existing storage backend initialization using the same attach and detach interfaces as well. For example, if user specified a storage backend, it implicitly attaches a storage backend and when process shutdown it automatically detatches the storage backend. While the current PR does not change the existing execution path, there are duplication and potential maintenance issues in the long run. Let me know how your thoughts and thanks again : )

@alphabetc1
Copy link
Collaborator Author

Thank you @alphabetc1 for the PR, I quite like this feature and I am wondering would it be possible to refactor the existing storage backend initialization using the same attach and detach interfaces as well. For example, if user specified a storage backend, it implicitly attaches a storage backend and when process shutdown it automatically detatches the storage backend. While the current PR does not change the existing execution path, there are duplication and potential maintenance issues in the long run. Let me know how your thoughts and thanks again : )

Thanks for the review and suggestion!
I totally agree, that’s also how I was thinking about it. I’ll update this PR to refactor the existing storage backend init to use the same attach/detach interfaces.

@alphabetc1 alphabetc1 force-pushed the feat/hicache_store_runtime_attach_detach branch from 1295769 to fab4275 Compare December 30, 2025 02:53
@alphabetc1
Copy link
Collaborator Author

cc @xiezhq-hermann

@stmatengss
Copy link
Collaborator

It is a very useful PR. It can support model update and fault tolerance.

@stmatengss
Copy link
Collaborator

If the CI still fails, merge the main and run it again.

@alphabetc1
Copy link
Collaborator Author

/rerun-failed-ci

1 similar comment
@alphabetc1
Copy link
Collaborator Author

alphabetc1 commented Jan 16, 2026

/rerun-failed-ci

@alphabetc1
Copy link
Collaborator Author

alphabetc1 commented Jan 17, 2026

/rerun-failed-ci 4

@alphabetc1
Copy link
Collaborator Author

alphabetc1 commented Jan 18, 2026

/rerun-failed-ci 1

# }'
@app.api_route("/hicache/storage-backend", methods=["PUT"])
@auth_level(AuthLevel.ADMIN_OPTIONAL)
async def attach_hicache_storage_backend(obj: AttachHiCacheStorageReqInput):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched to a more RESTful API, cc @slin1237 @stmatengss

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complies with router standard. LGTM.

@alphabetc1
Copy link
Collaborator Author

alphabetc1 commented Jan 20, 2026

/rerun-failed-ci

@alphabetc1
Copy link
Collaborator Author

alphabetc1 commented Jan 20, 2026

/rerun-failed-ci 1

@alphabetc1

This comment was marked as outdated.

@alphabetc1

This comment was marked as outdated.

@alphabetc1 alphabetc1 changed the title [HiCache]: Support HiCache storage runtime attach/detach [HiCache][HA 1/N] Support HiCache storage runtime attach/detach Jan 22, 2026

@app.api_route("/clear_hicache_storage_backend", methods=["GET", "POST"])
@auth_level(AuthLevel.ADMIN_OPTIONAL)
async def clear_hicache_storage_backend_deprecated():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!!!

@xiezhq-hermann xiezhq-hermann merged commit fd3b179 into sgl-project:main Jan 27, 2026
405 of 424 checks passed
@alphabetc1 alphabetc1 deleted the feat/hicache_store_runtime_attach_detach branch February 6, 2026 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation hicache Hierarchical Caching for SGLang high priority run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants