sgl-project · xiezhq-hermann · Jan 27, 2026 · Dec 26, 2025 · Dec 26, 2025 · Dec 27, 2025
diff --git a/docs/advanced_features/hicache.rst b/docs/advanced_features/hicache.rst
@@ -6,3 +6,4 @@ Hierarchical KV Caching (HiCache)
 
    hicache_best_practices.md
    hicache_design.md
+   hicache_storage_runtime_attach_detach.md
diff --git a/docs/advanced_features/hicache_best_practices.md b/docs/advanced_features/hicache_best_practices.md
@@ -19,6 +19,10 @@ SGLang HiCache extends the traditional RadixAttention with a three-tier hierarch
 --hicache-storage-backend             # Optional storage backend (e.g., hf3fs, mooncake, etc.)
 ```
 
+Notes:
+
+- Besides configuring `--hicache-storage-backend` at startup, SGLang also supports **runtime attach/detach** of the HiCache storage backend (no restart required) via HTTP admin endpoints. See [Runtime Attach/Detach HiCache Storage Backend](hicache_storage_runtime_attach_detach.md).
+
 ## Key Configurations with Storage Backends Enabled
 
 ### Memory Layout Optimization

diff --git a/docs/advanced_features/hicache_storage_runtime_attach_detach.md b/docs/advanced_features/hicache_storage_runtime_attach_detach.md
@@ -0,0 +1,132 @@
+# Runtime Attach/Detach HiCache Storage Backend (No Restart)
+
+This document explains how to **dynamically attach/detach the HiCache L3 storage backend at runtime** (e.g., `mooncake` / `hf3fs` / `nixl` / `file` / `aibrix` / `eic`) while **SGLang is already running and serving traffic**, without restarting the process.
+
+For safety and consistency, the current implementation **strictly requires** these operations to happen only when the service is **idle**:
+
+- **No running requests**
+- **No waiting/queued requests**
+
+If the idle condition is not met, the API will fail fast (HTTP 400) and **will not modify** the current service state.
+
+---
+
+## 1. Background and implementation overview
+
+### 1.1 Architecture / control path
+
+The control path is:
+
+1. **HTTP Server** (`python/sglang/srt/entrypoints/http_server.py`)
+   - Exposes `PUT /hicache/storage-backend`, `DELETE /hicache/storage-backend`, `GET /hicache/storage-backend`
+2. **TokenizerManager** (`python/sglang/srt/managers/tokenizer_communicator_mixin.py`)
+   - Sends the request to the Scheduler via `_Communicator`
+3. **Scheduler** (`python/sglang/srt/managers/scheduler.py`)
+   - Performs a **strict idle check**
+   - Calls `tree_cache.attach_storage_backend(...)` / `detach_storage_backend(...)`
+4. **HiRadixCache** (`python/sglang/srt/mem_cache/hiradix_cache.py`)
+   - Parses `hicache_storage_backend_extra_config_json` (supports both backend config and prefetch knobs)
+   - Calls `cache_controller.attach_storage_backend(...)` / `detach_storage_backend(...)`
+5. **HiCacheController** (`python/sglang/srt/managers/cache_controller.py`)
+   - Creates/destroys the storage backend instance (via `StorageBackendFactory`)
+   - Starts/stops backend background threads at runtime (prefetch/backup)
+
+---
+
+## 2. Idle-state requirement (strict)
+
+The Scheduler uses a stricter `_is_idle_for_hicache_storage_op()`:
+
+- `_is_no_request()` is true (covers running/overlap/pp/disagg and other active states)
+- `waiting_queue` is empty
+- `grammar_queue` is empty (if the grammar backend is enabled)
+
+If the condition is not met, attach/detach returns an error like:
+
+- `Reject attach: scheduler is not idle. #queue-req=... #running-req=...`
+
+> Tip: before switching, drain upstream traffic and wait for the server to become idle, then call attach/detach.
+
+### 2.1 DP (data parallel) semantics
+
+When `dp_size > 1`, the tokenizer dispatches the request to **all DP scheduler instances** and aggregates their responses:
+
+- The final `success` is **true only if all DP ranks return success**
+- The final `message` concatenates messages from all DP ranks
+
+This is intended to prevent “silent partial success”, but it also means you may see:
+
+- Overall **failure** even though **some ranks already succeeded**
+
+Currently there is **no automatic partial rollback** across DP ranks (see TODO in code). Operationally:
+
+- Prefer to keep backend config identical across ranks
+- If attach fails, immediately call detach (best-effort/idempotent), fix config, then retry attach
+
+---
+
+## 3. How to use (HTTP Admin API)
+
+The examples below assume your SGLang HTTP server is at `http://127.0.0.1:30000`.
+
+### 3.1 Query current storage backend status
+
+```bash
+curl -s http://127.0.0.1:30000/hicache/storage-backend
+```
+
+Example response:
+
+```json
+{
+  "hicache_storage_backend": "mooncake",
+  "hicache_storage_backend_extra_config": "{\"master_server_address\":\"127.0.0.1:50051\", ...}"
+}
+```
+
+### 3.2 Attach (enable) a storage backend
+```bash
+curl -s -X PUT http://127.0.0.1:30000/hicache/storage-backend \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "hicache_storage_backend": "mooncake"
+  }'
+```
+
+```bash
+curl -s -X PUT http://127.0.0.1:30000/hicache/storage-backend \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "hicache_storage_backend": "mooncake",
+    "hicache_storage_backend_extra_config_json": "{\"master_server_address\":\"127.0.0.1:50051\",\"protocol\":\"tcp\",\"global_segment_size\":\"4gb\",\"prefetch_threshold\":256}",
+    "hicache_storage_prefetch_policy": "timeout"
+  }'
+```
+
+Notes:
+
+- `hicache_storage_backend_extra_config_json` can include both:
+  - **Backend configuration** (e.g., Mooncake master/metadata/protocol, etc.)
+  - **Prefetch configuration** (`prefetch_threshold`, `prefetch_timeout_base`, `prefetch_timeout_per_ki_token`, `hicache_storage_pass_prefix_keys`)
+
+### 3.3 Detach (disable) the storage backend
+
+```bash
+curl -s -X DELETE http://127.0.0.1:30000/hicache/storage-backend
+```
+
+Notes:
+
+- Detach only makes SGLang **stop using** the L3 storage backend and stops prefetch/backup threads
+- It **does not automatically delete** data stored in Mooncake/HF3FS (or other remote backends)
+
+---
+
+## 4. Behavior and caveats
+
+- **No restart required**: attach/detach switches in-process at runtime
+- **Must be idle**: otherwise the request is rejected to avoid consistency issues
+- **Host KV layout constraints still apply**: for example, Mooncake still requires layouts like `page_first/page_first_direct/page_head`; if the server's HiCache host-memory layout does not satisfy the backend requirements, attach will fail with an error
+- **Observability**:
+  - After attach, `server_args.hicache_storage_backend*` is updated on both the tokenizer and scheduler sides
+  - If metrics are enabled, attach will create a storage metrics collector in `HiRadixCache` on demand
@@ -93,6 +93,7 @@
 from sglang.srt.function_call.function_call_parser import FunctionCallParser
 from sglang.srt.managers.io_struct import (
     AbortReq,
+    AttachHiCacheStorageReqInput,
     CheckWeightsReqInput,
     CloseSessionReqInput,
     ConfigureLoggingReq,
@@ -693,6 +694,22 @@ async def flush_cache():
 
 @app.api_route("/clear_hicache_storage_backend", methods=["GET", "POST"])
 @auth_level(AuthLevel.ADMIN_OPTIONAL)
+async def clear_hicache_storage_backend_deprecated():
+    """Deprecated: use POST /hicache/storage-backend/clear."""
+    ret = await _global_state.tokenizer_manager.clear_hicache_storage()
+    return Response(
+        content=(
+            "Deprecated endpoint. Use POST /hicache/storage-backend/clear.\n"
+            "Hierarchical cache storage backend cleared.\n"
+        ),
+        status_code=200 if ret.success else HTTPStatus.BAD_REQUEST,
+    )
+
+
+# example usage:
+# curl -s -X POST http://127.0.0.1:30000/clear_hicache_storage_backend
+@app.api_route("/hicache/storage-backend/clear", methods=["POST"])
+@auth_level(AuthLevel.ADMIN_OPTIONAL)
 async def clear_hicache_storage_backend():
     """Clear the hierarchical cache storage backend."""
     ret = await _global_state.tokenizer_manager.clear_hicache_storage()
@@ -702,6 +719,89 @@ async def clear_hicache_storage_backend():
     )
 
 
+# example usage:
+# curl -s -X PUT http://127.0.0.1:30000/hicache/storage-backend \
+#  -H 'Content-Type: application/json' \
+#   -d '{
+#     "hicache_storage_backend": "file",
+#     "hicache_storage_backend_extra_config_json": "{}",
+#     "hicache_storage_prefetch_policy": "timeout",
+#     "hicache_write_policy": "write_through"
+#   }'
+@app.api_route("/hicache/storage-backend", methods=["PUT"])
+@auth_level(AuthLevel.ADMIN_OPTIONAL)
+async def attach_hicache_storage_backend(obj: AttachHiCacheStorageReqInput):
+    """Attach (enable) HiCache storage backend at runtime.
+
+    Only allowed when there are NO running / queued requests.
+    """
+    if not _global_state.tokenizer_manager.server_args.admin_api_key:
+        return _admin_api_key_missing_response()
+
+    ret = await _global_state.tokenizer_manager.attach_hicache_storage(
+        hicache_storage_backend=obj.hicache_storage_backend,
+        hicache_storage_backend_extra_config_json=obj.hicache_storage_backend_extra_config_json,
+        hicache_storage_prefetch_policy=obj.hicache_storage_prefetch_policy,
+        hicache_write_policy=obj.hicache_write_policy,
+    )
+    msg = getattr(ret, "message", "")
+    return Response(
+        content=(
+            (
+                "HiCache storage backend attached.\n"
+                if ret.success
+                else "Failed to attach HiCache storage backend.\n"
+            )
+            + (msg + "\n" if msg else "")
+        ),
+        status_code=200 if ret.success else HTTPStatus.BAD_REQUEST,
+    )
+
+
+# example usage:
+# curl -s -X DELETE http://127.0.0.1:30000/hicache/storage-backend
+@app.api_route("/hicache/storage-backend", methods=["DELETE"])
+@auth_level(AuthLevel.ADMIN_OPTIONAL)
+async def detach_hicache_storage_backend():
+    """Detach (disable) HiCache storage backend at runtime.
+
+    Only allowed when there are NO running / queued requests.
+    """
+    if not _global_state.tokenizer_manager.server_args.admin_api_key:
+        return _admin_api_key_missing_response()
+
+    ret = await _global_state.tokenizer_manager.detach_hicache_storage()
+    msg = getattr(ret, "message", "")
+    return Response(
+        content=(
+            (
+                "HiCache storage backend detached.\n"
+                if ret.success
+                else "Failed to detach HiCache storage backend.\n"
+            )
+            + (msg + "\n" if msg else "")
+        ),
+        status_code=200 if ret.success else HTTPStatus.BAD_REQUEST,
+    )
+
+
+# example usage:
+# curl -s http://127.0.0.1:30000/hicache/storage-backend
+@app.get("/hicache/storage-backend")
+@auth_level(AuthLevel.ADMIN_OPTIONAL)
+async def hicache_storage_backend_status():
+    """Get current HiCache storage backend status (tokenizer-side view)."""
+    if not _global_state.tokenizer_manager.server_args.admin_api_key:
+        return _admin_api_key_missing_response()
+
+    return {
+        "hicache_storage_backend": _global_state.tokenizer_manager.server_args.hicache_storage_backend,
+        "hicache_storage_backend_extra_config": _global_state.tokenizer_manager.server_args.hicache_storage_backend_extra_config,
+        "hicache_storage_prefetch_policy": _global_state.tokenizer_manager.server_args.hicache_storage_prefetch_policy,
+        "hicache_write_policy": _global_state.tokenizer_manager.server_args.hicache_write_policy,
+    }
+
+
 @app.api_route("/start_profile", methods=["GET", "POST"])
 @auth_level(AuthLevel.ADMIN_OPTIONAL)
 async def start_profile_async(obj: Optional[ProfileReqInput] = None):
@@ -1489,6 +1589,27 @@ def _create_error_response(e):
     )
 
 
+# FIXME: In theory we should configure ADMIN_FORCE for some entrypoints, but doing so
+# would currently cause all endpoints to go through add_api_key_middleware
+# (even when neither api-key nor admin-api-key is configured).
+#
+# For now, we simulate ADMIN_FORCE by explicitly checking the admin API key parameter.
+# Once the auth wiring is refactored so ADMIN_FORCE only affects the intended
+# admin endpoints, we should switch this logic to use ADMIN_FORCE directly.
+def _admin_api_key_missing_response(
+    status_code: HTTPStatus = HTTPStatus.BAD_REQUEST,
+) -> ORJSONResponse:
+    return ORJSONResponse(
+        content={
+            "error": (
+                "This endpoint requires admin API key, but this server was started "
+                "without one (admin-api-key). Restart with --admin-api-key to enable."
+            )
+        },
+        status_code=status_code,
+    )
+
+
 # Minimal 32x32 black PNG (base64, GLM4v requires at least 32x32 sized image)
 MINIMUM_PNG_PICTURE_BASE64 = "iVBORw0KGgoAAAANSUhEUgAAACAAAAAgCAYAAABzenr0AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAbUlEQVRYhe3VsQ2AMAxE0Y/lIgNQULD/OqyCMgCihCKSG4yRuKuiNH6JLsoEbMACOGBcua9HOR7Y6w6swBwMy0qLTpkeI77qdEBpBFAHBBDAGH8WrwJKI4AAegUCfAKgEgpQDvh3CR3oQCuav58qlAw73kKCSgAAAABJRU5ErkJggg=="
Original file line number	Diff line number	Diff line change
Expand Up		@@ -6,3 +6,4 @@ Hierarchical KV Caching (HiCache)

		hicache_best_practices.md
		hicache_design.md
		hicache_storage_runtime_attach_detach.md