Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
2b56246
[HiCache]: support runtime attach/detach hicache storage
alphabetc1 Dec 26, 2025
72e3929
add ut
alphabetc1 Dec 26, 2025
1b51810
support hicache_storage_prefetch_policy
alphabetc1 Dec 27, 2025
003f7b2
fix
alphabetc1 Dec 27, 2025
775c998
refactor the existing storage backend init to use the same attach/det…
alphabetc1 Dec 29, 2025
fab4275
fix ci
alphabetc1 Dec 30, 2025
9fac448
fix
alphabetc1 Dec 30, 2025
e878adf
support update hicache_write_policy
alphabetc1 Jan 4, 2026
6033659
support config switch
alphabetc1 Jan 4, 2026
5a130de
Merge remote-tracking branch 'origin/main' into feat/hicache_store_ru…
alphabetc1 Jan 4, 2026
59a479a
fix mtr
alphabetc1 Jan 6, 2026
2934b8a
Merge branch 'main' into feat/hicache_store_runtime_attach_detach
alphabetc1 Jan 6, 2026
908fa97
Merge remote-tracking branch 'origin/main' into feat/hicache_store_ru…
alphabetc1 Jan 6, 2026
86da98a
Merge branch 'main' into feat/hicache_store_runtime_attach_detach
alphabetc1 Jan 9, 2026
bb7e8d7
Merge branch 'main' into feat/hicache_store_runtime_attach_detach
alphabetc1 Jan 14, 2026
5d384fb
add security
alphabetc1 Jan 14, 2026
c23477c
Merge branch 'main' into feat/hicache_store_runtime_attach_detach
alphabetc1 Jan 15, 2026
b8fe011
Merge branch 'main' into feat/hicache_store_runtime_attach_detach
alphabetc1 Jan 16, 2026
105e7d5
mock ADMIN_FORCE
alphabetc1 Jan 17, 2026
0ef30a8
Merge branch 'main' into feat/hicache_store_runtime_attach_detach
alphabetc1 Jan 17, 2026
4e6b48b
make API more RESTful
alphabetc1 Jan 19, 2026
a6f0610
Merge branch 'main' into feat/hicache_store_runtime_attach_detach
alphabetc1 Jan 19, 2026
b25a6c7
Merge branch 'main' into feat/hicache_store_runtime_attach_detach
alphabetc1 Jan 20, 2026
ad084ba
Merge branch 'main' into feat/hicache_store_runtime_attach_detach
alphabetc1 Jan 22, 2026
fc912a5
fix rebase
alphabetc1 Jan 23, 2026
6a1cf31
refactor drain_storage_control_queues
alphabetc1 Jan 24, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/advanced_features/hicache.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@ Hierarchical KV Caching (HiCache)

hicache_best_practices.md
hicache_design.md
hicache_storage_runtime_attach_detach.md
4 changes: 4 additions & 0 deletions docs/advanced_features/hicache_best_practices.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,10 @@ SGLang HiCache extends the traditional RadixAttention with a three-tier hierarch
--hicache-storage-backend # Optional storage backend (e.g., hf3fs, mooncake, etc.)
```

Notes:

- Besides configuring `--hicache-storage-backend` at startup, SGLang also supports **runtime attach/detach** of the HiCache storage backend (no restart required) via HTTP admin endpoints. See [Runtime Attach/Detach HiCache Storage Backend](hicache_storage_runtime_attach_detach.md).

## Key Configurations with Storage Backends Enabled

### Memory Layout Optimization
Expand Down
132 changes: 132 additions & 0 deletions docs/advanced_features/hicache_storage_runtime_attach_detach.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# Runtime Attach/Detach HiCache Storage Backend (No Restart)

This document explains how to **dynamically attach/detach the HiCache L3 storage backend at runtime** (e.g., `mooncake` / `hf3fs` / `nixl` / `file` / `aibrix` / `eic`) while **SGLang is already running and serving traffic**, without restarting the process.

For safety and consistency, the current implementation **strictly requires** these operations to happen only when the service is **idle**:

- **No running requests**
- **No waiting/queued requests**

If the idle condition is not met, the API will fail fast (HTTP 400) and **will not modify** the current service state.

---

## 1. Background and implementation overview

### 1.1 Architecture / control path

The control path is:

1. **HTTP Server** (`python/sglang/srt/entrypoints/http_server.py`)
- Exposes `PUT /hicache/storage-backend`, `DELETE /hicache/storage-backend`, `GET /hicache/storage-backend`
2. **TokenizerManager** (`python/sglang/srt/managers/tokenizer_communicator_mixin.py`)
- Sends the request to the Scheduler via `_Communicator`
3. **Scheduler** (`python/sglang/srt/managers/scheduler.py`)
- Performs a **strict idle check**
- Calls `tree_cache.attach_storage_backend(...)` / `detach_storage_backend(...)`
4. **HiRadixCache** (`python/sglang/srt/mem_cache/hiradix_cache.py`)
- Parses `hicache_storage_backend_extra_config_json` (supports both backend config and prefetch knobs)
- Calls `cache_controller.attach_storage_backend(...)` / `detach_storage_backend(...)`
5. **HiCacheController** (`python/sglang/srt/managers/cache_controller.py`)
- Creates/destroys the storage backend instance (via `StorageBackendFactory`)
- Starts/stops backend background threads at runtime (prefetch/backup)

---

## 2. Idle-state requirement (strict)

The Scheduler uses a stricter `_is_idle_for_hicache_storage_op()`:

- `_is_no_request()` is true (covers running/overlap/pp/disagg and other active states)
- `waiting_queue` is empty
- `grammar_queue` is empty (if the grammar backend is enabled)

If the condition is not met, attach/detach returns an error like:

- `Reject attach: scheduler is not idle. #queue-req=... #running-req=...`

> Tip: before switching, drain upstream traffic and wait for the server to become idle, then call attach/detach.

### 2.1 DP (data parallel) semantics

When `dp_size > 1`, the tokenizer dispatches the request to **all DP scheduler instances** and aggregates their responses:

- The final `success` is **true only if all DP ranks return success**
- The final `message` concatenates messages from all DP ranks

This is intended to prevent “silent partial success”, but it also means you may see:

- Overall **failure** even though **some ranks already succeeded**

Currently there is **no automatic partial rollback** across DP ranks (see TODO in code). Operationally:

- Prefer to keep backend config identical across ranks
- If attach fails, immediately call detach (best-effort/idempotent), fix config, then retry attach

---

## 3. How to use (HTTP Admin API)

The examples below assume your SGLang HTTP server is at `http://127.0.0.1:30000`.

### 3.1 Query current storage backend status

```bash
curl -s http://127.0.0.1:30000/hicache/storage-backend
```

Example response:

```json
{
"hicache_storage_backend": "mooncake",
"hicache_storage_backend_extra_config": "{\"master_server_address\":\"127.0.0.1:50051\", ...}"
}
```

### 3.2 Attach (enable) a storage backend
```bash
curl -s -X PUT http://127.0.0.1:30000/hicache/storage-backend \
-H 'Content-Type: application/json' \
-d '{
"hicache_storage_backend": "mooncake"
}'
```

```bash
curl -s -X PUT http://127.0.0.1:30000/hicache/storage-backend \
-H 'Content-Type: application/json' \
-d '{
"hicache_storage_backend": "mooncake",
"hicache_storage_backend_extra_config_json": "{\"master_server_address\":\"127.0.0.1:50051\",\"protocol\":\"tcp\",\"global_segment_size\":\"4gb\",\"prefetch_threshold\":256}",
"hicache_storage_prefetch_policy": "timeout"
}'
```

Notes:

- `hicache_storage_backend_extra_config_json` can include both:
- **Backend configuration** (e.g., Mooncake master/metadata/protocol, etc.)
- **Prefetch configuration** (`prefetch_threshold`, `prefetch_timeout_base`, `prefetch_timeout_per_ki_token`, `hicache_storage_pass_prefix_keys`)

### 3.3 Detach (disable) the storage backend

```bash
curl -s -X DELETE http://127.0.0.1:30000/hicache/storage-backend
```

Notes:

- Detach only makes SGLang **stop using** the L3 storage backend and stops prefetch/backup threads
- It **does not automatically delete** data stored in Mooncake/HF3FS (or other remote backends)

---

## 4. Behavior and caveats

- **No restart required**: attach/detach switches in-process at runtime
- **Must be idle**: otherwise the request is rejected to avoid consistency issues
- **Host KV layout constraints still apply**: for example, Mooncake still requires layouts like `page_first/page_first_direct/page_head`; if the server's HiCache host-memory layout does not satisfy the backend requirements, attach will fail with an error
- **Observability**:
- After attach, `server_args.hicache_storage_backend*` is updated on both the tokenizer and scheduler sides
- If metrics are enabled, attach will create a storage metrics collector in `HiRadixCache` on demand
121 changes: 121 additions & 0 deletions python/sglang/srt/entrypoints/http_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@
from sglang.srt.function_call.function_call_parser import FunctionCallParser
from sglang.srt.managers.io_struct import (
AbortReq,
AttachHiCacheStorageReqInput,
CheckWeightsReqInput,
CloseSessionReqInput,
ConfigureLoggingReq,
Expand Down Expand Up @@ -693,6 +694,22 @@ async def flush_cache():

@app.api_route("/clear_hicache_storage_backend", methods=["GET", "POST"])
@auth_level(AuthLevel.ADMIN_OPTIONAL)
async def clear_hicache_storage_backend_deprecated():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!!!

"""Deprecated: use POST /hicache/storage-backend/clear."""
ret = await _global_state.tokenizer_manager.clear_hicache_storage()
return Response(
content=(
"Deprecated endpoint. Use POST /hicache/storage-backend/clear.\n"
"Hierarchical cache storage backend cleared.\n"
),
status_code=200 if ret.success else HTTPStatus.BAD_REQUEST,
)


# example usage:
# curl -s -X POST http://127.0.0.1:30000/clear_hicache_storage_backend
@app.api_route("/hicache/storage-backend/clear", methods=["POST"])
@auth_level(AuthLevel.ADMIN_OPTIONAL)
async def clear_hicache_storage_backend():
"""Clear the hierarchical cache storage backend."""
ret = await _global_state.tokenizer_manager.clear_hicache_storage()
Expand All @@ -702,6 +719,89 @@ async def clear_hicache_storage_backend():
)


# example usage:
# curl -s -X PUT http://127.0.0.1:30000/hicache/storage-backend \
# -H 'Content-Type: application/json' \
# -d '{
# "hicache_storage_backend": "file",
# "hicache_storage_backend_extra_config_json": "{}",
# "hicache_storage_prefetch_policy": "timeout",
# "hicache_write_policy": "write_through"
# }'
@app.api_route("/hicache/storage-backend", methods=["PUT"])
@auth_level(AuthLevel.ADMIN_OPTIONAL)
async def attach_hicache_storage_backend(obj: AttachHiCacheStorageReqInput):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched to a more RESTful API, cc @slin1237 @stmatengss

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complies with router standard. LGTM.

"""Attach (enable) HiCache storage backend at runtime.

Only allowed when there are NO running / queued requests.
"""
if not _global_state.tokenizer_manager.server_args.admin_api_key:
return _admin_api_key_missing_response()

ret = await _global_state.tokenizer_manager.attach_hicache_storage(
hicache_storage_backend=obj.hicache_storage_backend,
hicache_storage_backend_extra_config_json=obj.hicache_storage_backend_extra_config_json,
hicache_storage_prefetch_policy=obj.hicache_storage_prefetch_policy,
hicache_write_policy=obj.hicache_write_policy,
)
msg = getattr(ret, "message", "")
return Response(
content=(
(
"HiCache storage backend attached.\n"
if ret.success
else "Failed to attach HiCache storage backend.\n"
)
+ (msg + "\n" if msg else "")
),
status_code=200 if ret.success else HTTPStatus.BAD_REQUEST,
)


# example usage:
# curl -s -X DELETE http://127.0.0.1:30000/hicache/storage-backend
@app.api_route("/hicache/storage-backend", methods=["DELETE"])
@auth_level(AuthLevel.ADMIN_OPTIONAL)
async def detach_hicache_storage_backend():
"""Detach (disable) HiCache storage backend at runtime.

Only allowed when there are NO running / queued requests.
"""
if not _global_state.tokenizer_manager.server_args.admin_api_key:
return _admin_api_key_missing_response()

ret = await _global_state.tokenizer_manager.detach_hicache_storage()
msg = getattr(ret, "message", "")
return Response(
content=(
(
"HiCache storage backend detached.\n"
if ret.success
else "Failed to detach HiCache storage backend.\n"
)
+ (msg + "\n" if msg else "")
),
status_code=200 if ret.success else HTTPStatus.BAD_REQUEST,
)


# example usage:
# curl -s http://127.0.0.1:30000/hicache/storage-backend
@app.get("/hicache/storage-backend")
@auth_level(AuthLevel.ADMIN_OPTIONAL)
async def hicache_storage_backend_status():
"""Get current HiCache storage backend status (tokenizer-side view)."""
if not _global_state.tokenizer_manager.server_args.admin_api_key:
return _admin_api_key_missing_response()

return {
"hicache_storage_backend": _global_state.tokenizer_manager.server_args.hicache_storage_backend,
"hicache_storage_backend_extra_config": _global_state.tokenizer_manager.server_args.hicache_storage_backend_extra_config,
"hicache_storage_prefetch_policy": _global_state.tokenizer_manager.server_args.hicache_storage_prefetch_policy,
"hicache_write_policy": _global_state.tokenizer_manager.server_args.hicache_write_policy,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider some other status, such as hicache_write_policy

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done



@app.api_route("/start_profile", methods=["GET", "POST"])
@auth_level(AuthLevel.ADMIN_OPTIONAL)
async def start_profile_async(obj: Optional[ProfileReqInput] = None):
Expand Down Expand Up @@ -1489,6 +1589,27 @@ def _create_error_response(e):
)


# FIXME: In theory we should configure ADMIN_FORCE for some entrypoints, but doing so
# would currently cause all endpoints to go through add_api_key_middleware
# (even when neither api-key nor admin-api-key is configured).
#
# For now, we simulate ADMIN_FORCE by explicitly checking the admin API key parameter.
# Once the auth wiring is refactored so ADMIN_FORCE only affects the intended
# admin endpoints, we should switch this logic to use ADMIN_FORCE directly.
def _admin_api_key_missing_response(
status_code: HTTPStatus = HTTPStatus.BAD_REQUEST,
) -> ORJSONResponse:
return ORJSONResponse(
content={
"error": (
"This endpoint requires admin API key, but this server was started "
"without one (admin-api-key). Restart with --admin-api-key to enable."
)
},
status_code=status_code,
)


# Minimal 32x32 black PNG (base64, GLM4v requires at least 32x32 sized image)
MINIMUM_PNG_PICTURE_BASE64 = "iVBORw0KGgoAAAANSUhEUgAAACAAAAAgCAYAAABzenr0AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAbUlEQVRYhe3VsQ2AMAxE0Y/lIgNQULD/OqyCMgCihCKSG4yRuKuiNH6JLsoEbMACOGBcua9HOR7Y6w6swBwMy0qLTpkeI77qdEBpBFAHBBDAGH8WrwJKI4AAegUCfAKgEgpQDvh3CR3oQCuav58qlAw73kKCSgAAAABJRU5ErkJggg=="

Expand Down
Loading
Loading