[HiCache][RFC] SLA-oriented high availability for HiCache storage by alphabetc1 · Pull Request #17521 · sgl-project/sglang

alphabetc1 · 2026-01-21T19:03:00Z

Background & Goals

HiCache Storage is currently weak at handling runtime config updates and failure handling. It doesn’t really support the kind of high availability (HA) we want — things like seamless switchover and automatic failover.

This patch lays out a step‑by‑step plan to make HiCache Storage more HA‑friendly. The roadmap mainly focuses on:

Switchover: planned switches (upgrades, downgrades, maintenance)
Failover: unplanned switches (storage or network failures)

To keep the risk manageable, this will be split into multiple smaller, independent issues.

Switchover (Planned Switch)

For planned operations (upgrades, downgrades, maintenance), we want to be able to switch storage backends smoothly, with as little impact on upstream services as possible.

Support dynamic updates of the storage backend: [HiCache][HA 1/N] Support HiCache storage runtime attach/detach #15892
- Status: GET /hicache/storage-backend
- Attach / switch: PUT /hicache/storage-backend
- Detach: DELETE /hicache/storage-backend
- By default, be conservative:
  - Only allow updates when there are no in‑flight requests.
  - If there is ongoing traffic, reject the operation and return an error.
Support forced updates of the storage backend: [HiCache][RFC] SLA-oriented high availability for HiCache storage #17521
- Add a force flag so that even under load (with in‑flight requests) and when the storage is failing, we can quickly detach the faulty backend and switch away from it, minimizing impact on upstream traffic.
API security: feat: add --admin-api-key for finer-grained endpoint auth #15908
- Use an admin-api-key for authentication.
- Dynamic storage updates are only allowed if:
  - admin-api-key is explicitly configured in sglang, and
  - the client includes this key in the API call.
- This is to avoid silent or accidental changes in production.
Partial rollback:
- For multi‑step migrations / switches (e.g., TP / DP / PP phases), if something goes wrong halfway or results don’t match expectations, we should be able to roll back part of the changes instead of doing a full reset.

Automatic Failover (Unplanned Failure Handling)

When the underlying storage has problems (e.g., node down, network issues), we want the system to detect it automatically, cut off the bad backend, and recover by switching to a healthy one with minimal impact on callers.

Automatic failure handling: [HiCache][RFC] SLA-oriented high availability for HiCache storage #17521
- Failure detection:
  - Define how we detect issues: timeouts, specific error codes, and thresholds for consecutive failures.
  - Expose metrics and hook them into alerts, so operators can see failures instead of guessing.
- Circuit breaking & recovery:
  - When a backend is considered unhealthy, automatically circuit break it (logically detach it from HiCache Storage).
  - Try to reconnect to:
    - other healthy instances in the same cluster, or
    - configured standby nodes (if provided).
  - Expose the current backend status (e.g., healthy / circuit_broken / recovering) so it’s observable from the outside.
Multi-environment support: [HiCache][RFC] SLA-oriented high availability for HiCache storage #17521
- Provide a non‑intrusive way to plug in different storage backends.
- Try to keep a unified management layer for different backends, so logic doesn’t have to be reimplemented per environment.

Architecture / Flow Diagram

TODO

…ach interfaces

…ntime_attach_detach

gemini-code-assist · 2026-01-21T19:03:04Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

dongyibo · 2026-02-04T07:01:37Z

Hello, I'm running DeepSeek v3.2 on an H800 with hicache(only host mem not L3 cache), and I've enabled pp+tp+dp+dp_attn. After running for a while, I get a CUDA illegal memory error. Are there still compatibility issues with hicache on large-scale clusters? Could you please check?
#18166

alphabetc1 · 2026-02-04T14:22:02Z

Hello, I'm running DeepSeek v3.2 on an H800 with hicache(only host mem not L3 cache), and I've enabled pp+tp+dp+dp_attn. After running for a while, I get a CUDA illegal memory error. Are there still compatibility issues with hicache on large-scale clusters? Could you please check? #18166

Seems not related to this PR; we can discuss it later in ur issue

alphabetc1 and others added 25 commits December 30, 2025 10:50

[HiCache]: support runtime attach/detach hicache storage

2b56246

add ut

72e3929

support hicache_storage_prefetch_policy

1b51810

fix

003f7b2

refactor the existing storage backend init to use the same attach/det…

775c998

…ach interfaces

fix ci

fab4275

fix

9fac448

support update hicache_write_policy

e878adf

support config switch

6033659

Merge remote-tracking branch 'origin/main' into feat/hicache_store_ru…

5a130de

…ntime_attach_detach

fix mtr

59a479a

Merge branch 'main' into feat/hicache_store_runtime_attach_detach

2934b8a

Merge remote-tracking branch 'origin/main' into feat/hicache_store_ru…

908fa97

…ntime_attach_detach

Merge branch 'main' into feat/hicache_store_runtime_attach_detach

86da98a

Merge branch 'main' into feat/hicache_store_runtime_attach_detach

bb7e8d7

add security

5d384fb

Merge branch 'main' into feat/hicache_store_runtime_attach_detach

c23477c

Merge branch 'main' into feat/hicache_store_runtime_attach_detach

b8fe011

mock ADMIN_FORCE

105e7d5

Merge branch 'main' into feat/hicache_store_runtime_attach_detach

0ef30a8

make API more RESTful

4e6b48b

Merge branch 'main' into feat/hicache_store_runtime_attach_detach

a6f0610

Merge branch 'main' into feat/hicache_store_runtime_attach_detach

b25a6c7

[HiCache] support force attach/detach of HiCache storage

2d994a5

[HiCache] storage fault tolerance

50adb6b

alphabetc1 requested review from Ying1123, hanming-lu, hnyls2002, merrymercy and xiezhq-hermann as code owners January 21, 2026 19:03

alphabetc1 requested review from ByronHsu, CatherineSue, JustinTong0323, ShangmingCai, ispobock, slin1237 and yizhang2077 as code owners January 21, 2026 19:03

github-actions bot added documentation Improvements or additions to documentation hicache Hierarchical Caching for SGLang labels Jan 21, 2026

alphabetc1 changed the title ~~[HiCache][DO NOT MERGE] HiCache Storage High Availability~~ [HiCache][RFC] HiCache Storage High Availability Jan 22, 2026

alphabetc1 changed the title ~~[HiCache][RFC] HiCache Storage High Availability~~ [HiCache][RFC] SLA-oriented high availability for HiCache storage Jan 29, 2026

Merge branch 'main' into feat/hicache_store_ha

888e8d5

xiezhq-hermann self-assigned this Feb 4, 2026

alphabetc1 marked this pull request as draft February 4, 2026 14:37

alphabetc1 marked this pull request as ready for review February 9, 2026 16:44

alphabetc1 marked this pull request as draft March 13, 2026 19:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HiCache][RFC] SLA-oriented high availability for HiCache storage#17521

[HiCache][RFC] SLA-oriented high availability for HiCache storage#17521
alphabetc1 wants to merge 26 commits intosgl-project:mainfrom
alphabetc1:feat/hicache_store_ha

alphabetc1 commented Jan 21, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 21, 2026

Uh oh!

dongyibo commented Feb 4, 2026 •

edited

Loading

Uh oh!

alphabetc1 commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alphabetc1 commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background & Goals

Switchover (Planned Switch)

Automatic Failover (Unplanned Failure Handling)

Architecture / Flow Diagram

Uh oh!

gemini-code-assist bot commented Jan 21, 2026

Uh oh!

dongyibo commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alphabetc1 commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alphabetc1 commented Jan 21, 2026 •

edited

Loading

dongyibo commented Feb 4, 2026 •

edited

Loading