Skip to content

[HiCache][RFC] SLA-oriented high availability for HiCache storage#17521

Draft
alphabetc1 wants to merge 26 commits intosgl-project:mainfrom
alphabetc1:feat/hicache_store_ha
Draft

[HiCache][RFC] SLA-oriented high availability for HiCache storage#17521
alphabetc1 wants to merge 26 commits intosgl-project:mainfrom
alphabetc1:feat/hicache_store_ha

Conversation

@alphabetc1
Copy link
Collaborator

@alphabetc1 alphabetc1 commented Jan 21, 2026

Background & Goals

HiCache Storage is currently weak at handling runtime config updates and failure handling. It doesn’t really support the kind of high availability (HA) we want — things like seamless switchover and automatic failover.

This patch lays out a step‑by‑step plan to make HiCache Storage more HA‑friendly. The roadmap mainly focuses on:

  • Switchover: planned switches (upgrades, downgrades, maintenance)
  • Failover: unplanned switches (storage or network failures)

To keep the risk manageable, this will be split into multiple smaller, independent issues.


Switchover (Planned Switch)

For planned operations (upgrades, downgrades, maintenance), we want to be able to switch storage backends smoothly, with as little impact on upstream services as possible.

  • Support dynamic updates of the storage backend: [HiCache][HA 1/N] Support HiCache storage runtime attach/detach #15892

    • Status: GET /hicache/storage-backend
    • Attach / switch: PUT /hicache/storage-backend
    • Detach: DELETE /hicache/storage-backend
    • By default, be conservative:
      • Only allow updates when there are no in‑flight requests.
      • If there is ongoing traffic, reject the operation and return an error.
  • Support forced updates of the storage backend: [HiCache][RFC] SLA-oriented high availability for HiCache storage #17521

    • Add a force flag so that even under load (with in‑flight requests) and when the storage is failing, we can quickly detach the faulty backend and switch away from it, minimizing impact on upstream traffic.
  • API security: feat: add --admin-api-key for finer-grained endpoint auth #15908

    • Use an admin-api-key for authentication.
    • Dynamic storage updates are only allowed if:
      • admin-api-key is explicitly configured in sglang, and
      • the client includes this key in the API call.
    • This is to avoid silent or accidental changes in production.
  • Partial rollback:

    • For multi‑step migrations / switches (e.g., TP / DP / PP phases), if something goes wrong halfway or results don’t match expectations, we should be able to roll back part of the changes instead of doing a full reset.

Automatic Failover (Unplanned Failure Handling)

When the underlying storage has problems (e.g., node down, network issues), we want the system to detect it automatically, cut off the bad backend, and recover by switching to a healthy one with minimal impact on callers.

  • Automatic failure handling: [HiCache][RFC] SLA-oriented high availability for HiCache storage #17521

    • Failure detection:
      • Define how we detect issues: timeouts, specific error codes, and thresholds for consecutive failures.
      • Expose metrics and hook them into alerts, so operators can see failures instead of guessing.
    • Circuit breaking & recovery:
      • When a backend is considered unhealthy, automatically circuit break it (logically detach it from HiCache Storage).
      • Try to reconnect to:
        • other healthy instances in the same cluster, or
        • configured standby nodes (if provided).
      • Expose the current backend status (e.g., healthy / circuit_broken / recovering) so it’s observable from the outside.
  • Multi-environment support: [HiCache][RFC] SLA-oriented high availability for HiCache storage #17521

    • Provide a non‑intrusive way to plug in different storage backends.
    • Try to keep a unified management layer for different backends, so logic doesn’t have to be reimplemented per environment.

Architecture / Flow Diagram

TODO

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions bot added documentation Improvements or additions to documentation hicache Hierarchical Caching for SGLang labels Jan 21, 2026
@alphabetc1 alphabetc1 changed the title [HiCache][DO NOT MERGE] HiCache Storage High Availability [HiCache][RFC] HiCache Storage High Availability Jan 22, 2026
@alphabetc1 alphabetc1 changed the title [HiCache][RFC] HiCache Storage High Availability [HiCache][RFC] SLA-oriented high availability for HiCache storage Jan 29, 2026
@dongyibo
Copy link

dongyibo commented Feb 4, 2026

Hello, I'm running DeepSeek v3.2 on an H800 with hicache(only host mem not L3 cache), and I've enabled pp+tp+dp+dp_attn. After running for a while, I get a CUDA illegal memory error. Are there still compatibility issues with hicache on large-scale clusters? Could you please check?
#18166

@alphabetc1
Copy link
Collaborator Author

Hello, I'm running DeepSeek v3.2 on an H800 with hicache(only host mem not L3 cache), and I've enabled pp+tp+dp+dp_attn. After running for a while, I get a CUDA illegal memory error. Are there still compatibility issues with hicache on large-scale clusters? Could you please check? #18166

Seems not related to this PR; we can discuss it later in ur issue

@xiezhq-hermann xiezhq-hermann self-assigned this Feb 4, 2026
@alphabetc1 alphabetc1 marked this pull request as draft February 4, 2026 14:37
@alphabetc1 alphabetc1 marked this pull request as ready for review February 9, 2026 16:44
@alphabetc1 alphabetc1 marked this pull request as draft March 13, 2026 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation hicache Hierarchical Caching for SGLang

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants