[HiCache][RFC] SLA-oriented high availability for HiCache storage#17521
[HiCache][RFC] SLA-oriented high availability for HiCache storage#17521alphabetc1 wants to merge 26 commits intosgl-project:mainfrom
Conversation
…ntime_attach_detach
…ntime_attach_detach
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Hello, I'm running DeepSeek v3.2 on an H800 with hicache(only host mem not L3 cache), and I've enabled pp+tp+dp+dp_attn. After running for a while, I get a CUDA illegal memory error. Are there still compatibility issues with hicache on large-scale clusters? Could you please check? |
Seems not related to this PR; we can discuss it later in ur issue |
Background & Goals
HiCache Storage is currently weak at handling runtime config updates and failure handling. It doesn’t really support the kind of high availability (HA) we want — things like seamless switchover and automatic failover.
This patch lays out a step‑by‑step plan to make HiCache Storage more HA‑friendly. The roadmap mainly focuses on:
To keep the risk manageable, this will be split into multiple smaller, independent issues.
Switchover (Planned Switch)
For planned operations (upgrades, downgrades, maintenance), we want to be able to switch storage backends smoothly, with as little impact on upstream services as possible.
Support dynamic updates of the storage backend: [HiCache][HA 1/N] Support HiCache storage runtime attach/detach #15892
GET /hicache/storage-backendPUT /hicache/storage-backendDELETE /hicache/storage-backendSupport forced updates of the storage backend: [HiCache][RFC] SLA-oriented high availability for HiCache storage #17521
forceflag so that even under load (with in‑flight requests) and when the storage is failing, we can quickly detach the faulty backend and switch away from it, minimizing impact on upstream traffic.API security: feat: add --admin-api-key for finer-grained endpoint auth #15908
admin-api-keyfor authentication.admin-api-keyis explicitly configured in sglang, andPartial rollback:
Automatic Failover (Unplanned Failure Handling)
When the underlying storage has problems (e.g., node down, network issues), we want the system to detect it automatically, cut off the bad backend, and recover by switching to a healthy one with minimal impact on callers.
Automatic failure handling: [HiCache][RFC] SLA-oriented high availability for HiCache storage #17521
healthy/circuit_broken/recovering) so it’s observable from the outside.Multi-environment support: [HiCache][RFC] SLA-oriented high availability for HiCache storage #17521
Architecture / Flow Diagram
TODO