[HiCache] Support DeepSeek V3.2 L3 offloading by vladnosiv · Pull Request #18637 · sgl-project/sglang

vladnosiv · 2026-02-11T20:33:05Z

Motivation

Addition of HiCache support for DeepSeek V3.2 after L2 cache support: #17415
Relates to #17085

Modifications

Added support for the indexer keys and the integration with file storage and MoonCake storage

Accuracy Tests

Launch command with MoonCake (server + storage client on localhost)

export SGLANG_JIT_DEEPGEMM_FAST_WARMUP=1

export SELF_IP=<...>
export MOONCAKE_MASTER=$SELF_IP:50051
export MOONCAKE_LOCAL_HOSTNAME=$SELF_IP
export MOONCAKE_GLOBAL_SEGMENT_SIZE=0
export MOONCAKE_TE_META_DATA_SERVER=http://$SELF_IP:8080/metadata
export MOONCAKE_PROTOCOL=rdma
export MC_TE_METRIC=1

python3 -m sglang.launch_server \
      --model-path /home/devuser/.cache/huggingface/DeepSeek-V3.2 \
      --trust-remote-code \
      --port 8000 \
      --host 0.0.0.0 \
      --context-length 65536 \
      --chunked-prefill-size 65536 \
      --tp-size 8 \
      --chat-template examples/chat_template/tool_chat_template_deepseekv32.jinja \
      --page-size 64 \
      --mem-fraction-static 0.8 \
      --enable-hierarchical-cache \
      --hicache-ratio 2.0 \
      --hicache-storage-backend mooncake \
      --hicache-io-backend direct \
      --hicache-mem-layout page_first_direct \
      --hicache-write-policy write_through \
      --hicache-storage-prefetch-policy wait_complete &> sglang.out

gsm8k test command

python benchmark/gsm8k/bench_sglang.py --port 8000 --num-questions 500 --num-shots 48 --parallel 100

First time

Accuracy: 0.964
Invalid: 0.000
Latency: 104.313 s
Output throughput: 458.706 token/s

The second run after restarting the server (the cache is saved in the MoonCake Store):

Accuracy: 0.962
Invalid: 0.000
Latency: 81.281 s
Output throughput: 588.795 token/s

The logs show cache hits immediately after launch:

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

gemini-code-assist · 2026-02-11T20:33:09Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

stmatengss · 2026-02-12T15:09:55Z

/tag-and-rerun-ci

stmatengss · 2026-02-25T03:14:04Z

After merging #16137, the Mooncake store class becomes independent from hicache storage; you should resolve any conflicts. @vladnosiv

# Conflicts: # python/sglang/srt/mem_cache/storage/mooncake_store/mooncake_store.py

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

vladnosiv · 2026-02-26T21:59:45Z

acc test on actual commit

Accuracy: 0.960
Invalid: 0.000
Latency: 85.842 s
Output throughput: 555.406 token/s

stmatengss · 2026-02-27T05:22:49Z

/rerun-failed-ci

vladnosiv · 2026-02-27T11:59:44Z

We discussed with @hzh0425 that it is better to wait for HiCache refactoring and then continue with this PR. I'll transfer it to draft for now.

llc-kc · 2026-03-07T01:37:51Z

Hi, why we need store the indexer cache independently, but not pack it with kv cache into a single tensor?

vladnosiv · 2026-03-07T09:25:01Z

Hi, why we need store the indexer cache independently, but not pack it with kv cache into a single tensor?

Hi !
Now this is done in this PR, because it was independent in the SGLang code as a whole -- in the L2 cache, and with P/D, the KV cache and Indexer cache are processed independently.

After the PR #19912 merge with the addition of page_first layout for indexer, my PR should become much easier.

In addition, I see no problems (and maybe I'm wrong) to then make a separate PR with another implementation of NSATokenToKVPool, which would merge kv and indexer buffers into one tensor. This would allow you to have 1 object in the MoonCake Store instead of two, but it sounds more like an experimental optimization under a separate flag (since it will also be possible to make a fused H2D/D2H transfer kernel for such a merged object).

For inital support of the MoonCake Store, having 2 objects per page doesn't sound so bad, considering that only large pages with 64 tokens are allowed now.

huangtingwei9988 · 2026-03-07T09:38:29Z

Hi, why we need store the indexer cache independently, but not pack it with kv cache into a single tensor?

Hi ! Now this is done in this PR, because it was independent in the SGLang code as a whole -- in the L2 cache, and with P/D, the KV cache and Indexer cache are processed independently.

After the PR #19912 merge with the addition of page_first layout for indexer, my PR should become much easier.

In addition, I see no problems (and maybe I'm wrong) to then make a separate PR with another implementation of NSATokenToKVPool, which would merge kv and indexer buffers into one tensor. This would allow you to have 1 object in the MoonCake Store instead of two, but it sounds more like an experimental optimization under a separate flag (since it will also be possible to make a fused H2D/D2H transfer kernel for such a merged object).

For inital support of the MoonCake Store, having 2 objects per page doesn't sound so bad, considering that only large pages with 64 tokens are allowed now.

Merging the indexer cache and key-value cache into a single tensor increases complexity.

Firstly, the indexer cache is organized at the page level, while the key-value cache is organized at the token level. This makes offset calculations in the I/O kernel more cumbersome during copying.

Secondly, we plan to implement a unified design for hybrid LLMs (e.g., DSA and Mamba), and will not design a separate system for DSA. The current approach is to use a full key-value cache plus an extra cache for each key.

llc-kc · 2026-03-07T09:44:24Z

Ok, thank you for your reply @vladnosiv @huangtingwei9988

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

vladnosiv · 2026-03-11T16:20:39Z

After merging indexer page_first layout tested accuracy on TP8 DP8 EP8 setup:

First run with cold cache:

Accuracy: 0.960
Invalid: 0.000
Latency: 94.320 s
Output throughput: 505.261 token/s

Second run (after server restart) with hot mooncake cache:

Accuracy: 0.962
Invalid: 0.000
Latency: 84.455 s
Output throughput: 566.290 token/s

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

vladnosiv · 2026-03-17T10:52:21Z

I took some of the changes from the refactoring branch and adapted NSA+MoonCake for them, I'll clean the code in more detail and test it after the merge of refactoring pr

llc-kc · 2026-03-18T11:46:55Z

@vladnosiv When I use your branch vladnosiv:dsv32-l3-hicache, it's works for V3.2. Thank you for your outstanding work.
When I test kimi-k2.5, it's seems there is a bug:
TypeError: HiCacheController.prefetch() got an unexpected keyword argument 'extra_pools'

vladnosiv · 2026-03-18T11:54:54Z

@llc-kc Hi!
I'll assume that you're using the branch with the latest commit. After it, the branch should not be run as is, it is preparatory for the refactoring merge conflicts resolving and has brought many changes. Also, I took only part of the files needed for integration from the PR with refactoring, so it's broken at this moment

vladnosiv · 2026-03-25T09:34:20Z

Commits cherry-picked to #21259

feat: support l3 for dsv32

994628c

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

vladnosiv requested review from Ying1123, hanming-lu, hnyls2002, merrymercy, xiezhq-hermann and yizhang2077 as code owners February 11, 2026 20:33

github-actions Bot added the hicache Hierarchical Caching for SGLang label Feb 11, 2026

github-actions Bot added the run-ci label Feb 12, 2026

xiezhq-hermann assigned xiezhq-hermann, hzh0425 and stmatengss Feb 23, 2026

stmatengss reviewed Feb 25, 2026

View reviewed changes

huangtingwei9988 self-assigned this Feb 25, 2026

vladnosiv added 2 commits February 26, 2026 16:01

Merge branch 'main' into dsv32-l3-hicache

cf4465f

# Conflicts: # python/sglang/srt/mem_cache/storage/mooncake_store/mooncake_store.py

refactor

9924782

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

vladnosiv marked this pull request as draft February 27, 2026 11:59

vladnosiv added 3 commits March 10, 2026 10:17

Merge branch 'main' into dsv32-l3-hicache

d3f8834

integrate page first indexer layout

83b4915

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

add nsa mixin

41a06fc

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

vladnosiv and others added 2 commits March 10, 2026 11:18

small refactor in cache controller

c17fbe3

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

Merge branch 'main' into dsv32-l3-hicache

f591a57

Merge branch 'main' into dsv32-l3-hicache

b4ad30b

vladnosiv marked this pull request as ready for review March 16, 2026 09:24

vladnosiv requested review from hzh0425 and ispobock as code owners March 16, 2026 09:24

vladnosiv added 2 commits March 17, 2026 11:12

Merge branch 'main' into dsv32-l3-hicache

5a75532

pick-up some code from refactoring branch and integrate nsa+mooncake

47e965a

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

ShangmingCai mentioned this pull request Mar 19, 2026

[Bug] HiCache causes NCCL timeout in TP mode (TP=16) #20859

Closed

5 tasks

ShangmingCai added the high priority label Mar 19, 2026

vladnosiv closed this Mar 25, 2026

Conversation

vladnosiv commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Uh oh!

gemini-code-assist Bot commented Feb 11, 2026

Uh oh!

stmatengss commented Feb 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stmatengss commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vladnosiv commented Feb 26, 2026

Uh oh!

stmatengss commented Feb 27, 2026

Uh oh!

vladnosiv commented Feb 27, 2026

Uh oh!

llc-kc commented Mar 7, 2026

Uh oh!

vladnosiv commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huangtingwei9988 commented Mar 7, 2026

Uh oh!

llc-kc commented Mar 7, 2026

Uh oh!

vladnosiv commented Mar 11, 2026

Uh oh!

vladnosiv commented Mar 17, 2026

Uh oh!

llc-kc commented Mar 18, 2026

Uh oh!

vladnosiv commented Mar 18, 2026

Uh oh!

vladnosiv commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

vladnosiv commented Feb 11, 2026 •

edited

Loading

stmatengss commented Feb 25, 2026 •

edited

Loading

vladnosiv commented Mar 7, 2026 •

edited

Loading