Skip to content

[HiCache] Support DeepSeek V3.2 L3 offloading#18637

Closed
vladnosiv wants to merge 11 commits into
sgl-project:mainfrom
vladnosiv:dsv32-l3-hicache
Closed

[HiCache] Support DeepSeek V3.2 L3 offloading#18637
vladnosiv wants to merge 11 commits into
sgl-project:mainfrom
vladnosiv:dsv32-l3-hicache

Conversation

@vladnosiv
Copy link
Copy Markdown
Contributor

@vladnosiv vladnosiv commented Feb 11, 2026

Motivation

Addition of HiCache support for DeepSeek V3.2 after L2 cache support: #17415
Relates to #17085

Modifications

Added support for the indexer keys and the integration with file storage and MoonCake storage

Accuracy Tests

Launch command with MoonCake (server + storage client on localhost)

export SGLANG_JIT_DEEPGEMM_FAST_WARMUP=1

export SELF_IP=<...>
export MOONCAKE_MASTER=$SELF_IP:50051
export MOONCAKE_LOCAL_HOSTNAME=$SELF_IP
export MOONCAKE_GLOBAL_SEGMENT_SIZE=0
export MOONCAKE_TE_META_DATA_SERVER=http://$SELF_IP:8080/metadata
export MOONCAKE_PROTOCOL=rdma
export MC_TE_METRIC=1

python3 -m sglang.launch_server \
      --model-path /home/devuser/.cache/huggingface/DeepSeek-V3.2 \
      --trust-remote-code \
      --port 8000 \
      --host 0.0.0.0 \
      --context-length 65536 \
      --chunked-prefill-size 65536 \
      --tp-size 8 \
      --chat-template examples/chat_template/tool_chat_template_deepseekv32.jinja \
      --page-size 64 \
      --mem-fraction-static 0.8 \
      --enable-hierarchical-cache \
      --hicache-ratio 2.0 \
      --hicache-storage-backend mooncake \
      --hicache-io-backend direct \
      --hicache-mem-layout page_first_direct \
      --hicache-write-policy write_through \
      --hicache-storage-prefetch-policy wait_complete &> sglang.out

gsm8k test command

python benchmark/gsm8k/bench_sglang.py --port 8000 --num-questions 500 --num-shots 48 --parallel 100

First time

Accuracy: 0.964
Invalid: 0.000
Latency: 104.313 s
Output throughput: 458.706 token/s

The second run after restarting the server (the cache is saved in the MoonCake Store):

Accuracy: 0.962
Invalid: 0.000
Latency: 81.281 s
Output throughput: 588.795 token/s

The logs show cache hits immediately after launch:

Снимок экрана 2026-02-11 в 21 05 35

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the hicache Hierarchical Caching for SGLang label Feb 11, 2026
@stmatengss
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

Comment thread python/sglang/srt/mem_cache/storage/mooncake_store/mooncake_store.py Outdated
Comment thread python/sglang/srt/mem_cache/storage/mooncake_store/mooncake_store.py Outdated
Comment thread python/sglang/srt/mem_cache/storage/mooncake_store/mooncake_store.py Outdated
Comment thread python/sglang/srt/mem_cache/storage/mooncake_store/mooncake_store.py Outdated
Comment thread python/sglang/srt/mem_cache/memory_pool_host.py Outdated
@stmatengss
Copy link
Copy Markdown
Collaborator

stmatengss commented Feb 25, 2026

After merging #16137, the Mooncake store class becomes independent from hicache storage; you should resolve any conflicts. @vladnosiv

@huangtingwei9988 huangtingwei9988 self-assigned this Feb 25, 2026
# Conflicts:
#	python/sglang/srt/mem_cache/storage/mooncake_store/mooncake_store.py
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
@vladnosiv
Copy link
Copy Markdown
Contributor Author

acc test on actual commit

Accuracy: 0.960
Invalid: 0.000
Latency: 85.842 s
Output throughput: 555.406 token/s

@stmatengss
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@vladnosiv
Copy link
Copy Markdown
Contributor Author

We discussed with @hzh0425 that it is better to wait for HiCache refactoring and then continue with this PR. I'll transfer it to draft for now.

@vladnosiv vladnosiv marked this pull request as draft February 27, 2026 11:59
@llc-kc
Copy link
Copy Markdown
Contributor

llc-kc commented Mar 7, 2026

Hi, why we need store the indexer cache independently, but not pack it with kv cache into a single tensor?

@vladnosiv
Copy link
Copy Markdown
Contributor Author

vladnosiv commented Mar 7, 2026

Hi, why we need store the indexer cache independently, but not pack it with kv cache into a single tensor?

Hi !
Now this is done in this PR, because it was independent in the SGLang code as a whole -- in the L2 cache, and with P/D, the KV cache and Indexer cache are processed independently.

After the PR #19912 merge with the addition of page_first layout for indexer, my PR should become much easier.

In addition, I see no problems (and maybe I'm wrong) to then make a separate PR with another implementation of NSATokenToKVPool, which would merge kv and indexer buffers into one tensor. This would allow you to have 1 object in the MoonCake Store instead of two, but it sounds more like an experimental optimization under a separate flag (since it will also be possible to make a fused H2D/D2H transfer kernel for such a merged object).

For inital support of the MoonCake Store, having 2 objects per page doesn't sound so bad, considering that only large pages with 64 tokens are allowed now.

@huangtingwei9988
Copy link
Copy Markdown
Collaborator

Hi, why we need store the indexer cache independently, but not pack it with kv cache into a single tensor?

Hi ! Now this is done in this PR, because it was independent in the SGLang code as a whole -- in the L2 cache, and with P/D, the KV cache and Indexer cache are processed independently.

After the PR #19912 merge with the addition of page_first layout for indexer, my PR should become much easier.

In addition, I see no problems (and maybe I'm wrong) to then make a separate PR with another implementation of NSATokenToKVPool, which would merge kv and indexer buffers into one tensor. This would allow you to have 1 object in the MoonCake Store instead of two, but it sounds more like an experimental optimization under a separate flag (since it will also be possible to make a fused H2D/D2H transfer kernel for such a merged object).

For inital support of the MoonCake Store, having 2 objects per page doesn't sound so bad, considering that only large pages with 64 tokens are allowed now.

Merging the indexer cache and key-value cache into a single tensor increases complexity.

Firstly, the indexer cache is organized at the page level, while the key-value cache is organized at the token level. This makes offset calculations in the I/O kernel more cumbersome during copying.

Secondly, we plan to implement a unified design for hybrid LLMs (e.g., DSA and Mamba), and will not design a separate system for DSA. The current approach is to use a full key-value cache plus an extra cache for each key.

@llc-kc
Copy link
Copy Markdown
Contributor

llc-kc commented Mar 7, 2026

Ok, thank you for your reply @vladnosiv @huangtingwei9988

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
vladnosiv and others added 2 commits March 10, 2026 11:18
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
@vladnosiv
Copy link
Copy Markdown
Contributor Author

After merging indexer page_first layout tested accuracy on TP8 DP8 EP8 setup:

First run with cold cache:

Accuracy: 0.960
Invalid: 0.000
Latency: 94.320 s
Output throughput: 505.261 token/s

Second run (after server restart) with hot mooncake cache:

Accuracy: 0.962
Invalid: 0.000
Latency: 84.455 s
Output throughput: 566.290 token/s

@vladnosiv vladnosiv marked this pull request as ready for review March 16, 2026 09:24
@vladnosiv
Copy link
Copy Markdown
Contributor Author

I took some of the changes from the refactoring branch and adapted NSA+MoonCake for them, I'll clean the code in more detail and test it after the merge of refactoring pr

@llc-kc
Copy link
Copy Markdown
Contributor

llc-kc commented Mar 18, 2026

@vladnosiv When I use your branch vladnosiv:dsv32-l3-hicache, it's works for V3.2. Thank you for your outstanding work.
When I test kimi-k2.5, it's seems there is a bug:
TypeError: HiCacheController.prefetch() got an unexpected keyword argument 'extra_pools'

@vladnosiv
Copy link
Copy Markdown
Contributor Author

@llc-kc Hi!
I'll assume that you're using the branch with the latest commit. After it, the branch should not be run as is, it is preparatory for the refactoring merge conflicts resolving and has brought many changes. Also, I took only part of the files needed for integration from the PR with refactoring, so it's broken at this moment

@vladnosiv
Copy link
Copy Markdown
Contributor Author

Commits cherry-picked to #21259

@vladnosiv vladnosiv closed this Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hicache Hierarchical Caching for SGLang high priority run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants