[HiCache] Support DeepSeek V3.2 L3 offloading#18637
Conversation
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci |
|
After merging #16137, the Mooncake store class becomes independent from hicache storage; you should resolve any conflicts. @vladnosiv |
# Conflicts: # python/sglang/srt/mem_cache/storage/mooncake_store/mooncake_store.py
|
acc test on actual commit |
|
/rerun-failed-ci |
|
We discussed with @hzh0425 that it is better to wait for HiCache refactoring and then continue with this PR. I'll transfer it to draft for now. |
|
Hi, why we need store the indexer cache independently, but not pack it with kv cache into a single tensor? |
Hi ! After the PR #19912 merge with the addition of page_first layout for indexer, my PR should become much easier. In addition, I see no problems (and maybe I'm wrong) to then make a separate PR with another implementation of NSATokenToKVPool, which would merge kv and indexer buffers into one tensor. This would allow you to have 1 object in the MoonCake Store instead of two, but it sounds more like an experimental optimization under a separate flag (since it will also be possible to make a fused H2D/D2H transfer kernel for such a merged object). For inital support of the MoonCake Store, having 2 objects per page doesn't sound so bad, considering that only large pages with 64 tokens are allowed now. |
Merging the indexer cache and key-value cache into a single tensor increases complexity. Firstly, the indexer cache is organized at the page level, while the key-value cache is organized at the token level. This makes offset calculations in the I/O kernel more cumbersome during copying. Secondly, we plan to implement a unified design for hybrid LLMs (e.g., DSA and Mamba), and will not design a separate system for DSA. The current approach is to use a full key-value cache plus an extra cache for each key. |
|
Ok, thank you for your reply @vladnosiv @huangtingwei9988 |
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
|
After merging indexer page_first layout tested accuracy on TP8 DP8 EP8 setup: First run with cold cache: Second run (after server restart) with hot mooncake cache: |
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
|
I took some of the changes from the refactoring branch and adapted NSA+MoonCake for them, I'll clean the code in more detail and test it after the merge of refactoring pr |
|
@vladnosiv When I use your branch vladnosiv:dsv32-l3-hicache, it's works for V3.2. Thank you for your outstanding work. |
|
@llc-kc Hi! |
|
Commits cherry-picked to #21259 |
Motivation
Addition of HiCache support for DeepSeek V3.2 after L2 cache support: #17415
Relates to #17085
Modifications
Added support for the indexer keys and the integration with file storage and MoonCake storage
Accuracy Tests
Launch command with MoonCake (server + storage client on localhost)
gsm8k test command
First time
The second run after restarting the server (the cache is saved in the MoonCake Store):
The logs show cache hits immediately after launch: