[Draft] Incremental KV Cache Transfer for PD Disaggregation#22234
Open
yudian0504 wants to merge 3 commits intosgl-project:mainfrom
Open
[Draft] Incremental KV Cache Transfer for PD Disaggregation#22234yudian0504 wants to merge 3 commits intosgl-project:mainfrom
yudian0504 wants to merge 3 commits intosgl-project:mainfrom
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
6 tasks
ishandhanani
added a commit
that referenced
this pull request
Apr 7, 2026
- Remove unnecessary running_batch guard in decode idle memory check: running requests' pages are already in the tree via process_prebuilt -> cache_unfinished_req. - Move start_send_idx update after send() to prevent cursor advance on skipped/failed sends. - Add TP sync guards in _update_handshake_waiters and pop_transferred to prevent gloo hangs from queue size divergence across TP ranks. - Add diagnostics to Bug #14 KV cache full assertion in _pre_alloc. Cursor fix and TP sync guards inspired by #22234 (yudian0504). Validated on sa-b200 Qwen 32B 3P1D at concurrency 128 and 256.
ShangmingCai
reviewed
Apr 9, 2026
ShangmingCai
reviewed
Apr 9, 2026
ShangmingCai
reviewed
Apr 9, 2026
Contributor
Author
Reproduce - Hybrid-Attn (8*H20) Qwen3.5-35B-A3BPrefillCUDA_VISIBLE_DEVICES=0,1,2,3 python -m sglang.launch_server \
--model-path /home/admin/Qwen3.5-35B-A3B/ \
--host 0.0.0.0 \
--port 8188 \
--trust-remote-code \
--tp-size 4 \
--page-size 64 \
--max-running-requests 32 \
--mem-fraction-static 0.9 \
--chunked-prefill-size 8192 \
--mamba-scheduler-strategy extra_buffer \
--disaggregation-mode prefill \
--disaggregation-transfer-backend mooncake \
--disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3
note: --page-size 64 should be retained to prevent transmission fragmentation after the decode node has been running for a long time.DecodeCUDA_VISIBLE_DEVICES=4,5,6,7 python -m sglang.launch_server \
--model-path /home/admin/Qwen3.5-35B-A3B/ \
--host 0.0.0.0 \
--port 8189 \
--trust-remote-code \
--tp-size 4 \
--page-size 64 \
--max-running-requests 128 \
--mem-fraction-static 0.9 \
--chunked-prefill-size 8192 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--mamba-scheduler-strategy extra_buffer \
--disaggregation-mode decode \
--disaggregation-enable-decode-radix-cache \
--disaggregation-transfer-backend mooncake \
--disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
the code from the internal repository is being cherry-picked incrementally...
Reproduce (8*H20) M2.5
Prefill
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m sglang.launch_server \ --model-path /home/admin/MiniMax-M2.5/ \ --host 0.0.0.0 \ --port 8188 \ --trust-remote-code \ --tp-size 4 \ --page-size 64 \ --max-running-requests 32 \ --mem-fraction-static 0.9 \ --chunked-prefill-size 8192 \ --attention-backend fa3 \ --disaggregation-mode prefill \ --disaggregation-transfer-backend mooncake \ --disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3 note: --page-size 64 should be retained to prevent transmission fragmentation after the decode node has been running for a long time.Decode
Router
Benchmark
main
pr