Skip to content

[Draft] Incremental KV Cache Transfer for PD Disaggregation#22234

Open
yudian0504 wants to merge 3 commits intosgl-project:mainfrom
antgroup:pd_incremental_transfer
Open

[Draft] Incremental KV Cache Transfer for PD Disaggregation#22234
yudian0504 wants to merge 3 commits intosgl-project:mainfrom
antgroup:pd_incremental_transfer

Conversation

@yudian0504
Copy link
Copy Markdown
Contributor

@yudian0504 yudian0504 commented Apr 7, 2026

the code from the internal repository is being cherry-picked incrementally...

Reproduce (8*H20) M2.5

Prefill

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m sglang.launch_server \
--model-path /home/admin/MiniMax-M2.5/ \
--host 0.0.0.0 \
--port 8188 \
--trust-remote-code \
--tp-size 4 \
--page-size 64 \
--max-running-requests 32 \
--mem-fraction-static 0.9 \
--chunked-prefill-size 8192 \
--attention-backend fa3 \
--disaggregation-mode prefill \
--disaggregation-transfer-backend mooncake \
--disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3

note: --page-size 64 should be retained to prevent transmission fragmentation after the decode node has been running for a long time.

Decode

CUDA_VISIBLE_DEVICES=4,5,6,7 python -m sglang.launch_server \
--model-path /home/admin/MiniMax-M2.5/ \
--host 0.0.0.0 \
--port 8189 \
--trust-remote-code \
--tp-size 4 \
--page-size 64 \
--max-running-requests 128 \
--mem-fraction-static 0.9 \
--chunked-prefill-size 8192 \
--attention-backend fa3 \
--disaggregation-mode decode \
--disaggregation-transfer-backend mooncake \
--disaggregation-enable-decode-radix-cache \
--disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3

Router

python -m sglang_router.launch_router \
--pd-disaggregation \
--prefill http://127.0.0.1:8188/ \
--decode http://127.0.0.1:8189/ \
--host 0.0.0.0 \
--port 8001

Benchmark

python -m sglang.bench_serving  \
--backend sglang   \
--host 127.0.0.1 \
--port 8001  \
--dataset-name generated-shared-prefix  \
--gsp-num-groups 80 \
--gsp-prompts-per-group 8   \
--gsp-system-prompt-len 7000 \
--gsp-question-len 1000 \
--gsp-output-len 1000 \
--gsp-range-ratio 0.5  \
--num-prompts 1024 \
--max-concurrency 128
main
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 128       
Successful requests:                     640       
Benchmark duration (s):                  274.12    
Total input tokens:                      4343739   
Total input text tokens:                 4343739   
Total generated tokens:                  478516    
Total generated tokens (retokenized):    513015    
Request throughput (req/s):              2.33      
Input token throughput (tok/s):          15846.12  
Output token throughput (tok/s):         1745.64   
Peak output token throughput (tok/s):    1951.00   
Peak concurrent requests:                135       
Total token throughput (tok/s):          17591.77  
Concurrency:                             119.15    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   51034.87  
Median E2E Latency (ms):                 51106.88  
P90 E2E Latency (ms):                    61808.48  
P99 E2E Latency (ms):                    76013.41  
---------------Time to First Token----------------
Mean TTFT (ms):                          19837.62  
Median TTFT (ms):                        20315.57  
P99 TTFT (ms):                           38496.58  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          41.80     
Median TPOT (ms):                        42.67     
P90 TPOT (ms):                           44.00     
P99 TPOT (ms):                           44.47     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           41.77     
Median ITL (ms):                         42.16     
P90 ITL (ms):                            46.13     
P95 ITL (ms):                            48.55     
P99 ITL (ms):                            65.06     
Max ITL (ms):                            427.93    
==================================================
pr
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 128       
Successful requests:                     640       
Benchmark duration (s):                  233.71    
Total input tokens:                      4341981   
Total input text tokens:                 4341981   
Total generated tokens:                  478516    
Total generated tokens (retokenized):    528318    
Request throughput (req/s):              2.74      
Input token throughput (tok/s):          18578.34  
Output token throughput (tok/s):         2047.46   
Peak output token throughput (tok/s):    2336.00   
Peak concurrent requests:                137       
Total token throughput (tok/s):          20625.80  
Concurrency:                             120.45    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   43984.84  
Median E2E Latency (ms):                 43068.02  
P90 E2E Latency (ms):                    56656.78  
P99 E2E Latency (ms):                    68406.12  
---------------Time to First Token----------------
Mean TTFT (ms):                          4564.99   
Median TTFT (ms):                        3127.30   
P99 TTFT (ms):                           21727.59  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          52.83     
Median TPOT (ms):                        55.06     
P90 TPOT (ms):                           56.87     
P99 TPOT (ms):                           57.54     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           52.76     
Median ITL (ms):                         55.25     
P90 ITL (ms):                            62.87     
P95 ITL (ms):                            74.22     
P99 ITL (ms):                            89.70     
Max ITL (ms):                            479.11    
==================================================

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

ishandhanani added a commit that referenced this pull request Apr 7, 2026
- Remove unnecessary running_batch guard in decode idle memory check:
  running requests' pages are already in the tree via process_prebuilt
  -> cache_unfinished_req.
- Move start_send_idx update after send() to prevent cursor advance
  on skipped/failed sends.
- Add TP sync guards in _update_handshake_waiters and pop_transferred
  to prevent gloo hangs from queue size divergence across TP ranks.
- Add diagnostics to Bug #14 KV cache full assertion in _pre_alloc.

Cursor fix and TP sync guards inspired by #22234 (yudian0504).
Validated on sa-b200 Qwen 32B 3P1D at concurrency 128 and 256.
@yudian0504
Copy link
Copy Markdown
Contributor Author

yudian0504 commented Apr 9, 2026

Reproduce - Hybrid-Attn (8*H20) Qwen3.5-35B-A3B

Prefill

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m sglang.launch_server \
--model-path /home/admin/Qwen3.5-35B-A3B/ \
--host 0.0.0.0 \
--port 8188 \
--trust-remote-code \
--tp-size 4 \
--page-size 64 \
--max-running-requests 32 \
--mem-fraction-static 0.9 \
--chunked-prefill-size 8192 \
--mamba-scheduler-strategy extra_buffer \
--disaggregation-mode prefill \
--disaggregation-transfer-backend mooncake \
--disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3

note: --page-size 64 should be retained to prevent transmission fragmentation after the decode node has been running for a long time.

Decode

CUDA_VISIBLE_DEVICES=4,5,6,7 python -m sglang.launch_server \
--model-path /home/admin/Qwen3.5-35B-A3B/ \
--host 0.0.0.0 \
--port 8189 \
--trust-remote-code \
--tp-size 4 \
--page-size 64 \
--max-running-requests 128 \
--mem-fraction-static 0.9 \
--chunked-prefill-size 8192 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--mamba-scheduler-strategy extra_buffer \
--disaggregation-mode decode \
--disaggregation-enable-decode-radix-cache \
--disaggregation-transfer-backend mooncake \
--disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants