[Draft] Incremental KV Cache Transfer for PD Disaggregation by yudian0504 · Pull Request #22234 · sgl-project/sglang

yudian0504 · 2026-04-07T03:55:33Z

the code from the internal repository is being cherry-picked incrementally...

Reproduce (8*H20) M2.5

Prefill

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m sglang.launch_server \
--model-path /home/admin/MiniMax-M2.5/ \
--host 0.0.0.0 \
--port 8188 \
--trust-remote-code \
--tp-size 4 \
--page-size 64 \
--max-running-requests 32 \
--mem-fraction-static 0.9 \
--chunked-prefill-size 8192 \
--attention-backend fa3 \
--disaggregation-mode prefill \
--disaggregation-transfer-backend mooncake \
--disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3

note: --page-size 64 should be retained to prevent transmission fragmentation after the decode node has been running for a long time.

Decode

CUDA_VISIBLE_DEVICES=4,5,6,7 python -m sglang.launch_server \
--model-path /home/admin/MiniMax-M2.5/ \
--host 0.0.0.0 \
--port 8189 \
--trust-remote-code \
--tp-size 4 \
--page-size 64 \
--max-running-requests 128 \
--mem-fraction-static 0.9 \
--chunked-prefill-size 8192 \
--attention-backend fa3 \
--disaggregation-mode decode \
--disaggregation-transfer-backend mooncake \
--disaggregation-enable-decode-radix-cache \
--disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3

Router

python -m sglang_router.launch_router \
--pd-disaggregation \
--prefill http://127.0.0.1:8188/ \
--decode http://127.0.0.1:8189/ \
--host 0.0.0.0 \
--port 8001

Benchmark

python -m sglang.bench_serving  \
--backend sglang   \
--host 127.0.0.1 \
--port 8001  \
--dataset-name generated-shared-prefix  \
--gsp-num-groups 80 \
--gsp-prompts-per-group 8   \
--gsp-system-prompt-len 7000 \
--gsp-question-len 1000 \
--gsp-output-len 1000 \
--gsp-range-ratio 0.5  \
--num-prompts 1024 \
--max-concurrency 128

main

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 128       
Successful requests:                     640       
Benchmark duration (s):                  274.12    
Total input tokens:                      4343739   
Total input text tokens:                 4343739   
Total generated tokens:                  478516    
Total generated tokens (retokenized):    513015    
Request throughput (req/s):              2.33      
Input token throughput (tok/s):          15846.12  
Output token throughput (tok/s):         1745.64   
Peak output token throughput (tok/s):    1951.00   
Peak concurrent requests:                135       
Total token throughput (tok/s):          17591.77  
Concurrency:                             119.15    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   51034.87  
Median E2E Latency (ms):                 51106.88  
P90 E2E Latency (ms):                    61808.48  
P99 E2E Latency (ms):                    76013.41  
---------------Time to First Token----------------
Mean TTFT (ms):                          19837.62  
Median TTFT (ms):                        20315.57  
P99 TTFT (ms):                           38496.58  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          41.80     
Median TPOT (ms):                        42.67     
P90 TPOT (ms):                           44.00     
P99 TPOT (ms):                           44.47     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           41.77     
Median ITL (ms):                         42.16     
P90 ITL (ms):                            46.13     
P95 ITL (ms):                            48.55     
P99 ITL (ms):                            65.06     
Max ITL (ms):                            427.93    
==================================================

pr

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 128       
Successful requests:                     640       
Benchmark duration (s):                  233.71    
Total input tokens:                      4341981   
Total input text tokens:                 4341981   
Total generated tokens:                  478516    
Total generated tokens (retokenized):    528318    
Request throughput (req/s):              2.74      
Input token throughput (tok/s):          18578.34  
Output token throughput (tok/s):         2047.46   
Peak output token throughput (tok/s):    2336.00   
Peak concurrent requests:                137       
Total token throughput (tok/s):          20625.80  
Concurrency:                             120.45    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   43984.84  
Median E2E Latency (ms):                 43068.02  
P90 E2E Latency (ms):                    56656.78  
P99 E2E Latency (ms):                    68406.12  
---------------Time to First Token----------------
Mean TTFT (ms):                          4564.99   
Median TTFT (ms):                        3127.30   
P99 TTFT (ms):                           21727.59  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          52.83     
Median TPOT (ms):                        55.06     
P90 TPOT (ms):                           56.87     
P99 TPOT (ms):                           57.54     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           52.76     
Median ITL (ms):                         55.25     
P90 ITL (ms):                            62.87     
P95 ITL (ms):                            74.22     
P99 ITL (ms):                            89.70     
Max ITL (ms):                            479.11    
==================================================

gemini-code-assist · 2026-04-07T03:55:39Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

- Remove unnecessary running_batch guard in decode idle memory check: running requests' pages are already in the tree via process_prebuilt -> cache_unfinished_req. - Move start_send_idx update after send() to prevent cursor advance on skipped/failed sends. - Add TP sync guards in _update_handshake_waiters and pop_transferred to prevent gloo hangs from queue size divergence across TP ranks. - Add diagnostics to Bug #14 KV cache full assertion in _pre_alloc. Cursor fix and TP sync guards inspired by #22234 (yudian0504). Validated on sa-b200 Qwen 32B 3P1D at concurrency 128 and 256.

python/sglang/srt/server_args.py

python/sglang/srt/disaggregation/mooncake/conn.py

python/sglang/srt/disaggregation/base/conn.py

yudian0504 · 2026-04-09T13:08:09Z

Reproduce - Hybrid-Attn (8*H20) Qwen3.5-35B-A3B

Prefill

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m sglang.launch_server \
--model-path /home/admin/Qwen3.5-35B-A3B/ \
--host 0.0.0.0 \
--port 8188 \
--trust-remote-code \
--tp-size 4 \
--page-size 64 \
--max-running-requests 32 \
--mem-fraction-static 0.9 \
--chunked-prefill-size 8192 \
--mamba-scheduler-strategy extra_buffer \
--disaggregation-mode prefill \
--disaggregation-transfer-backend mooncake \
--disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3

note: --page-size 64 should be retained to prevent transmission fragmentation after the decode node has been running for a long time.

Decode

CUDA_VISIBLE_DEVICES=4,5,6,7 python -m sglang.launch_server \
--model-path /home/admin/Qwen3.5-35B-A3B/ \
--host 0.0.0.0 \
--port 8189 \
--trust-remote-code \
--tp-size 4 \
--page-size 64 \
--max-running-requests 128 \
--mem-fraction-static 0.9 \
--chunked-prefill-size 8192 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--mamba-scheduler-strategy extra_buffer \
--disaggregation-mode decode \
--disaggregation-enable-decode-radix-cache \
--disaggregation-transfer-backend mooncake \
--disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3

Incremental KV Cache Transfer for PD Disaggregation

0e6e0f4

yudian0504 requested review from ByronHsu, ShangmingCai, Ying1123, hnyls2002, merrymercy and xiezhq-hermann as code owners April 7, 2026 03:55

ShangmingCai mentioned this pull request Apr 7, 2026

[P/D disagg] - support decode side radix cache #19746

Open

6 tasks

ShangmingCai reviewed Apr 9, 2026

View reviewed changes

python/sglang/srt/server_args.py Show resolved Hide resolved

ShangmingCai reviewed Apr 9, 2026

View reviewed changes

python/sglang/srt/disaggregation/mooncake/conn.py Outdated Show resolved Hide resolved

ShangmingCai reviewed Apr 9, 2026

View reviewed changes

python/sglang/srt/disaggregation/base/conn.py Outdated Show resolved Hide resolved

yudian0504 and others added 2 commits April 9, 2026 20:41

Merge branch 'main' into pd_incremental_transfer

aeb2f08

add switch & support hybrid-attn model

dee7b1b

yudian0504 requested review from fzyzcjy, hanming-lu, hzh0425, ispobock, sufeng-buaa and yizhang2077 as code owners April 9, 2026 13:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft] Incremental KV Cache Transfer for PD Disaggregation#22234

[Draft] Incremental KV Cache Transfer for PD Disaggregation#22234
yudian0504 wants to merge 3 commits intosgl-project:mainfrom
antgroup:pd_incremental_transfer

yudian0504 commented Apr 7, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Apr 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yudian0504 commented Apr 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yudian0504 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reproduce (8*H20) M2.5

Prefill

Decode

Router

Benchmark

main

pr

Uh oh!

gemini-code-assist bot commented Apr 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yudian0504 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reproduce - Hybrid-Attn (8*H20) Qwen3.5-35B-A3B

Prefill

Decode

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yudian0504 commented Apr 7, 2026 •

edited

Loading

yudian0504 commented Apr 9, 2026 •

edited

Loading