Use bootstrap server to sync prefill dp rank by changhuaixin · Pull Request #14726 · sgl-project/sglang

changhuaixin · 2025-12-09T09:49:02Z

Motivation

As described in Feature #10174 and #13052, use bootstrap server to store prefill dp rank, thus allow prefill server to use load balance methods.

Modifications

The bootstrap process is modified as follows:

The cases when DPC assigns prefill dp rank is handled now, enabling us to support load balance methods other than follow-bootstrap-room.

Prefill server always register its dp rank to bootstrap server now. Decode server checks where data_parallel_rank is assigned in GenerateReqInput.

Accuracy Tests

None

Evaluating communication overhead

It costs less than 2ms on average to put and get bootstrap info from bootstrap server.

[2025-12-31 18:01:40 DP2 TP2 EP2] Prefill DP rank registration stats: 0.000962s
[2025-12-31 18:01:40 DP1 TP1 EP1] Prefill DP rank registration stats: 0.001020s
[2025-12-31 18:01:40 DP1 TP1 EP1] Prefill DP rank registration stats: 0.001160s
[2025-12-31 18:01:42 DP3 TP3 EP3] Prefill DP rank registration stats: 0.001908s
[2025-12-31 18:01:42 DP0 TP0 EP0] Prefill DP rank registration stats: 0.001397s
[2025-12-31 18:01:42 DP2 TP2 EP2] Prefill DP rank registration stats: 0.001447s
[2025-12-31 18:01:44 DP3 TP3 EP3] Prefill DP rank registration stats: 0.003155s

Benchmarking and Profiling

I have used Qwen3-235B to verify its effectiveness. To reduce TTFT is our primary goal and ISL/OSL is set to 3500/50. In this case the maximum throughput for prefill server is around 3.2. Thus request rate with 1/2/2.6/3.2/4/inf is evaluated.

request rate	original mean TTFT	optimized mean TTFT	optimized ratio
1	1860.30	1432.17	23.01%
2	3896.73	2669.39	31.49%
2.6	5976.88	3854.59	35.50%
3.2	10446.72	4816.37	53.89%
4	15599.93	12293.58	21.19%
inf	37264.81	37184.89	0.21%

Also throughput has increased from 3.09 to 3.13 by 2%. In other cases, it might be higher.

Without this patch and request rate 3.2

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    3.2
Max request concurrency:                 200
Successful requests:                     200
Benchmark duration (s):                  64.74
Total input tokens:                      700000
Total input text tokens:                 700000
Total input vision tokens:               0
Total generated tokens:                  1000
Total generated tokens (retokenized):    1000
Request throughput (req/s):              3.09
Input token throughput (tok/s):          10813.14
Output token throughput (tok/s):         15.45
Peak output token throughput (tok/s):    135.00
Peak concurrent requests:                50
Total token throughput (tok/s):          10828.58
Concurrency:                             32.30
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   10453.78
Median E2E Latency (ms):                 10587.19
---------------Time to First Token----------------
Mean TTFT (ms):                          10446.72
Median TTFT (ms):                        10587.14
P99 TTFT (ms):                           14396.50
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1.77
Median TPOT (ms):                        0.01
P99 TPOT (ms):                           32.92
---------------Inter-Token Latency----------------
Mean ITL (ms):                           1.76
Median ITL (ms):                         0.01
P95 ITL (ms):                            21.20
P99 ITL (ms):                            42.23
Max ITL (ms):                            50.29
==================================================

With this patch

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    3.2
Max request concurrency:                 200
Successful requests:                     200
Benchmark duration (s):                  63.84
Total input tokens:                      700000
Total input text tokens:                 700000
Total input vision tokens:               0
Total generated tokens:                  1000
Total generated tokens (retokenized):    1000
Request throughput (req/s):              3.13
Input token throughput (tok/s):          10965.42
Output token throughput (tok/s):         15.66
Peak output token throughput (tok/s):    60.00
Peak concurrent requests:                32
Total token throughput (tok/s):          10981.08
Concurrency:                             15.10
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   4819.28
Median E2E Latency (ms):                 5100.56
---------------Time to First Token----------------
Mean TTFT (ms):                          4816.37
Median TTFT (ms):                        5100.47
P99 TTFT (ms):                           7569.13
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.73
Median TPOT (ms):                        0.02
P99 TPOT (ms):                           23.07
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.73
Median ITL (ms):                         0.01
P95 ITL (ms):                            0.10
P99 ITL (ms):                            23.56
Max ITL (ms):                            43.52
==================================================

benchseving and commands

benchserving
python -m sglang.bench_serving --backend sglang --model /root/workspace/models/Qwen3-235B-A22B-Instruct-2507-FP8 --pd-separated --host localhost --port 8000 --dataset-name random --dataset-path /root/workspace/models/ShareGPT_V3_unfiltered_cleaned_split.json --random-input-len 3500 --random-output-len 5 --random-range-ratio 1 --request-rate 1 --num-prompts 200 --max-concurrency 200

prefill server
python3 -m sglang.launch_server --model-path /root/workspace/models/Qwen3-235B-A22B
-Instruct-2507-FP8 --port 30001 --base-gpu-id 0 --disaggregation-mode prefill --disable-radix-cache --disaggregation-bootstrap-port 8991 --host=172.26.228.72 --mem-fraction-static 0.75 --tp-size 4 --ep-size 4 --enable-dp-attention --dp-size 4 --moe-a2a-backend deepep --cuda-graph-max-bs 128 --chunked-prefill-size 160000 --load-balance-method round_robin

decode server
python3 -m sglang.launch_server --model-path /root/workspace/models/Qwen3-235B-A22B-Instruct-2507-FP8 --port 31001 --base-gpu-id 4 --disaggregation-mode decode --disable-radix-cache --host=172.26.228.72 --mem-fraction-static 0.75 --tp-size 4 --ep-size 4 --enable-dp-attention --dp-size 4 --moe-a2a-backend deepep --attention-backend flashinfer --cuda-graph-max-bs 128 --load-balance-method shortest_queue --prefill-round-robin-balance --decode-log-interval 1

router
/sgl-router --pd-disaggregation --prefill http://172.26.228.72:30001 8991 --decode http://172.26.228.72:31001 --policy cache_aware --port 8000

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-09T09:49:27Z

Summary of Changes

Hello @changhuaixin, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements a new system for synchronizing prefill data parallel (DP) ranks through a central bootstrap server. The primary goal is to facilitate more robust load balancing across prefill servers by providing a centralized, dynamic registry for DP ranks. This change enhances the distributed architecture by allowing prefill instances to register their presence and retrieve necessary routing information, moving towards a more flexible and scalable setup for handling requests.

Highlights

New Environment Variable: Introduced SGLANG_SYNC_PREFILL_DP_RANK to control whether prefill data parallel (DP) ranks are synchronized via the bootstrap server.
Bootstrap Server Enhancements: The bootstrap server now supports registering and retrieving prefill DP rank information for specific bootstrap_room IDs. It also includes a cleanup mechanism to remove expired entries from the prefill_dp_rank_table.
Prefill DP Rank Synchronization Logic: Modified the KVCacheManager to conditionally register its prefill DP rank with the bootstrap server and to fetch the appropriate prefill DP rank for incoming requests, enabling better load balancing.
API Updates for Disaggregation Backends: The init methods across various disaggregation backends (common, fake, mooncake, nixl) have been updated to accept an optional prefill_dp_rank argument, integrating the new synchronization mechanism.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces functionality to synchronize prefill data parallel ranks through a bootstrap server, which is a significant enhancement for load balancing in disaggregated setups. The changes involve adding a new environment variable, modifying bootstrap server communication to register and retrieve DP ranks, and updating the KV sender/receiver logic to utilize this new synchronization. The documentation has been updated to reflect the new environment variable. Overall, the changes are well-structured and include appropriate error handling and logging.

python/sglang/srt/disaggregation/common/conn.py

hnyls2002 · 2025-12-09T13:04:11Z

Good to see we finally have this!

docs/references/environment_variables.md

python/sglang/srt/disaggregation/common/conn.py

python/sglang/srt/disaggregation/decode.py

python/sglang/srt/disaggregation/common/conn.py

docs/advanced_features/pd_disaggregation.md

ShangmingCai · 2025-12-17T03:41:07Z

Result of --num-prompts 100 --max-concurrency 100 looks good. Can you run another set of --num-prompts 1000 --max-concurrency from 100 to 200, 300, 400, let test what max-concurrency affect the performance gain.

python/sglang/srt/disaggregation/common/conn.py

changhuaixin · 2025-12-19T03:32:23Z

Result of --num-prompts 100 --max-concurrency 100 looks good. Can you run another set of --num-prompts 1000 --max-concurrency from 100 to 200, 300, 400, let test what max-concurrency affect the performance gain.

I have updated results with --num-prompts 200 --max-concurrency 200 with request rate from 1/2/2.6/3.2/4/inf. This feature showed significant optimization on TTFT, especially when request rate approximates the prefill servers' maximum throughtput.

python/sglang/srt/disaggregation/common/conn.py

python/sglang/srt/disaggregation/decode.py

python/sglang/srt/disaggregation/common/conn.py

ShangmingCai

Others LGTM, just some minor suggestions, we should be able to wrap it up after another round of review.

ShangmingCai · 2025-12-23T06:23:19Z

/tag-and-rerun-ci

docs/advanced_features/pd_disaggregation.md

chivalryq · 2026-01-15T10:51:53Z

Hi guys, if I understand it right, this PR changes how decode instance get the dp-rank for one request, but there's a left question: how can a external router specify different dp-rank for P/D instances? I think #16059 is here to solve that. What's your opinion on this problem?

Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>

changhuaixin requested review from ByronHsu, ShangmingCai and hnyls2002 as code owners December 9, 2025 09:49

github-actions bot added the documentation Improvements or additions to documentation label Dec 9, 2025

changhuaixin mentioned this pull request Dec 9, 2025

[Feature] Support all DP load balance methods for PD-Disaggregation mode #13052

Closed

6 tasks

gemini-code-assist bot reviewed Dec 9, 2025

View reviewed changes

python/sglang/srt/disaggregation/common/conn.py Outdated Show resolved Hide resolved

ShangmingCai assigned hnyls2002 and ShangmingCai Dec 9, 2025

changhuaixin force-pushed the changhuaixin/poc_for_get_prefill_dp_rank branch from 3dd78c8 to 4cb5a1e Compare December 15, 2025 02:50

ShangmingCai reviewed Dec 15, 2025

View reviewed changes

changhuaixin force-pushed the changhuaixin/poc_for_get_prefill_dp_rank branch from 4cb5a1e to 3123300 Compare December 16, 2025 11:37

ShangmingCai reviewed Dec 16, 2025

View reviewed changes

docs/advanced_features/pd_disaggregation.md Outdated Show resolved Hide resolved

ShangmingCai reviewed Dec 16, 2025

View reviewed changes

docs/advanced_features/pd_disaggregation.md Outdated Show resolved Hide resolved

ShangmingCai reviewed Dec 17, 2025

View reviewed changes

python/sglang/srt/disaggregation/common/conn.py Outdated Show resolved Hide resolved

ShangmingCai reviewed Dec 17, 2025

View reviewed changes

python/sglang/srt/disaggregation/common/conn.py Outdated Show resolved Hide resolved

ShangmingCai reviewed Dec 17, 2025

View reviewed changes

python/sglang/srt/disaggregation/common/conn.py Outdated Show resolved Hide resolved

ShangmingCai reviewed Dec 17, 2025

View reviewed changes

python/sglang/srt/disaggregation/common/conn.py Outdated Show resolved Hide resolved

ShangmingCai reviewed Dec 22, 2025

View reviewed changes

python/sglang/srt/disaggregation/common/conn.py Outdated Show resolved Hide resolved

ShangmingCai reviewed Dec 22, 2025

View reviewed changes

python/sglang/srt/disaggregation/common/conn.py Outdated Show resolved Hide resolved

ShangmingCai reviewed Dec 22, 2025

View reviewed changes

python/sglang/srt/disaggregation/common/conn.py Outdated Show resolved Hide resolved

ShangmingCai reviewed Dec 22, 2025

View reviewed changes

python/sglang/srt/disaggregation/decode.py Outdated Show resolved Hide resolved

ShangmingCai reviewed Dec 22, 2025

View reviewed changes

python/sglang/srt/disaggregation/common/conn.py Outdated Show resolved Hide resolved

ShangmingCai reviewed Dec 22, 2025

View reviewed changes

github-actions bot added the run-ci label Dec 23, 2025

ShangmingCai mentioned this pull request Dec 29, 2025

feat: add prefill_data_parallel_rank for external dp dispatch in P/D disaggregation #16059

Open

6 tasks

hnyls2002 mentioned this pull request Dec 29, 2025

[Feature] Load Balance Refactor for DP-Attention #16080

Open

2 tasks

hnyls2002 requested changes Dec 29, 2025

View reviewed changes

docs/advanced_features/pd_disaggregation.md Outdated Show resolved Hide resolved

changhuaixin force-pushed the changhuaixin/poc_for_get_prefill_dp_rank branch from 0f6ec93 to c774935 Compare December 31, 2025 08:56

github-actions bot added deepseek npu labels Dec 31, 2025

changhuaixin force-pushed the changhuaixin/poc_for_get_prefill_dp_rank branch from c774935 to 08edc50 Compare December 31, 2025 15:19

changhuaixin requested review from Ying1123, merrymercy and xiezhq-hermann as code owners December 31, 2025 15:19

changhuaixin added 2 commits January 20, 2026 14:23

Use bootstrap server to notify prefill dp rank

dff2190

Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>

query prefill dp rank by room_id

11c189a

Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>

changhuaixin force-pushed the changhuaixin/poc_for_get_prefill_dp_rank branch from 9fec5c4 to 11c189a Compare January 22, 2026 03:23

hnyls2002 mentioned this pull request Feb 23, 2026

[PD-Disagg] Support query dp rank from bootstrap server. #19168

Merged

hnyls2002 closed this Feb 23, 2026

Conversation

changhuaixin commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Evaluating communication overhead

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Dec 9, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

hnyls2002 commented Dec 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ShangmingCai commented Dec 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

changhuaixin commented Dec 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

ShangmingCai commented Dec 23, 2025

Uh oh!

Uh oh!

chivalryq commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

changhuaixin commented Dec 9, 2025 •

edited

Loading