Skip to content

Use bootstrap server to sync prefill dp rank#14726

Closed
changhuaixin wants to merge 2 commits intosgl-project:mainfrom
openanolis:changhuaixin/poc_for_get_prefill_dp_rank
Closed

Use bootstrap server to sync prefill dp rank#14726
changhuaixin wants to merge 2 commits intosgl-project:mainfrom
openanolis:changhuaixin/poc_for_get_prefill_dp_rank

Conversation

@changhuaixin
Copy link
Copy Markdown
Contributor

@changhuaixin changhuaixin commented Dec 9, 2025

Motivation

As described in Feature #10174 and #13052, use bootstrap server to store prefill dp rank, thus allow prefill server to use load balance methods.

Modifications

The bootstrap process is modified as follows:
image

The cases when DPC assigns prefill dp rank is handled now, enabling us to support load balance methods other than follow-bootstrap-room.

Prefill server always register its dp rank to bootstrap server now. Decode server checks where data_parallel_rank is assigned in GenerateReqInput.

Accuracy Tests

None

Evaluating communication overhead

It costs less than 2ms on average to put and get bootstrap info from bootstrap server.

[2025-12-31 18:01:40 DP2 TP2 EP2] Prefill DP rank registration stats: 0.000962s
[2025-12-31 18:01:40 DP1 TP1 EP1] Prefill DP rank registration stats: 0.001020s
[2025-12-31 18:01:40 DP1 TP1 EP1] Prefill DP rank registration stats: 0.001160s
[2025-12-31 18:01:42 DP3 TP3 EP3] Prefill DP rank registration stats: 0.001908s
[2025-12-31 18:01:42 DP0 TP0 EP0] Prefill DP rank registration stats: 0.001397s
[2025-12-31 18:01:42 DP2 TP2 EP2] Prefill DP rank registration stats: 0.001447s
[2025-12-31 18:01:44 DP3 TP3 EP3] Prefill DP rank registration stats: 0.003155s

Benchmarking and Profiling

I have used Qwen3-235B to verify its effectiveness. To reduce TTFT is our primary goal and ISL/OSL is set to 3500/50. In this case the maximum throughput for prefill server is around 3.2. Thus request rate with 1/2/2.6/3.2/4/inf is evaluated.

request rate original mean TTFT optimized mean TTFT optimized ratio
1 1860.30 1432.17 23.01%
2 3896.73 2669.39 31.49%
2.6 5976.88 3854.59 35.50%
3.2 10446.72 4816.37 53.89%
4 15599.93 12293.58 21.19%
inf 37264.81 37184.89 0.21%

Also throughput has increased from 3.09 to 3.13 by 2%. In other cases, it might be higher.

Without this patch and request rate 3.2

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    3.2
Max request concurrency:                 200
Successful requests:                     200
Benchmark duration (s):                  64.74
Total input tokens:                      700000
Total input text tokens:                 700000
Total input vision tokens:               0
Total generated tokens:                  1000
Total generated tokens (retokenized):    1000
Request throughput (req/s):              3.09
Input token throughput (tok/s):          10813.14
Output token throughput (tok/s):         15.45
Peak output token throughput (tok/s):    135.00
Peak concurrent requests:                50
Total token throughput (tok/s):          10828.58
Concurrency:                             32.30
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   10453.78
Median E2E Latency (ms):                 10587.19
---------------Time to First Token----------------
Mean TTFT (ms):                          10446.72
Median TTFT (ms):                        10587.14
P99 TTFT (ms):                           14396.50
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1.77
Median TPOT (ms):                        0.01
P99 TPOT (ms):                           32.92
---------------Inter-Token Latency----------------
Mean ITL (ms):                           1.76
Median ITL (ms):                         0.01
P95 ITL (ms):                            21.20
P99 ITL (ms):                            42.23
Max ITL (ms):                            50.29
==================================================

With this patch

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    3.2
Max request concurrency:                 200
Successful requests:                     200
Benchmark duration (s):                  63.84
Total input tokens:                      700000
Total input text tokens:                 700000
Total input vision tokens:               0
Total generated tokens:                  1000
Total generated tokens (retokenized):    1000
Request throughput (req/s):              3.13
Input token throughput (tok/s):          10965.42
Output token throughput (tok/s):         15.66
Peak output token throughput (tok/s):    60.00
Peak concurrent requests:                32
Total token throughput (tok/s):          10981.08
Concurrency:                             15.10
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   4819.28
Median E2E Latency (ms):                 5100.56
---------------Time to First Token----------------
Mean TTFT (ms):                          4816.37
Median TTFT (ms):                        5100.47
P99 TTFT (ms):                           7569.13
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.73
Median TPOT (ms):                        0.02
P99 TPOT (ms):                           23.07
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.73
Median ITL (ms):                         0.01
P95 ITL (ms):                            0.10
P99 ITL (ms):                            23.56
Max ITL (ms):                            43.52
==================================================

benchseving and commands

benchserving
python -m sglang.bench_serving --backend sglang --model /root/workspace/models/Qwen3-235B-A22B-Instruct-2507-FP8 --pd-separated --host localhost --port 8000 --dataset-name random --dataset-path /root/workspace/models/ShareGPT_V3_unfiltered_cleaned_split.json --random-input-len 3500 --random-output-len 5 --random-range-ratio 1 --request-rate 1 --num-prompts 200 --max-concurrency 200

prefill server
python3 -m sglang.launch_server --model-path /root/workspace/models/Qwen3-235B-A22B
-Instruct-2507-FP8 --port 30001 --base-gpu-id 0 --disaggregation-mode prefill --disable-radix-cache --disaggregation-bootstrap-port 8991 --host=172.26.228.72 --mem-fraction-static 0.75 --tp-size 4 --ep-size 4 --enable-dp-attention --dp-size 4 --moe-a2a-backend deepep --cuda-graph-max-bs 128 --chunked-prefill-size 160000 --load-balance-method round_robin

decode server
python3 -m sglang.launch_server --model-path /root/workspace/models/Qwen3-235B-A22B-Instruct-2507-FP8 --port 31001 --base-gpu-id 4 --disaggregation-mode decode --disable-radix-cache --host=172.26.228.72 --mem-fraction-static 0.75 --tp-size 4 --ep-size 4 --enable-dp-attention --dp-size 4 --moe-a2a-backend deepep --attention-backend flashinfer --cuda-graph-max-bs 128 --load-balance-method shortest_queue --prefill-round-robin-balance --decode-log-interval 1

router
/sgl-router --pd-disaggregation --prefill http://172.26.228.72:30001 8991 --decode http://172.26.228.72:31001 --policy cache_aware --port 8000

Checklist

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Dec 9, 2025
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @changhuaixin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements a new system for synchronizing prefill data parallel (DP) ranks through a central bootstrap server. The primary goal is to facilitate more robust load balancing across prefill servers by providing a centralized, dynamic registry for DP ranks. This change enhances the distributed architecture by allowing prefill instances to register their presence and retrieve necessary routing information, moving towards a more flexible and scalable setup for handling requests.

Highlights

  • New Environment Variable: Introduced SGLANG_SYNC_PREFILL_DP_RANK to control whether prefill data parallel (DP) ranks are synchronized via the bootstrap server.
  • Bootstrap Server Enhancements: The bootstrap server now supports registering and retrieving prefill DP rank information for specific bootstrap_room IDs. It also includes a cleanup mechanism to remove expired entries from the prefill_dp_rank_table.
  • Prefill DP Rank Synchronization Logic: Modified the KVCacheManager to conditionally register its prefill DP rank with the bootstrap server and to fetch the appropriate prefill DP rank for incoming requests, enabling better load balancing.
  • API Updates for Disaggregation Backends: The init methods across various disaggregation backends (common, fake, mooncake, nixl) have been updated to accept an optional prefill_dp_rank argument, integrating the new synchronization mechanism.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces functionality to synchronize prefill data parallel ranks through a bootstrap server, which is a significant enhancement for load balancing in disaggregated setups. The changes involve adding a new environment variable, modifying bootstrap server communication to register and retrieve DP ranks, and updating the KV sender/receiver logic to utilize this new synchronization. The documentation has been updated to reflect the new environment variable. Overall, the changes are well-structured and include appropriate error handling and logging.

@hnyls2002
Copy link
Copy Markdown
Collaborator

Good to see we finally have this!

@changhuaixin changhuaixin force-pushed the changhuaixin/poc_for_get_prefill_dp_rank branch from 3dd78c8 to 4cb5a1e Compare December 15, 2025 02:50
@changhuaixin changhuaixin force-pushed the changhuaixin/poc_for_get_prefill_dp_rank branch from 4cb5a1e to 3123300 Compare December 16, 2025 11:37
@ShangmingCai
Copy link
Copy Markdown
Collaborator

Result of --num-prompts 100 --max-concurrency 100 looks good. Can you run another set of --num-prompts 1000 --max-concurrency from 100 to 200, 300, 400, let test what max-concurrency affect the performance gain.

@changhuaixin
Copy link
Copy Markdown
Contributor Author

Result of --num-prompts 100 --max-concurrency 100 looks good. Can you run another set of --num-prompts 1000 --max-concurrency from 100 to 200, 300, 400, let test what max-concurrency affect the performance gain.

I have updated results with --num-prompts 200 --max-concurrency 200 with request rate from 1/2/2.6/3.2/4/inf. This feature showed significant optimization on TTFT, especially when request rate approximates the prefill servers' maximum throughtput.

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Others LGTM, just some minor suggestions, we should be able to wrap it up after another round of review.

@ShangmingCai
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@changhuaixin changhuaixin force-pushed the changhuaixin/poc_for_get_prefill_dp_rank branch from 0f6ec93 to c774935 Compare December 31, 2025 08:56
@changhuaixin changhuaixin force-pushed the changhuaixin/poc_for_get_prefill_dp_rank branch from c774935 to 08edc50 Compare December 31, 2025 15:19
@chivalryq
Copy link
Copy Markdown

Hi guys, if I understand it right, this PR changes how decode instance get the dp-rank for one request, but there's a left question: how can a external router specify different dp-rank for P/D instances? I think #16059 is here to solve that. What's your opinion on this problem?

Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>
Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek documentation Improvements or additions to documentation npu run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants