Use bootstrap server to sync prefill dp rank#14726
Use bootstrap server to sync prefill dp rank#14726changhuaixin wants to merge 2 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @changhuaixin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request implements a new system for synchronizing prefill data parallel (DP) ranks through a central bootstrap server. The primary goal is to facilitate more robust load balancing across prefill servers by providing a centralized, dynamic registry for DP ranks. This change enhances the distributed architecture by allowing prefill instances to register their presence and retrieve necessary routing information, moving towards a more flexible and scalable setup for handling requests. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request introduces functionality to synchronize prefill data parallel ranks through a bootstrap server, which is a significant enhancement for load balancing in disaggregated setups. The changes involve adding a new environment variable, modifying bootstrap server communication to register and retrieve DP ranks, and updating the KV sender/receiver logic to utilize this new synchronization. The documentation has been updated to reflect the new environment variable. Overall, the changes are well-structured and include appropriate error handling and logging.
|
Good to see we finally have this! |
3dd78c8 to
4cb5a1e
Compare
4cb5a1e to
3123300
Compare
|
Result of |
I have updated results with |
ShangmingCai
left a comment
There was a problem hiding this comment.
Others LGTM, just some minor suggestions, we should be able to wrap it up after another round of review.
|
/tag-and-rerun-ci |
0f6ec93 to
c774935
Compare
c774935 to
08edc50
Compare
|
Hi guys, if I understand it right, this PR changes how decode instance get the dp-rank for one request, but there's a left question: how can a external router specify different dp-rank for P/D instances? I think #16059 is here to solve that. What's your opinion on this problem? |
Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>
Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>
9fec5c4 to
11c189a
Compare
Motivation
As described in Feature #10174 and #13052, use bootstrap server to store prefill dp rank, thus allow prefill server to use load balance methods.
Modifications
The bootstrap process is modified as follows:

The cases when DPC assigns prefill dp rank is handled now, enabling us to support load balance methods other than follow-bootstrap-room.
Prefill server always register its dp rank to bootstrap server now. Decode server checks where data_parallel_rank is assigned in GenerateReqInput.
Accuracy Tests
None
Evaluating communication overhead
It costs less than 2ms on average to put and get bootstrap info from bootstrap server.
Benchmarking and Profiling
I have used Qwen3-235B to verify its effectiveness. To reduce TTFT is our primary goal and ISL/OSL is set to 3500/50. In this case the maximum throughput for prefill server is around 3.2. Thus request rate with 1/2/2.6/3.2/4/inf is evaluated.
Also throughput has increased from 3.09 to 3.13 by 2%. In other cases, it might be higher.
Without this patch and request rate 3.2
With this patch
benchseving and commands
Checklist