[PD-Disagg] Support query dp rank from bootstrap server.#19168
[PD-Disagg] Support query dp rank from bootstrap server.#19168
Conversation
Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>
Summary of ChangesHello @hnyls2002, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the PD-Disaggregation feature by introducing dynamic data parallel (DP) rank resolution capabilities. It allows decode instances to actively communicate with the bootstrap server to register and query DP ranks, moving beyond static assignments. The bootstrap server itself has been upgraded with new API endpoints and a crucial cleanup mechanism to maintain the integrity of registered entries. These changes aim to improve the robustness and flexibility of distributed KV cache management, particularly in scenarios involving complex load balancing strategies. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for querying the data parallelism (DP) rank from the bootstrap server, which is a key feature for enabling more flexible load balancing strategies like round_robin in a prefill/decode disaggregated setup. The changes include adding new API endpoints to the bootstrap server for registering and querying DP ranks, along with a cleanup mechanism for stale entries. The decode scheduler is updated to handle requests that need to query the DP rank by queueing them and processing them in batches. The changes are well-structured, but there are a couple of issues regarding a new dp_size parameter that is introduced but not correctly handled in the class hierarchy and is unused in the constructor.
|
/tag-and-rerun-ci |
…#19168) Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com> Co-authored-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>
…o xverse_moe * 'xverse_moe' of https://github.com/xiaobaicxy/sglang: (275 commits) fix: add missing blank line after docstring in serving_transcription.py (sgl-project#19206) Whisper model support & `/v1/audio/transcriptions` endpoint & benchmark (sgl-project#16983) fix: patch docker image fixes (sgl-project#19100) [PD-Disagg] Unify prefill info data transition flow, all with `PrefillServerInfo` (sgl-project#19195) [CI] Tiny enhance the dp attention load blance benchmark (sgl-project#19194) add new ci user (sgl-project#19133) [CI] fix the teardown output of disaggregation test (sgl-project#19193) [PD-Disagg] Support query dp rank from bootstrap server. (sgl-project#19168) [Kernel Slimming] Migrate AWQ marlin repack kernel to JIT (sgl-project#18949) [Diffusion] Match rotary_embedding module name style (sgl-project#19179) [Refactor] Split rotary_embedding.py into a modular package (sgl-project#19144) [NPU] bump sgl-kernel-npu to 2026.02.01.post2 (sgl-project#19178) Use single mma warp group for short q_len in FA to optimize decoding performance (sgl-project#18985) Reorganize topk logic to clean up code and expose logical experts (sgl-project#16945) [ROCm] Use unreg path for custom all-reduce during CUDA graph capture (sgl-project#19162) [diffusion] feat: detect Flux2 custom VAE path from component_paths (sgl-project#19170) [AMD] ENV flags tuning and cleanup (sgl-project#19176) Fix bench_one_batch_server by moving the print statements (sgl-project#19175) Update rocm7.2 Dockerfile to install amdsmi for QuickReduce Initialization (sgl-project#19091) Revert "Refactor graph input buffers (sgl-project#18991)" (sgl-project#19173) ...
The prefill handler was missing the data_parallel_rank parameter in its async_generate call, preventing DP rank-aware routing from working in disaggregated mode. The decode handler already passes this correctly. Extract dp_rank from the routing info (set by the KV router in prefill_router.rs) and forward it to SGLang's engine so the prefill scheduler directs work to the correct DP rank. This works in conjunction with sgl-project/sglang#19168, which adds per-request DP rank resolution on the SGLang side -- the decode worker can now resolve the prefill DP rank via the bootstrap server rather than relying on bootstrap_room % dp_size.
…#19168) Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com> Co-authored-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>
…#19168) Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com> Co-authored-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>
…#19168) Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com> Co-authored-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>

Motivation
Based on PR #14726 by @changhuaixin. Allows prefill servers to use load balance methods other than
follow_bootstrap_room(e.g.,round_robin) in PD-disaggregation mode, which significantly reduces TTFT under load (see benchmarks in #14726).When prefill uses non-deterministic load balancing, the decode server cannot infer which prefill DP rank processed a request. This PR adds per-request dp_rank synchronization through the bootstrap server.
Modifications
POST /register_dp_rankandPOST /query_dp_ranksendpoints with TTL-based cleanup. Parallel info now includesfollow_bootstrap_roomflag so decode knows the prefill's LB strategy.follow_bootstrap_room.pending_reqsqueue and batch-resolved each scheduler loop.ensure_parallel_info()onCommonKVManagerto decouple parallel info fetching from receiver creation, shared by bothDecodePreallocQueueandCommonKVReceiver.target_dp_groupalias; unified toprefill_dp_rank.Design difference from #14726
dp_rank is resolved before receiver creation, not inside
init(). This means:should_notify_dp_rankflaginit()signature unchanged — zero changes to mooncake/nixl/fake backendspop_preallocatedloop has no dp_rank awarenessfollow_bootstrap_room, dp_rank is computed locally (bootstrap_room % dp_size) with no HTTP overhead — the original PR always queries the bootstrap server