Skip to content

[PD-Disagg] Support query dp rank from bootstrap server.#19168

Merged
hnyls2002 merged 16 commits intomainfrom
lsyin/support-query-dp-rank
Feb 23, 2026
Merged

[PD-Disagg] Support query dp rank from bootstrap server.#19168
hnyls2002 merged 16 commits intomainfrom
lsyin/support-query-dp-rank

Conversation

@hnyls2002
Copy link
Copy Markdown
Collaborator

@hnyls2002 hnyls2002 commented Feb 23, 2026

Motivation

Based on PR #14726 by @changhuaixin. Allows prefill servers to use load balance methods other than follow_bootstrap_room (e.g., round_robin) in PD-disaggregation mode, which significantly reduces TTFT under load (see benchmarks in #14726).

When prefill uses non-deterministic load balancing, the decode server cannot infer which prefill DP rank processed a request. This PR adds per-request dp_rank synchronization through the bootstrap server.

Modifications

  • Bootstrap server: New POST /register_dp_rank and POST /query_dp_ranks endpoints with TTL-based cleanup. Parallel info now includes follow_bootstrap_room flag so decode knows the prefill's LB strategy.
  • Prefill sender: Registers dp_rank per request when not using follow_bootstrap_room.
  • Decode side: Resolves dp_rank before creating the KV receiver. Requests whose dp_rank is not yet available are held in a pending_reqs queue and batch-resolved each scheduler loop.
  • Connection-level init: Extracted ensure_parallel_info() on CommonKVManager to decouple parallel info fetching from receiver creation, shared by both DecodePreallocQueue and CommonKVReceiver.
  • Naming cleanup: Removed redundant target_dp_group alias; unified to prefill_dp_rank.

Design difference from #14726

dp_rank is resolved before receiver creation, not inside init(). This means:

  • No half-initialized receiver state or should_notify_dp_rank flag
  • init() signature unchanged — zero changes to mooncake/nixl/fake backends
  • pop_preallocated loop has no dp_rank awareness
  • When using follow_bootstrap_room, dp_rank is computed locally (bootstrap_room % dp_size) with no HTTP overhead — the original PR always queries the bootstrap server

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Feb 23, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @hnyls2002, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the PD-Disaggregation feature by introducing dynamic data parallel (DP) rank resolution capabilities. It allows decode instances to actively communicate with the bootstrap server to register and query DP ranks, moving beyond static assignments. The bootstrap server itself has been upgraded with new API endpoints and a crucial cleanup mechanism to maintain the integrity of registered entries. These changes aim to improve the robustness and flexibility of distributed KV cache management, particularly in scenarios involving complex load balancing strategies.

Highlights

  • Dynamic DP Rank Resolution: Implemented a mechanism for decode instances to query and register their data parallel (DP) ranks with the bootstrap server, allowing for more flexible load balancing beyond simple bootstrap_room modulo.
  • Bootstrap Server Enhancements: The bootstrap server now supports new API endpoints (/register_dp_rank, /query_dp_ranks) to facilitate DP rank management and includes a periodic cleanup task for expired entries.
  • Configuration and Documentation: Introduced a new environment variable (SGLANG_DISAGGREGATION_BOOTSTRAP_ENTRY_CLEANUP_INTERVAL) to control the cleanup frequency, and updated documentation.
  • Code Refactoring: Streamlined the KVSender and KVReceiver classes by making register_to_bootstrap a public abstract method and adjusting internal logic for parallel information retrieval.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • docs/advanced_features/pd_disaggregation.md
    • Added documentation for the new SGLANG_DISAGGREGATION_BOOTSTRAP_ENTRY_CLEANUP_INTERVAL environment variable.
  • python/sglang/srt/disaggregation/base/conn.py
    • Added an abstract method register_to_bootstrap to BaseKVSender.
  • python/sglang/srt/disaggregation/common/conn.py
    • Imported time and envs modules.
    • Added server_args to KVSender initialization.
    • Renamed _register_to_bootstrap to register_to_bootstrap and made it a public method.
    • Included load_balance_method in the payload when registering to the bootstrap server.
    • Added follow_bootstrap_room_table to KVSender to store load balancing information.
    • Introduced _register_prefill_dp_rank method in KVReceiver to register the prefill DP rank with the bootstrap server.
    • Modified _get_prefill_parallel_info_from_server to return the follow_bootstrap_room status.
    • Refactored prefill_dp_rank assignment logic and introduced _setup_bootstrap_infos in KVReceiver.
    • Updated the bootstrap_key generation to use prefill_dp_rank instead of target_dp_group.
    • Modified calls to _get_bootstrap_info_from_server to use prefill_dp_rank.
    • Updated CommonKVBootstrapServer constructor to accept dp_size.
    • Added follow_bootstrap_room, room_to_dp_rank, and entry_cleanup_interval attributes to CommonKVBootstrapServer.
    • Added new routes /register_dp_rank and /query_dp_ranks to the bootstrap server.
    • Implemented _handle_register_dp_rank and _handle_query_dp_ranks handlers for the new routes.
    • Included follow_bootstrap_room in the response from the /route endpoint.
    • Implemented _cleanup_expired_entries as an asynchronous task for periodic cleanup of room_to_dp_rank entries.
    • Added a static method query_prefill_dp_ranks to CommonKVReceiver for batch querying DP ranks.
  • python/sglang/srt/disaggregation/decode.py
    • Added a pending_reqs list to DecodeScheduler to hold requests awaiting DP rank resolution.
    • Introduced _resolve_dp_rank to determine the DP rank for a request.
    • Created _create_receiver_and_enqueue to encapsulate receiver creation and request enqueuing.
    • Modified the add method to use the new DP rank resolution and queuing logic.
    • Added _resolve_pending_reqs to batch-resolve DP ranks for pending requests.
    • Called _resolve_pending_reqs in pop_preallocated to process pending requests.
  • python/sglang/srt/environ.py
    • Added SGLANG_DISAGGREGATION_BOOTSTRAP_ENTRY_CLEANUP_INTERVAL environment variable with a default value of 120 seconds.
  • python/sglang/srt/managers/disagg_service.py
    • Passed the dp_size argument to the kv_bootstrap_server_class constructor.
  • python/sglang/srt/server_args.py
    • Removed the backward compatibility warning for the round_robin load balancing method in PD prefill mode.
Activity
  • The pull request is currently marked as 'WIP' (Work In Progress), indicating that the author is still actively developing and refining the changes.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for querying the data parallelism (DP) rank from the bootstrap server, which is a key feature for enabling more flexible load balancing strategies like round_robin in a prefill/decode disaggregated setup. The changes include adding new API endpoints to the bootstrap server for registering and querying DP ranks, along with a cleanup mechanism for stale entries. The decode scheduler is updated to handle requests that need to query the DP rank by queueing them and processing them in batches. The changes are well-structured, but there are a couple of issues regarding a new dp_size parameter that is introduced but not correctly handled in the class hierarchy and is unused in the constructor.

@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image

CI looks good. Failed amd tests are irrelevant.

@hnyls2002 hnyls2002 merged commit 2274bfe into main Feb 23, 2026
238 of 258 checks passed
@hnyls2002 hnyls2002 deleted the lsyin/support-query-dp-rank branch February 23, 2026 18:59
zhuxinjie-nz pushed a commit to zhuxinjie-nz/sglang that referenced this pull request Feb 24, 2026
…#19168)

Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>
Co-authored-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>
xiaobaicxy added a commit to xiaobaicxy/sglang that referenced this pull request Feb 24, 2026
…o xverse_moe

* 'xverse_moe' of https://github.com/xiaobaicxy/sglang: (275 commits)
  fix: add missing blank line after docstring in serving_transcription.py (sgl-project#19206)
  Whisper model support & `/v1/audio/transcriptions` endpoint & benchmark (sgl-project#16983)
  fix: patch docker image fixes (sgl-project#19100)
  [PD-Disagg] Unify prefill info data transition flow, all with `PrefillServerInfo` (sgl-project#19195)
  [CI] Tiny enhance the dp attention load blance benchmark (sgl-project#19194)
  add new ci user (sgl-project#19133)
  [CI] fix the teardown output of disaggregation test (sgl-project#19193)
  [PD-Disagg] Support query dp rank from bootstrap server. (sgl-project#19168)
  [Kernel Slimming] Migrate AWQ marlin repack kernel to JIT (sgl-project#18949)
  [Diffusion] Match rotary_embedding module name style (sgl-project#19179)
  [Refactor] Split rotary_embedding.py into a modular package (sgl-project#19144)
  [NPU] bump sgl-kernel-npu to 2026.02.01.post2 (sgl-project#19178)
  Use single mma warp group for short q_len in FA to optimize decoding performance (sgl-project#18985)
  Reorganize topk logic to clean up code and expose logical experts (sgl-project#16945)
  [ROCm] Use unreg path for custom all-reduce during CUDA graph capture (sgl-project#19162)
  [diffusion] feat: detect Flux2 custom VAE path from component_paths (sgl-project#19170)
  [AMD] ENV flags tuning and cleanup (sgl-project#19176)
  Fix bench_one_batch_server by moving the print statements (sgl-project#19175)
  Update rocm7.2 Dockerfile to install amdsmi for QuickReduce Initialization (sgl-project#19091)
  Revert "Refactor graph input buffers (sgl-project#18991)" (sgl-project#19173)
  ...
ishandhanani added a commit to ai-dynamo/dynamo that referenced this pull request Mar 1, 2026
The prefill handler was missing the data_parallel_rank parameter in its
async_generate call, preventing DP rank-aware routing from working in
disaggregated mode. The decode handler already passes this correctly.

Extract dp_rank from the routing info (set by the KV router in
prefill_router.rs) and forward it to SGLang's engine so the prefill
scheduler directs work to the correct DP rank.

This works in conjunction with sgl-project/sglang#19168, which adds
per-request DP rank resolution on the SGLang side -- the decode worker
can now resolve the prefill DP rank via the bootstrap server rather
than relying on bootstrap_room % dp_size.
magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026
…#19168)

Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>
Co-authored-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
…#19168)

Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>
Co-authored-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
…#19168)

Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>
Co-authored-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation high priority run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants