Fix PD bootstrap failure handling by yhyang201 · Pull Request #24772 · sgl-project/sglang

yhyang201 · 2026-05-09T05:41:20Z

Summary

Set self.bootstrap_infos = None on bootstrap info fetch failure so downstream code hits the None-check instead of AttributeError
Skip update_status(WaitingForInput) when _setup_bootstrap_infos already marked the request as Failed

Ported from 48135b2

Test plan

PD disaggregation with bootstrap failure (e.g. prefill server down)

gemini-code-assist · 2026-05-09T05:41:23Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Co-Authored-By: Cheng Wan <chwan@rice.edu>

yhyang201 · 2026-05-09T05:44:05Z

/tag-and-rerun-ci

Ported from 48135b2 Co-Authored-By: Cheng Wan <chwan@rice.edu>

ShangmingCai · 2026-05-09T09:40:50Z

+            # In PD-prefill mode the cross-engine contract is `bootstrap_room`:
+            # the decode-side KV receiver locates the prefill DP rank via
+            # `bootstrap_room % prefill_dp_size`. Honoring an externally-set
+            # `routed_dp_rank` here breaks that contract whenever the two
+            # diverge (e.g., dynamo's KV router picks a rank for load-balance
+            # reasons that has no relation to `bootstrap_room`). Fall through
+            # to follow_bootstrap_room dispatch to keep prefill ↔ decode aligned.
+            if (
+                self.server_args.disaggregation_mode == "prefill"
+                and req.bootstrap_room is not None
+            ):
+                return False


Not sure about this. If routed_dp_rank is not None, we should respect it?

If this would be assigned when the strategy is follow_bootstrap_room, then maybe we should add this in the condition as well, something like:

Suggested change

# In PD-prefill mode the cross-engine contract is `bootstrap_room`:

# the decode-side KV receiver locates the prefill DP rank via

# `bootstrap_room % prefill_dp_size`. Honoring an externally-set

# `routed_dp_rank` here breaks that contract whenever the two

# diverge (e.g., dynamo's KV router picks a rank for load-balance

# reasons that has no relation to `bootstrap_room`). Fall through

# to follow_bootstrap_room dispatch to keep prefill ↔ decode aligned.

if (

self.server_args.disaggregation_mode == "prefill"

and req.bootstrap_room is not None

):

return False

# In PD-prefill mode the cross-engine contract is `bootstrap_room`:

# the decode-side KV receiver locates the prefill DP rank via

# `bootstrap_room % prefill_dp_size`. Honoring an externally-set

# `routed_dp_rank` here breaks that contract whenever the two

# diverge (e.g., dynamo's KV router picks a rank for load-balance

# reasons that has no relation to `bootstrap_room`). Fall through

# to follow_bootstrap_room dispatch to keep prefill ↔ decode aligned.

if (

self.server_args.disaggregation_mode == "prefill"

and req.bootstrap_room is not None

and self.load_balance_method = "follow_bootstrap_room"

):

return False

I am not sure about this, should ping liangsheng for this.

Thanks for the review! You're right — this is already handled better on main via #23882, so I've reverted it.

ShangmingCai

The common backend modification looks good.

This reverts commit 702e0c5.

ShangmingCai

LGTM

ShangmingCai · 2026-05-09T11:36:11Z

/rerun-stage stage-c-test-8-gpu-h20

github-actions · 2026-05-09T11:36:44Z

✅ Triggered stage-c-test-8-gpu-h20 to run independently (skipping dependencies). View workflow run

ShangmingCai · 2026-05-09T11:36:57Z

/rerun-test test/registered/disaggregation/test_disaggregation_basic.py

github-actions · 2026-05-09T11:37:29Z

✅ 2-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/disaggregation/test_disaggregation_basic.py

ShangmingCai · 2026-05-09T11:37:33Z

No need to run full CI, we only need these two, which should be enough.

* main: (87 commits) [Fix] Disable FlashInfer allreduce fusion under deterministic inference (sgl-project#24629) fix: STANDALONE spec-decode hidden-size mismatch crash (sgl-project#24217) Followup fix for Custom AR V2 in non NVL scenarios (sgl-project#24742) Fix reduce_scatterv producer contract for SUM_LEN (sgl-project#24785) [NPU]Documentation update for communications quantization feature (sgl-project#24668) [Session R3] Add routed_experts_start_len for absolute routing slice control (sgl-project#24851) [Model] Add MiniCPM-V 4.6 support (sgl-project#24855) Support Intern-S2-Preview (sgl-project#24875) [PD] Unify dsv4 dispatch with swa (sgl-project#24888) Optimize MHC pipeline: DeepGemm, fused norm, fused hc_head (sgl-project#24775) Fix PD bootstrap failure handling (sgl-project#24772) [Spec] Cleanup idle stub and shape-check patterns (sgl-project#24881) [Bug] Add dsv4 state_type branch to mooncake disaggregation (sgl-project#24878) [Spec V1] Split draft-extend phase from `EagleDraftInput` into new `EagleDraftExtendInput` (sgl-project#24859) [Gemma4] Optimize Gemm4 with fused Q/K/V RMSNorm + per-expert FP8 ckpt loader (sgl-project#24696) [spec decoding] support kimi-k2.5-eagle3-mla (sgl-project#24826) [SPEC V2] fix: skip stale state updates in spec-v2 overlap (sgl-project#23456) [RL] Call torch.cuda.empty_cache() for `in-place` pause mode to avoid OOM (sgl-project#24854) [diffusion] CI: add cache-dit CI tests (sgl-project#19213) [Utils] Make request dump robust to unpicklable server_args and large meta_info (sgl-project#24767) ... # Conflicts: # python/sglang/srt/utils/common.py

yhyang201 requested review from ByronHsu, ShangmingCai and hnyls2002 as code owners May 9, 2026 05:41

Fix PD bootstrap failure handling

827389c

Co-Authored-By: Cheng Wan <chwan@rice.edu>

yhyang201 force-pushed the fix/pd-bootstrap-failure-handling branch from 43a6b02 to 827389c Compare May 9, 2026 05:43

github-actions Bot added the run-ci label May 9, 2026

Skip external DP rank routing in prefill disaggregation mode

702e0c5

Ported from 48135b2 Co-Authored-By: Cheng Wan <chwan@rice.edu>

yhyang201 requested review from Ying1123, merrymercy and xiezhq-hermann as code owners May 9, 2026 08:34

ShangmingCai reviewed May 9, 2026

View reviewed changes

Revert "Skip external DP rank routing in prefill disaggregation mode"

49f9ac8

This reverts commit 702e0c5.

ShangmingCai approved these changes May 9, 2026

View reviewed changes

yhyang201 merged commit bd0aa22 into main May 10, 2026
154 of 160 checks passed

yhyang201 deleted the fix/pd-bootstrap-failure-handling branch May 10, 2026 11:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix PD bootstrap failure handling#24772

Fix PD bootstrap failure handling#24772
yhyang201 merged 3 commits into
mainfrom
fix/pd-bootstrap-failure-handling

yhyang201 commented May 9, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented May 9, 2026

Uh oh!

yhyang201 commented May 9, 2026

Uh oh!

ShangmingCai May 9, 2026

Uh oh!

yhyang201 May 9, 2026

Uh oh!

ShangmingCai left a comment

Uh oh!

ShangmingCai left a comment

Uh oh!

ShangmingCai commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

ShangmingCai commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

ShangmingCai commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yhyang201 commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

gemini-code-assist Bot commented May 9, 2026

Uh oh!

yhyang201 commented May 9, 2026

Uh oh!

ShangmingCai May 9, 2026

Choose a reason for hiding this comment

Uh oh!

yhyang201 May 9, 2026

Choose a reason for hiding this comment

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

ShangmingCai commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

ShangmingCai commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

ShangmingCai commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yhyang201 commented May 9, 2026 •

edited

Loading