Skip to content

Fix PD bootstrap failure handling#24772

Merged
yhyang201 merged 3 commits into
mainfrom
fix/pd-bootstrap-failure-handling
May 10, 2026
Merged

Fix PD bootstrap failure handling#24772
yhyang201 merged 3 commits into
mainfrom
fix/pd-bootstrap-failure-handling

Conversation

@yhyang201
Copy link
Copy Markdown
Collaborator

@yhyang201 yhyang201 commented May 9, 2026

Summary

  • Set self.bootstrap_infos = None on bootstrap info fetch failure so downstream code hits the None-check instead of AttributeError
  • Skip update_status(WaitingForInput) when _setup_bootstrap_infos already marked the request as Failed

Ported from 48135b2

Test plan

  • PD disaggregation with bootstrap failure (e.g. prefill server down)

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Co-Authored-By: Cheng Wan <chwan@rice.edu>
@yhyang201 yhyang201 force-pushed the fix/pd-bootstrap-failure-handling branch from 43a6b02 to 827389c Compare May 9, 2026 05:43
@yhyang201
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label May 9, 2026
Ported from 48135b2

Co-Authored-By: Cheng Wan <chwan@rice.edu>
Comment on lines +565 to +576
# In PD-prefill mode the cross-engine contract is `bootstrap_room`:
# the decode-side KV receiver locates the prefill DP rank via
# `bootstrap_room % prefill_dp_size`. Honoring an externally-set
# `routed_dp_rank` here breaks that contract whenever the two
# diverge (e.g., dynamo's KV router picks a rank for load-balance
# reasons that has no relation to `bootstrap_room`). Fall through
# to follow_bootstrap_room dispatch to keep prefill ↔ decode aligned.
if (
self.server_args.disaggregation_mode == "prefill"
and req.bootstrap_room is not None
):
return False
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this. If routed_dp_rank is not None, we should respect it?

If this would be assigned when the strategy is follow_bootstrap_room, then maybe we should add this in the condition as well, something like:

Suggested change
# In PD-prefill mode the cross-engine contract is `bootstrap_room`:
# the decode-side KV receiver locates the prefill DP rank via
# `bootstrap_room % prefill_dp_size`. Honoring an externally-set
# `routed_dp_rank` here breaks that contract whenever the two
# diverge (e.g., dynamo's KV router picks a rank for load-balance
# reasons that has no relation to `bootstrap_room`). Fall through
# to follow_bootstrap_room dispatch to keep prefill ↔ decode aligned.
if (
self.server_args.disaggregation_mode == "prefill"
and req.bootstrap_room is not None
):
return False
# In PD-prefill mode the cross-engine contract is `bootstrap_room`:
# the decode-side KV receiver locates the prefill DP rank via
# `bootstrap_room % prefill_dp_size`. Honoring an externally-set
# `routed_dp_rank` here breaks that contract whenever the two
# diverge (e.g., dynamo's KV router picks a rank for load-balance
# reasons that has no relation to `bootstrap_room`). Fall through
# to follow_bootstrap_room dispatch to keep prefill ↔ decode aligned.
if (
self.server_args.disaggregation_mode == "prefill"
and req.bootstrap_room is not None
and self.load_balance_method = "follow_bootstrap_room"
):
return False

I am not sure about this, should ping liangsheng for this.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! You're right — this is already handled better on main via #23882, so I've reverted it.

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The common backend modification looks good.

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ShangmingCai
Copy link
Copy Markdown
Collaborator

/rerun-stage stage-c-test-8-gpu-h20

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

✅ Triggered stage-c-test-8-gpu-h20 to run independently (skipping dependencies). View workflow run

@ShangmingCai
Copy link
Copy Markdown
Collaborator

/rerun-test test/registered/disaggregation/test_disaggregation_basic.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

2-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/disaggregation/test_disaggregation_basic.py

@ShangmingCai
Copy link
Copy Markdown
Collaborator

No need to run full CI, we only need these two, which should be enough.

@yhyang201 yhyang201 merged commit bd0aa22 into main May 10, 2026
154 of 160 checks passed
@yhyang201 yhyang201 deleted the fix/pd-bootstrap-failure-handling branch May 10, 2026 11:02
ltcs11 added a commit to ltcs11/sglang that referenced this pull request May 11, 2026
* main: (87 commits)
  [Fix] Disable FlashInfer allreduce fusion under deterministic inference (sgl-project#24629)
  fix: STANDALONE spec-decode hidden-size mismatch crash (sgl-project#24217)
  Followup fix for Custom AR V2 in non NVL scenarios (sgl-project#24742)
  Fix reduce_scatterv producer contract for SUM_LEN (sgl-project#24785)
  [NPU]Documentation update for communications quantization feature (sgl-project#24668)
  [Session R3] Add routed_experts_start_len for absolute routing slice control (sgl-project#24851)
  [Model] Add MiniCPM-V 4.6 support (sgl-project#24855)
  Support Intern-S2-Preview (sgl-project#24875)
  [PD] Unify dsv4 dispatch with swa (sgl-project#24888)
  Optimize MHC pipeline: DeepGemm, fused norm, fused hc_head (sgl-project#24775)
  Fix PD bootstrap failure handling (sgl-project#24772)
  [Spec] Cleanup idle stub and shape-check patterns (sgl-project#24881)
  [Bug] Add dsv4 state_type branch to mooncake disaggregation (sgl-project#24878)
  [Spec V1] Split draft-extend phase from `EagleDraftInput` into new `EagleDraftExtendInput` (sgl-project#24859)
  [Gemma4] Optimize Gemm4 with fused Q/K/V RMSNorm + per-expert FP8 ckpt loader (sgl-project#24696)
  [spec decoding] support kimi-k2.5-eagle3-mla (sgl-project#24826)
  [SPEC V2] fix: skip stale state updates in spec-v2 overlap (sgl-project#23456)
  [RL] Call torch.cuda.empty_cache() for `in-place` pause mode to avoid OOM (sgl-project#24854)
  [diffusion] CI: add cache-dit CI tests (sgl-project#19213)
  [Utils] Make request dump robust to unpicklable server_args and large meta_info (sgl-project#24767)
  ...

# Conflicts:
#	python/sglang/srt/utils/common.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants