[PD-Disagg] Fix bootstrap server race condition when prefill workers not yet registered#19288
Merged
[PD-Disagg] Fix bootstrap server race condition when prefill workers not yet registered#19288
Conversation
…not yet registered The bootstrap server starts accepting HTTP requests before any prefill worker has registered via PUT. If a decode worker queries server info during this window, PrefillServerInfo is constructed with None fields, causing TypeError crash. Fix: track registered worker count on bootstrap server, return 503 until all workers registered; add client-side retry in ensure_parallel_info. Co-authored-by: Cursor <cursoragent@cursor.com>
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Only need at least one prefill worker registered for metadata to be available. Specific worker queries have their own error handling. Co-authored-by: Cursor <cursoragent@cursor.com>
Collaborator
Author
|
/rerun-stage stage-c-test-8-gpu-h20 |
Collaborator
Author
|
/rerun-stage stage-b-test-large-2-gpu |
Revert _is_ready() to check full worker count (dp * tp * pp). Also guard specific worker queries with _is_ready() and handle KeyError to prevent server crash on missing entries. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Contributor
|
✅ Triggered |
Contributor
|
✅ Triggered |
Contributor
Contributor
Collaborator
Author
|
PD-Disaggregation tests passed. |
magicYang1573
pushed a commit
to magicYang1573/sglang
that referenced
this pull request
Mar 9, 2026
…not yet registered (sgl-project#19288) Co-authored-by: Cursor <cursoragent@cursor.com>
lawrence-harmonic
added a commit
to lawrence-harmonic/sglang
that referenced
this pull request
Mar 10, 2026
Wangzheee
pushed a commit
to Wangzheee/sglang
that referenced
this pull request
Mar 21, 2026
…not yet registered (sgl-project#19288) Co-authored-by: Cursor <cursoragent@cursor.com>
JustinTong0323
pushed a commit
to JustinTong0323/sglang
that referenced
this pull request
Apr 7, 2026
…not yet registered (sgl-project#19288) Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bootstrap server
_is_ready()check: track registered worker count, return 503 until alldp_size * tp_size * pp_sizeworkers have registered-1,-1,-1) and specific worker query with_is_ready()KeyErroronprefill_port_tablelookup instead of crashing the serverNonecheck after dict lookup (dead code)Decode client
ensure_parallel_info(20 retries, 1s interval) so decode waits for prefill registration instead of failing immediatelyError Log
From CI failure:
Root Cause
CommonKVBootstrapServerstarts its HTTP server before any prefill worker has registered. Fields likeattn_tp_sizeareNoneuntil a PUT arrives. A decode GET during this window crashes withTypeErrorinPrefillServerInfo.__post_init__.This race always existed but was silently masked: previously
_handle_route_getreturned a plain dict (serializingNoneas JSONnull), and the decode-sideint(None)was swallowed byexcept Exception. After #19195 introducedPrefillServerInfowith__post_init__, the crash moved server-side and became visible.Repro Patch
Delay
register_to_bootstrap()to widen the race window:Test plan
test_disaggregation_basic.pypassestest_disaggregation_dp_attention.pypasses