Skip to content

[PD-Disagg] Fix bootstrap server race condition when prefill workers not yet registered#19288

Merged
hnyls2002 merged 5 commits intomainfrom
lsyin/fix-boostrap-data-race
Feb 25, 2026
Merged

[PD-Disagg] Fix bootstrap server race condition when prefill workers not yet registered#19288
hnyls2002 merged 5 commits intomainfrom
lsyin/fix-boostrap-data-race

Conversation

@hnyls2002
Copy link
Copy Markdown
Collaborator

@hnyls2002 hnyls2002 commented Feb 25, 2026

Summary

Bootstrap server

  • Add _is_ready() check: track registered worker count, return 503 until all dp_size * tp_size * pp_size workers have registered
  • Guard both metadata query (-1,-1,-1) and specific worker query with _is_ready()
  • Catch KeyError on prefill_port_table lookup instead of crashing the server
  • Remove unreachable None check after dict lookup (dead code)

Decode client

  • Add retry logic in ensure_parallel_info (20 retries, 1s interval) so decode waits for prefill registration instead of failing immediately

Error Log

From CI failure:

File ".../sglang/srt/disaggregation/common/conn.py", line 675, in _handle_route_get
    info = PrefillServerInfo(
File ".../sglang/srt/disaggregation/common/conn.py", line 56, in __post_init__
    self.attn_tp_size = int(self.attn_tp_size)
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

Root Cause

CommonKVBootstrapServer starts its HTTP server before any prefill worker has registered. Fields like attn_tp_size are None until a PUT arrives. A decode GET during this window crashes with TypeError in PrefillServerInfo.__post_init__.

This race always existed but was silently masked: previously _handle_route_get returned a plain dict (serializing None as JSON null), and the decode-side int(None) was swallowed by except Exception. After #19195 introduced PrefillServerInfo with __post_init__, the crash moved server-side and became visible.

Repro Patch

Delay register_to_bootstrap() to widen the race window:

def register_to_bootstrap(self):
    timer = threading.Timer(10, self._do_register_to_bootstrap)
    timer.daemon = True
    timer.start()
    return

def _do_register_to_bootstrap(self):
    # ... original body ...

Test plan

  • test_disaggregation_basic.py passes
  • test_disaggregation_dp_attention.py passes

…not yet registered

The bootstrap server starts accepting HTTP requests before any prefill
worker has registered via PUT. If a decode worker queries server info
during this window, PrefillServerInfo is constructed with None fields,
causing TypeError crash.

Fix: track registered worker count on bootstrap server, return 503 until
all workers registered; add client-side retry in ensure_parallel_info.

Co-authored-by: Cursor <cursoragent@cursor.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Only need at least one prefill worker registered for metadata to be
available. Specific worker queries have their own error handling.

Co-authored-by: Cursor <cursoragent@cursor.com>
@hnyls2002
Copy link
Copy Markdown
Collaborator Author

hnyls2002 commented Feb 25, 2026

/rerun-stage stage-c-test-8-gpu-h20

@hnyls2002
Copy link
Copy Markdown
Collaborator Author

hnyls2002 commented Feb 25, 2026

/rerun-stage stage-b-test-large-2-gpu

Revert _is_ready() to check full worker count (dp * tp * pp). Also
guard specific worker queries with _is_ready() and handle KeyError
to prevent server crash on missing entries.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@sgl-project sgl-project deleted a comment from github-actions bot Feb 25, 2026
@sgl-project sgl-project deleted a comment from github-actions bot Feb 25, 2026
@sgl-project sgl-project deleted a comment from github-actions bot Feb 25, 2026
@sgl-project sgl-project deleted a comment from github-actions bot Feb 25, 2026
@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-c-test-8-gpu-h20 to run independently (skipping dependencies).

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-b-test-large-2-gpu to run independently (skipping dependencies).

@github-actions
Copy link
Copy Markdown
Contributor

🔗 View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🔗 View workflow run

@hnyls2002
Copy link
Copy Markdown
Collaborator Author

🔗 View workflow run

PD-Disaggregation tests passed.

@hnyls2002 hnyls2002 merged commit ab0f608 into main Feb 25, 2026
137 of 147 checks passed
@hnyls2002 hnyls2002 deleted the lsyin/fix-boostrap-data-race branch February 25, 2026 04:22
magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026
…not yet registered (sgl-project#19288)

Co-authored-by: Cursor <cursoragent@cursor.com>
lawrence-harmonic added a commit to lawrence-harmonic/sglang that referenced this pull request Mar 10, 2026
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
…not yet registered (sgl-project#19288)

Co-authored-by: Cursor <cursoragent@cursor.com>
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
…not yet registered (sgl-project#19288)

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants