Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -922,13 +922,25 @@ async def register_worker_with_bootstrap(self):
)
while True:
try:
async with httpx.AsyncClient() as client:
# timeout=30s: the bootstrap server runs inside the DP0 worker
# process and can stall the uvicorn event loop for several
# seconds while DP0 is holding the GIL (e.g. EFA MR
# registration of multi-GB KV caches across many NICs). The
# httpx default of 5s is not enough headroom and causes
# non-DP0 workers to hit ReadTimeout even though the server
# eventually responds 200 OK.
async with httpx.AsyncClient(timeout=30.0) as client:
response = await client.post(url, json=payload.model_dump())
response.raise_for_status()
logger.debug("Successfully registered with bootstrap server at %s", url)
break
except httpx.ConnectError:
# Bootstrap server not ready, wait for a while and retry.
except (
httpx.ConnectError,
httpx.TimeoutException,
httpx.RemoteProtocolError,
):
Comment on lines +937 to +941

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While adding httpx.TimeoutException and httpx.RemoteProtocolError addresses the specific ReadTimeout issue, it is recommended to catch the broader httpx.TransportError instead of just httpx.ConnectError. Transient network issues or a heavily loaded bootstrap server can manifest as httpx.ReadError or httpx.WriteError (both subclasses of TransportError), which would currently fall through to the except Exception block, potentially logging an empty error message and causing the same startup deadlock the PR aims to fix.

Suggested change
except (
httpx.ConnectError,
httpx.TimeoutException,
httpx.RemoteProtocolError,
):
except (
httpx.TransportError,
httpx.TimeoutException,
httpx.RemoteProtocolError,
):

# Bootstrap server not ready / event loop momentarily starved
# / connection dropped mid-flight. Wait for a while and retry.
await asyncio.sleep(1)
except Exception as e:
err_msg = (
Expand Down
Loading