fix: respect user stage_init_timeout during diffusion stage initialization (#2518)#2548
fix: respect user stage_init_timeout during diffusion stage initialization (#2518)#2548passionworkeer wants to merge 1 commit intovllm-project:mainfrom
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Pull request overview
This PR fixes diffusion stage startup timeouts by ensuring the user-provided stage_init_timeout is propagated through diffusion stage initialization and used for the diffusion subprocess READY handshake, instead of relying on a hardcoded 600s timeout.
Changes:
- Thread
stage_init_timeoutthrough_initialize_stages→initialize_diffusion_stage→StageDiffusionClient→complete_diffusion_handshake. - Update diffusion handshake polling to use the provided timeout value (seconds) rather than
_HANDSHAKE_POLL_TIMEOUT_S. - Extend diffusion stage init utilities/docstrings to include the new timeout parameter.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
vllm_omni/engine/async_omni_engine.py |
Passes stage_init_timeout into diffusion stage initialization from orchestrator stage init. |
vllm_omni/engine/stage_init_utils.py |
Adds stage_init_timeout parameter to diffusion stage init helper and forwards it to the client. |
vllm_omni/diffusion/stage_diffusion_client.py |
Accepts stage_init_timeout and forwards it to diffusion handshake completion. |
vllm_omni/diffusion/stage_diffusion_proc.py |
Adds timeout_s argument to diffusion handshake helpers and uses it for ZMQ poll timeout. |
Comments suppressed due to low confidence (1)
vllm_omni/diffusion/stage_diffusion_proc.py:599
- The new configurable handshake timeout isn’t reflected in the TimeoutError message. Include the effective
timeout_s(and possibly the handshake address) in the raised TimeoutError so users can confirm whichstage_init_timeoutvalue was applied when diagnosing slow diffusion stage startups.
timeout_ms = timeout_s * 1000
while True:
events = dict(poller.poll(timeout=timeout_ms))
if not events:
raise TimeoutError("Timed out waiting for READY from StageDiffusionProc")
if handshake_socket in events:
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| poller.register(proc.sentinel, zmq.POLLIN) | ||
|
|
||
| timeout_ms = _HANDSHAKE_POLL_TIMEOUT_S * 1000 | ||
| timeout_ms = timeout_s * 1000 | ||
| while True: | ||
| events = dict(poller.poll(timeout=timeout_ms)) |
There was a problem hiding this comment.
_HANDSHAKE_POLL_TIMEOUT_S is no longer used after switching to the timeout_s parameter, so this leaves an unused module-level constant (likely to fail lint/type checks). Remove _HANDSHAKE_POLL_TIMEOUT_S or use it as the default when timeout_s isn’t provided.
| stage_cfg, | ||
| metadata, | ||
| batch_size=self.diffusion_batch_size, | ||
| stage_init_timeout=stage_init_timeout, | ||
| ) |
There was a problem hiding this comment.
There’s no regression test ensuring stage_init_timeout is forwarded into diffusion stage initialization (and ultimately controls the diffusion handshake timeout). Consider adding a unit test (similar to test_initialize_stages_restores_device_visibility_after_diffusion_init) that monkeypatches initialize_diffusion_stage/StageDiffusionClient/complete_diffusion_handshake and asserts the expected timeout value is passed through, to prevent reintroducing a hardcoded default.
Description
When initializing the diffusion stage, the user's \stage_init_timeout\ parameter was being ignored. The diffusion stage handshake was using a hardcoded 600-second timeout instead of respecting the user-configured \stage_init_timeout.
Root Cause
The \stage_init_timeout\ parameter was passed to _launch_llm_stage\ for LLM stages but was not passed to \initialize_diffusion_stage\ for diffusion stages. Additionally, \complete_diffusion_handshake\ used a hardcoded _HANDSHAKE_POLL_TIMEOUT_S = 600.
Fix
Pass \stage_init_timeout\ through the diffusion stage initialization chain:
Files Changed
Fixes #2518