Skip to content

fix: respect user stage_init_timeout during diffusion stage initialization (#2518)#2548

Closed
passionworkeer wants to merge 1 commit intovllm-project:mainfrom
passionworkeer:fix-stage-init-timeout-2518
Closed

fix: respect user stage_init_timeout during diffusion stage initialization (#2518)#2548
passionworkeer wants to merge 1 commit intovllm-project:mainfrom
passionworkeer:fix-stage-init-timeout-2518

Conversation

@passionworkeer
Copy link
Copy Markdown

Description

When initializing the diffusion stage, the user's \stage_init_timeout\ parameter was being ignored. The diffusion stage handshake was using a hardcoded 600-second timeout instead of respecting the user-configured \stage_init_timeout.

Root Cause

The \stage_init_timeout\ parameter was passed to _launch_llm_stage\ for LLM stages but was not passed to \initialize_diffusion_stage\ for diffusion stages. Additionally, \complete_diffusion_handshake\ used a hardcoded _HANDSHAKE_POLL_TIMEOUT_S = 600.

Fix

Pass \stage_init_timeout\ through the diffusion stage initialization chain:

  1. _initialize_stages\ → \initialize_diffusion_stage\ (new \stage_init_timeout\ parameter)
  2. \initialize_diffusion_stage\ → \StageDiffusionClient.init\ (new \stage_init_timeout\ parameter)
  3. \StageDiffusionClient.init\ → \complete_diffusion_handshake\ (new \ imeout_s\ parameter)
  4. \complete_diffusion_handshake\ → _perform_diffusion_handshake\ (uses \ imeout_s\ instead of hardcoded value)

Files Changed

  • \�llm_omni/engine/async_omni_engine.py\ - pass \stage_init_timeout\ to \initialize_diffusion_stage\
  • \�llm_omni/engine/stage_init_utils.py\ - add \stage_init_timeout\ parameter to \initialize_diffusion_stage\
  • \�llm_omni/diffusion/stage_diffusion_client.py\ - add \stage_init_timeout\ to \StageDiffusionClient.init\ and pass to \complete_diffusion_handshake\
  • \�llm_omni/diffusion/stage_diffusion_proc.py\ - add \ imeout_s\ parameter to \complete_diffusion_handshake\ and _perform_diffusion_handshake\

Fixes #2518

Copilot AI review requested due to automatic review settings April 7, 2026 09:13
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes diffusion stage startup timeouts by ensuring the user-provided stage_init_timeout is propagated through diffusion stage initialization and used for the diffusion subprocess READY handshake, instead of relying on a hardcoded 600s timeout.

Changes:

  • Thread stage_init_timeout through _initialize_stagesinitialize_diffusion_stageStageDiffusionClientcomplete_diffusion_handshake.
  • Update diffusion handshake polling to use the provided timeout value (seconds) rather than _HANDSHAKE_POLL_TIMEOUT_S.
  • Extend diffusion stage init utilities/docstrings to include the new timeout parameter.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
vllm_omni/engine/async_omni_engine.py Passes stage_init_timeout into diffusion stage initialization from orchestrator stage init.
vllm_omni/engine/stage_init_utils.py Adds stage_init_timeout parameter to diffusion stage init helper and forwards it to the client.
vllm_omni/diffusion/stage_diffusion_client.py Accepts stage_init_timeout and forwards it to diffusion handshake completion.
vllm_omni/diffusion/stage_diffusion_proc.py Adds timeout_s argument to diffusion handshake helpers and uses it for ZMQ poll timeout.
Comments suppressed due to low confidence (1)

vllm_omni/diffusion/stage_diffusion_proc.py:599

  • The new configurable handshake timeout isn’t reflected in the TimeoutError message. Include the effective timeout_s (and possibly the handshake address) in the raised TimeoutError so users can confirm which stage_init_timeout value was applied when diagnosing slow diffusion stage startups.
        timeout_ms = timeout_s * 1000
        while True:
            events = dict(poller.poll(timeout=timeout_ms))
            if not events:
                raise TimeoutError("Timed out waiting for READY from StageDiffusionProc")
            if handshake_socket in events:

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 592 to 596
poller.register(proc.sentinel, zmq.POLLIN)

timeout_ms = _HANDSHAKE_POLL_TIMEOUT_S * 1000
timeout_ms = timeout_s * 1000
while True:
events = dict(poller.poll(timeout=timeout_ms))
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_HANDSHAKE_POLL_TIMEOUT_S is no longer used after switching to the timeout_s parameter, so this leaves an unused module-level constant (likely to fail lint/type checks). Remove _HANDSHAKE_POLL_TIMEOUT_S or use it as the default when timeout_s isn’t provided.

Copilot uses AI. Check for mistakes.
Comment on lines 540 to 544
stage_cfg,
metadata,
batch_size=self.diffusion_batch_size,
stage_init_timeout=stage_init_timeout,
)
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s no regression test ensuring stage_init_timeout is forwarded into diffusion stage initialization (and ultimately controls the diffusion handshake timeout). Consider adding a unit test (similar to test_initialize_stages_restores_device_visibility_after_diffusion_init) that monkeypatches initialize_diffusion_stage/StageDiffusionClient/complete_diffusion_handshake and asserts the expected timeout value is passed through, to prevent reintroducing a hardcoded default.

Copilot uses AI. Check for mistakes.
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

I have merged #2519 which solved #2518, thanks for your contribution, please check other issues if you are interested in contributing to vllm-omni :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Diffusion stage initialization fails with TimeoutError and user stage_init_timeout is ignored

3 participants