Skip to content

[Bugfix] Restore user config/runtime stage init timeout#2519

Merged
hsliuustc0106 merged 20 commits intovllm-project:mainfrom
yuanheng-zhao:fix/stage-init-timeout
Apr 11, 2026
Merged

[Bugfix] Restore user config/runtime stage init timeout#2519
hsliuustc0106 merged 20 commits intovllm-project:mainfrom
yuanheng-zhao:fix/stage-init-timeout

Conversation

@yuanheng-zhao
Copy link
Copy Markdown
Contributor

@yuanheng-zhao yuanheng-zhao commented Apr 6, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Resolves #2518

This PR tries to restore the behavior of passing through of stage_init_timeout before the refactoring of StageDiffusionClient. Specifically,

  1. Removed hardcoded _HANDSHAKE_POLL_TIMEOUT_S
  2. . For diffusion stage, the arg is not passed into initialize_diffusion_stage in AsyncOmniEngine._initialize_stages. Fixed.
  3. For LLM stage, stage_init_timeout is used in acquire_device_locks but not in complete_stage_handshake. Fixed.

Defaults of stage init timeout kept as:

  • serve.py: --stage-init-timeout default=300 (online serving CLI)
  • async_omni_engine.py: AsyncOmniEngine.__init__(stage_init_timeout: int = 300) (engine API)
  • omni_base.py: kwargs.pop("stage_init_timeout", 300)

Test Plan

DiT offline example test (text to image) and AR end2end example with applying stage_init_timeout

DiT:

python examples/offline_inference/text_to_image/text_to_image.py \
	  --model stepfun-ai/NextStep-1.1 \
	  --prompt "A baby panda wearing an Iron Man mask, holding a board with 'NextStep-1' written on it" \
	  --height 512 \
	  --width 512 \
	  --num-inference-steps 28 \
	  --guidance-scale 7.5 \
	  --guidance-scale-2 1.0 \
	  --cfg-schedule constant \
	  --seed 42 \
	  --output output_nextstep_layerwise.png \
	  --enable-layerwise-offload \
	  --stage-init-timeout 1200

^^This is a test case in #2339

Test Result

DiT

Before,

  File "~/repos/vllm-project/vllm-omni/vllm_omni/entrypoints/omni_base.py", line 95, in __init__
    self.engine = AsyncOmniEngine(
                  ^^^^^^^^^^^^^^^^
  File "~/repos/vllm-project/vllm-omni/vllm_omni/engine/async_omni_engine.py", line 285, in __init__
    raise TimeoutError(f"Orchestrator did not become ready within {startup_timeout}s") from e
TimeoutError: Orchestrator did not become ready within 600s

After,
TimeoutError resolved.

AR

Before,

ERROR 04-05 06:50:43 [async_omni_engine.py:620]   File "~/repos/vllm-project/vllm-omni/vllm_omni/engine/async_omni_engine.py", line 370, in _launch_llm_stage
ERROR 04-05 06:50:43 [async_omni_engine.py:620]     complete_stage_handshake(proc, handshake_address, addresses, vllm_config)
ERROR 04-05 06:50:43 [async_omni_engine.py:620]   File "~/repos/vllm-project/vllm-omni/vllm_omni/engine/stage_engine_core_proc.py", line 154, in complete_stage_handshake
ERROR 04-05 06:50:43 [async_omni_engine.py:620]     _perform_handshake(proc, handshake_address, addresses, vllm_config)
ERROR 04-05 06:50:43 [async_omni_engine.py:620]   File "~/repos/vllm-project/vllm-omni/vllm_omni/engine/stage_engine_core_proc.py", line 182, in _perform_handshake
ERROR 04-05 06:50:43 [async_omni_engine.py:620]     identity, msg = _recv(poller, handshake_socket, proc, "READY")
ERROR 04-05 06:50:43 [async_omni_engine.py:620]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-05 06:50:43 [async_omni_engine.py:620]   File "~/repos/vllm-project/vllm-omni/vllm_omni/engine/stage_engine_core_proc.py", line 201, in _recv
ERROR 04-05 06:50:43 [async_omni_engine.py:620]     raise TimeoutError(f"Timed out waiting for {expected} from StageEngineCoreProc")
ERROR 04-05 06:50:43 [async_omni_engine.py:620] TimeoutError: Timed out waiting for READY from StageEngineCoreProc

After, TimeoutError resolved.

For nightly tests failures, check #2519 (comment)


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
@yuanheng-zhao yuanheng-zhao changed the title [Bugfix] Restore user config and runtime stage init timeout [Bugfix] Restore user config/runtime stage init timeout Apr 6, 2026
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
@yuanheng-zhao yuanheng-zhao marked this pull request as ready for review April 6, 2026 10:17
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@wuhang2014 @chickeyton PTAL

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bugfix for stage_init_timeout propagation. Clear purpose and good error messages. Missing regression test.

help="Enable logging of diffusion pipeline stats.",
)
parser.add_argument(
"--init-timeout",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two timeout args: --init-timeout and --stage-init-timeout. Should this be a single arg? init_timeout is not used elsewhere in the engine.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a previous orchestrator and stage design: init_timeout is for orchestrator timeout (can be considered as overall timeout), stage_init_timeout is timeout for any single stage - I just keep it as is

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might require further optimization/cleanup though. In some test cases that require longer init time users have to set both init_timeout and stage_init_timeout for now

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two timeout args: --init-timeout and --stage-init-timeout. Should this be a single arg? init_timeout is not used elsewhere in the engine.

--init-timeout and --stage-init-timeout are not orthogonal, one is enough in my opinion.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two timeout args: --init-timeout and --stage-init-timeout. Should this be a single arg? init_timeout is not used elsewhere in the engine.

--init-timeout and --stage-init-timeout are not orthogonal, one is enough in my opinion.

Agree.

Shall we discuss and improve this design in further commits, so that distinguish the bugfix in current PR and further improvement/cleanup towards args. (might relate to @lishunyang12 's incoming config refactor as well)

(These two args exist before and after the previous stage client refactor while --stage-init-timeout fails to work after that refactor. )

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @lishunyang12 will your refactor cover this --init-timeout and --stage-init-timeout?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hsliuustc0106 No — #2383 is scoped to the deploy YAML schema (pipeline.yaml + deploy/<model>.yaml) and how CLI args merge into it. --init-timeout and --stage-init-timeout get popped at OmniBase.__init__ / AsyncOmniEngine.__init__ as runtime knobs and never reach the stage-config factory, so they flow through unchanged.

Agree they should be unified though. Happy to take it as a follow-up after #2383 merges — pick one canonical name (stage_init_timeout reads more accurate; it's the per-stage budget) and deprecate the other with a DeprecationWarning. Want me to file an issue?

Copy link
Copy Markdown
Contributor

@wuhang2014 wuhang2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering how does the timeout mechanism work in llm stage? Do they have a unified handling logic?

def complete_diffusion_handshake(
proc: BaseProcess,
handshake_address: str,
handshake_timeout: int = 600,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better no default value for the param.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a runtime stage init timeout is not provided, what default value shall we use? Neither None nor 0 seems to be suitable here..

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a runtime stage init timeout is not provided, what default value shall we use? Neither None nor 0 seems to be suitable here..

Default value should be in cli args for online serving

Copy link
Copy Markdown
Contributor Author

@yuanheng-zhao yuanheng-zhao Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Default values for these functions have been removed. Now defaults kept consistent and at higher level entrypoints:

  • serve.py: --stage-init-timeout default=300 (online serving CLI)
  • async_omni_engine.py: AsyncOmniEngine.init(stage_init_timeout: int = 300) (engine API)
  • omni_base.py: kwargs.pop("stage_init_timeout", 300) (kept intact)

@yuanheng-zhao
Copy link
Copy Markdown
Contributor Author

I'm wondering how does the timeout mechanism work in llm stage? Do they have a unified handling logic?

Hey @wuhang2014 , for llm stage the timeout passing flow is

AsyncOmniEngine._launch_llm_stage
-> passed into acquire_device_locks
-> not passed to complete_stage_handshake
    -> _perform_handshake
        -> _HANDSHAKE_POLL_TIMEOUT_S * 1000

where _launch_llm_stage does have both stage init timeout and init timeout but it doesn't pass to _perform_handshake

@wuhang2014
Copy link
Copy Markdown
Contributor

I'm wondering how does the timeout mechanism work in llm stage? Do they have a unified handling logic?

Hey @wuhang2014 , for llm stage the timeout passing flow is

AsyncOmniEngine._launch_llm_stage
-> passed into acquire_device_locks
-> not passed to complete_stage_handshake
    -> _perform_handshake
        -> _HANDSHAKE_POLL_TIMEOUT_S * 1000

where _launch_llm_stage does have both stage init timeout and init timeout but it doesn't pass to _perform_handshake

Could the diffusion side use the same logic with llm?

@yuanheng-zhao
Copy link
Copy Markdown
Contributor Author

I'm wondering how does the timeout mechanism work in llm stage? Do they have a unified handling logic?

Hey @wuhang2014 , for llm stage the timeout passing flow is

AsyncOmniEngine._launch_llm_stage
-> passed into acquire_device_locks
-> not passed to complete_stage_handshake
    -> _perform_handshake
        -> _HANDSHAKE_POLL_TIMEOUT_S * 1000

where _launch_llm_stage does have both stage init timeout and init timeout but it doesn't pass to _perform_handshake

Could the diffusion side use the same logic with llm?

We could - to approach this we need refactor the current stage clients again...might be too heavy to fix a timeout error only.

FYI, The current flow of diffusion stage timeout passing through (as compared to that of llm stage):

in vllm_omni/engine/stage_init_utils.py,

initialize_diffusion_stage
-> StageDiffusionClient.__init__ without timeout arg passing in
    -> complete_diffusion_handshake
        -> _perform_diffusion_handshake - timeout_ms = _HANDSHAKE_POLL_TIMEOUT_S * 1000 (hardcoded)

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
@yuanheng-zhao yuanheng-zhao requested a review from wuhang2014 April 7, 2026 07:26
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
@yuanheng-zhao
Copy link
Copy Markdown
Contributor Author

cc @chickeyton

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
model: str,
od_config: OmniDiffusionConfig,
metadata: StageMetadata,
handshake_timeout: int,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be named stage_init_timeout as well ? coz there may be some other lengthly initialization process in the future

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, updated

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Apr 10, 2026
@yenuo26
Copy link
Copy Markdown
Collaborator

yenuo26 commented Apr 10, 2026

Can you both add or remove the stage_init_timeout configuration in the following code in tests/conftest.py? I think your change will cause the startup time for most test cases to actually become shorter.

@pytest.fixture(scope="module")
def omni_runner(request, model_prefix):
    with _omni_server_lock:
        model, stage_config_path = request.param
        model = model_prefix + model
        with OmniRunner(model, seed=42, stage_configs_path=stage_config_path, stage_init_timeout=300) as runner:
            print("OmniRunner started successfully")
            yield runner
            print("OmniRunner stopping...")

        print("OmniRunner stopped")
@pytest.fixture(scope="module")
def omni_server(request: pytest.FixtureRequest, run_level: str, model_prefix: str) -> Generator[OmniServer, Any, None]:
    """Start vLLM-Omni server as a subprocess with actual model weights.
    Uses session scope so the server starts only once for the entire test session.
    Multi-stage initialization can take 10-20+ minutes.
    """
    with _omni_server_lock:
        params: OmniServerParams = request.param
        model = model_prefix + params.model
        port = params.port
        stage_config_path = params.stage_config_path
        if run_level == "advanced_model" and stage_config_path is not None:
            with open(stage_config_path, encoding="utf-8") as f:
                cfg = yaml.safe_load(f) or {}
            stage_ids = [stage["stage_id"] for stage in cfg.get("stage_args", []) if "stage_id" in stage]
            stage_config_path = modify_stage_config(
                stage_config_path,
                deletes={"stage_args": {stage_id: ["engine_args.load_format"] for stage_id in stage_ids}},
            )

        server_args = params.server_args or []
        if params.use_omni:
            server_args = ["--stage-init-timeout", "120", *server_args]
        if stage_config_path is not None:
            server_args += ["--stage-configs-path", stage_config_path]

@princepride
Copy link
Copy Markdown
Collaborator

We need @hsliuustc0106 approve it

@hsliuustc0106 hsliuustc0106 added the nightly-test label to trigger buildkite nightly test CI label Apr 10, 2026
@yuanheng-zhao
Copy link
Copy Markdown
Contributor Author

Can you both add or remove the stage_init_timeout configuration in the following code in tests/conftest.py? I think your change will cause the startup time for most test cases to actually become shorter.

@pytest.fixture(scope="module")
def omni_runner(request, model_prefix):
    with _omni_server_lock:
        model, stage_config_path = request.param
        model = model_prefix + model
        with OmniRunner(model, seed=42, stage_configs_path=stage_config_path, stage_init_timeout=300) as runner:
            print("OmniRunner started successfully")
            yield runner
            print("OmniRunner stopping...")

        print("OmniRunner stopped")
@pytest.fixture(scope="module")
def omni_server(request: pytest.FixtureRequest, run_level: str, model_prefix: str) -> Generator[OmniServer, Any, None]:
    """Start vLLM-Omni server as a subprocess with actual model weights.
    Uses session scope so the server starts only once for the entire test session.
    Multi-stage initialization can take 10-20+ minutes.
    """
    with _omni_server_lock:
        params: OmniServerParams = request.param
        model = model_prefix + params.model
        port = params.port
        stage_config_path = params.stage_config_path
        if run_level == "advanced_model" and stage_config_path is not None:
            with open(stage_config_path, encoding="utf-8") as f:
                cfg = yaml.safe_load(f) or {}
            stage_ids = [stage["stage_id"] for stage in cfg.get("stage_args", []) if "stage_id" in stage]
            stage_config_path = modify_stage_config(
                stage_config_path,
                deletes={"stage_args": {stage_id: ["engine_args.load_format"] for stage_id in stage_ids}},
            )

        server_args = params.server_args or []
        if params.use_omni:
            server_args = ["--stage-init-timeout", "120", *server_args]
        if stage_config_path is not None:
            server_args += ["--stage-configs-path", stage_config_path]

@yenuo26 I think the change of this PR only restores the functionality of stage-init-timeout but won't make startup time for test cases become shorter.

@yenuo26
Copy link
Copy Markdown
Collaborator

yenuo26 commented Apr 10, 2026

Can you both add or remove the stage_init_timeout configuration in the following code in tests/conftest.py? I think your change will cause the startup time for most test cases to actually become shorter.

@pytest.fixture(scope="module")
def omni_runner(request, model_prefix):
    with _omni_server_lock:
        model, stage_config_path = request.param
        model = model_prefix + model
        with OmniRunner(model, seed=42, stage_configs_path=stage_config_path, stage_init_timeout=300) as runner:
            print("OmniRunner started successfully")
            yield runner
            print("OmniRunner stopping...")

        print("OmniRunner stopped")
@pytest.fixture(scope="module")
def omni_server(request: pytest.FixtureRequest, run_level: str, model_prefix: str) -> Generator[OmniServer, Any, None]:
    """Start vLLM-Omni server as a subprocess with actual model weights.
    Uses session scope so the server starts only once for the entire test session.
    Multi-stage initialization can take 10-20+ minutes.
    """
    with _omni_server_lock:
        params: OmniServerParams = request.param
        model = model_prefix + params.model
        port = params.port
        stage_config_path = params.stage_config_path
        if run_level == "advanced_model" and stage_config_path is not None:
            with open(stage_config_path, encoding="utf-8") as f:
                cfg = yaml.safe_load(f) or {}
            stage_ids = [stage["stage_id"] for stage in cfg.get("stage_args", []) if "stage_id" in stage]
            stage_config_path = modify_stage_config(
                stage_config_path,
                deletes={"stage_args": {stage_id: ["engine_args.load_format"] for stage_id in stage_ids}},
            )

        server_args = params.server_args or []
        if params.use_omni:
            server_args = ["--stage-init-timeout", "120", *server_args]
        if stage_config_path is not None:
            server_args += ["--stage-configs-path", stage_config_path]

@yenuo26 I think the change of this PR only restores the functionality of stage-init-timeout but won't make startup time for test cases become shorter.

Since we have passed in the corresponding stage_init_timeout here, once you fix this issue and the timeout value takes effect, would the timeout period change from 600 to a lower value based on the passed-in argument? So, should we update these places accordingly?

@yuanheng-zhao
Copy link
Copy Markdown
Contributor Author

yuanheng-zhao commented Apr 10, 2026

Since we have passed in the corresponding stage_init_timeout here, once you fix this issue and the timeout value takes effect, would the timeout period change from 600 to a lower value based on the passed-in argument? So, should we update these places accordingly?

once you fix this issue and the timeout value takes effect, would the timeout period change from 600 to a lower value based on the passed-in argument

^ Yes that's correct. Timeout Error triggers earlier for tests which aims at testing the timeout/shupdown/cleanup mechanism as we set the stage-int-timeout to a lower value. However for a regular test (in which the stage initialize successfully within the timeout seconds) , the launch time of engine/stage client should stay the same

@yenuo26
Copy link
Copy Markdown
Collaborator

yenuo26 commented Apr 10, 2026

Since we have passed in the corresponding stage_init_timeout here, once you fix this issue and the timeout value takes effect, would the timeout period change from 600 to a lower value based on the passed-in argument? So, should we update these places accordingly?

once you fix this issue and the timeout value takes effect, would the timeout period change from 600 to a lower value based on the passed-in argument

^ Yes that's correct. Timeout Error triggers earlier for tests which aims at testing the timeout/shupdown/cleanup mechanism as we set the stage-int-timeout to a lower value. However for a regular test (in which the stage initialize successfully within the timeout seconds) , the launch time of engine/stage client should stay the same

Therefore, can you also remove --stage-init-timeout from the above code in this PR? Otherwise, it will most likely cause the test cases in CI to time out.

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
@yuanheng-zhao
Copy link
Copy Markdown
Contributor Author

yuanheng-zhao commented Apr 10, 2026

Therefore, can you also remove --stage-init-timeout from the above code in this PR? Otherwise, it will most likely cause the test cases in CI to time out.

Got it. Removed extra stage init timeout in tests/conftest.py, now it applies default value of 300 seconds.


Wait a sec, let me updated other test conftest as well

@yuanheng-zhao
Copy link
Copy Markdown
Contributor Author

yuanheng-zhao commented Apr 10, 2026

Regression tests of timeout passing for two paths (path of diffusion stage client and _launch_llm_stage call flow) have been added @hsliuustc0106

pytest -s tests/engine/test_async_omni_engine_stage_init.py
# ...
# ... 4 passed, 16 warnings in 0.64s ...

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
@yuanheng-zhao
Copy link
Copy Markdown
Contributor Author

In this commit 8cbb78b, I made the following changes:

  1. let OmniServerParams.stage_init_timeout in the root tests/conftest.py control the stage-init-time for any tests using OmniServerParams to setup args
  2. For specific tests not applying omni server params, add module-level (unit test level) variable STAGE_INIT_TIMEOUT_S at the top to control its stage init timeout, as different sizes of models might have variable length of expected init time

PTAL cc @yenuo26 , @congw729

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we clean this up post-merge, I'd suggest matching upstream's vocabulary instead of collapsing — e.g. rename to --stage-handshake-timeout and --engine-ready-timeout (or re-export VLLM_ENGINE_READY_TIMEOUT_S as the env var). Keeps the two-knob model upstream validated and makes the relationship obvious. Out of scope for this bugfix — happy to take as a follow-up after #2519 and #2383 land.

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction to my earlier review. I claimed VLLM_ENGINE_READY_TIMEOUT_S is "not in vllm-omni's stage init path" / a "silent dead letter." That was wrong — it is in the path, via inheritance.

The corrected evidence chain

vllm_omni/engine/stage_engine_core_client.py:362:

class StageEngineCoreClient(StageEngineCoreClientBase, AsyncMPClient):

MRO: StageEngineCoreClient → StageEngineCoreClientBase → AsyncMPClient → MPClient. StageEngineCoreClientBase.__init__ (line 131) calls super().__init__(), which lands in MPClient.__init__ (vllm/v1/engine/core_client.py:458). That method unconditionally runs the poll loop at lines 537–549:

# vllm/v1/engine/core_client.py:537-549
while identities:
    if not sync_input_socket.poll(timeout=VLLM_ENGINE_READY_TIMEOUT_S * 1000):
        raise TimeoutError("Timed out waiting for engines to send initial message on input socket.")

So VLLM_ENGINE_READY_TIMEOUT_S IS consulted during LLM stage init, via _attach_llm_stage → make_async_mp_client → super().__init__() → MPClient.__init__.

What it actually bounds (which matters more than whether it's in the path)

It is not equivalent to --stage-init-timeout. They run sequentially and bound very different phases.

What VLLM_ENGINE_READY_TIMEOUT_S bounds is the wait for a single empty b"" frame that the subprocess sends as a ZMQ DEALER↔ROUTER identity registration. From vllm/v1/engine/core.py:1233-1237, inside process_input_sockets:

for input_socket in input_sockets:
    # Send initial message to each input socket - this is required
    # before the front-end ROUTER socket can send input messages
    # back to us.
    input_socket.send(b"")
    poller.register(input_socket, zmq.POLLIN)

The registration frame has no payload. It exists purely to satisfy ZMQ's pattern requirement: a ROUTER cannot route messages back to a DEALER until the DEALER has first sent something so the ROUTER can learn its identity.

The subprocess pushes this frame onto the wire as soon as process_input_sockets runs — which happens after model load completes but before the handshake context manager exits with READY. So by the time vllm-omni's _attach_llm_stage reaches MPClient.__init__ and polls the input socket, the frame has already been sitting in the kernel buffer for some time. poll() returns near-instantly with the buffered frame. VLLM_ENGINE_READY_TIMEOUT_S=600 is microseconds of headroom on a phase that takes microseconds.

So setting VLLM_ENGINE_READY_TIMEOUT_S=1200 to fix a slow model load is meaningless — that env var doesn't bound model loading. Only --stage-init-timeout does. The 600s default upstream is "longer than any plausible failure mode for the ZMQ identity exchange," not "long enough for a 20-minute model load."

Diffusion vs LLM asymmetry

Diffusion stages bypass both inherited vllm timers entirely:

  • vllm_omni/diffusion/stage_diffusion_client.py:37class StageDiffusionClient: (plain class, no MPClient ancestor)
  • vllm_omni/diffusion/stage_diffusion_proc.py:45class StageDiffusionProc: (plain class, no EngineCoreProc ancestor)

So VLLM_ENGINE_READY_TIMEOUT_S and HANDSHAKE_TIMEOUT_MINS are only in the LLM path. The asymmetry exists but is operationally invisible because both inherited timers bound near-instant phases.

Implication for the --init-timeout vs --stage-init-timeout discussion

Timer What it actually bounds Slow phase? LLM Diffusion
--stage-init-timeout Model loading inside handshake context YES
--init-timeout Outer wrapper around _initialize_stages no — must be ≥ stage_init_timeout
HANDSHAKE_TIMEOUT_MINS=5 (vllm hardcoded) HELLO→INIT round trip no — ms ✅ inherited via EngineCoreProc.__init__
VLLM_ENGINE_READY_TIMEOUT_S (vllm env var) Drain b"" frame from ZMQ socket buffer no — µs ✅ inherited via MPClient.__init__

There's really one knob that matters for slow model loads--stage-init-timeout. The other three either bound near-instant phases or are outer wrappers that just need to be wide enough not to fire spuriously. --init-timeout is the only real footgun (because users have to keep it ≥ stage_init_timeout manually, which is the case @yuanheng-zhao described upthread). The inherited vllm timers fire on broken-environment failure modes (subprocess crash, ZMQ misconfig, network partition), not slow loads, so doubling them doesn't help and they don't need to be exposed to users.

This actually strengthens @wuhang2014's "collapse to one knob" position: keep --stage-init-timeout as canonical, drop or deprecate --init-timeout. The inherited vllm timers can stay as they are.

Still outside the scope of this bugfix — the unification belongs in a follow-up after #2519 and #2383 land. But hopefully the corrected timer matrix is useful when someone picks it up.

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
@yuanheng-zhao
Copy link
Copy Markdown
Contributor Author

yuanheng-zhao commented Apr 11, 2026

Nightly Tests Failures:

Diffusion · Other · Function Test with L4

[2026-04-10T15:14:42Z] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.05 GiB of which 20.12 MiB is free. Including non-PyTorch memory, this process has 22.02 GiB memory in use. Of the allocated memory 21.80 GiB is allocated by PyTorch, and 10.58 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
...
[2026-04-10T15:29:25Z] ERROR tests/e2e/online_serving/test_flux_2_dev_expansion.py::test_flux_2_dev[parallel_cfg_2] - RuntimeError: Server processes exited with code 1 before becoming ready.

#2010 introduces tests/e2e/online_serving/test_flux_2_dev_expansion.py but OOM on L4, not related to this PR.

Omni Perf Test

FAILED tests/dfx/perf/scripts/run_benchmark.py::test_performance_benchmark[benchmark_params3-omni_server1] - AssertionError: Request failures exist
assert 99 == 100

Tested on main branch 78bef62

pytest -s tests/dfx/perf/scripts/run_benchmark.py::test_performance_benchmark[benchmark_params3-omni_server1]

But cannot reproduce the error. I'm suspecting this is related with concurrency as the timeout args changes on this PR won't affect if stage and engine initialize successfully.

Omni · Function Test with H100

FAILED tests/e2e/online_serving/test_qwen3_omni_expansion.py::test_video_to_text_audio_001[default] - AssertionError: The output does not contain any of the keywords.
assert False
 +  where False = any(<generator object assert_omni_response.<locals>.<genexpr> at 0x73aaac54fd80>)
FAILED tests/e2e/online_serving/test_qwen3_omni_expansion.py::test_text_audio_to_text_audio_001[async_chunk] - AssertionError: The audio content is not same as the text
assert (0.7265306421035564 is not None and 0.7265306421035564 > 0.9)
 +  where 0.7265306421035564 = OmniResponse(text_content='The audio contains a repeating sequence of the word "test" spoken in a robotic or synthesized voice. The phrase "test test test test test test test" is repeated multiple times with consistent timing and pitch. There are no other discernible sounds, music, or background noise in the recording.', audio_data=None, audio_content=' The audio contains a repeating sequence of the word test, spoken in a robotic or synthesized voice. The phrase test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test,', audio_format=None, audio_bytes=b'RIFFr\xa3\x0e\x0...
   

Tested and failed on main branch 78bef62

pytest -s tests/e2e/online_serving/test_qwen3_omni_expansion.py::test_video_to_text_audio_001[default] tests/e2e/online_serving/test_qwen3_omni_expansion.py::test_text_audio_to_text_audio_001[async_chunk]

FAILED tests/e2e/online_serving/test_qwen3_omni_expansion.py::test_video_to_text_audio_001[default] - AssertionError: The request failed.
FAILED tests/e2e/online_serving/test_qwen3_omni_expansion.py::test_text_audio_to_text_audio_001[async_chunk] - AssertionError: The request failed.
====================================================== 2 failed, 16 warnings in 444.62s (0:07:24)

Omni · Function Test with L4

[2026-04-10T15:08:53Z] (APIServer pid=180) ERROR 04-10 15:08:53 [async_omni_engine.py:841]   File "/workdir/vllm_omni/engine/stage_engine_core_proc.py", line 182, in _perform_handshake
[2026-04-10T15:08:53Z] (APIServer pid=180) ERROR 04-10 15:08:53 [async_omni_engine.py:841]     identity, msg = _recv(poller, handshake_socket, proc, "READY", handshake_timeout)
[2026-04-10T15:08:53Z] (APIServer pid=180) ERROR 04-10 15:08:53 [async_omni_engine.py:841]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-04-10T15:08:53Z] (APIServer pid=180) ERROR 04-10 15:08:53 [async_omni_engine.py:841]   File "/workdir/vllm_omni/engine/stage_engine_core_proc.py", line 202, in _recv
[2026-04-10T15:08:53Z] (APIServer pid=180) ERROR 04-10 15:08:53 [async_omni_engine.py:841]     raise TimeoutError(
[2026-04-10T15:08:53Z] (APIServer pid=180) ERROR 04-10 15:08:53 [async_omni_engine.py:841] TimeoutError: Timed out waiting for READY from StageEngineCoreProc after 300s. This typically indicates model loading or initialization is taking too long. Consider increasing `stage_init_timeout` for large models.
...
[2026-04-10T15:22:20Z] ERROR tests/e2e/online_serving/test_dynin_omni_expansion.py::test_send_i2i_request_001[omni_server0] - RuntimeError: Server processes exited with code 1 before becoming ready.
[2026-04-10T15:22:20Z] ERROR tests/e2e/online_serving/test_dynin_omni_expansion.py::test_send_t2i_request_001[omni_server0] - RuntimeError: Server processes exited with code 1 before becoming ready.

This is an stage initialization TimeoutError which is different from #2649 reported. Let me try again with increasing stage init timeout.

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
@yuanheng-zhao
Copy link
Copy Markdown
Contributor Author

yuanheng-zhao commented Apr 11, 2026

Latest nightly tests:

Omni · Function Test with L4

Caused by CUDA error: out of memory

[2026-04-11T04:10:18Z] ERROR tests/e2e/online_serving/test_dynin_omni_expansion.py::test_send_i2i_request_001[omni_server0] - RuntimeError: Server processes exited with code 1 before becoming ready.
[2026-04-11T04:10:18Z] ERROR tests/e2e/online_serving/test_dynin_omni_expansion.py::test_send_t2i_request_001[omni_server0] - RuntimeError: Server processes exited with code 1 before becoming ready.

Diffusion · Other · Function Test with L4

[2026-04-11T04:18:30Z] ERROR tests/e2e/online_serving/test_flux_2_dev_expansion.py::test_flux_2_dev[parallel_cfg_2] - RuntimeError: Server processes exited with code 1 before becoming ready.

which are not related to this PR changes. Other nightly tests passed.

image

@lishunyang12 lishunyang12 enabled auto-merge (squash) April 11, 2026 07:47
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

#2519 (comment) @nuclearwu can help fix the first nightly ci breakdown? from #2010

@hsliuustc0106 hsliuustc0106 disabled auto-merge April 11, 2026 16:48
@hsliuustc0106 hsliuustc0106 merged commit c20cac8 into vllm-project:main Apr 11, 2026
7 of 8 checks passed
daixinning pushed a commit to daixinning/vllm-omni that referenced this pull request Apr 13, 2026
…#2519)

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nightly-test label to trigger buildkite nightly test CI ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Diffusion stage initialization fails with TimeoutError and user stage_init_timeout is ignored

8 participants