[Bugfix] Restore user config/runtime stage init timeout by yuanheng-zhao · Pull Request #2519 · vllm-project/vllm-omni

yuanheng-zhao · 2026-04-06T08:49:30Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Resolves #2518

This PR tries to restore the behavior of passing through of stage_init_timeout before the refactoring of StageDiffusionClient. Specifically,

Removed hardcoded _HANDSHAKE_POLL_TIMEOUT_S
. For diffusion stage, the arg is not passed into initialize_diffusion_stage in AsyncOmniEngine._initialize_stages. Fixed.
For LLM stage, stage_init_timeout is used in acquire_device_locks but not in complete_stage_handshake. Fixed.

Defaults of stage init timeout kept as:

serve.py: --stage-init-timeout default=300 (online serving CLI)
async_omni_engine.py: AsyncOmniEngine.__init__(stage_init_timeout: int = 300) (engine API)
omni_base.py: kwargs.pop("stage_init_timeout", 300)

Test Plan

DiT offline example test (text to image) and AR end2end example with applying stage_init_timeout

DiT:

python examples/offline_inference/text_to_image/text_to_image.py \
	  --model stepfun-ai/NextStep-1.1 \
	  --prompt "A baby panda wearing an Iron Man mask, holding a board with 'NextStep-1' written on it" \
	  --height 512 \
	  --width 512 \
	  --num-inference-steps 28 \
	  --guidance-scale 7.5 \
	  --guidance-scale-2 1.0 \
	  --cfg-schedule constant \
	  --seed 42 \
	  --output output_nextstep_layerwise.png \
	  --enable-layerwise-offload \
	  --stage-init-timeout 1200

^^This is a test case in #2339

Test Result

DiT

Before,

  File "~/repos/vllm-project/vllm-omni/vllm_omni/entrypoints/omni_base.py", line 95, in __init__
    self.engine = AsyncOmniEngine(
                  ^^^^^^^^^^^^^^^^
  File "~/repos/vllm-project/vllm-omni/vllm_omni/engine/async_omni_engine.py", line 285, in __init__
    raise TimeoutError(f"Orchestrator did not become ready within {startup_timeout}s") from e
TimeoutError: Orchestrator did not become ready within 600s

After,
TimeoutError resolved.

AR

Before,

ERROR 04-05 06:50:43 [async_omni_engine.py:620]   File "~/repos/vllm-project/vllm-omni/vllm_omni/engine/async_omni_engine.py", line 370, in _launch_llm_stage
ERROR 04-05 06:50:43 [async_omni_engine.py:620]     complete_stage_handshake(proc, handshake_address, addresses, vllm_config)
ERROR 04-05 06:50:43 [async_omni_engine.py:620]   File "~/repos/vllm-project/vllm-omni/vllm_omni/engine/stage_engine_core_proc.py", line 154, in complete_stage_handshake
ERROR 04-05 06:50:43 [async_omni_engine.py:620]     _perform_handshake(proc, handshake_address, addresses, vllm_config)
ERROR 04-05 06:50:43 [async_omni_engine.py:620]   File "~/repos/vllm-project/vllm-omni/vllm_omni/engine/stage_engine_core_proc.py", line 182, in _perform_handshake
ERROR 04-05 06:50:43 [async_omni_engine.py:620]     identity, msg = _recv(poller, handshake_socket, proc, "READY")
ERROR 04-05 06:50:43 [async_omni_engine.py:620]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-05 06:50:43 [async_omni_engine.py:620]   File "~/repos/vllm-project/vllm-omni/vllm_omni/engine/stage_engine_core_proc.py", line 201, in _recv
ERROR 04-05 06:50:43 [async_omni_engine.py:620]     raise TimeoutError(f"Timed out waiting for {expected} from StageEngineCoreProc")
ERROR 04-05 06:50:43 [async_omni_engine.py:620] TimeoutError: Timed out waiting for READY from StageEngineCoreProc

After, TimeoutError resolved.

For nightly tests failures, check #2519 (comment)

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

hsliuustc0106 · 2026-04-06T13:12:40Z

@wuhang2014 @chickeyton PTAL

hsliuustc0106

Bugfix for stage_init_timeout propagation. Clear purpose and good error messages. Missing regression test.

hsliuustc0106 · 2026-04-06T13:21:23Z

        help="Enable logging of diffusion pipeline stats.",
    )
+    parser.add_argument(
+        "--init-timeout",


Two timeout args: --init-timeout and --stage-init-timeout. Should this be a single arg? init_timeout is not used elsewhere in the engine.

This is a previous orchestrator and stage design: init_timeout is for orchestrator timeout (can be considered as overall timeout), stage_init_timeout is timeout for any single stage - I just keep it as is

Might require further optimization/cleanup though. In some test cases that require longer init time users have to set both init_timeout and stage_init_timeout for now

Two timeout args: --init-timeout and --stage-init-timeout. Should this be a single arg? init_timeout is not used elsewhere in the engine.

--init-timeout and --stage-init-timeout are not orthogonal, one is enough in my opinion.

Two timeout args: --init-timeout and --stage-init-timeout. Should this be a single arg? init_timeout is not used elsewhere in the engine.

--init-timeout and --stage-init-timeout are not orthogonal, one is enough in my opinion.

Agree.

Shall we discuss and improve this design in further commits, so that distinguish the bugfix in current PR and further improvement/cleanup towards args. (might relate to @lishunyang12 's incoming config refactor as well)

(These two args exist before and after the previous stage client refactor while --stage-init-timeout fails to work after that refactor. )

cc @lishunyang12 will your refactor cover this --init-timeout and --stage-init-timeout?

@hsliuustc0106 No — #2383 is scoped to the deploy YAML schema (pipeline.yaml + deploy/<model>.yaml) and how CLI args merge into it. --init-timeout and --stage-init-timeout get popped at OmniBase.__init__ / AsyncOmniEngine.__init__ as runtime knobs and never reach the stage-config factory, so they flow through unchanged.

Agree they should be unified though. Happy to take it as a follow-up after #2383 merges — pick one canonical name (stage_init_timeout reads more accurate; it's the per-stage budget) and deprecate the other with a DeprecationWarning. Want me to file an issue?

wuhang2014

I'm wondering how does the timeout mechanism work in llm stage? Do they have a unified handling logic?

wuhang2014 · 2026-04-07T01:46:12Z

 def complete_diffusion_handshake(
    proc: BaseProcess,
    handshake_address: str,
+    handshake_timeout: int = 600,


It's better no default value for the param.

If a runtime stage init timeout is not provided, what default value shall we use? Neither None nor 0 seems to be suitable here..

If a runtime stage init timeout is not provided, what default value shall we use? Neither None nor 0 seems to be suitable here..

Default value should be in cli args for online serving

Done. Default values for these functions have been removed. Now defaults kept consistent and at higher level entrypoints:

serve.py: --stage-init-timeout default=300 (online serving CLI)

async_omni_engine.py: AsyncOmniEngine.init(stage_init_timeout: int = 300) (engine API)

omni_base.py: kwargs.pop("stage_init_timeout", 300) (kept intact)

yuanheng-zhao · 2026-04-07T02:16:26Z

I'm wondering how does the timeout mechanism work in llm stage? Do they have a unified handling logic?

Hey @wuhang2014 , for llm stage the timeout passing flow is

AsyncOmniEngine._launch_llm_stage
-> passed into acquire_device_locks
-> not passed to complete_stage_handshake
    -> _perform_handshake
        -> _HANDSHAKE_POLL_TIMEOUT_S * 1000

where _launch_llm_stage does have both stage init timeout and init timeout but it doesn't pass to _perform_handshake

wuhang2014 · 2026-04-07T02:20:48Z

I'm wondering how does the timeout mechanism work in llm stage? Do they have a unified handling logic?

Hey @wuhang2014 , for llm stage the timeout passing flow is
AsyncOmniEngine._launch_llm_stage
-> passed into acquire_device_locks
-> not passed to complete_stage_handshake
    -> _perform_handshake
        -> _HANDSHAKE_POLL_TIMEOUT_S * 1000
where _launch_llm_stage does have both stage init timeout and init timeout but it doesn't pass to _perform_handshake

Could the diffusion side use the same logic with llm?

yuanheng-zhao · 2026-04-07T02:35:41Z

I'm wondering how does the timeout mechanism work in llm stage? Do they have a unified handling logic?

Hey @wuhang2014 , for llm stage the timeout passing flow is
AsyncOmniEngine._launch_llm_stage
-> passed into acquire_device_locks
-> not passed to complete_stage_handshake
    -> _perform_handshake
        -> _HANDSHAKE_POLL_TIMEOUT_S * 1000
where _launch_llm_stage does have both stage init timeout and init timeout but it doesn't pass to _perform_handshake
Could the diffusion side use the same logic with llm?

We could - to approach this we need refactor the current stage clients again...might be too heavy to fix a timeout error only.

FYI, The current flow of diffusion stage timeout passing through (as compared to that of llm stage):

in vllm_omni/engine/stage_init_utils.py,

initialize_diffusion_stage
-> StageDiffusionClient.__init__ without timeout arg passing in
    -> complete_diffusion_handshake
        -> _perform_diffusion_handshake - timeout_ms = _HANDSHAKE_POLL_TIMEOUT_S * 1000 (hardcoded)

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

yuanheng-zhao · 2026-04-08T07:08:40Z

cc @chickeyton

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

chickeyton · 2026-04-09T09:03:01Z

        model: str,
        od_config: OmniDiffusionConfig,
        metadata: StageMetadata,
+        handshake_timeout: int,


should it be named stage_init_timeout as well ? coz there may be some other lengthly initialization process in the future

Sure, updated

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

yenuo26 · 2026-04-10T02:44:07Z

Can you both add or remove the stage_init_timeout configuration in the following code in tests/conftest.py? I think your change will cause the startup time for most test cases to actually become shorter.

@pytest.fixture(scope="module")
def omni_runner(request, model_prefix):
    with _omni_server_lock:
        model, stage_config_path = request.param
        model = model_prefix + model
        with OmniRunner(model, seed=42, stage_configs_path=stage_config_path, stage_init_timeout=300) as runner:
            print("OmniRunner started successfully")
            yield runner
            print("OmniRunner stopping...")

        print("OmniRunner stopped")

@pytest.fixture(scope="module")
def omni_server(request: pytest.FixtureRequest, run_level: str, model_prefix: str) -> Generator[OmniServer, Any, None]:
    """Start vLLM-Omni server as a subprocess with actual model weights.
    Uses session scope so the server starts only once for the entire test session.
    Multi-stage initialization can take 10-20+ minutes.
    """
    with _omni_server_lock:
        params: OmniServerParams = request.param
        model = model_prefix + params.model
        port = params.port
        stage_config_path = params.stage_config_path
        if run_level == "advanced_model" and stage_config_path is not None:
            with open(stage_config_path, encoding="utf-8") as f:
                cfg = yaml.safe_load(f) or {}
            stage_ids = [stage["stage_id"] for stage in cfg.get("stage_args", []) if "stage_id" in stage]
            stage_config_path = modify_stage_config(
                stage_config_path,
                deletes={"stage_args": {stage_id: ["engine_args.load_format"] for stage_id in stage_ids}},
            )

        server_args = params.server_args or []
        if params.use_omni:
            server_args = ["--stage-init-timeout", "120", *server_args]
        if stage_config_path is not None:
            server_args += ["--stage-configs-path", stage_config_path]

princepride · 2026-04-10T02:45:27Z

We need @hsliuustc0106 approve it

yuanheng-zhao · 2026-04-10T03:23:56Z

Can you both add or remove the stage_init_timeout configuration in the following code in tests/conftest.py? I think your change will cause the startup time for most test cases to actually become shorter.

@pytest.fixture(scope="module")
def omni_runner(request, model_prefix):
    with _omni_server_lock:
        model, stage_config_path = request.param
        model = model_prefix + model
        with OmniRunner(model, seed=42, stage_configs_path=stage_config_path, stage_init_timeout=300) as runner:
            print("OmniRunner started successfully")
            yield runner
            print("OmniRunner stopping...")

        print("OmniRunner stopped")

@pytest.fixture(scope="module")
def omni_server(request: pytest.FixtureRequest, run_level: str, model_prefix: str) -> Generator[OmniServer, Any, None]:
    """Start vLLM-Omni server as a subprocess with actual model weights.
    Uses session scope so the server starts only once for the entire test session.
    Multi-stage initialization can take 10-20+ minutes.
    """
    with _omni_server_lock:
        params: OmniServerParams = request.param
        model = model_prefix + params.model
        port = params.port
        stage_config_path = params.stage_config_path
        if run_level == "advanced_model" and stage_config_path is not None:
            with open(stage_config_path, encoding="utf-8") as f:
                cfg = yaml.safe_load(f) or {}
            stage_ids = [stage["stage_id"] for stage in cfg.get("stage_args", []) if "stage_id" in stage]
            stage_config_path = modify_stage_config(
                stage_config_path,
                deletes={"stage_args": {stage_id: ["engine_args.load_format"] for stage_id in stage_ids}},
            )

        server_args = params.server_args or []
        if params.use_omni:
            server_args = ["--stage-init-timeout", "120", *server_args]
        if stage_config_path is not None:
            server_args += ["--stage-configs-path", stage_config_path]

@yenuo26 I think the change of this PR only restores the functionality of stage-init-timeout but won't make startup time for test cases become shorter.

yenuo26 · 2026-04-10T03:31:03Z

Can you both add or remove the stage_init_timeout configuration in the following code in tests/conftest.py? I think your change will cause the startup time for most test cases to actually become shorter.

@pytest.fixture(scope="module")
def omni_runner(request, model_prefix):
    with _omni_server_lock:
        model, stage_config_path = request.param
        model = model_prefix + model
        with OmniRunner(model, seed=42, stage_configs_path=stage_config_path, stage_init_timeout=300) as runner:
            print("OmniRunner started successfully")
            yield runner
            print("OmniRunner stopping...")

        print("OmniRunner stopped")

@pytest.fixture(scope="module")
def omni_server(request: pytest.FixtureRequest, run_level: str, model_prefix: str) -> Generator[OmniServer, Any, None]:
    """Start vLLM-Omni server as a subprocess with actual model weights.
    Uses session scope so the server starts only once for the entire test session.
    Multi-stage initialization can take 10-20+ minutes.
    """
    with _omni_server_lock:
        params: OmniServerParams = request.param
        model = model_prefix + params.model
        port = params.port
        stage_config_path = params.stage_config_path
        if run_level == "advanced_model" and stage_config_path is not None:
            with open(stage_config_path, encoding="utf-8") as f:
                cfg = yaml.safe_load(f) or {}
            stage_ids = [stage["stage_id"] for stage in cfg.get("stage_args", []) if "stage_id" in stage]
            stage_config_path = modify_stage_config(
                stage_config_path,
                deletes={"stage_args": {stage_id: ["engine_args.load_format"] for stage_id in stage_ids}},
            )

        server_args = params.server_args or []
        if params.use_omni:
            server_args = ["--stage-init-timeout", "120", *server_args]
        if stage_config_path is not None:
            server_args += ["--stage-configs-path", stage_config_path]

@yenuo26 I think the change of this PR only restores the functionality of stage-init-timeout but won't make startup time for test cases become shorter.

Since we have passed in the corresponding stage_init_timeout here, once you fix this issue and the timeout value takes effect, would the timeout period change from 600 to a lower value based on the passed-in argument? So, should we update these places accordingly?

yuanheng-zhao · 2026-04-10T03:36:35Z

Since we have passed in the corresponding stage_init_timeout here, once you fix this issue and the timeout value takes effect, would the timeout period change from 600 to a lower value based on the passed-in argument? So, should we update these places accordingly?

once you fix this issue and the timeout value takes effect, would the timeout period change from 600 to a lower value based on the passed-in argument

^ Yes that's correct. Timeout Error triggers earlier for tests which aims at testing the timeout/shupdown/cleanup mechanism as we set the stage-int-timeout to a lower value. However for a regular test (in which the stage initialize successfully within the timeout seconds) , the launch time of engine/stage client should stay the same

yenuo26 · 2026-04-10T03:51:09Z

Since we have passed in the corresponding stage_init_timeout here, once you fix this issue and the timeout value takes effect, would the timeout period change from 600 to a lower value based on the passed-in argument? So, should we update these places accordingly?

once you fix this issue and the timeout value takes effect, would the timeout period change from 600 to a lower value based on the passed-in argument

^ Yes that's correct. Timeout Error triggers earlier for tests which aims at testing the timeout/shupdown/cleanup mechanism as we set the stage-int-timeout to a lower value. However for a regular test (in which the stage initialize successfully within the timeout seconds) , the launch time of engine/stage client should stay the same

Therefore, can you also remove --stage-init-timeout from the above code in this PR? Otherwise, it will most likely cause the test cases in CI to time out.

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

yuanheng-zhao · 2026-04-10T04:38:13Z

Therefore, can you also remove --stage-init-timeout from the above code in this PR? Otherwise, it will most likely cause the test cases in CI to time out.

Got it. Removed extra stage init timeout in tests/conftest.py, now it applies default value of 300 seconds.

Wait a sec, let me updated other test conftest as well

yuanheng-zhao · 2026-04-10T04:39:45Z

Regression tests of timeout passing for two paths (path of diffusion stage client and _launch_llm_stage call flow) have been added @hsliuustc0106

pytest -s tests/engine/test_async_omni_engine_stage_init.py
# ...
# ... 4 passed, 16 warnings in 0.64s ...

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

yuanheng-zhao · 2026-04-10T06:12:55Z

In this commit 8cbb78b, I made the following changes:

let OmniServerParams.stage_init_timeout in the root tests/conftest.py control the stage-init-time for any tests using OmniServerParams to setup args
For specific tests not applying omni server params, add module-level (unit test level) variable STAGE_INIT_TIMEOUT_S at the top to control its stage init timeout, as different sizes of models might have variable length of expected init time

PTAL cc @yenuo26 , @congw729

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

lishunyang12

If we clean this up post-merge, I'd suggest matching upstream's vocabulary instead of collapsing — e.g. rename to --stage-handshake-timeout and --engine-ready-timeout (or re-export VLLM_ENGINE_READY_TIMEOUT_S as the env var). Keeps the two-knob model upstream validated and makes the relationship obvious. Out of scope for this bugfix — happy to take as a follow-up after #2519 and #2383 land.

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

lishunyang12

Correction to my earlier review. I claimed VLLM_ENGINE_READY_TIMEOUT_S is "not in vllm-omni's stage init path" / a "silent dead letter." That was wrong — it is in the path, via inheritance.

The corrected evidence chain

vllm_omni/engine/stage_engine_core_client.py:362:

class StageEngineCoreClient(StageEngineCoreClientBase, AsyncMPClient):

MRO: StageEngineCoreClient → StageEngineCoreClientBase → AsyncMPClient → MPClient. StageEngineCoreClientBase.__init__ (line 131) calls super().__init__(), which lands in MPClient.__init__ (vllm/v1/engine/core_client.py:458). That method unconditionally runs the poll loop at lines 537–549:

# vllm/v1/engine/core_client.py:537-549
while identities:
    if not sync_input_socket.poll(timeout=VLLM_ENGINE_READY_TIMEOUT_S * 1000):
        raise TimeoutError("Timed out waiting for engines to send initial message on input socket.")

So VLLM_ENGINE_READY_TIMEOUT_S IS consulted during LLM stage init, via _attach_llm_stage → make_async_mp_client → super().__init__() → MPClient.__init__.

What it actually bounds (which matters more than whether it's in the path)

It is not equivalent to --stage-init-timeout. They run sequentially and bound very different phases.

What VLLM_ENGINE_READY_TIMEOUT_S bounds is the wait for a single empty b"" frame that the subprocess sends as a ZMQ DEALER↔ROUTER identity registration. From vllm/v1/engine/core.py:1233-1237, inside process_input_sockets:

for input_socket in input_sockets:
    # Send initial message to each input socket - this is required
    # before the front-end ROUTER socket can send input messages
    # back to us.
    input_socket.send(b"")
    poller.register(input_socket, zmq.POLLIN)

The registration frame has no payload. It exists purely to satisfy ZMQ's pattern requirement: a ROUTER cannot route messages back to a DEALER until the DEALER has first sent something so the ROUTER can learn its identity.

The subprocess pushes this frame onto the wire as soon as process_input_sockets runs — which happens after model load completes but before the handshake context manager exits with READY. So by the time vllm-omni's _attach_llm_stage reaches MPClient.__init__ and polls the input socket, the frame has already been sitting in the kernel buffer for some time. poll() returns near-instantly with the buffered frame. VLLM_ENGINE_READY_TIMEOUT_S=600 is microseconds of headroom on a phase that takes microseconds.

So setting VLLM_ENGINE_READY_TIMEOUT_S=1200 to fix a slow model load is meaningless — that env var doesn't bound model loading. Only --stage-init-timeout does. The 600s default upstream is "longer than any plausible failure mode for the ZMQ identity exchange," not "long enough for a 20-minute model load."

Diffusion vs LLM asymmetry

Diffusion stages bypass both inherited vllm timers entirely:

vllm_omni/diffusion/stage_diffusion_client.py:37 — class StageDiffusionClient: (plain class, no MPClient ancestor)
vllm_omni/diffusion/stage_diffusion_proc.py:45 — class StageDiffusionProc: (plain class, no EngineCoreProc ancestor)

So VLLM_ENGINE_READY_TIMEOUT_S and HANDSHAKE_TIMEOUT_MINS are only in the LLM path. The asymmetry exists but is operationally invisible because both inherited timers bound near-instant phases.

Implication for the `--init-timeout` vs `--stage-init-timeout` discussion

Timer	What it actually bounds	Slow phase?	LLM	Diffusion
`--stage-init-timeout`	Model loading inside handshake context	YES	✅	✅
`--init-timeout`	Outer wrapper around `_initialize_stages`	no — must be ≥ `stage_init_timeout`	✅	✅
`HANDSHAKE_TIMEOUT_MINS=5` (vllm hardcoded)	HELLO→INIT round trip	no — ms	✅ inherited via `EngineCoreProc.__init__`	❌
`VLLM_ENGINE_READY_TIMEOUT_S` (vllm env var)	Drain `b""` frame from ZMQ socket buffer	no — µs	✅ inherited via `MPClient.__init__`	❌

There's really one knob that matters for slow model loads — --stage-init-timeout. The other three either bound near-instant phases or are outer wrappers that just need to be wide enough not to fire spuriously. --init-timeout is the only real footgun (because users have to keep it ≥ stage_init_timeout manually, which is the case @yuanheng-zhao described upthread). The inherited vllm timers fire on broken-environment failure modes (subprocess crash, ZMQ misconfig, network partition), not slow loads, so doubling them doesn't help and they don't need to be exposed to users.

This actually strengthens @wuhang2014's "collapse to one knob" position: keep --stage-init-timeout as canonical, drop or deprecate --init-timeout. The inherited vllm timers can stay as they are.

Still outside the scope of this bugfix — the unification belongs in a follow-up after #2519 and #2383 land. But hopefully the corrected timer matrix is useful when someone picks it up.

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

yuanheng-zhao · 2026-04-11T02:50:23Z

Nightly Tests Failures:

Diffusion · Other · Function Test with L4

[2026-04-10T15:14:42Z] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.05 GiB of which 20.12 MiB is free. Including non-PyTorch memory, this process has 22.02 GiB memory in use. Of the allocated memory 21.80 GiB is allocated by PyTorch, and 10.58 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
...
[2026-04-10T15:29:25Z] ERROR tests/e2e/online_serving/test_flux_2_dev_expansion.py::test_flux_2_dev[parallel_cfg_2] - RuntimeError: Server processes exited with code 1 before becoming ready.

#2010 introduces tests/e2e/online_serving/test_flux_2_dev_expansion.py but OOM on L4, not related to this PR.

Omni Perf Test

FAILED tests/dfx/perf/scripts/run_benchmark.py::test_performance_benchmark[benchmark_params3-omni_server1] - AssertionError: Request failures exist
assert 99 == 100

Tested on main branch 78bef62

pytest -s tests/dfx/perf/scripts/run_benchmark.py::test_performance_benchmark[benchmark_params3-omni_server1]

But cannot reproduce the error. I'm suspecting this is related with concurrency as the timeout args changes on this PR won't affect if stage and engine initialize successfully.

Omni · Function Test with H100

FAILED tests/e2e/online_serving/test_qwen3_omni_expansion.py::test_video_to_text_audio_001[default] - AssertionError: The output does not contain any of the keywords.
assert False
 +  where False = any(<generator object assert_omni_response.<locals>.<genexpr> at 0x73aaac54fd80>)
FAILED tests/e2e/online_serving/test_qwen3_omni_expansion.py::test_text_audio_to_text_audio_001[async_chunk] - AssertionError: The audio content is not same as the text
assert (0.7265306421035564 is not None and 0.7265306421035564 > 0.9)
 +  where 0.7265306421035564 = OmniResponse(text_content='The audio contains a repeating sequence of the word "test" spoken in a robotic or synthesized voice. The phrase "test test test test test test test" is repeated multiple times with consistent timing and pitch. There are no other discernible sounds, music, or background noise in the recording.', audio_data=None, audio_content=' The audio contains a repeating sequence of the word test, spoken in a robotic or synthesized voice. The phrase test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test,', audio_format=None, audio_bytes=b'RIFFr\xa3\x0e\x0...

Tested and failed on main branch 78bef62

pytest -s tests/e2e/online_serving/test_qwen3_omni_expansion.py::test_video_to_text_audio_001[default] tests/e2e/online_serving/test_qwen3_omni_expansion.py::test_text_audio_to_text_audio_001[async_chunk]

FAILED tests/e2e/online_serving/test_qwen3_omni_expansion.py::test_video_to_text_audio_001[default] - AssertionError: The request failed.
FAILED tests/e2e/online_serving/test_qwen3_omni_expansion.py::test_text_audio_to_text_audio_001[async_chunk] - AssertionError: The request failed.
====================================================== 2 failed, 16 warnings in 444.62s (0:07:24)

Omni · Function Test with L4

[2026-04-10T15:08:53Z] (APIServer pid=180) ERROR 04-10 15:08:53 [async_omni_engine.py:841]   File "/workdir/vllm_omni/engine/stage_engine_core_proc.py", line 182, in _perform_handshake
[2026-04-10T15:08:53Z] (APIServer pid=180) ERROR 04-10 15:08:53 [async_omni_engine.py:841]     identity, msg = _recv(poller, handshake_socket, proc, "READY", handshake_timeout)
[2026-04-10T15:08:53Z] (APIServer pid=180) ERROR 04-10 15:08:53 [async_omni_engine.py:841]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-04-10T15:08:53Z] (APIServer pid=180) ERROR 04-10 15:08:53 [async_omni_engine.py:841]   File "/workdir/vllm_omni/engine/stage_engine_core_proc.py", line 202, in _recv
[2026-04-10T15:08:53Z] (APIServer pid=180) ERROR 04-10 15:08:53 [async_omni_engine.py:841]     raise TimeoutError(
[2026-04-10T15:08:53Z] (APIServer pid=180) ERROR 04-10 15:08:53 [async_omni_engine.py:841] TimeoutError: Timed out waiting for READY from StageEngineCoreProc after 300s. This typically indicates model loading or initialization is taking too long. Consider increasing `stage_init_timeout` for large models.
...
[2026-04-10T15:22:20Z] ERROR tests/e2e/online_serving/test_dynin_omni_expansion.py::test_send_i2i_request_001[omni_server0] - RuntimeError: Server processes exited with code 1 before becoming ready.
[2026-04-10T15:22:20Z] ERROR tests/e2e/online_serving/test_dynin_omni_expansion.py::test_send_t2i_request_001[omni_server0] - RuntimeError: Server processes exited with code 1 before becoming ready.

This is an stage initialization TimeoutError which is different from #2649 reported. Let me try again with increasing stage init timeout.

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

yuanheng-zhao · 2026-04-11T05:24:37Z

Latest nightly tests:

Omni · Function Test with L4

Caused by CUDA error: out of memory

[2026-04-11T04:10:18Z] ERROR tests/e2e/online_serving/test_dynin_omni_expansion.py::test_send_i2i_request_001[omni_server0] - RuntimeError: Server processes exited with code 1 before becoming ready.
[2026-04-11T04:10:18Z] ERROR tests/e2e/online_serving/test_dynin_omni_expansion.py::test_send_t2i_request_001[omni_server0] - RuntimeError: Server processes exited with code 1 before becoming ready.

Diffusion · Other · Function Test with L4

[2026-04-11T04:18:30Z] ERROR tests/e2e/online_serving/test_flux_2_dev_expansion.py::test_flux_2_dev[parallel_cfg_2] - RuntimeError: Server processes exited with code 1 before becoming ready.

which are not related to this PR changes. Other nightly tests passed.

done

hsliuustc0106 · 2026-04-11T16:48:23Z

#2519 (comment) @nuclearwu can help fix the first nightly ci breakdown? from #2010

…#2519) Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com> Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>

yuanheng-zhao added 4 commits April 6, 2026 16:45

fix runtime config timeout for stage diffusion

090071a

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

upd AsyncOmniEngine

18d37f1

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

upd

4eb68c0

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

add arg to t2i example script

28fb703

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

yuanheng-zhao changed the title ~~[Bugfix] Restore user config and runtime stage init timeout~~ [Bugfix] Restore user config/runtime stage init timeout Apr 6, 2026

yuanheng-zhao added 2 commits April 6, 2026 17:44

add init-timeout to t2i script

82e56ea

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

make default stage init timeout consistent

7dce68b

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

yuanheng-zhao marked this pull request as ready for review April 6, 2026 10:17

yuanheng-zhao requested a review from hsliuustc0106 as a code owner April 6, 2026 10:17

hsliuustc0106 previously requested changes Apr 6, 2026

View reviewed changes

wuhang2014 reviewed Apr 7, 2026

View reviewed changes

keep defaults at higher lv

1f2ba4e

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

yuanheng-zhao requested a review from wuhang2014 April 7, 2026 07:26

Merge from main

6077766

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

yuanheng-zhao mentioned this pull request Apr 8, 2026

[Bug]: Diffusion stage initialization fails with TimeoutError and user stage_init_timeout is ignored #2518

Closed

1 task

Merge from main

aa50ba0

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

yuanheng-zhao requested a review from hsliuustc0106 April 9, 2026 03:28

chickeyton reviewed Apr 9, 2026

View reviewed changes

rename

007dd9a

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

yuanheng-zhao requested a review from chickeyton April 9, 2026 12:56

amy-why-3459 mentioned this pull request Apr 10, 2026

[CI Failure]: Omni · Perf Test, Qwen3-Omni-30B-A3B-Instruct, Orchestrator run timeout #2661

Closed

1 task

Merge branch 'main' into fix/stage-init-timeout

b799908

hsliuustc0106 added the ready label to trigger buildkite CI label Apr 10, 2026

hsliuustc0106 added the nightly-test label to trigger buildkite nightly test CI label Apr 10, 2026

Merge branch 'main' into fix/stage-init-timeout

bbb9053

yuanheng-zhao added 2 commits April 10, 2026 12:37

rm extra stage init timeout in conftest

2cc29fc

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

add timeout passing regression test

2437b7b

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

upd tests stage init timeout setup

8cbb78b

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

Merge from main

94b8948

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

lishunyang12 approved these changes Apr 10, 2026

View reviewed changes

fix cpu unit test

12e6353

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

lishunyang12 reviewed Apr 10, 2026

View reviewed changes

lishunyang12 mentioned this pull request Apr 10, 2026

[Config Refactor][2/N] Pipeline + Deploy Config Schema #2383

Merged

increase stage init timeout for test_dynin_omni_expansion

fda53a9

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

increase stage init timeout for run benchmarks

02aa1d6

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

Merge branch 'main' into fix/stage-init-timeout

dc75079

lishunyang12 enabled auto-merge (squash) April 11, 2026 07:47

hsliuustc0106 disabled auto-merge April 11, 2026 16:48

hsliuustc0106 merged commit c20cac8 into vllm-project:main Apr 11, 2026
7 of 8 checks passed

hsliuustc0106 mentioned this pull request Apr 11, 2026

fix: respect user stage_init_timeout during diffusion stage initialization (#2518) #2548

Closed

gcanlin mentioned this pull request Apr 12, 2026

[Bugfix] Fix UT for the missing of log_stats in Engine #2706

Merged

5 tasks

Conversation

yuanheng-zhao commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

DiT

AR

Uh oh!

hsliuustc0106 commented Apr 6, 2026

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wuhang2014 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuanheng-zhao Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuanheng-zhao commented Apr 7, 2026

Uh oh!

wuhang2014 commented Apr 7, 2026

Uh oh!

yuanheng-zhao commented Apr 7, 2026

Uh oh!

yuanheng-zhao commented Apr 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yenuo26 commented Apr 10, 2026

Uh oh!

princepride commented Apr 10, 2026

Uh oh!

yuanheng-zhao commented Apr 10, 2026

Uh oh!

yenuo26 commented Apr 10, 2026

Uh oh!

yuanheng-zhao commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yenuo26 commented Apr 10, 2026

Uh oh!

yuanheng-zhao commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuanheng-zhao commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuanheng-zhao commented Apr 10, 2026

Uh oh!

lishunyang12 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

yuanheng-zhao commented Apr 6, 2026 •

edited

Loading

yuanheng-zhao Apr 7, 2026 •

edited

Loading

yuanheng-zhao commented Apr 10, 2026 •

edited

Loading

yuanheng-zhao commented Apr 10, 2026 •

edited

Loading

yuanheng-zhao commented Apr 10, 2026 •

edited

Loading

lishunyang12 left a comment •

edited

Loading

Implication for the `--init-timeout` vs `--stage-init-timeout` discussion

yuanheng-zhao commented Apr 11, 2026 •

edited

Loading

yuanheng-zhao commented Apr 11, 2026 •

edited

Loading