[Bugfix] Restore user config/runtime stage init timeout#2519
[Bugfix] Restore user config/runtime stage init timeout#2519hsliuustc0106 merged 20 commits intovllm-project:mainfrom
Conversation
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
|
@wuhang2014 @chickeyton PTAL |
hsliuustc0106
left a comment
There was a problem hiding this comment.
Bugfix for stage_init_timeout propagation. Clear purpose and good error messages. Missing regression test.
| help="Enable logging of diffusion pipeline stats.", | ||
| ) | ||
| parser.add_argument( | ||
| "--init-timeout", |
There was a problem hiding this comment.
Two timeout args: --init-timeout and --stage-init-timeout. Should this be a single arg? init_timeout is not used elsewhere in the engine.
There was a problem hiding this comment.
This is a previous orchestrator and stage design: init_timeout is for orchestrator timeout (can be considered as overall timeout), stage_init_timeout is timeout for any single stage - I just keep it as is
There was a problem hiding this comment.
Might require further optimization/cleanup though. In some test cases that require longer init time users have to set both init_timeout and stage_init_timeout for now
There was a problem hiding this comment.
Two timeout args:
--init-timeoutand--stage-init-timeout. Should this be a single arg?init_timeoutis not used elsewhere in the engine.
--init-timeout and --stage-init-timeout are not orthogonal, one is enough in my opinion.
There was a problem hiding this comment.
Two timeout args:
--init-timeoutand--stage-init-timeout. Should this be a single arg?init_timeoutis not used elsewhere in the engine.
--init-timeoutand--stage-init-timeoutare not orthogonal, one is enough in my opinion.
Agree.
Shall we discuss and improve this design in further commits, so that distinguish the bugfix in current PR and further improvement/cleanup towards args. (might relate to @lishunyang12 's incoming config refactor as well)
(These two args exist before and after the previous stage client refactor while --stage-init-timeout fails to work after that refactor. )
There was a problem hiding this comment.
cc @lishunyang12 will your refactor cover this --init-timeout and --stage-init-timeout?
There was a problem hiding this comment.
@hsliuustc0106 No — #2383 is scoped to the deploy YAML schema (pipeline.yaml + deploy/<model>.yaml) and how CLI args merge into it. --init-timeout and --stage-init-timeout get popped at OmniBase.__init__ / AsyncOmniEngine.__init__ as runtime knobs and never reach the stage-config factory, so they flow through unchanged.
Agree they should be unified though. Happy to take it as a follow-up after #2383 merges — pick one canonical name (stage_init_timeout reads more accurate; it's the per-stage budget) and deprecate the other with a DeprecationWarning. Want me to file an issue?
wuhang2014
left a comment
There was a problem hiding this comment.
I'm wondering how does the timeout mechanism work in llm stage? Do they have a unified handling logic?
| def complete_diffusion_handshake( | ||
| proc: BaseProcess, | ||
| handshake_address: str, | ||
| handshake_timeout: int = 600, |
There was a problem hiding this comment.
It's better no default value for the param.
There was a problem hiding this comment.
If a runtime stage init timeout is not provided, what default value shall we use? Neither None nor 0 seems to be suitable here..
There was a problem hiding this comment.
If a runtime stage init timeout is not provided, what default value shall we use? Neither
Nonenor0seems to be suitable here..
Default value should be in cli args for online serving
There was a problem hiding this comment.
Done. Default values for these functions have been removed. Now defaults kept consistent and at higher level entrypoints:
serve.py: --stage-init-timeout default=300 (online serving CLI)async_omni_engine.py: AsyncOmniEngine.init(stage_init_timeout: int = 300) (engine API)omni_base.py: kwargs.pop("stage_init_timeout", 300) (kept intact)
Hey @wuhang2014 , for llm stage the timeout passing flow is where |
Could the diffusion side use the same logic with llm? |
We could - to approach this we need refactor the current stage clients again...might be too heavy to fix a timeout error only. FYI, The current flow of diffusion stage timeout passing through (as compared to that of llm stage): |
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
|
cc @chickeyton |
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
| model: str, | ||
| od_config: OmniDiffusionConfig, | ||
| metadata: StageMetadata, | ||
| handshake_timeout: int, |
There was a problem hiding this comment.
should it be named stage_init_timeout as well ? coz there may be some other lengthly initialization process in the future
|
Can you both add or remove the stage_init_timeout configuration in the following code in tests/conftest.py? I think your change will cause the startup time for most test cases to actually become shorter. |
|
We need @hsliuustc0106 approve it |
@yenuo26 I think the change of this PR only restores the functionality of stage-init-timeout but won't make startup time for test cases become shorter. |
Since we have passed in the corresponding stage_init_timeout here, once you fix this issue and the timeout value takes effect, would the timeout period change from 600 to a lower value based on the passed-in argument? So, should we update these places accordingly? |
^ Yes that's correct. Timeout Error triggers earlier for tests which aims at testing the timeout/shupdown/cleanup mechanism as we set the stage-int-timeout to a lower value. However for a regular test (in which the stage initialize successfully within the timeout seconds) , the launch time of engine/stage client should stay the same |
Therefore, can you also remove --stage-init-timeout from the above code in this PR? Otherwise, it will most likely cause the test cases in CI to time out. |
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Got it. Removed extra stage init timeout in Wait a sec, let me updated other test conftest as well |
|
Regression tests of timeout passing for two paths (path of diffusion stage client and pytest -s tests/engine/test_async_omni_engine_stage_init.py
# ...
# ... 4 passed, 16 warnings in 0.64s ... |
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
|
In this commit 8cbb78b, I made the following changes:
|
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
There was a problem hiding this comment.
If we clean this up post-merge, I'd suggest matching upstream's vocabulary instead of collapsing — e.g. rename to --stage-handshake-timeout and --engine-ready-timeout (or re-export VLLM_ENGINE_READY_TIMEOUT_S as the env var). Keeps the two-knob model upstream validated and makes the relationship obvious. Out of scope for this bugfix — happy to take as a follow-up after #2519 and #2383 land.
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
lishunyang12
left a comment
There was a problem hiding this comment.
Correction to my earlier review. I claimed VLLM_ENGINE_READY_TIMEOUT_S is "not in vllm-omni's stage init path" / a "silent dead letter." That was wrong — it is in the path, via inheritance.
The corrected evidence chain
vllm_omni/engine/stage_engine_core_client.py:362:
class StageEngineCoreClient(StageEngineCoreClientBase, AsyncMPClient):MRO: StageEngineCoreClient → StageEngineCoreClientBase → AsyncMPClient → MPClient. StageEngineCoreClientBase.__init__ (line 131) calls super().__init__(), which lands in MPClient.__init__ (vllm/v1/engine/core_client.py:458). That method unconditionally runs the poll loop at lines 537–549:
# vllm/v1/engine/core_client.py:537-549
while identities:
if not sync_input_socket.poll(timeout=VLLM_ENGINE_READY_TIMEOUT_S * 1000):
raise TimeoutError("Timed out waiting for engines to send initial message on input socket.")So VLLM_ENGINE_READY_TIMEOUT_S IS consulted during LLM stage init, via _attach_llm_stage → make_async_mp_client → super().__init__() → MPClient.__init__.
What it actually bounds (which matters more than whether it's in the path)
It is not equivalent to --stage-init-timeout. They run sequentially and bound very different phases.
What VLLM_ENGINE_READY_TIMEOUT_S bounds is the wait for a single empty b"" frame that the subprocess sends as a ZMQ DEALER↔ROUTER identity registration. From vllm/v1/engine/core.py:1233-1237, inside process_input_sockets:
for input_socket in input_sockets:
# Send initial message to each input socket - this is required
# before the front-end ROUTER socket can send input messages
# back to us.
input_socket.send(b"")
poller.register(input_socket, zmq.POLLIN)The registration frame has no payload. It exists purely to satisfy ZMQ's pattern requirement: a ROUTER cannot route messages back to a DEALER until the DEALER has first sent something so the ROUTER can learn its identity.
The subprocess pushes this frame onto the wire as soon as process_input_sockets runs — which happens after model load completes but before the handshake context manager exits with READY. So by the time vllm-omni's _attach_llm_stage reaches MPClient.__init__ and polls the input socket, the frame has already been sitting in the kernel buffer for some time. poll() returns near-instantly with the buffered frame. VLLM_ENGINE_READY_TIMEOUT_S=600 is microseconds of headroom on a phase that takes microseconds.
So setting VLLM_ENGINE_READY_TIMEOUT_S=1200 to fix a slow model load is meaningless — that env var doesn't bound model loading. Only --stage-init-timeout does. The 600s default upstream is "longer than any plausible failure mode for the ZMQ identity exchange," not "long enough for a 20-minute model load."
Diffusion vs LLM asymmetry
Diffusion stages bypass both inherited vllm timers entirely:
vllm_omni/diffusion/stage_diffusion_client.py:37—class StageDiffusionClient:(plain class, noMPClientancestor)vllm_omni/diffusion/stage_diffusion_proc.py:45—class StageDiffusionProc:(plain class, noEngineCoreProcancestor)
So VLLM_ENGINE_READY_TIMEOUT_S and HANDSHAKE_TIMEOUT_MINS are only in the LLM path. The asymmetry exists but is operationally invisible because both inherited timers bound near-instant phases.
Implication for the --init-timeout vs --stage-init-timeout discussion
| Timer | What it actually bounds | Slow phase? | LLM | Diffusion |
|---|---|---|---|---|
--stage-init-timeout |
Model loading inside handshake context | YES | ✅ | ✅ |
--init-timeout |
Outer wrapper around _initialize_stages |
no — must be ≥ stage_init_timeout |
✅ | ✅ |
HANDSHAKE_TIMEOUT_MINS=5 (vllm hardcoded) |
HELLO→INIT round trip | no — ms | ✅ inherited via EngineCoreProc.__init__ |
❌ |
VLLM_ENGINE_READY_TIMEOUT_S (vllm env var) |
Drain b"" frame from ZMQ socket buffer |
no — µs | ✅ inherited via MPClient.__init__ |
❌ |
There's really one knob that matters for slow model loads — --stage-init-timeout. The other three either bound near-instant phases or are outer wrappers that just need to be wide enough not to fire spuriously. --init-timeout is the only real footgun (because users have to keep it ≥ stage_init_timeout manually, which is the case @yuanheng-zhao described upthread). The inherited vllm timers fire on broken-environment failure modes (subprocess crash, ZMQ misconfig, network partition), not slow loads, so doubling them doesn't help and they don't need to be exposed to users.
This actually strengthens @wuhang2014's "collapse to one knob" position: keep --stage-init-timeout as canonical, drop or deprecate --init-timeout. The inherited vllm timers can stay as they are.
Still outside the scope of this bugfix — the unification belongs in a follow-up after #2519 and #2383 land. But hopefully the corrected timer matrix is useful when someone picks it up.
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Nightly Tests Failures:Diffusion · Other · Function Test with L4[2026-04-10T15:14:42Z] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.05 GiB of which 20.12 MiB is free. Including non-PyTorch memory, this process has 22.02 GiB memory in use. Of the allocated memory 21.80 GiB is allocated by PyTorch, and 10.58 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
...
[2026-04-10T15:29:25Z] ERROR tests/e2e/online_serving/test_flux_2_dev_expansion.py::test_flux_2_dev[parallel_cfg_2] - RuntimeError: Server processes exited with code 1 before becoming ready.#2010 introduces Omni Perf TestFAILED tests/dfx/perf/scripts/run_benchmark.py::test_performance_benchmark[benchmark_params3-omni_server1] - AssertionError: Request failures exist
assert 99 == 100Tested on pytest -s tests/dfx/perf/scripts/run_benchmark.py::test_performance_benchmark[benchmark_params3-omni_server1]But cannot reproduce the error. I'm suspecting this is related with concurrency as the timeout args changes on this PR won't affect if stage and engine initialize successfully. Omni · Function Test with H100FAILED tests/e2e/online_serving/test_qwen3_omni_expansion.py::test_video_to_text_audio_001[default] - AssertionError: The output does not contain any of the keywords.
assert False
+ where False = any(<generator object assert_omni_response.<locals>.<genexpr> at 0x73aaac54fd80>)
FAILED tests/e2e/online_serving/test_qwen3_omni_expansion.py::test_text_audio_to_text_audio_001[async_chunk] - AssertionError: The audio content is not same as the text
assert (0.7265306421035564 is not None and 0.7265306421035564 > 0.9)
+ where 0.7265306421035564 = OmniResponse(text_content='The audio contains a repeating sequence of the word "test" spoken in a robotic or synthesized voice. The phrase "test test test test test test test" is repeated multiple times with consistent timing and pitch. There are no other discernible sounds, music, or background noise in the recording.', audio_data=None, audio_content=' The audio contains a repeating sequence of the word test, spoken in a robotic or synthesized voice. The phrase test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test, test,', audio_format=None, audio_bytes=b'RIFFr\xa3\x0e\x0...
Tested and failed on pytest -s tests/e2e/online_serving/test_qwen3_omni_expansion.py::test_video_to_text_audio_001[default] tests/e2e/online_serving/test_qwen3_omni_expansion.py::test_text_audio_to_text_audio_001[async_chunk]
FAILED tests/e2e/online_serving/test_qwen3_omni_expansion.py::test_video_to_text_audio_001[default] - AssertionError: The request failed.
FAILED tests/e2e/online_serving/test_qwen3_omni_expansion.py::test_text_audio_to_text_audio_001[async_chunk] - AssertionError: The request failed.
====================================================== 2 failed, 16 warnings in 444.62s (0:07:24)Omni · Function Test with L4[2026-04-10T15:08:53Z] (APIServer pid=180) ERROR 04-10 15:08:53 [async_omni_engine.py:841] File "/workdir/vllm_omni/engine/stage_engine_core_proc.py", line 182, in _perform_handshake
[2026-04-10T15:08:53Z] (APIServer pid=180) ERROR 04-10 15:08:53 [async_omni_engine.py:841] identity, msg = _recv(poller, handshake_socket, proc, "READY", handshake_timeout)
[2026-04-10T15:08:53Z] (APIServer pid=180) ERROR 04-10 15:08:53 [async_omni_engine.py:841] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-04-10T15:08:53Z] (APIServer pid=180) ERROR 04-10 15:08:53 [async_omni_engine.py:841] File "/workdir/vllm_omni/engine/stage_engine_core_proc.py", line 202, in _recv
[2026-04-10T15:08:53Z] (APIServer pid=180) ERROR 04-10 15:08:53 [async_omni_engine.py:841] raise TimeoutError(
[2026-04-10T15:08:53Z] (APIServer pid=180) ERROR 04-10 15:08:53 [async_omni_engine.py:841] TimeoutError: Timed out waiting for READY from StageEngineCoreProc after 300s. This typically indicates model loading or initialization is taking too long. Consider increasing `stage_init_timeout` for large models.
...
[2026-04-10T15:22:20Z] ERROR tests/e2e/online_serving/test_dynin_omni_expansion.py::test_send_i2i_request_001[omni_server0] - RuntimeError: Server processes exited with code 1 before becoming ready.
[2026-04-10T15:22:20Z] ERROR tests/e2e/online_serving/test_dynin_omni_expansion.py::test_send_t2i_request_001[omni_server0] - RuntimeError: Server processes exited with code 1 before becoming ready.This is an stage initialization |
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
|
#2519 (comment) @nuclearwu can help fix the first nightly ci breakdown? from #2010 |
…#2519) Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com> Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Resolves #2518
This PR tries to restore the behavior of passing through of
stage_init_timeoutbefore the refactoring ofStageDiffusionClient. Specifically,_HANDSHAKE_POLL_TIMEOUT_Sinitialize_diffusion_stageinAsyncOmniEngine._initialize_stages. Fixed.stage_init_timeoutis used inacquire_device_locksbut not incomplete_stage_handshake. Fixed.Defaults of stage init timeout kept as:
--stage-init-timeout default=300(online serving CLI)AsyncOmniEngine.__init__(stage_init_timeout: int = 300)(engine API)kwargs.pop("stage_init_timeout", 300)Test Plan
DiT offline example test (text to image) and AR end2end example with applying
stage_init_timeoutDiT:
python examples/offline_inference/text_to_image/text_to_image.py \ --model stepfun-ai/NextStep-1.1 \ --prompt "A baby panda wearing an Iron Man mask, holding a board with 'NextStep-1' written on it" \ --height 512 \ --width 512 \ --num-inference-steps 28 \ --guidance-scale 7.5 \ --guidance-scale-2 1.0 \ --cfg-schedule constant \ --seed 42 \ --output output_nextstep_layerwise.png \ --enable-layerwise-offload \ --stage-init-timeout 1200^^This is a test case in #2339
Test Result
DiT
Before,
After,
TimeoutErrorresolved.AR
Before,
After,
TimeoutErrorresolved.For nightly tests failures, check #2519 (comment)
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)