[Bugfix] Release stage launch lock before handshake by fake0fan · Pull Request #2717 · vllm-project/vllm-omni

fake0fan · 2026-04-13T03:50:38Z

Summary

This PR fixes a stage startup serialization issue in AsyncOmniEngine.

Previously, _launch_llm_stage() held llm_stage_launch_lock while waiting for
complete_stage_handshake() to finish. Since the handshake may block on vLLM
worker startup and model initialization, later stages could not acquire the
launch lock in time, causing multiple stages to start sequentially instead of
in parallel.

This change narrows the lock scope so that it only protects stage-specific
device environment setup and process spawning. The expensive handshake now runs
outside the global launch lock, which allows other stages to continue launching
in parallel.

Root Cause

complete_stage_handshake() was executed inside the llm_stage_launch_lock
critical section.

That meant:

stage-specific device visibility was prepared correctly
the child process was spawned correctly
but the global launch lock remained held during the full HELLO/READY handshake

Because that handshake can take a long time, the next stage had to wait even
though the previous stage had already completed the part that truly required
serialization.

Changes

Moved complete_stage_handshake() out of the llm_stage_launch_lock scope in
vllm_omni/engine/async_omni_engine.py
Kept stage device/file locks alive until handshake completion, so resource
protection semantics are preserved
Added a regression test in
tests/engine/test_async_omni_engine_stage_init.py to verify that a second
LLM stage can reach spawn_stage_core() while the first stage is still
blocked in handshake

Why This Is Safe

The stage-specific environment variables only need to be protected through the
process spawn step, so the child process inherits the correct device visibility.

After StageEngineCoreProc has already been spawned, waiting for the handshake
does not need to hold the global launch lock anymore. Keeping the per-stage
device locks until the handshake completes still prevents premature resource
release.

Testing

Static validation

python3 -m py_compile vllm_omni/engine/async_omni_engine.py tests/engine/test_async_omni_engine_stage_init.py

Added regression coverage

test_launch_llm_stage_releases_launch_lock_before_complete_stage_handshake

Expected Impact

Multiple LLM stages can launch in parallel instead of being serialized by the
handshake wait
Reduces stage initialization latency for multi-stage startup flows
Adds regression protection for future refactors around stage launch locking

@yinpeiqi @amy-why-3459 @Gaohan123 @hsliuustc0106 @wuhang2014 @chickeyton

Allow LLM stages to finish their slow startup handshake without holding the global launch lock so multiple stages can boot in parallel. Add a regression test to prevent handshake waits from serializing stage startup again. Made-with: Cursor Signed-off-by: Chenguang ZHENG <645327136@qq.com>

Signed-off-by: Chenguang ZHENG <645327136@qq.com>

yinpeiqi · 2026-04-13T05:34:39Z

Do we need an end-to-end test case for detect the stage init time? Overall this pr LGTM.

amy-why-3459 · 2026-04-13T06:39:47Z

LGTM

hsliuustc0106 · 2026-04-13T09:25:38Z

Blocker scan:

Category	Result
Correctness	PASS
Reliability/Safety	PASS
Breaking Changes	PASS
Test Coverage	PASS
Documentation	PASS
Security	PASS

OVERALL: NO BLOCKERS

VERDICT: COMMENT

Excellent bugfix. Root cause analysis is clear, safety rationale is solid, regression test is comprehensive. The lock scope narrowing is correct - stage-specific device visibility only needs protection through process spawn, not through the entire handshake.

This is a clean fix for a real serialization issue. Gates pass.

Signed-off-by: Chenguang ZHENG <645327136@qq.com>

fake0fan requested a review from hsliuustc0106 as a code owner April 13, 2026 03:50

fake0fan force-pushed the fix_init branch from 24c9b25 to bc3e707 Compare April 13, 2026 03:51

fake0fan added 2 commits April 13, 2026 11:51

Merge branch 'main' into fix_init

68355f7

fix pre-commit

300d548

Signed-off-by: Chenguang ZHENG <645327136@qq.com>

fake0fan added 2 commits April 13, 2026 14:27

Merge branch 'main' into fix_init

5546ada

Merge branch 'main' into fix_init

c678051

Merge branch 'main' into fix_init

9beacfa

hsliuustc0106 added the ready label to trigger buildkite CI label Apr 13, 2026

hsliuustc0106 approved these changes Apr 13, 2026

View reviewed changes

hsliuustc0106 merged commit 155583f into vllm-project:main Apr 13, 2026
5 of 8 checks passed

Celeste-jq pushed a commit to IsleOfDawnlight/vllm-omni-voxcpm that referenced this pull request Apr 14, 2026

[Bugfix] Release stage launch lock before handshake (vllm-project#2717)

502f61e

Signed-off-by: Chenguang ZHENG <645327136@qq.com>

lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026

[Bugfix] Release stage launch lock before handshake (vllm-project#2717)

f5760a5

Signed-off-by: Chenguang ZHENG <645327136@qq.com>

fake0fan deleted the fix_init branch May 10, 2026 13:19

clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026

[Bugfix] Release stage launch lock before handshake (vllm-project#2717)

5e7ace9

Signed-off-by: Chenguang ZHENG <645327136@qq.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Release stage launch lock before handshake#2717

[Bugfix] Release stage launch lock before handshake#2717
hsliuustc0106 merged 6 commits into
vllm-project:mainfrom
fake0fan:fix_init

fake0fan commented Apr 13, 2026 •

edited

Loading

Uh oh!

yinpeiqi commented Apr 13, 2026

Uh oh!

amy-why-3459 commented Apr 13, 2026

Uh oh!

hsliuustc0106 commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

fake0fan commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Changes

Why This Is Safe

Testing

Static validation

Added regression coverage

Expected Impact

Uh oh!

yinpeiqi commented Apr 13, 2026

Uh oh!

amy-why-3459 commented Apr 13, 2026

Uh oh!

hsliuustc0106 commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fake0fan commented Apr 13, 2026 •

edited

Loading