Skip to content

[Bugfix] Release stage launch lock before handshake#2717

Merged
hsliuustc0106 merged 6 commits into
vllm-project:mainfrom
fake0fan:fix_init
Apr 13, 2026
Merged

[Bugfix] Release stage launch lock before handshake#2717
hsliuustc0106 merged 6 commits into
vllm-project:mainfrom
fake0fan:fix_init

Conversation

@fake0fan
Copy link
Copy Markdown
Contributor

@fake0fan fake0fan commented Apr 13, 2026

Summary

This PR fixes a stage startup serialization issue in AsyncOmniEngine.

Previously, _launch_llm_stage() held llm_stage_launch_lock while waiting for
complete_stage_handshake() to finish. Since the handshake may block on vLLM
worker startup and model initialization, later stages could not acquire the
launch lock in time, causing multiple stages to start sequentially instead of
in parallel.

This change narrows the lock scope so that it only protects stage-specific
device environment setup and process spawning. The expensive handshake now runs
outside the global launch lock, which allows other stages to continue launching
in parallel.

Root Cause

complete_stage_handshake() was executed inside the llm_stage_launch_lock
critical section.

That meant:

  • stage-specific device visibility was prepared correctly
  • the child process was spawned correctly
  • but the global launch lock remained held during the full HELLO/READY handshake

Because that handshake can take a long time, the next stage had to wait even
though the previous stage had already completed the part that truly required
serialization.

Changes

  • Moved complete_stage_handshake() out of the llm_stage_launch_lock scope in
    vllm_omni/engine/async_omni_engine.py
  • Kept stage device/file locks alive until handshake completion, so resource
    protection semantics are preserved
  • Added a regression test in
    tests/engine/test_async_omni_engine_stage_init.py to verify that a second
    LLM stage can reach spawn_stage_core() while the first stage is still
    blocked in handshake

Why This Is Safe

The stage-specific environment variables only need to be protected through the
process spawn step, so the child process inherits the correct device visibility.

After StageEngineCoreProc has already been spawned, waiting for the handshake
does not need to hold the global launch lock anymore. Keeping the per-stage
device locks until the handshake completes still prevents premature resource
release.

Testing

Static validation

  • python3 -m py_compile vllm_omni/engine/async_omni_engine.py tests/engine/test_async_omni_engine_stage_init.py

Added regression coverage

  • test_launch_llm_stage_releases_launch_lock_before_complete_stage_handshake

Expected Impact

  • Multiple LLM stages can launch in parallel instead of being serialized by the
    handshake wait
  • Reduces stage initialization latency for multi-stage startup flows
  • Adds regression protection for future refactors around stage launch locking

@yinpeiqi @amy-why-3459 @Gaohan123 @hsliuustc0106 @wuhang2014 @chickeyton

@fake0fan fake0fan requested a review from hsliuustc0106 as a code owner April 13, 2026 03:50
Allow LLM stages to finish their slow startup handshake without holding the global launch lock so multiple stages can boot in parallel. Add a regression test to prevent handshake waits from serializing stage startup again.

Made-with: Cursor
Signed-off-by: Chenguang ZHENG <645327136@qq.com>
Signed-off-by: Chenguang ZHENG <645327136@qq.com>
@yinpeiqi
Copy link
Copy Markdown
Contributor

Do we need an end-to-end test case for detect the stage init time? Overall this pr LGTM.

@amy-why-3459
Copy link
Copy Markdown
Contributor

LGTM

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

Blocker scan:

Category Result
Correctness PASS
Reliability/Safety PASS
Breaking Changes PASS
Test Coverage PASS
Documentation PASS
Security PASS

OVERALL: NO BLOCKERS

VERDICT: COMMENT

Excellent bugfix. Root cause analysis is clear, safety rationale is solid, regression test is comprehensive. The lock scope narrowing is correct - stage-specific device visibility only needs protection through process spawn, not through the entire handshake.

This is a clean fix for a real serialization issue. Gates pass.

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Apr 13, 2026
@hsliuustc0106 hsliuustc0106 merged commit 155583f into vllm-project:main Apr 13, 2026
5 of 8 checks passed
Celeste-jq pushed a commit to IsleOfDawnlight/vllm-omni-voxcpm that referenced this pull request Apr 14, 2026
Signed-off-by: Chenguang ZHENG <645327136@qq.com>
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
Signed-off-by: Chenguang ZHENG <645327136@qq.com>
@fake0fan fake0fan deleted the fix_init branch May 10, 2026 13:19
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
Signed-off-by: Chenguang ZHENG <645327136@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants