[Bugfix] Release stage launch lock before handshake#2717
Merged
Conversation
Allow LLM stages to finish their slow startup handshake without holding the global launch lock so multiple stages can boot in parallel. Add a regression test to prevent handshake waits from serializing stage startup again. Made-with: Cursor Signed-off-by: Chenguang ZHENG <645327136@qq.com>
Signed-off-by: Chenguang ZHENG <645327136@qq.com>
Contributor
|
Do we need an end-to-end test case for detect the stage init time? Overall this pr LGTM. |
Contributor
|
LGTM |
Collaborator
|
Blocker scan:
OVERALL: NO BLOCKERS VERDICT: COMMENT Excellent bugfix. Root cause analysis is clear, safety rationale is solid, regression test is comprehensive. The lock scope narrowing is correct - stage-specific device visibility only needs protection through process spawn, not through the entire handshake. This is a clean fix for a real serialization issue. Gates pass. |
hsliuustc0106
approved these changes
Apr 13, 2026
Celeste-jq
pushed a commit
to IsleOfDawnlight/vllm-omni-voxcpm
that referenced
this pull request
Apr 14, 2026
Signed-off-by: Chenguang ZHENG <645327136@qq.com>
lengrongfu
pushed a commit
to lengrongfu/vllm-omni
that referenced
this pull request
May 1, 2026
Signed-off-by: Chenguang ZHENG <645327136@qq.com>
clodaghwalsh17
pushed a commit
to clodaghwalsh17/nm-vllm-omni-ent
that referenced
this pull request
May 12, 2026
Signed-off-by: Chenguang ZHENG <645327136@qq.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes a stage startup serialization issue in
AsyncOmniEngine.Previously,
_launch_llm_stage()heldllm_stage_launch_lockwhile waiting forcomplete_stage_handshake()to finish. Since the handshake may block on vLLMworker startup and model initialization, later stages could not acquire the
launch lock in time, causing multiple stages to start sequentially instead of
in parallel.
This change narrows the lock scope so that it only protects stage-specific
device environment setup and process spawning. The expensive handshake now runs
outside the global launch lock, which allows other stages to continue launching
in parallel.
Root Cause
complete_stage_handshake()was executed inside thellm_stage_launch_lockcritical section.
That meant:
Because that handshake can take a long time, the next stage had to wait even
though the previous stage had already completed the part that truly required
serialization.
Changes
complete_stage_handshake()out of thellm_stage_launch_lockscope invllm_omni/engine/async_omni_engine.pyprotection semantics are preserved
tests/engine/test_async_omni_engine_stage_init.pyto verify that a secondLLM stage can reach
spawn_stage_core()while the first stage is stillblocked in handshake
Why This Is Safe
The stage-specific environment variables only need to be protected through the
process spawn step, so the child process inherits the correct device visibility.
After
StageEngineCoreProchas already been spawned, waiting for the handshakedoes not need to hold the global launch lock anymore. Keeping the per-stage
device locks until the handshake completes still prevents premature resource
release.
Testing
Static validation
python3 -m py_compile vllm_omni/engine/async_omni_engine.py tests/engine/test_async_omni_engine_stage_init.pyAdded regression coverage
test_launch_llm_stage_releases_launch_lock_before_complete_stage_handshakeExpected Impact
handshake wait
@yinpeiqi @amy-why-3459 @Gaohan123 @hsliuustc0106 @wuhang2014 @chickeyton