Skip to content

[Bugfix] Fix coord_socket assertion in DPEngineCoreProc for offline DP mode#35916

Merged
zhuohan123 merged 1 commit intovllm-project:mainfrom
jaewonlee-fb:fix-dp-coord-socket-offline
Mar 4, 2026
Merged

[Bugfix] Fix coord_socket assertion in DPEngineCoreProc for offline DP mode#35916
zhuohan123 merged 1 commit intovllm-project:mainfrom
jaewonlee-fb:fix-dp-coord-socket-offline

Conversation

@jaewonlee-fb
Copy link
Copy Markdown
Contributor

@jaewonlee-fb jaewonlee-fb commented Mar 3, 2026

Purpose

Fix assertion failure assert coord_socket is not None in process_output_sockets() when running with data parallelism in offline SPMD mode (e.g., standalone benchmarks with DP > 1).

The bug: DPEngineCoreProc.resume_scheduler() unconditionally sends coordinator messages (client_index=-1) without checking whether a coordinator exists. In offline mode, no coordinator is started, so coord_socket is None, causing an assertion crash.

Fix 1: Guard resume_scheduler() with has_coordinator check — this is the root cause fix. The method should not attempt to wake other DP engines via the coordinator when there is no coordinator.

Fix 2: Replace the hard assert in process_output_sockets() with graceful error logging as a safety net. If a coordinator message somehow reaches the output processing without a coordinator socket, log an error and drop the message instead of crashing.

Test Plan

Ran standalone MoE model benchmarks on 8x GPUs (DP=8, TP=1, EP=8):

python benchmarks/benchmark_latency.py \
    --model <moe-model> \
    --tensor-parallel-size 1 \
    --data-parallel-size 8 \
    --trust-remote-code

Test Result

All DP offline benchmarks pass (decode and prefill). Previously all crashed with AssertionError on coord_socket.

cc @njhill @houseroad @zhuohan123 @hao-aaron

@jaewonlee-fb jaewonlee-fb requested a review from njhill as a code owner March 3, 2026 21:24
@mergify mergify Bot added v1 bug Something isn't working labels Mar 3, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a crash that occurs when running in data parallelism offline SPMD mode. The root cause was an unconditional attempt to send a message to a coordinator, even when one wasn't present, leading to an assertion failure. The fix is implemented in two parts: first, by adding a guard in DPEngineCoreProc.resume_scheduler to only send coordinator messages if a coordinator exists, which corrects the root cause. Second, as a defensive measure, the hard assertion in process_output_sockets is replaced with error logging to prevent crashes if such a message is ever sent erroneously. The changes are logical, well-implemented, and directly solve the described problem. I have no further suggestions.

Copy link
Copy Markdown
Collaborator

@houseroad houseroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@houseroad houseroad added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 3, 2026
Comment thread vllm/v1/engine/core.py Outdated
…P mode

Fix assertion failure `assert coord_socket is not None` in
`process_output_sockets()` when running with data parallelism in offline
SPMD mode (e.g., standalone benchmarks with DP > 1).

The bug: `DPEngineCoreProc.resume_scheduler()` unconditionally sends
coordinator messages (client_index=-1) without checking whether a
coordinator exists. In offline mode, no coordinator is started, so
coord_socket is None, causing an assertion crash.

Guard resume_scheduler() with has_coordinator check to only send
coordinator messages when a coordinator actually exists.

Signed-off-by: Jaewon Lee <jaewon@meta.com>
@jaewonlee-fb jaewonlee-fb force-pushed the fix-dp-coord-socket-offline branch from 8a60487 to 83bc7fa Compare March 3, 2026 21:54
@jaewonlee-fb
Copy link
Copy Markdown
Contributor Author

@njhill would you mind confirming the fix? Thanks!

Copy link
Copy Markdown
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jaewonlee-fb

@zhuohan123 zhuohan123 enabled auto-merge (squash) March 3, 2026 22:26
@zhuohan123 zhuohan123 merged commit f22ff29 into vllm-project:main Mar 4, 2026
49 checks passed
Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026
avinashsingh77 pushed a commit to avinashsingh77/vllm that referenced this pull request Mar 12, 2026
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants