[Bugfix] Fix coord_socket assertion in DPEngineCoreProc for offline DP mode#35916
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses a crash that occurs when running in data parallelism offline SPMD mode. The root cause was an unconditional attempt to send a message to a coordinator, even when one wasn't present, leading to an assertion failure. The fix is implemented in two parts: first, by adding a guard in DPEngineCoreProc.resume_scheduler to only send coordinator messages if a coordinator exists, which corrects the root cause. Second, as a defensive measure, the hard assertion in process_output_sockets is replaced with error logging to prevent crashes if such a message is ever sent erroneously. The changes are logical, well-implemented, and directly solve the described problem. I have no further suggestions.
…P mode Fix assertion failure `assert coord_socket is not None` in `process_output_sockets()` when running with data parallelism in offline SPMD mode (e.g., standalone benchmarks with DP > 1). The bug: `DPEngineCoreProc.resume_scheduler()` unconditionally sends coordinator messages (client_index=-1) without checking whether a coordinator exists. In offline mode, no coordinator is started, so coord_socket is None, causing an assertion crash. Guard resume_scheduler() with has_coordinator check to only send coordinator messages when a coordinator actually exists. Signed-off-by: Jaewon Lee <jaewon@meta.com>
8a60487 to
83bc7fa
Compare
|
@njhill would you mind confirming the fix? Thanks! |
…P mode (vllm-project#35916) Signed-off-by: Jaewon Lee <jaewon@meta.com>
…P mode (vllm-project#35916) Signed-off-by: Jaewon Lee <jaewon@meta.com>
…P mode (vllm-project#35916) Signed-off-by: Jaewon Lee <jaewon@meta.com>
Purpose
Fix assertion failure
assert coord_socket is not Noneinprocess_output_sockets()when running with data parallelism in offline SPMD mode (e.g., standalone benchmarks with DP > 1).The bug:
DPEngineCoreProc.resume_scheduler()unconditionally sends coordinator messages (client_index=-1) without checking whether a coordinator exists. In offline mode, no coordinator is started, socoord_socketisNone, causing an assertion crash.Fix 1: Guard
resume_scheduler()withhas_coordinatorcheck — this is the root cause fix. The method should not attempt to wake other DP engines via the coordinator when there is no coordinator.Fix 2: Replace the hard
assertinprocess_output_sockets()with graceful error logging as a safety net. If a coordinator message somehow reaches the output processing without a coordinator socket, log an error and drop the message instead of crashing.Test Plan
Ran standalone MoE model benchmarks on 8x GPUs (DP=8, TP=1, EP=8):
python benchmarks/benchmark_latency.py \ --model <moe-model> \ --tensor-parallel-size 1 \ --data-parallel-size 8 \ --trust-remote-codeTest Result
All DP offline benchmarks pass (decode and prefill). Previously all crashed with AssertionError on coord_socket.
cc @njhill @houseroad @zhuohan123 @hao-aaron