[Bugfix] Fix DP wave race condition re-arming engine while paused#36608
[Bugfix] Fix DP wave race condition re-arming engine while paused#36608AjAnubolu wants to merge 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses a race condition in the data parallel engine by preventing the engine from starting a new wave if the scheduler is paused. The change in vllm/v1/engine/core.py adds a check for the scheduler's paused state before setting self.engines_running to True. This correctly gates the start of a new DP wave and prevents potential hangs during collective RPC calls when the system is in a paused state, for example during scaling operations. The fix is targeted and looks correct.
Note: Security Review did not run due to the size of the PR.
Check is_scheduler_paused() before re-arming engines_running on START_DP_WAVE to prevent NCCL collective timeout. Signed-off-by: AjAnubolu <anuboluajay@gmail.com>
ca14af5 to
981367b
Compare
|
Hi @AjAnubolu, thanks for your contribution, appreciate the fix! Just wanted to check something, I am concerned about potential deadlock in the case that pause reqs to each engine are delayed: has this been tested? EDIT: appears like this test fails |
Gate START_DP_WAVE on scheduler pause state to prevent race with collective_rpc.
Closes #36594