Skip to content

[Bugfix] Fix DP wave race condition re-arming engine while paused#36608

Open
AjAnubolu wants to merge 1 commit intovllm-project:mainfrom
AjAnubolu:fix/dp-wave-pause-race-36594
Open

[Bugfix] Fix DP wave race condition re-arming engine while paused#36608
AjAnubolu wants to merge 1 commit intovllm-project:mainfrom
AjAnubolu:fix/dp-wave-pause-race-36594

Conversation

@AjAnubolu
Copy link
Copy Markdown
Contributor

Gate START_DP_WAVE on scheduler pause state to prevent race with collective_rpc.

Closes #36594

@AjAnubolu AjAnubolu requested a review from njhill as a code owner March 10, 2026 08:17
@mergify mergify bot added v1 bug Something isn't working labels Mar 10, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a race condition in the data parallel engine by preventing the engine from starting a new wave if the scheduler is paused. The change in vllm/v1/engine/core.py adds a check for the scheduler's paused state before setting self.engines_running to True. This correctly gates the start of a new DP wave and prevents potential hangs during collective RPC calls when the system is in a paused state, for example during scaling operations. The fix is targeted and looks correct.

Note: Security Review did not run due to the size of the PR.

Check is_scheduler_paused() before re-arming engines_running on
START_DP_WAVE to prevent NCCL collective timeout.

Signed-off-by: AjAnubolu <anuboluajay@gmail.com>
@AjAnubolu AjAnubolu force-pushed the fix/dp-wave-pause-race-36594 branch from ca14af5 to 981367b Compare March 13, 2026 03:09
@hao-aaron
Copy link
Copy Markdown
Contributor

hao-aaron commented Mar 13, 2026

Hi @AjAnubolu, thanks for your contribution, appreciate the fix! Just wanted to check something, I am concerned about potential deadlock in the case that pause reqs to each engine are delayed:

Engine 0 (idle)          Coordinator            Engine 1 (idle)
     |                       |                       |
     |                       |                receives pause and return
     |                       |                       |
 recv req/wave boundary----->|---- START_DP_WAVE --> | engines running = False, never steps
     |                       |                       |
 Enter all reduce, deadlock  |                       |
     |                       |                       |
recv pause req, add          |                       |
to queue but never processed |                       |
     |                       |                       |

has this been tested?

EDIT: appears like this test fails

async def test_dp_pause_keep_race_staggered_engines():

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working v1

Projects

None yet

2 participants