Skip to content

fix hang with pause and collectives#38761

Draft
hao-aaron wants to merge 2 commits intovllm-project:mainfrom
hao-aaron:hang-fix
Draft

fix hang with pause and collectives#38761
hao-aaron wants to merge 2 commits intovllm-project:mainfrom
hao-aaron:hang-fix

Conversation

@hao-aaron
Copy link
Copy Markdown
Contributor

@hao-aaron hao-aaron commented Apr 1, 2026

Purpose

The problem can be described as follows:

  1. Send pause, pause returns, engines_running = False
  2. Queued requests -> add_request_async() called by DPAsyncMPClient, send ”FIRST_REQ” to DPCoordinator
  3. Asynchronously, send weight sync request to all engines
  4. Some engines get start_dp_wave and go into dummy batch, some get collective weight sync requests, deadlock!

So we need to respect the pause state when sending “FIRST_REQ" from the frontend. If we are paused, we shouldn’t send a start_dp_wave.

Okay, so the next obvious step is we query the engines, if any are in a paused state, then we send the request but don’t broadcast start wave. This can result in problems if our engines are not in a consistent state.

  1. Send pause request to 2 engines, engine 0 and 1. Engine 0 returns immediately, engine 1 hasn’t processed pause yet.
  2. Concurrently, add_request_async() called by DPAsyncMPClient. Query pause state, discover that one engine is in a paused state so we don’t broadcast start wave. Then, lets say we send our request to engine 1
  3. Engine 1 gets the new req before pause and starts processing the new request -> deadlock since engine 0 has not gotten the start_dp_wave signal to step into dummy batch.

We could also keep a paused state variable on the frontend side, but this would result in the same problem above.

The point is, even if we are an engine core who is paused, we still need to respect dp wave start requests because other engines may already be in the model call and need us to participate.

The deadlock in both of the above cases is a result of sending a start_dp_wave when we are not supposed to be stepping, or not sending start_dp_wave when we are supposed to be stepping. The frontend is unable to obtain the information of whether we should actually step or not, this information is only known by the engine that the request is eventually routed to. Thus, we need to handle the sending of "FIRST_REQ" on the engine core side rather than the frontend side.

Lets say we are an engine, and we get a request while engines_running = False. There are two possibilities:

  1. Scheduler is not paused:
  • Send start_dp_wave to all. Even if other ranks are paused, they will step with us to avoid deadlock.
  1. Scheduler is paused:
  • Add request, do not send start_dp_wave

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
@mergify mergify bot added the v1 label Apr 1, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request simplifies the request wave coordination logic by removing the explicit "FIRST_REQ" notification from the engine core client. Instead, the engine core now handles notifying the coordinator when an idle, unpaused engine receives a request, ensuring consistent state transitions. I have no feedback to provide as there were no review comments to evaluate.

x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant