fix hang with pause and collectives by hao-aaron · Pull Request #38761 · vllm-project/vllm

hao-aaron · 2026-04-01T22:41:14Z

Purpose

The problem can be described as follows:

Send pause, pause returns, engines_running = False
Queued requests -> add_request_async() called by DPAsyncMPClient, send ”FIRST_REQ” to DPCoordinator
Asynchronously, send weight sync request to all engines
Some engines get start_dp_wave and go into dummy batch, some get collective weight sync requests, deadlock!

So we need to respect the pause state when sending “FIRST_REQ" from the frontend. If we are paused, we shouldn’t send a start_dp_wave.

Okay, so the next obvious step is we query the engines, if any are in a paused state, then we send the request but don’t broadcast start wave. This can result in problems if our engines are not in a consistent state.

Send pause request to 2 engines, engine 0 and 1. Engine 0 returns immediately, engine 1 hasn’t processed pause yet.
Concurrently, add_request_async() called by DPAsyncMPClient. Query pause state, discover that one engine is in a paused state so we don’t broadcast start wave. Then, lets say we send our request to engine 1
Engine 1 gets the new req before pause and starts processing the new request -> deadlock since engine 0 has not gotten the start_dp_wave signal to step into dummy batch.

We could also keep a paused state variable on the frontend side, but this would result in the same problem above.

The point is, even if we are an engine core who is paused, we still need to respect dp wave start requests because other engines may already be in the model call and need us to participate.

The deadlock in both of the above cases is a result of sending a start_dp_wave when we are not supposed to be stepping, or not sending start_dp_wave when we are supposed to be stepping. The frontend is unable to obtain the information of whether we should actually step or not, this information is only known by the engine that the request is eventually routed to. Thus, we need to handle the sending of "FIRST_REQ" on the engine core side rather than the frontend side.

Lets say we are an engine, and we get a request while engines_running = False. There are two possibilities:

Scheduler is not paused:

Send start_dp_wave to all. Even if other ranks are paused, they will step with us to avoid deadlock.

Scheduler is paused:

Add request, do not send start_dp_wave

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

gemini-code-assist

Code Review

This pull request simplifies the request wave coordination logic by removing the explicit "FIRST_REQ" notification from the engine core client. Instead, the engine core now handles notifying the coordinator when an idle, unpaused engine receives a request, ensuring consistent state transitions. I have no feedback to provide as there were no review comments to evaluate.

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

x

881b06f

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

mergify bot added the v1 label Apr 1, 2026

gemini-code-assist bot reviewed Apr 1, 2026

View reviewed changes

hao-aaron mentioned this pull request Apr 1, 2026

[bug] Fix remaining START_DP_WAVE pause race in _handle_client_request #38009

Open

x

3ee833b

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix hang with pause and collectives#38761

fix hang with pause and collectives#38761
hao-aaron wants to merge 2 commits intovllm-project:mainfrom
hao-aaron:hang-fix

hao-aaron commented Apr 1, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hao-aaron commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hao-aaron commented Apr 1, 2026 •

edited

Loading