Skip to content

[bug] Fix deadlock with pause resume and collective_rpc#37024

Merged
robertgshaw2-redhat merged 2 commits intovllm-project:mainfrom
hao-aaron:dpep-hang
Mar 19, 2026
Merged

[bug] Fix deadlock with pause resume and collective_rpc#37024
robertgshaw2-redhat merged 2 commits intovllm-project:mainfrom
hao-aaron:dpep-hang

Conversation

@hao-aaron
Copy link
Copy Markdown
Contributor

@hao-aaron hao-aaron commented Mar 14, 2026

Purpose

closes #36594

Test Plan

pending large scale training run

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

x
Signed-off-by: hao-aaron <ahao@anyscale.com>
@mergify mergify bot added v1 bug Something isn't working labels Mar 14, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The code change in vllm/v1/engine/core.py adds the line self.engines_running = True to the add_request method, ensuring that the engine's running state is explicitly set to true whenever a new request is added.

x
Signed-off-by: hao-aaron <ahao@anyscale.com>
@hao-aaron hao-aaron changed the title [bug] fix hang dpep pause [bug] Fix deadlock with pause resume and collective_rpc Mar 17, 2026
@hao-aaron hao-aaron marked this pull request as ready for review March 17, 2026 00:09
@hao-aaron hao-aaron requested a review from njhill as a code owner March 17, 2026 00:09
@S1ro1
Copy link
Copy Markdown

S1ro1 commented Mar 18, 2026

We have validated this internally on a large scale run (36 inference nodes) with P/D and it has fixed the issue mentioned above.

@robertgshaw2-redhat robertgshaw2-redhat enabled auto-merge (squash) March 18, 2026 22:30
@robertgshaw2-redhat robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 18, 2026
@robertgshaw2-redhat robertgshaw2-redhat merged commit 6accb21 into vllm-project:main Mar 19, 2026
46 of 47 checks passed
fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026
junjzhang added a commit to junjzhang/vllm that referenced this pull request Mar 24, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by
checking `scheduler.pause_state` before setting `engines_running = True`.
However, the same unguarded pattern exists in
`DPEngineCoreProc._handle_client_request()`.

When `pause_generation()` + `collective_rpc()` is used for online weight
update, a late `START_DP_WAVE` from the DP coordinator can re-arm the
engine loop via `_handle_client_request` while the engine is paused.
The re-armed engine enters the dummy-batch ALLREDUCE while the peer
engine is in `collective_rpc`, causing a one-sided collective deadlock.

Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to
`add_request()`.

Fixes: vllm-project#36594
Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
junjzhang added a commit to junjzhang/vllm that referenced this pull request Mar 24, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by
checking `scheduler.pause_state` before setting `engines_running = True`.
However, the same unguarded pattern exists in
`DPEngineCoreProc._handle_client_request()`.

When `pause_generation()` + `collective_rpc()` is used for online weight
update, a late `START_DP_WAVE` from the DP coordinator can re-arm the
engine loop via `_handle_client_request` while the engine is paused.
The re-armed engine enters the dummy-batch ALLREDUCE while the peer
engine is in `collective_rpc`, causing a one-sided collective deadlock.

Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to
`add_request()`.

Fixes: vllm-project#36594
Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
junjzhang added a commit to junjzhang/vllm that referenced this pull request Mar 25, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by
checking `scheduler.pause_state` before setting `engines_running = True`.
However, the same unguarded pattern exists in
`DPEngineCoreProc._handle_client_request()`.

When `pause_generation()` + `collective_rpc()` is used for online weight
update, a late `START_DP_WAVE` from the DP coordinator can re-arm the
engine loop via `_handle_client_request` while the engine is paused.
The re-armed engine enters the dummy-batch ALLREDUCE while the peer
engine is in `collective_rpc`, causing a one-sided collective deadlock.

Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to
`add_request()`.

Fixes: vllm-project#36594
Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
junjzhang added a commit to junjzhang/vllm that referenced this pull request Mar 25, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by
checking `scheduler.pause_state` before setting `engines_running = True`.
However, the same unguarded pattern exists in
`DPEngineCoreProc._handle_client_request()`.

When `pause_generation()` + `collective_rpc()` is used for online weight
update, a late `START_DP_WAVE` from the DP coordinator can re-arm the
engine loop via `_handle_client_request` while the engine is paused.
The re-armed engine enters the dummy-batch ALLREDUCE while the peer
engine is in `collective_rpc`, causing a one-sided collective deadlock.

Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to
`add_request()`.

Fixes: vllm-project#36594
Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
junjzhang added a commit to junjzhang/vllm that referenced this pull request Mar 25, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by
checking `scheduler.pause_state` before setting `engines_running = True`.
However, the same unguarded pattern exists in
`DPEngineCoreProc._handle_client_request()`.

When `pause_generation()` + `collective_rpc()` is used for online weight
update, a late `START_DP_WAVE` from the DP coordinator can re-arm the
engine loop via `_handle_client_request` while the engine is paused.
The re-armed engine enters the dummy-batch ALLREDUCE while the peer
engine is in `collective_rpc`, causing a one-sided collective deadlock.

Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to
`add_request()`.

Fixes: vllm-project#36594
Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
junjzhang added a commit to junjzhang/vllm that referenced this pull request Mar 25, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by
checking `scheduler.pause_state` before setting `engines_running = True`.
However, the same unguarded pattern exists in
`DPEngineCoreProc._handle_client_request()`.

When `pause_generation()` + `collective_rpc()` is used for online weight
update, a late `START_DP_WAVE` from the DP coordinator can re-arm the
engine loop via `_handle_client_request` while the engine is paused.
The re-armed engine enters the dummy-batch ALLREDUCE while the peer
engine is in `collective_rpc`, causing a one-sided collective deadlock.

Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to
`add_request()`.

Fixes: vllm-project#36594
Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
junjzhang added a commit to junjzhang/vllm that referenced this pull request Mar 25, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by
checking `scheduler.pause_state` before setting `engines_running = True`.
However, the same unguarded pattern exists in
`DPEngineCoreProc._handle_client_request()`.

When `pause_generation()` + `collective_rpc()` is used for online weight
update, a late `START_DP_WAVE` from the DP coordinator can re-arm the
engine loop via `_handle_client_request` while the engine is paused.
The re-armed engine enters the dummy-batch ALLREDUCE while the peer
engine is in `collective_rpc`, causing a one-sided collective deadlock.

Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to
`add_request()`.

Fixes: vllm-project#36594
Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
junjzhang added a commit to junjzhang/vllm that referenced this pull request Mar 26, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by
checking `scheduler.pause_state` before setting `engines_running = True`.
However, the same unguarded pattern exists in
`DPEngineCoreProc._handle_client_request()`.

When `pause_generation()` + `collective_rpc()` is used for online weight
update, a late `START_DP_WAVE` from the DP coordinator can re-arm the
engine loop via `_handle_client_request` while the engine is paused.
The re-armed engine enters the dummy-batch ALLREDUCE while the peer
engine is in `collective_rpc`, causing a one-sided collective deadlock.

Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to
`add_request()`.

Fixes: vllm-project#36594
Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
SouthWest7 pushed a commit to SouthWest7/vllm that referenced this pull request Mar 27, 2026
khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026
junjzhang added a commit to junjzhang/vllm that referenced this pull request Mar 27, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by
checking `scheduler.pause_state` before setting `engines_running = True`.
However, the same unguarded pattern exists in
`DPEngineCoreProc._handle_client_request()`.

When `pause_generation()` + `collective_rpc()` is used for online weight
update, a late `START_DP_WAVE` from the DP coordinator can re-arm the
engine loop via `_handle_client_request` while the engine is paused.
The re-armed engine enters the dummy-batch ALLREDUCE while the peer
engine is in `collective_rpc`, causing a one-sided collective deadlock.

Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to
`add_request()`.

Fixes: vllm-project#36594
Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
junjzhang added a commit to junjzhang/vllm that referenced this pull request Mar 27, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by
checking `scheduler.pause_state` before setting `engines_running = True`.
However, the same unguarded pattern exists in
`DPEngineCoreProc._handle_client_request()`.

When `pause_generation()` + `collective_rpc()` is used for online weight
update, a late `START_DP_WAVE` from the DP coordinator can re-arm the
engine loop via `_handle_client_request` while the engine is paused.
The re-armed engine enters the dummy-batch ALLREDUCE while the peer
engine is in `collective_rpc`, causing a one-sided collective deadlock.

Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to
`add_request()`.

Fixes: vllm-project#36594
Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
Monishver11 pushed a commit to Monishver11/vllm that referenced this pull request Mar 27, 2026
…#37024)

Signed-off-by: hao-aaron <ahao@anyscale.com>
Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026
junjzhang added a commit to junjzhang/vllm that referenced this pull request Mar 28, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by
checking `scheduler.pause_state` before setting `engines_running = True`.
However, the same unguarded pattern exists in
`DPEngineCoreProc._handle_client_request()`.

When `pause_generation()` + `collective_rpc()` is used for online weight
update, a late `START_DP_WAVE` from the DP coordinator can re-arm the
engine loop via `_handle_client_request` while the engine is paused.
The re-armed engine enters the dummy-batch ALLREDUCE while the peer
engine is in `collective_rpc`, causing a one-sided collective deadlock.

Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to
`add_request()`.

Fixes: vllm-project#36594
Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
junjzhang added a commit to junjzhang/vllm that referenced this pull request Mar 28, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by
checking `scheduler.pause_state` before setting `engines_running = True`.
However, the same unguarded pattern exists in
`DPEngineCoreProc._handle_client_request()`.

When `pause_generation()` + `collective_rpc()` is used for online weight
update, a late `START_DP_WAVE` from the DP coordinator can re-arm the
engine loop via `_handle_client_request` while the engine is paused.
The re-armed engine enters the dummy-batch ALLREDUCE while the peer
engine is in `collective_rpc`, causing a one-sided collective deadlock.

Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to
`add_request()`.

Fixes: vllm-project#36594
Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
junjzhang added a commit to junjzhang/vllm that referenced this pull request Mar 28, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by
checking `scheduler.pause_state` before setting `engines_running = True`.
However, the same unguarded pattern exists in
`DPEngineCoreProc._handle_client_request()`.

When `pause_generation()` + `collective_rpc()` is used for online weight
update, a late `START_DP_WAVE` from the DP coordinator can re-arm the
engine loop via `_handle_client_request` while the engine is paused.
The re-armed engine enters the dummy-batch ALLREDUCE while the peer
engine is in `collective_rpc`, causing a one-sided collective deadlock.

Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to
`add_request()`.

Fixes: vllm-project#36594
Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
junjzhang added a commit to junjzhang/vllm that referenced this pull request Mar 29, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by
checking `scheduler.pause_state` before setting `engines_running = True`.
However, the same unguarded pattern exists in
`DPEngineCoreProc._handle_client_request()`.

When `pause_generation()` + `collective_rpc()` is used for online weight
update, a late `START_DP_WAVE` from the DP coordinator can re-arm the
engine loop via `_handle_client_request` while the engine is paused.
The re-armed engine enters the dummy-batch ALLREDUCE while the peer
engine is in `collective_rpc`, causing a one-sided collective deadlock.

Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to
`add_request()`.

Fixes: vllm-project#36594
Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
junjzhang added a commit to junjzhang/vllm that referenced this pull request Mar 30, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by
checking `scheduler.pause_state` before setting `engines_running = True`.
However, the same unguarded pattern exists in
`DPEngineCoreProc._handle_client_request()`.

When `pause_generation()` + `collective_rpc()` is used for online weight
update, a late `START_DP_WAVE` from the DP coordinator can re-arm the
engine loop via `_handle_client_request` while the engine is paused.
The re-armed engine enters the dummy-batch ALLREDUCE while the peer
engine is in `collective_rpc`, causing a one-sided collective deadlock.

Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to
`add_request()`.

Fixes: vllm-project#36594
Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Mar 30, 2026
…#37024)

Signed-off-by: hao-aaron <ahao@anyscale.com>
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
junjzhang added a commit to junjzhang/vllm that referenced this pull request Mar 31, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by
checking `scheduler.pause_state` before setting `engines_running = True`.
However, the same unguarded pattern exists in
`DPEngineCoreProc._handle_client_request()`.

When `pause_generation()` + `collective_rpc()` is used for online weight
update, a late `START_DP_WAVE` from the DP coordinator can re-arm the
engine loop via `_handle_client_request` while the engine is paused.
The re-armed engine enters the dummy-batch ALLREDUCE while the peer
engine is in `collective_rpc`, causing a one-sided collective deadlock.

Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to
`add_request()`.

Fixes: vllm-project#36594
Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 1, 2026
…#37024)

Signed-off-by: hao-aaron <ahao@anyscale.com>
Signed-off-by: EricccYang <yangyang4991@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

3 participants