[bug] Fix deadlock with pause resume and collective_rpc by hao-aaron · Pull Request #37024 · vllm-project/vllm

hao-aaron · 2026-03-14T00:29:01Z

Purpose

closes #36594

Test Plan

pending large scale training run

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: hao-aaron <ahao@anyscale.com>

gemini-code-assist

Code Review

The code change in vllm/v1/engine/core.py adds the line self.engines_running = True to the add_request method, ensuring that the engine's running state is explicitly set to true whenever a new request is added.

Signed-off-by: hao-aaron <ahao@anyscale.com>

S1ro1 · 2026-03-18T22:27:57Z

We have validated this internally on a large scale run (36 inference nodes) with P/D and it has fixed the issue mentioned above.

…#37024) Signed-off-by: hao-aaron <ahao@anyscale.com>

PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by checking `scheduler.pause_state` before setting `engines_running = True`. However, the same unguarded pattern exists in `DPEngineCoreProc._handle_client_request()`. When `pause_generation()` + `collective_rpc()` is used for online weight update, a late `START_DP_WAVE` from the DP coordinator can re-arm the engine loop via `_handle_client_request` while the engine is paused. The re-armed engine enters the dummy-batch ALLREDUCE while the peer engine is in `collective_rpc`, causing a one-sided collective deadlock. Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to `add_request()`. Fixes: vllm-project#36594 Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>

…#37024) Signed-off-by: hao-aaron <ahao@anyscale.com>

PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by checking `scheduler.pause_state` before setting `engines_running = True`. However, the same unguarded pattern exists in `DPEngineCoreProc._handle_client_request()`. When `pause_generation()` + `collective_rpc()` is used for online weight update, a late `START_DP_WAVE` from the DP coordinator can re-arm the engine loop via `_handle_client_request` while the engine is paused. The re-armed engine enters the dummy-batch ALLREDUCE while the peer engine is in `collective_rpc`, causing a one-sided collective deadlock. Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to `add_request()`. Fixes: vllm-project#36594 Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>

…#37024) Signed-off-by: hao-aaron <ahao@anyscale.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>

…#37024) Signed-off-by: hao-aaron <ahao@anyscale.com>

PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by checking `scheduler.pause_state` before setting `engines_running = True`. However, the same unguarded pattern exists in `DPEngineCoreProc._handle_client_request()`. When `pause_generation()` + `collective_rpc()` is used for online weight update, a late `START_DP_WAVE` from the DP coordinator can re-arm the engine loop via `_handle_client_request` while the engine is paused. The re-armed engine enters the dummy-batch ALLREDUCE while the peer engine is in `collective_rpc`, causing a one-sided collective deadlock. Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to `add_request()`. Fixes: vllm-project#36594 Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>

…#37024) Signed-off-by: hao-aaron <ahao@anyscale.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>

PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by checking `scheduler.pause_state` before setting `engines_running = True`. However, the same unguarded pattern exists in `DPEngineCoreProc._handle_client_request()`. When `pause_generation()` + `collective_rpc()` is used for online weight update, a late `START_DP_WAVE` from the DP coordinator can re-arm the engine loop via `_handle_client_request` while the engine is paused. The re-armed engine enters the dummy-batch ALLREDUCE while the peer engine is in `collective_rpc`, causing a one-sided collective deadlock. Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to `add_request()`. Fixes: vllm-project#36594 Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>

…#37024) Signed-off-by: hao-aaron <ahao@anyscale.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>

x

2913b0b

Signed-off-by: hao-aaron <ahao@anyscale.com>

mergify bot added v1 bug Something isn't working labels Mar 14, 2026

gemini-code-assist bot reviewed Mar 14, 2026

View reviewed changes

x

8c37e97

Signed-off-by: hao-aaron <ahao@anyscale.com>

hao-aaron changed the title ~~[bug] fix hang dpep pause~~ [bug] Fix deadlock with pause resume and collective_rpc Mar 17, 2026

hao-aaron marked this pull request as ready for review March 17, 2026 00:09

hao-aaron requested a review from njhill as a code owner March 17, 2026 00:09

robertgshaw2-redhat approved these changes Mar 18, 2026

View reviewed changes

robertgshaw2-redhat enabled auto-merge (squash) March 18, 2026 22:30

robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 18, 2026

robertgshaw2-redhat merged commit 6accb21 into vllm-project:main Mar 19, 2026
46 of 47 checks passed

fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026

[bug] Fix deadlock with pause resume and collective_rpc (vllm-project…

99adfa2

…#37024) Signed-off-by: hao-aaron <ahao@anyscale.com>

junjzhang mentioned this pull request Mar 24, 2026

[bug] Fix remaining START_DP_WAVE pause race in _handle_client_request #38009

Open

SouthWest7 pushed a commit to SouthWest7/vllm that referenced this pull request Mar 27, 2026

[bug] Fix deadlock with pause resume and collective_rpc (vllm-project…

b71995e

…#37024) Signed-off-by: hao-aaron <ahao@anyscale.com>

khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026

[bug] Fix deadlock with pause resume and collective_rpc (vllm-project…

6ce502d

…#37024) Signed-off-by: hao-aaron <ahao@anyscale.com>

JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026

[bug] Fix deadlock with pause resume and collective_rpc (vllm-project…

355961a

…#37024) Signed-off-by: hao-aaron <ahao@anyscale.com>

vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Mar 30, 2026

[bug] Fix deadlock with pause resume and collective_rpc (vllm-project…

a884759

…#37024) Signed-off-by: hao-aaron <ahao@anyscale.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>

EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 1, 2026

[bug] Fix deadlock with pause resume and collective_rpc (vllm-project…

c5fb954

…#37024) Signed-off-by: hao-aaron <ahao@anyscale.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bug] Fix deadlock with pause resume and collective_rpc#37024

[bug] Fix deadlock with pause resume and collective_rpc#37024
robertgshaw2-redhat merged 2 commits intovllm-project:mainfrom
hao-aaron:dpep-hang

hao-aaron commented Mar 14, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

S1ro1 commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

hao-aaron commented Mar 14, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

S1ro1 commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hao-aaron commented Mar 14, 2026 •

edited by github-actions bot

Loading