[bug] Fix deadlock with pause resume and collective_rpc#37024
Merged
robertgshaw2-redhat merged 2 commits intovllm-project:mainfrom Mar 19, 2026
Merged
[bug] Fix deadlock with pause resume and collective_rpc#37024robertgshaw2-redhat merged 2 commits intovllm-project:mainfrom
robertgshaw2-redhat merged 2 commits intovllm-project:mainfrom
Conversation
|
We have validated this internally on a large scale run (36 inference nodes) with P/D and it has fixed the issue mentioned above. |
robertgshaw2-redhat
approved these changes
Mar 18, 2026
6accb21
into
vllm-project:main
46 of 47 checks passed
fxdawnn
pushed a commit
to fxdawnn/vllm
that referenced
this pull request
Mar 19, 2026
…#37024) Signed-off-by: hao-aaron <ahao@anyscale.com>
junjzhang
added a commit
to junjzhang/vllm
that referenced
this pull request
Mar 24, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by checking `scheduler.pause_state` before setting `engines_running = True`. However, the same unguarded pattern exists in `DPEngineCoreProc._handle_client_request()`. When `pause_generation()` + `collective_rpc()` is used for online weight update, a late `START_DP_WAVE` from the DP coordinator can re-arm the engine loop via `_handle_client_request` while the engine is paused. The re-armed engine enters the dummy-batch ALLREDUCE while the peer engine is in `collective_rpc`, causing a one-sided collective deadlock. Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to `add_request()`. Fixes: vllm-project#36594 Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
junjzhang
added a commit
to junjzhang/vllm
that referenced
this pull request
Mar 24, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by checking `scheduler.pause_state` before setting `engines_running = True`. However, the same unguarded pattern exists in `DPEngineCoreProc._handle_client_request()`. When `pause_generation()` + `collective_rpc()` is used for online weight update, a late `START_DP_WAVE` from the DP coordinator can re-arm the engine loop via `_handle_client_request` while the engine is paused. The re-armed engine enters the dummy-batch ALLREDUCE while the peer engine is in `collective_rpc`, causing a one-sided collective deadlock. Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to `add_request()`. Fixes: vllm-project#36594 Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
junjzhang
added a commit
to junjzhang/vllm
that referenced
this pull request
Mar 25, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by checking `scheduler.pause_state` before setting `engines_running = True`. However, the same unguarded pattern exists in `DPEngineCoreProc._handle_client_request()`. When `pause_generation()` + `collective_rpc()` is used for online weight update, a late `START_DP_WAVE` from the DP coordinator can re-arm the engine loop via `_handle_client_request` while the engine is paused. The re-armed engine enters the dummy-batch ALLREDUCE while the peer engine is in `collective_rpc`, causing a one-sided collective deadlock. Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to `add_request()`. Fixes: vllm-project#36594 Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
junjzhang
added a commit
to junjzhang/vllm
that referenced
this pull request
Mar 25, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by checking `scheduler.pause_state` before setting `engines_running = True`. However, the same unguarded pattern exists in `DPEngineCoreProc._handle_client_request()`. When `pause_generation()` + `collective_rpc()` is used for online weight update, a late `START_DP_WAVE` from the DP coordinator can re-arm the engine loop via `_handle_client_request` while the engine is paused. The re-armed engine enters the dummy-batch ALLREDUCE while the peer engine is in `collective_rpc`, causing a one-sided collective deadlock. Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to `add_request()`. Fixes: vllm-project#36594 Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
junjzhang
added a commit
to junjzhang/vllm
that referenced
this pull request
Mar 25, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by checking `scheduler.pause_state` before setting `engines_running = True`. However, the same unguarded pattern exists in `DPEngineCoreProc._handle_client_request()`. When `pause_generation()` + `collective_rpc()` is used for online weight update, a late `START_DP_WAVE` from the DP coordinator can re-arm the engine loop via `_handle_client_request` while the engine is paused. The re-armed engine enters the dummy-batch ALLREDUCE while the peer engine is in `collective_rpc`, causing a one-sided collective deadlock. Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to `add_request()`. Fixes: vllm-project#36594 Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
junjzhang
added a commit
to junjzhang/vllm
that referenced
this pull request
Mar 25, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by checking `scheduler.pause_state` before setting `engines_running = True`. However, the same unguarded pattern exists in `DPEngineCoreProc._handle_client_request()`. When `pause_generation()` + `collective_rpc()` is used for online weight update, a late `START_DP_WAVE` from the DP coordinator can re-arm the engine loop via `_handle_client_request` while the engine is paused. The re-armed engine enters the dummy-batch ALLREDUCE while the peer engine is in `collective_rpc`, causing a one-sided collective deadlock. Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to `add_request()`. Fixes: vllm-project#36594 Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
junjzhang
added a commit
to junjzhang/vllm
that referenced
this pull request
Mar 25, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by checking `scheduler.pause_state` before setting `engines_running = True`. However, the same unguarded pattern exists in `DPEngineCoreProc._handle_client_request()`. When `pause_generation()` + `collective_rpc()` is used for online weight update, a late `START_DP_WAVE` from the DP coordinator can re-arm the engine loop via `_handle_client_request` while the engine is paused. The re-armed engine enters the dummy-batch ALLREDUCE while the peer engine is in `collective_rpc`, causing a one-sided collective deadlock. Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to `add_request()`. Fixes: vllm-project#36594 Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
junjzhang
added a commit
to junjzhang/vllm
that referenced
this pull request
Mar 26, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by checking `scheduler.pause_state` before setting `engines_running = True`. However, the same unguarded pattern exists in `DPEngineCoreProc._handle_client_request()`. When `pause_generation()` + `collective_rpc()` is used for online weight update, a late `START_DP_WAVE` from the DP coordinator can re-arm the engine loop via `_handle_client_request` while the engine is paused. The re-armed engine enters the dummy-batch ALLREDUCE while the peer engine is in `collective_rpc`, causing a one-sided collective deadlock. Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to `add_request()`. Fixes: vllm-project#36594 Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
SouthWest7
pushed a commit
to SouthWest7/vllm
that referenced
this pull request
Mar 27, 2026
…#37024) Signed-off-by: hao-aaron <ahao@anyscale.com>
khairulkabir1661
pushed a commit
to khairulkabir1661/vllm
that referenced
this pull request
Mar 27, 2026
…#37024) Signed-off-by: hao-aaron <ahao@anyscale.com>
junjzhang
added a commit
to junjzhang/vllm
that referenced
this pull request
Mar 27, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by checking `scheduler.pause_state` before setting `engines_running = True`. However, the same unguarded pattern exists in `DPEngineCoreProc._handle_client_request()`. When `pause_generation()` + `collective_rpc()` is used for online weight update, a late `START_DP_WAVE` from the DP coordinator can re-arm the engine loop via `_handle_client_request` while the engine is paused. The re-armed engine enters the dummy-batch ALLREDUCE while the peer engine is in `collective_rpc`, causing a one-sided collective deadlock. Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to `add_request()`. Fixes: vllm-project#36594 Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
junjzhang
added a commit
to junjzhang/vllm
that referenced
this pull request
Mar 27, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by checking `scheduler.pause_state` before setting `engines_running = True`. However, the same unguarded pattern exists in `DPEngineCoreProc._handle_client_request()`. When `pause_generation()` + `collective_rpc()` is used for online weight update, a late `START_DP_WAVE` from the DP coordinator can re-arm the engine loop via `_handle_client_request` while the engine is paused. The re-armed engine enters the dummy-batch ALLREDUCE while the peer engine is in `collective_rpc`, causing a one-sided collective deadlock. Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to `add_request()`. Fixes: vllm-project#36594 Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
Monishver11
pushed a commit
to Monishver11/vllm
that referenced
this pull request
Mar 27, 2026
…#37024) Signed-off-by: hao-aaron <ahao@anyscale.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
JiantaoXu
pushed a commit
to JiantaoXu/vllm
that referenced
this pull request
Mar 28, 2026
…#37024) Signed-off-by: hao-aaron <ahao@anyscale.com>
junjzhang
added a commit
to junjzhang/vllm
that referenced
this pull request
Mar 28, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by checking `scheduler.pause_state` before setting `engines_running = True`. However, the same unguarded pattern exists in `DPEngineCoreProc._handle_client_request()`. When `pause_generation()` + `collective_rpc()` is used for online weight update, a late `START_DP_WAVE` from the DP coordinator can re-arm the engine loop via `_handle_client_request` while the engine is paused. The re-armed engine enters the dummy-batch ALLREDUCE while the peer engine is in `collective_rpc`, causing a one-sided collective deadlock. Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to `add_request()`. Fixes: vllm-project#36594 Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
junjzhang
added a commit
to junjzhang/vllm
that referenced
this pull request
Mar 28, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by checking `scheduler.pause_state` before setting `engines_running = True`. However, the same unguarded pattern exists in `DPEngineCoreProc._handle_client_request()`. When `pause_generation()` + `collective_rpc()` is used for online weight update, a late `START_DP_WAVE` from the DP coordinator can re-arm the engine loop via `_handle_client_request` while the engine is paused. The re-armed engine enters the dummy-batch ALLREDUCE while the peer engine is in `collective_rpc`, causing a one-sided collective deadlock. Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to `add_request()`. Fixes: vllm-project#36594 Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
junjzhang
added a commit
to junjzhang/vllm
that referenced
this pull request
Mar 28, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by checking `scheduler.pause_state` before setting `engines_running = True`. However, the same unguarded pattern exists in `DPEngineCoreProc._handle_client_request()`. When `pause_generation()` + `collective_rpc()` is used for online weight update, a late `START_DP_WAVE` from the DP coordinator can re-arm the engine loop via `_handle_client_request` while the engine is paused. The re-armed engine enters the dummy-batch ALLREDUCE while the peer engine is in `collective_rpc`, causing a one-sided collective deadlock. Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to `add_request()`. Fixes: vllm-project#36594 Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
junjzhang
added a commit
to junjzhang/vllm
that referenced
this pull request
Mar 29, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by checking `scheduler.pause_state` before setting `engines_running = True`. However, the same unguarded pattern exists in `DPEngineCoreProc._handle_client_request()`. When `pause_generation()` + `collective_rpc()` is used for online weight update, a late `START_DP_WAVE` from the DP coordinator can re-arm the engine loop via `_handle_client_request` while the engine is paused. The re-armed engine enters the dummy-batch ALLREDUCE while the peer engine is in `collective_rpc`, causing a one-sided collective deadlock. Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to `add_request()`. Fixes: vllm-project#36594 Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
junjzhang
added a commit
to junjzhang/vllm
that referenced
this pull request
Mar 30, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by checking `scheduler.pause_state` before setting `engines_running = True`. However, the same unguarded pattern exists in `DPEngineCoreProc._handle_client_request()`. When `pause_generation()` + `collective_rpc()` is used for online weight update, a late `START_DP_WAVE` from the DP coordinator can re-arm the engine loop via `_handle_client_request` while the engine is paused. The re-armed engine enters the dummy-batch ALLREDUCE while the peer engine is in `collective_rpc`, causing a one-sided collective deadlock. Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to `add_request()`. Fixes: vllm-project#36594 Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
vrdn-23
pushed a commit
to vrdn-23/vllm
that referenced
this pull request
Mar 30, 2026
…#37024) Signed-off-by: hao-aaron <ahao@anyscale.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>
junjzhang
added a commit
to junjzhang/vllm
that referenced
this pull request
Mar 31, 2026
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by checking `scheduler.pause_state` before setting `engines_running = True`. However, the same unguarded pattern exists in `DPEngineCoreProc._handle_client_request()`. When `pause_generation()` + `collective_rpc()` is used for online weight update, a late `START_DP_WAVE` from the DP coordinator can re-arm the engine loop via `_handle_client_request` while the engine is paused. The re-armed engine enters the dummy-batch ALLREDUCE while the peer engine is in `collective_rpc`, causing a one-sided collective deadlock. Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to `add_request()`. Fixes: vllm-project#36594 Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
EricccYang
pushed a commit
to EricccYang/vllm
that referenced
this pull request
Apr 1, 2026
…#37024) Signed-off-by: hao-aaron <ahao@anyscale.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
closes #36594
Test Plan
pending large scale training run
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.