[bug] Fix remaining START_DP_WAVE pause race in _handle_client_request by junjzhang · Pull Request #38009 · vllm-project/vllm

junjzhang · 2026-03-24T13:35:41Z

Purpose

Closes #36594 (remaining race)

PR #37024 fixed the START_DP_WAVE / pause race in add_request() by checking scheduler.pause_state before setting engines_running = True. However, the same unguarded pattern exists in DPEngineCoreProc._handle_client_request().

When pause_generation() + collective_rpc() is used for online weight update, a late START_DP_WAVE from the DP coordinator can re-arm the engine loop via _handle_client_request while the engine is paused. The re-armed engine enters the dummy-batch ALLREDUCE while the peer engine is in collective_rpc, causing a one-sided collective deadlock (NCCL timeout after 600s).

Race timeline

Engine 0 (serving)       Coordinator            Engine 1 (serving)
     |                       |                       |
 new req ----FIRST_REQ ----->|--- START_DP_WAVE ---->| (queued in zmq)
     |                       |                       |
 pause_scheduler ------------|---------------------->| pause_scheduler
 (returns: idle fast path)   |                       | (returns: idle fast path)
     |                       |                       |
 collective_rpc -------------|                 _handle_client_request
     |                       |                   START_DP_WAVE arrives
     |                       |                   engines_running = True ← BUG
     |                       |                   → dummy ALLREDUCE
     |                       |                       |
 collective_rpc ALLREDUCE ←--X-- rank mismatch --→ dummy ALLREDUCE
     |                  NCCL TIMEOUT (600s)          |

Fix

Add the same PauseState.UNPAUSED guard to _handle_client_request that #37024 added to add_request.

Test Plan

Reproduced consistently with DPEP (DP=2, TP=8, EP) + online weight sync on MoE model. The race is timing-dependent — enabling --enable-return-routed-experts (which adds a small per-step GPU buffer write) widens the race window enough to hit it on ~100% of runs. Without the fix: 3/3 runs deadlocked within 5 minutes of first weight sync. With the fix: pending verification.

Test Result

Pending large-scale training run with fix applied.

gemini-code-assist

Code Review

The pull request effectively addresses the identified race condition in _handle_client_request by adding a check for PauseState.UNPAUSED before re-arming the engine loop. This change directly prevents the one-sided collective deadlock described in the issue, ensuring that the engine does not inadvertently start processing while it is supposed to be paused. The implementation aligns with the fix previously applied to add_request for a similar issue.

github-actions · 2026-03-24T13:51:44Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

njhill

Thanks @junjzhang

junjzhang · 2026-03-25T04:17:42Z

CI failure in test_abort_final_step is unrelated to this PR — that test uses single-GPU non-DP setup, while this PR only modifies DPEngineCoreProc._handle_client_request. The test appears flaky (monkeypatch race with fork). Retrying CI.

junjzhang · 2026-03-25T06:27:50Z

@njhill CI is failing again on the same engine-1-gpu check — same test as before:

FAILED v1/engine/test_abort_final_step.py::test_abort_during_final_step[False]
AssertionError: Expected at least 1 captured finish status, got 0. File content: ['INIT:WORKER', 'INIT:SCHEDULER']

This test is unrelated to the change in this PR — it's a single-GPU abort test that exercises EngineCoreProc, while this PR only modifies DPEngineCoreProc._handle_client_request. The test appears to be flaky due to inter-process timing (it relies on a 100ms sleep for the engine core process to complete abort processing and flush the status file).

This has failed consistently across multiple CI runs, so re-running alone won't help. Would it make sense to skip or mark this test as flaky for now, or would you prefer a different approach?

markmc · 2026-03-26T11:29:56Z

@njhill CI is failing again on the same engine-1-gpu check — same test as before:
FAILED v1/engine/test_abort_final_step.py::test_abort_during_final_step[False]
AssertionError: Expected at least 1 captured finish status, got 0. File content: ['INIT:WORKER', 'INIT:SCHEDULER']
This test is unrelated to the change in this PR — it's a single-GPU abort test that exercises EngineCoreProc, while this PR only modifies DPEngineCoreProc._handle_client_request. The test appears to be flaky due

Filed as #38221

junjzhang · 2026-03-27T12:51:38Z

@njhill CI is failing again on the same engine-1-gpu check — same test as before:
FAILED v1/engine/test_abort_final_step.py::test_abort_during_final_step[False]
AssertionError: Expected at least 1 captured finish status, got 0. File content: ['INIT:WORKER', 'INIT:SCHEDULER']
This test is unrelated to the change in this PR — it's a single-GPU abort test that exercises EngineCoreProc, while this PR only modifies DPEngineCoreProc._handle_client_request. The test appears to be flaky due
Filed as #38221i
Got it, will retriger ci once this test is fixed.

PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by checking `scheduler.pause_state` before setting `engines_running = True`. However, the same unguarded pattern exists in `DPEngineCoreProc._handle_client_request()`. When `pause_generation()` + `collective_rpc()` is used for online weight update, a late `START_DP_WAVE` from the DP coordinator can re-arm the engine loop via `_handle_client_request` while the engine is paused. The re-armed engine enters the dummy-batch ALLREDUCE while the peer engine is in `collective_rpc`, causing a one-sided collective deadlock. Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to `add_request()`. Fixes: vllm-project#36594 Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>

hao-aaron · 2026-04-01T22:55:31Z

i think this PR may be running into the same issue mentioned in #36608 (comment). However, the original problems mentioned in your PR are a concern. Could you try running the solution in #38761 and see if that fixes the issues? Feel free to message in the #ext-vllm-sig-rl slack channel.

junjzhang · 2026-04-02T05:54:29Z

i think this PR may be running into the same issue mentioned in #36608 (comment). However, the original problems mentioned in your PR are a concern. Could you try running the solution in #38761 and see if that fixes the issues? Feel free to message in the #ext-vllm-sig-rl slack channel.

would love to test it.

junjzhang · 2026-04-03T02:49:16Z

Hi @hao-aaron, I tested PR #38761 on our setup (DPEP DP=2, TP=8, EP, MoE 120B + online weight sync). Unfortunately it did not fix the race — we hit the same NCCL watchdog timeout (_ALLGATHER_BASE, 600s) as before.

The possible reason: #38761 guards the add_request() path (the source of start_wave notifications), but the _handle_client_request(START_DP_WAVE) path — where the coordinator's broadcast is consumed by peer engines — remains unguarded. Since pause_generation() does not drain in-flight zmq messages, a START_DP_WAVE that was legitimately broadcast before pause can still arrive after pause and re-arm engines_running on one engine, while the other is already in collective_rpc. This is the one-sided collective mismatch described in our PR #38009.

In contrast, when we tested with only our #38009 fix (PauseState guard in _handle_client_request) and without #38761, DP=2 weight sync completed successfully with no NCCL issues.

I think both fixes are complementary: #38761 reduces stale START_DP_WAVE production, #38009 prevents damage when a stale message arrives. Happy to discuss further in #ext-vllm-sig-rl.

junjzhang requested a review from njhill as a code owner March 24, 2026 13:35

mergify bot added v1 bug Something isn't working labels Mar 24, 2026

gemini-code-assist bot reviewed Mar 24, 2026

View reviewed changes

njhill approved these changes Mar 24, 2026

View reviewed changes

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 24, 2026

njhill enabled auto-merge (squash) March 24, 2026 16:25

auto-merge was automatically disabled March 24, 2026 23:59
Head branch was pushed to by a user without write access

junjzhang force-pushed the fix/dp-wave-pause-race branch 3 times, most recently from a72aa56 to a8c0ee0 Compare March 25, 2026 03:22

junjzhang force-pushed the fix/dp-wave-pause-race branch 2 times, most recently from 0ebcfe8 to 1830fd0 Compare March 25, 2026 05:44

junjzhang force-pushed the fix/dp-wave-pause-race branch 2 times, most recently from fe9ddb6 to 4125ab0 Compare March 26, 2026 11:28

junjzhang force-pushed the fix/dp-wave-pause-race branch from 4125ab0 to 17ea3d1 Compare March 27, 2026 03:34

junjzhang force-pushed the fix/dp-wave-pause-race branch 6 times, most recently from a9af16f to b907d9f Compare March 30, 2026 00:19

junjzhang force-pushed the fix/dp-wave-pause-race branch from b907d9f to 65c0bec Compare March 31, 2026 08:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bug] Fix remaining START_DP_WAVE pause race in _handle_client_request#38009

[bug] Fix remaining START_DP_WAVE pause race in _handle_client_request#38009
junjzhang wants to merge 1 commit intovllm-project:mainfrom
junjzhang:fix/dp-wave-pause-race

junjzhang commented Mar 24, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

njhill left a comment

Uh oh!

junjzhang commented Mar 25, 2026

Uh oh!

junjzhang commented Mar 25, 2026 •

edited

Loading

Uh oh!

markmc commented Mar 26, 2026

Uh oh!

junjzhang commented Mar 27, 2026

Uh oh!

hao-aaron commented Apr 1, 2026 •

edited

Loading

Uh oh!

junjzhang commented Apr 2, 2026

Uh oh!

junjzhang commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

junjzhang commented Mar 24, 2026

Purpose

Race timeline

Fix

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

junjzhang commented Mar 25, 2026

Uh oh!

junjzhang commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markmc commented Mar 26, 2026

Uh oh!

junjzhang commented Mar 27, 2026

Uh oh!

hao-aaron commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

junjzhang commented Apr 2, 2026

Uh oh!

junjzhang commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

junjzhang commented Mar 25, 2026 •

edited

Loading

hao-aaron commented Apr 1, 2026 •

edited

Loading