Skip to content

[bug] Fix remaining START_DP_WAVE pause race in _handle_client_request#38009

Open
junjzhang wants to merge 1 commit intovllm-project:mainfrom
junjzhang:fix/dp-wave-pause-race
Open

[bug] Fix remaining START_DP_WAVE pause race in _handle_client_request#38009
junjzhang wants to merge 1 commit intovllm-project:mainfrom
junjzhang:fix/dp-wave-pause-race

Conversation

@junjzhang
Copy link
Copy Markdown

Purpose

Closes #36594 (remaining race)

PR #37024 fixed the START_DP_WAVE / pause race in add_request() by checking scheduler.pause_state before setting engines_running = True. However, the same unguarded pattern exists in DPEngineCoreProc._handle_client_request().

When pause_generation() + collective_rpc() is used for online weight update, a late START_DP_WAVE from the DP coordinator can re-arm the engine loop via _handle_client_request while the engine is paused. The re-armed engine enters the dummy-batch ALLREDUCE while the peer engine is in collective_rpc, causing a one-sided collective deadlock (NCCL timeout after 600s).

Race timeline

Engine 0 (serving)       Coordinator            Engine 1 (serving)
     |                       |                       |
 new req ----FIRST_REQ ----->|--- START_DP_WAVE ---->| (queued in zmq)
     |                       |                       |
 pause_scheduler ------------|---------------------->| pause_scheduler
 (returns: idle fast path)   |                       | (returns: idle fast path)
     |                       |                       |
 collective_rpc -------------|                 _handle_client_request
     |                       |                   START_DP_WAVE arrives
     |                       |                   engines_running = True ← BUG
     |                       |                   → dummy ALLREDUCE
     |                       |                       |
 collective_rpc ALLREDUCE ←--X-- rank mismatch --→ dummy ALLREDUCE
     |                  NCCL TIMEOUT (600s)          |

Fix

Add the same PauseState.UNPAUSED guard to _handle_client_request that #37024 added to add_request.

Test Plan

Reproduced consistently with DPEP (DP=2, TP=8, EP) + online weight sync on MoE model. The race is timing-dependent — enabling --enable-return-routed-experts (which adds a small per-step GPU buffer write) widens the race window enough to hit it on ~100% of runs. Without the fix: 3/3 runs deadlocked within 5 minutes of first weight sync. With the fix: pending verification.

Test Result

Pending large-scale training run with fix applied.

@junjzhang junjzhang requested a review from njhill as a code owner March 24, 2026 13:35
@mergify mergify bot added v1 bug Something isn't working labels Mar 24, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request effectively addresses the identified race condition in _handle_client_request by adding a check for PauseState.UNPAUSED before re-arming the engine loop. This change directly prevents the one-sided collective deadlock described in the issue, ensuring that the engine does not inadvertently start processing while it is supposed to be paused. The implementation aligns with the fix previously applied to add_request for a similar issue.

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Copy link
Copy Markdown
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @junjzhang

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 24, 2026
@njhill njhill enabled auto-merge (squash) March 24, 2026 16:25
auto-merge was automatically disabled March 24, 2026 23:59

Head branch was pushed to by a user without write access

@junjzhang junjzhang force-pushed the fix/dp-wave-pause-race branch 3 times, most recently from a72aa56 to a8c0ee0 Compare March 25, 2026 03:22
@junjzhang
Copy link
Copy Markdown
Author

CI failure in test_abort_final_step is unrelated to this PR — that test uses single-GPU non-DP setup, while this PR only modifies DPEngineCoreProc._handle_client_request. The test appears flaky (monkeypatch race with fork). Retrying CI.

@junjzhang junjzhang force-pushed the fix/dp-wave-pause-race branch 2 times, most recently from 0ebcfe8 to 1830fd0 Compare March 25, 2026 05:44
@junjzhang
Copy link
Copy Markdown
Author

junjzhang commented Mar 25, 2026

@njhill CI is failing again on the same engine-1-gpu check — same test as before:

FAILED v1/engine/test_abort_final_step.py::test_abort_during_final_step[False]
AssertionError: Expected at least 1 captured finish status, got 0. File content: ['INIT:WORKER', 'INIT:SCHEDULER']

This test is unrelated to the change in this PR — it's a single-GPU abort test that exercises EngineCoreProc, while this PR only modifies DPEngineCoreProc._handle_client_request. The test appears to be flaky due to inter-process timing (it relies on a 100ms sleep for the engine core process to complete abort processing and flush the status file).

This has failed consistently across multiple CI runs, so re-running alone won't help. Would it make sense to skip or mark this test as flaky for now, or would you prefer a different approach?

@junjzhang junjzhang force-pushed the fix/dp-wave-pause-race branch 2 times, most recently from fe9ddb6 to 4125ab0 Compare March 26, 2026 11:28
@markmc
Copy link
Copy Markdown
Member

markmc commented Mar 26, 2026

@njhill CI is failing again on the same engine-1-gpu check — same test as before:

FAILED v1/engine/test_abort_final_step.py::test_abort_during_final_step[False]
AssertionError: Expected at least 1 captured finish status, got 0. File content: ['INIT:WORKER', 'INIT:SCHEDULER']

This test is unrelated to the change in this PR — it's a single-GPU abort test that exercises EngineCoreProc, while this PR only modifies DPEngineCoreProc._handle_client_request. The test appears to be flaky due

Filed as #38221

@junjzhang junjzhang force-pushed the fix/dp-wave-pause-race branch from 4125ab0 to 17ea3d1 Compare March 27, 2026 03:34
@junjzhang
Copy link
Copy Markdown
Author

@njhill CI is failing again on the same engine-1-gpu check — same test as before:

FAILED v1/engine/test_abort_final_step.py::test_abort_during_final_step[False]
AssertionError: Expected at least 1 captured finish status, got 0. File content: ['INIT:WORKER', 'INIT:SCHEDULER']

This test is unrelated to the change in this PR — it's a single-GPU abort test that exercises EngineCoreProc, while this PR only modifies DPEngineCoreProc._handle_client_request. The test appears to be flaky due

Filed as #38221i
Got it, will retriger ci once this test is fixed.

@junjzhang junjzhang force-pushed the fix/dp-wave-pause-race branch 6 times, most recently from a9af16f to b907d9f Compare March 30, 2026 00:19
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by
checking `scheduler.pause_state` before setting `engines_running = True`.
However, the same unguarded pattern exists in
`DPEngineCoreProc._handle_client_request()`.

When `pause_generation()` + `collective_rpc()` is used for online weight
update, a late `START_DP_WAVE` from the DP coordinator can re-arm the
engine loop via `_handle_client_request` while the engine is paused.
The re-armed engine enters the dummy-batch ALLREDUCE while the peer
engine is in `collective_rpc`, causing a one-sided collective deadlock.

Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to
`add_request()`.

Fixes: vllm-project#36594
Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
@junjzhang junjzhang force-pushed the fix/dp-wave-pause-race branch from b907d9f to 65c0bec Compare March 31, 2026 08:16
@hao-aaron
Copy link
Copy Markdown
Contributor

hao-aaron commented Apr 1, 2026

i think this PR may be running into the same issue mentioned in #36608 (comment). However, the original problems mentioned in your PR are a concern. Could you try running the solution in #38761 and see if that fixes the issues? Feel free to message in the #ext-vllm-sig-rl slack channel.

@junjzhang
Copy link
Copy Markdown
Author

i think this PR may be running into the same issue mentioned in #36608 (comment). However, the original problems mentioned in your PR are a concern. Could you try running the solution in #38761 and see if that fixes the issues? Feel free to message in the #ext-vllm-sig-rl slack channel.

would love to test it.

@junjzhang
Copy link
Copy Markdown
Author

Hi @hao-aaron, I tested PR #38761 on our setup (DPEP DP=2, TP=8, EP, MoE 120B + online weight sync). Unfortunately it did not fix the race — we hit the same NCCL watchdog timeout (_ALLGATHER_BASE, 600s) as before.

The possible reason: #38761 guards the add_request() path (the source of start_wave notifications), but the _handle_client_request(START_DP_WAVE) path — where the coordinator's broadcast is consumed by peer engines — remains unguarded. Since pause_generation() does not drain in-flight zmq messages, a START_DP_WAVE that was legitimately broadcast before pause can still arrive after pause and re-arm engines_running on one engine, while the other is already in collective_rpc. This is the one-sided collective mismatch described in our PR #38009.

In contrast, when we tested with only our #38009 fix (PauseState guard in _handle_client_request) and without #38761, DP=2 weight sync completed successfully with no NCCL issues.

I think both fixes are complementary: #38761 reduces stale START_DP_WAVE production, #38009 prevents damage when a stale message arrives. Happy to discuss further in #ext-vllm-sig-rl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

4 participants