[bug] Fix remaining START_DP_WAVE pause race in _handle_client_request#38009
[bug] Fix remaining START_DP_WAVE pause race in _handle_client_request#38009junjzhang wants to merge 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
The pull request effectively addresses the identified race condition in _handle_client_request by adding a check for PauseState.UNPAUSED before re-arming the engine loop. This change directly prevents the one-sided collective deadlock described in the issue, ensuring that the engine does not inadvertently start processing while it is supposed to be paused. The implementation aligns with the fix previously applied to add_request for a similar issue.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
Head branch was pushed to by a user without write access
a72aa56 to
a8c0ee0
Compare
|
CI failure in |
0ebcfe8 to
1830fd0
Compare
|
@njhill CI is failing again on the same This test is unrelated to the change in this PR — it's a single-GPU abort test that exercises This has failed consistently across multiple CI runs, so re-running alone won't help. Would it make sense to skip or mark this test as flaky for now, or would you prefer a different approach? |
fe9ddb6 to
4125ab0
Compare
Filed as #38221 |
4125ab0 to
17ea3d1
Compare
|
a9af16f to
b907d9f
Compare
PR vllm-project#37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by checking `scheduler.pause_state` before setting `engines_running = True`. However, the same unguarded pattern exists in `DPEngineCoreProc._handle_client_request()`. When `pause_generation()` + `collective_rpc()` is used for online weight update, a late `START_DP_WAVE` from the DP coordinator can re-arm the engine loop via `_handle_client_request` while the engine is paused. The re-armed engine enters the dummy-batch ALLREDUCE while the peer engine is in `collective_rpc`, causing a one-sided collective deadlock. Add the same `PauseState.UNPAUSED` guard that vllm-project#37024 added to `add_request()`. Fixes: vllm-project#36594 Signed-off-by: Junjie ZHANG <51326516+junjzhang@users.noreply.github.com>
b907d9f to
65c0bec
Compare
|
i think this PR may be running into the same issue mentioned in #36608 (comment). However, the original problems mentioned in your PR are a concern. Could you try running the solution in #38761 and see if that fixes the issues? Feel free to message in the #ext-vllm-sig-rl slack channel. |
would love to test it. |
|
Hi @hao-aaron, I tested PR #38761 on our setup (DPEP DP=2, TP=8, EP, MoE 120B + online weight sync). Unfortunately it did not fix the race — we hit the same NCCL watchdog timeout ( The possible reason: #38761 guards the In contrast, when we tested with only our #38009 fix (PauseState guard in I think both fixes are complementary: #38761 reduces stale |
Purpose
Closes #36594 (remaining race)
PR #37024 fixed the
START_DP_WAVE/ pause race inadd_request()by checkingscheduler.pause_statebefore settingengines_running = True. However, the same unguarded pattern exists inDPEngineCoreProc._handle_client_request().When
pause_generation()+collective_rpc()is used for online weight update, a lateSTART_DP_WAVEfrom the DP coordinator can re-arm the engine loop via_handle_client_requestwhile the engine is paused. The re-armed engine enters the dummy-batch ALLREDUCE while the peer engine is incollective_rpc, causing a one-sided collective deadlock (NCCL timeout after 600s).Race timeline
Fix
Add the same
PauseState.UNPAUSEDguard to_handle_client_requestthat #37024 added toadd_request.Test Plan
Reproduced consistently with DPEP (DP=2, TP=8, EP) + online weight sync on MoE model. The race is timing-dependent — enabling
--enable-return-routed-experts(which adds a small per-step GPU buffer write) widens the race window enough to hit it on ~100% of runs. Without the fix: 3/3 runs deadlocked within 5 minutes of first weight sync. With the fix: pending verification.Test Result
Pending large-scale training run with fix applied.