[Bugfix] Prevent Silent Stage Dropouts: fix coordinator reconnect bug, close/update race, and heartbeat stall#1899
Conversation
Signed-off-by: pikaxinge <2392811793@qq.com>
81b2397 to
dfdd197
Compare
…d reconnect locking Signed-off-by: pikaxinge <2392811793@qq.com>
|
@hsliuustc0106 @Gaohan123 Hi! Could you please take a look when you have a chance? I have addressed all review comments, resolved the threads, and CI is now fully green. Thank you! |
|
I think this PR improved the stability of the current coordinator part. |
|
May I know how long those tests cost? |
|
@congw729 These are lightweight CPU tests. I measured |
Good to know. Thanks! |
|
Do you need to run the L4 tests before merging? I can add a label to trigger the L4 tests. |
|
@congw729 Not strictly required for this PR, since the changes are control-plane logic and the added tests are CPU/core_model. If you prefer extra safety, I’m happy to run L4 as well. |
I think this PR not required the L4 tests since the omni_coordinator & omni_coord_client_for_stage isn't applied in the current program flow yet, we will finish all the test during the "integration" task |
|
Hi @congw729 and @linyueqian, sorry for the extra ping. |
…, close/update race, and heartbeat stall (vllm-project#1899) Signed-off-by: pikaxinge <2392811793@qq.com> Co-authored-by: Alicia <115451386+congw729@users.noreply.github.com>
…, close/update race, and heartbeat stall (vllm-project#1899) Signed-off-by: pikaxinge <2392811793@qq.com> Co-authored-by: Alicia <115451386+congw729@users.noreply.github.com> Signed-off-by: bob-021206 <binyan_github@163.com>
Why This PR Matters
This fixes a control-plane reliability bug that can silently break stage liveness reporting:
These issues can cause stages to be misclassified (UP/DOWN) and make debugging cluster health difficult under network instability.
What Changed
_reconnect()is now actually called).max_retries) and explicit retry logging.update_info()is rejected while shutdown is in progress.close()to continue cleanup when final update send fails withRuntimeError/ZMQError.Tests Added/Updated
RuntimeErrorupdate_info()is rejected while closing_recv_eventhelper now uses timeout to avoid hanging testsValidation
Local commands executed:
python3 -m py_compile vllm_omni/distributed/omni_coordinator/omni_coord_client_for_stage.py tests/distributed/omni_coordinator/test_omni_coord_client_for_stage.py✅Risk
OmniCoordClientForStage.