Skip to content

[Bugfix] Prevent Silent Stage Dropouts: fix coordinator reconnect bug, close/update race, and heartbeat stall#1899

Merged
hsliuustc0106 merged 4 commits intovllm-project:mainfrom
pikaxinge:fix/coord-client-reconnect-heartbeat-race
Apr 7, 2026
Merged

[Bugfix] Prevent Silent Stage Dropouts: fix coordinator reconnect bug, close/update race, and heartbeat stall#1899
hsliuustc0106 merged 4 commits intovllm-project:mainfrom
pikaxinge:fix/coord-client-reconnect-heartbeat-race

Conversation

@pikaxinge
Copy link
Copy Markdown
Contributor

@pikaxinge pikaxinge commented Mar 14, 2026

Why This PR Matters

This fixes a control-plane reliability bug that can silently break stage liveness reporting:

  • reconnect path was not actually invoked on send failures
  • close/update race could emit stale status ordering
  • heartbeat loop could stop permanently after a transient failure

These issues can cause stages to be misclassified (UP/DOWN) and make debugging cluster health difficult under network instability.

What Changed

  • Fixed reconnect invocation in send-failure path (_reconnect() is now actually called).
  • Hardened reconnect behavior with bounded retries (max_retries) and explicit retry logging.
  • Moved event snapshotting under send lock to avoid stale payload emission races.
  • Added closing-state guard so update_info() is rejected while shutdown is in progress.
  • Improved heartbeat resiliency: transient send failures now retry on next interval instead of terminating heartbeat loop.
  • Hardened close() to continue cleanup when final update send fails with RuntimeError/ZMQError.

Tests Added/Updated

  • reconnect is called and send is retried after first failure
  • reconnect failure propagates error and does not silently pass
  • close path still cleans resources when final send raises RuntimeError
  • reconnect respects retry limit (no unbounded loop)
  • heartbeat loop retries after transient send failure
  • update_info() is rejected while closing
  • _recv_event helper now uses timeout to avoid hanging tests

Validation

Local commands executed:

  • python3 -m py_compile vllm_omni/distributed/omni_coordinator/omni_coord_client_for_stage.py tests/distributed/omni_coordinator/test_omni_coord_client_for_stage.py
  • custom targeted behavior scripts for:
    • bounded reconnect retries ✅
    • heartbeat retry after transient failure ✅
    • close cleanup on runtime exception ✅

Risk

  • Main behavior changes are localized to OmniCoordClientForStage.
  • No API surface changes.
  • Added focused tests around new/modified failure paths.

Signed-off-by: pikaxinge <2392811793@qq.com>
@pikaxinge pikaxinge force-pushed the fix/coord-client-reconnect-heartbeat-race branch from 81b2397 to dfdd197 Compare March 14, 2026 17:51
@pikaxinge pikaxinge changed the title Prevent Silent Stage Dropouts: fix coordinator reconnect bug, close/update race, and heartbeat stall [Bugfix] Prevent Silent Stage Dropouts: fix coordinator reconnect bug, close/update race, and heartbeat stall Mar 15, 2026
@pikaxinge
Copy link
Copy Markdown
Contributor Author

Opened tracking issue for detailed root-cause analysis and follow-up hardening scope: #1902\n\nThis PR (#1899) addresses the critical reconnect/race/heartbeat failure paths, and #1902 tracks the remaining reliability work (notably lifecycle-event handling under zmq.Again/backpressure).

Comment thread tests/distributed/omni_coordinator/test_omni_coord_client_for_stage.py Outdated
Comment thread tests/distributed/omni_coordinator/test_omni_coord_client_for_stage.py Outdated
…d reconnect locking

Signed-off-by: pikaxinge <2392811793@qq.com>
Copy link
Copy Markdown
Contributor

@NumberWan NumberWan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pikaxinge
Copy link
Copy Markdown
Contributor Author

@hsliuustc0106 @Gaohan123 Hi! Could you please take a look when you have a chance? I have addressed all review comments, resolved the threads, and CI is now fully green. Thank you!

@NumberWan
Copy link
Copy Markdown
Contributor

I think this PR improved the stability of the current coordinator part.
@hsliuustc0106 @Gaohan123 PTAL

@linyueqian
Copy link
Copy Markdown
Collaborator

@amy-why-3459 @tzhouam @congw729 PTAL

@congw729
Copy link
Copy Markdown
Collaborator

congw729 commented Apr 2, 2026

May I know how long those tests cost?

@pikaxinge
Copy link
Copy Markdown
Contributor Author

@congw729 These are lightweight CPU tests. I measured tests/distributed/omni_coordinator/test_omni_coord_client_for_stage.py (9 cases) across 3 runs: 0.305s / 0.315s / 0.351s (median 0.315s). So the added test cost is minimal.

@congw729
Copy link
Copy Markdown
Collaborator

congw729 commented Apr 2, 2026

@congw729 These are lightweight CPU tests. I measured tests/distributed/omni_coordinator/test_omni_coord_client_for_stage.py (9 cases) across 3 runs: 0.305s / 0.315s / 0.351s (median 0.315s). So the added test cost is minimal.

Good to know. Thanks!

@congw729
Copy link
Copy Markdown
Collaborator

congw729 commented Apr 2, 2026

Do you need to run the L4 tests before merging? I can add a label to trigger the L4 tests.

@pikaxinge
Copy link
Copy Markdown
Contributor Author

@congw729 Not strictly required for this PR, since the changes are control-plane logic and the added tests are CPU/core_model. If you prefer extra safety, I’m happy to run L4 as well.

@NumberWan
Copy link
Copy Markdown
Contributor

Do you need to run the L4 tests before merging? I can add a label to trigger the L4 tests.

I think this PR not required the L4 tests since the omni_coordinator & omni_coord_client_for_stage isn't applied in the current program flow yet, we will finish all the test during the "integration" task

@pikaxinge
Copy link
Copy Markdown
Contributor Author

Hi @congw729 and @linyueqian, sorry for the extra ping.
Quick update on #1899: CI is fully green, all review comments have been addressed, and NumberWan also confirmed L4 is not required for this PR.
If it looks good from your side, could one of you please help with an approval to unblock merge?
If not, no problem at all — I’m happy to make any additional updates you suggest.

@linyueqian linyueqian added the ready label to trigger buildkite CI label Apr 4, 2026
@hsliuustc0106 hsliuustc0106 merged commit c9dbc09 into vllm-project:main Apr 7, 2026
8 checks passed
vraiti pushed a commit to vraiti/vllm-omni that referenced this pull request Apr 9, 2026
…, close/update race, and heartbeat stall (vllm-project#1899)

Signed-off-by: pikaxinge <2392811793@qq.com>
Co-authored-by: Alicia <115451386+congw729@users.noreply.github.com>
bob-021206 pushed a commit to jasonlee-1024/vllm-omni that referenced this pull request Apr 21, 2026
…, close/update race, and heartbeat stall (vllm-project#1899)

Signed-off-by: pikaxinge <2392811793@qq.com>
Co-authored-by: Alicia <115451386+congw729@users.noreply.github.com>
Signed-off-by: bob-021206 <binyan_github@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants