Skip to content

[Bugfix] Fix elastic EP scale-up after scale-down#37357

Closed
tzulingk wants to merge 1 commit into
vllm-project:mainfrom
tzulingk:fix/elastic-ep-scale-down-zombie-actors
Closed

[Bugfix] Fix elastic EP scale-up after scale-down#37357
tzulingk wants to merge 1 commit into
vllm-project:mainfrom
tzulingk:fix/elastic-ep-scale-down-zombie-actors

Conversation

@tzulingk

@tzulingk tzulingk commented Mar 17, 2026

Copy link
Copy Markdown

Overview:

Without this fix, elastic EP scale-up after a prior scale-down always fails — either timing out after 600s or hanging indefinitely. Scale-up from a cold start works, but once any scale-down has occurred, subsequent scale-up is broken.

Two bugs combine to cause this:

Bug 1 (utils.py): CoreEngineActorManager.scale_down_elastic_ep() removed actor handles from tracking lists and deleted placement groups but never called ray.kill(actor). Removed actors blocked forever on input_queue.get(block=True), keeping their ZMQ DEALER sockets alive with stale identities. On the next scale-up, new engines with reused identities triggered ROUTER_HANDOVER ping-pong with the zombies — the client's poll() never received the ready b"" and timed out after 600s.

Bug 2 (core_client.py): _eep_wait_for_setup_switch_complete() was called as a bare coroutine (wait_future = self._eep_wait_for_setup_switch_complete()). The coroutine was not scheduled onto the event loop until explicitly awaited, so it missed events that fired between creation and the first await. Wrapping it in asyncio.ensure_future() schedules it immediately, so no events are dropped during the intervening asyncio.gather(*reconfig_futures).

Details:

vllm/v1/engine/utils.py — add ray.kill(actor) before placement group removal:

  • scale_down_elastic_ep() now captures the popped actor handle and calls ray.kill(actor) before ray.util.remove_placement_group(pg), matching what shutdown() already does
  • Without this, removed actors held ZMQ DEALER sockets indefinitely, causing the 600s timeout on subsequent scale-up

vllm/v1/engine/core_client.py — wrap wait coroutine in asyncio.ensure_future():

  • Both scale-up (_scale_up_elastic_ep) and scale-down (_scale_down_elastic_ep) paths changed from wait_future = self._eep_wait_for_setup_switch_complete() to wait_future = asyncio.ensure_future(...)
  • Ensures the poller is scheduled immediately and does not miss ready signals that arrive before the first explicit await

Where should the reviewer start?

  • vllm/v1/engine/utils.py:784-793 — the ray.kill(actor) fix; compare against shutdown() which already does this correctly
  • vllm/v1/engine/core_client.py:1585 and L1678 — the asyncio.ensure_future() wrapping in both scale paths

Test plan:

  • Deployed on AKS (4×A100-SXM4-80GB) with data_parallel_size=2, then issued the following sequence of POST /engine/scale_elastic_ep calls:
    • dp=2 → 3 → 4 → 3 → 2 → 4 → 2
  • After each step: verified GPU memory via nvidia-smi, confirmed Ray actor process count via ps aux, and validated inference with a live request
  • All 6 transitions returned {"status":"ok"} with correct GPU activation/release at each step
  • After each scale-down, confirmed no zombie Ray actor processes remained via ray list actors

@tzulingk tzulingk requested a review from njhill as a code owner March 17, 2026 23:43
@mergify mergify Bot added v1 bug Something isn't working labels Mar 17, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical bug where Ray actors were not being terminated during elastic scale-down operations. This led to zombie processes that retained network connections and caused subsequent scale-up attempts to fail with timeouts. The fix introduces a ray.kill(actor) call, which correctly terminates the actor process before its placement group is removed. This change is well-reasoned, directly solves the described issue, and aligns with the existing shutdown logic within the class, making it a necessary and correct improvement.

…EALER connections that block subsequent scale-up

Signed-off-by: Tzu-Ling <tzulingk@nvidia.com>
@tzulingk tzulingk force-pushed the fix/elastic-ep-scale-down-zombie-actors branch from 8f82cb8 to 86b234d Compare March 19, 2026 05:51
@tzulingk tzulingk changed the title [Bugfix] Kill Ray actors on elastic EP scale-down to prevent zombie DEALER connections that block subsequent scale-up [Bugfix] Fix elastic EP scale-up after scale-down Mar 19, 2026
@tlrmchlsmth

Copy link
Copy Markdown
Member

Is this already handled by #37131?

@tzulingk

Copy link
Copy Markdown
Author

Closing this PR — the problem it addresses has been resolved from a different angle by upstream changes already merged into main.

What this PR was fixing

Scale-up after scale-down would always hang (600s timeout). The root cause:

  1. Zombie actors: scale_down_elastic_ep() never called ray.kill(actor), leaving actors blocked on input_queue.get(block=True) indefinitely.
  2. ZMQ identity conflict: zombie actors kept their DEALER connections alive with stale identities. New actors with reused identities triggered endless ROUTER_HANDOVER ping-pong, so the client's poll() never received the ready b"" signal.
  3. Asyncio race: _eep_wait_for_setup_switch_complete() was instantiated as a bare coroutine instead of asyncio.ensure_future(), missing events fired before the first await.

What upstream already fixed

#37131 (merged Mar 20) redesigned the entire scale-up/down lifecycle: scale-down is now driven by a state machine in elastic_state.py where workers transition through ACTIVE → SWITCHING → COMPLETE, with teardown (including PyNCCL cleanup) handled as part of the state transition. This eliminates the gap where zombie actors could exist — not by explicitly killing them, but by restructuring ownership so the state machine never lets them become zombies in the first place.

Additionally #36330 and #37452 addressed related coordinator port races.

Verification

Tested vllm main HEAD 2488a82f8 (includes #37131, does not include this PR) integrated with Dynamo's elastic EP implementation. Full 6-step scale sequence on AKS (4×A100-SXM4-80GB):

Step Transition Result
Baseline dp=2
1 dp=2 → dp=3
2 dp=3 → dp=4
3 dp=4 → dp=3 (scale down)
4 dp=3 → dp=2 (scale down)
5 dp=2 → dp=4 (scale up after down) ✅ no hang
6 dp=4 → dp=2

The fixes in this PR are still correct (the ray.kill call and ensure_future wrapping are valid improvements), but the failure mode they target no longer manifests on main. Closing as superseded by #37131.

@tzulingk tzulingk closed this Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants