[Bugfix] Fix elastic EP scale-up after scale-down by tzulingk · Pull Request #37357 · vllm-project/vllm

tzulingk · 2026-03-17T23:43:20Z

Overview:

Without this fix, elastic EP scale-up after a prior scale-down always fails — either timing out after 600s or hanging indefinitely. Scale-up from a cold start works, but once any scale-down has occurred, subsequent scale-up is broken.

Two bugs combine to cause this:

Bug 1 (utils.py): CoreEngineActorManager.scale_down_elastic_ep() removed actor handles from tracking lists and deleted placement groups but never called ray.kill(actor). Removed actors blocked forever on input_queue.get(block=True), keeping their ZMQ DEALER sockets alive with stale identities. On the next scale-up, new engines with reused identities triggered ROUTER_HANDOVER ping-pong with the zombies — the client's poll() never received the ready b"" and timed out after 600s.

Bug 2 (core_client.py): _eep_wait_for_setup_switch_complete() was called as a bare coroutine (wait_future = self._eep_wait_for_setup_switch_complete()). The coroutine was not scheduled onto the event loop until explicitly awaited, so it missed events that fired between creation and the first await. Wrapping it in asyncio.ensure_future() schedules it immediately, so no events are dropped during the intervening asyncio.gather(*reconfig_futures).

Details:

vllm/v1/engine/utils.py — add ray.kill(actor) before placement group removal:

scale_down_elastic_ep() now captures the popped actor handle and calls ray.kill(actor) before ray.util.remove_placement_group(pg), matching what shutdown() already does
Without this, removed actors held ZMQ DEALER sockets indefinitely, causing the 600s timeout on subsequent scale-up

vllm/v1/engine/core_client.py — wrap wait coroutine in asyncio.ensure_future():

Both scale-up (_scale_up_elastic_ep) and scale-down (_scale_down_elastic_ep) paths changed from wait_future = self._eep_wait_for_setup_switch_complete() to wait_future = asyncio.ensure_future(...)
Ensures the poller is scheduled immediately and does not miss ready signals that arrive before the first explicit await

Where should the reviewer start?

vllm/v1/engine/utils.py:784-793 — the ray.kill(actor) fix; compare against shutdown() which already does this correctly
vllm/v1/engine/core_client.py:1585 and L1678 — the asyncio.ensure_future() wrapping in both scale paths

Test plan:

Deployed on AKS (4×A100-SXM4-80GB) with data_parallel_size=2, then issued the following sequence of POST /engine/scale_elastic_ep calls:
- dp=2 → 3 → 4 → 3 → 2 → 4 → 2
After each step: verified GPU memory via nvidia-smi, confirmed Ray actor process count via ps aux, and validated inference with a live request
All 6 transitions returned {"status":"ok"} with correct GPU activation/release at each step
After each scale-down, confirmed no zombie Ray actor processes remained via ray list actors

gemini-code-assist

Code Review

This pull request addresses a critical bug where Ray actors were not being terminated during elastic scale-down operations. This led to zombie processes that retained network connections and caused subsequent scale-up attempts to fail with timeouts. The fix introduces a ray.kill(actor) call, which correctly terminates the actor process before its placement group is removed. This change is well-reasoned, directly solves the described issue, and aligns with the existing shutdown logic within the class, making it a necessary and correct improvement.

…EALER connections that block subsequent scale-up Signed-off-by: Tzu-Ling <tzulingk@nvidia.com>

tlrmchlsmth · 2026-03-20T20:44:07Z

Is this already handled by #37131?

tzulingk · 2026-03-24T03:01:19Z

Closing this PR — the problem it addresses has been resolved from a different angle by upstream changes already merged into main.

What this PR was fixing

Scale-up after scale-down would always hang (600s timeout). The root cause:

Zombie actors: scale_down_elastic_ep() never called ray.kill(actor), leaving actors blocked on input_queue.get(block=True) indefinitely.
ZMQ identity conflict: zombie actors kept their DEALER connections alive with stale identities. New actors with reused identities triggered endless ROUTER_HANDOVER ping-pong, so the client's poll() never received the ready b"" signal.
Asyncio race: _eep_wait_for_setup_switch_complete() was instantiated as a bare coroutine instead of asyncio.ensure_future(), missing events fired before the first await.

What upstream already fixed

#37131 (merged Mar 20) redesigned the entire scale-up/down lifecycle: scale-down is now driven by a state machine in elastic_state.py where workers transition through ACTIVE → SWITCHING → COMPLETE, with teardown (including PyNCCL cleanup) handled as part of the state transition. This eliminates the gap where zombie actors could exist — not by explicitly killing them, but by restructuring ownership so the state machine never lets them become zombies in the first place.

Additionally #36330 and #37452 addressed related coordinator port races.

Verification

Tested vllm main HEAD 2488a82f8 (includes #37131, does not include this PR) integrated with Dynamo's elastic EP implementation. Full 6-step scale sequence on AKS (4×A100-SXM4-80GB):

Step	Transition	Result
Baseline	dp=2	✅
1	dp=2 → dp=3	✅
2	dp=3 → dp=4	✅
3	dp=4 → dp=3 (scale down)	✅
4	dp=3 → dp=2 (scale down)	✅
5	dp=2 → dp=4 (scale up after down)	✅ no hang
6	dp=4 → dp=2	✅

The fixes in this PR are still correct (the ray.kill call and ensure_future wrapping are valid improvements), but the failure mode they target no longer manifests on main. Closing as superseded by #37131.

tzulingk requested a review from njhill as a code owner March 17, 2026 23:43

mergify Bot added v1 bug Something isn't working labels Mar 17, 2026

gemini-code-assist Bot reviewed Mar 17, 2026

View reviewed changes

[Bugfix] Kill Ray actors on elastic EP scale-down to prevent zombie D…

86b234d

…EALER connections that block subsequent scale-up Signed-off-by: Tzu-Ling <tzulingk@nvidia.com>

tzulingk force-pushed the fix/elastic-ep-scale-down-zombie-actors branch from 8f82cb8 to 86b234d Compare March 19, 2026 05:51

tzulingk changed the title ~~[Bugfix] Kill Ray actors on elastic EP scale-down to prevent zombie DEALER connections that block subsequent scale-up~~ [Bugfix] Fix elastic EP scale-up after scale-down Mar 19, 2026

tzulingk closed this Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix elastic EP scale-up after scale-down#37357

[Bugfix] Fix elastic EP scale-up after scale-down#37357
tzulingk wants to merge 1 commit into
vllm-project:mainfrom
tzulingk:fix/elastic-ep-scale-down-zombie-actors

tzulingk commented Mar 17, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

tlrmchlsmth commented Mar 20, 2026

Uh oh!

tzulingk commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

tzulingk commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

tlrmchlsmth commented Mar 20, 2026

Uh oh!

tzulingk commented Mar 24, 2026

What this PR was fixing

What upstream already fixed

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tzulingk commented Mar 17, 2026 •

edited

Loading