[Bugfix] Fix elastic EP scale-up after scale-down#37357
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses a critical bug where Ray actors were not being terminated during elastic scale-down operations. This led to zombie processes that retained network connections and caused subsequent scale-up attempts to fail with timeouts. The fix introduces a ray.kill(actor) call, which correctly terminates the actor process before its placement group is removed. This change is well-reasoned, directly solves the described issue, and aligns with the existing shutdown logic within the class, making it a necessary and correct improvement.
…EALER connections that block subsequent scale-up Signed-off-by: Tzu-Ling <tzulingk@nvidia.com>
8f82cb8 to
86b234d
Compare
|
Is this already handled by #37131? |
|
Closing this PR — the problem it addresses has been resolved from a different angle by upstream changes already merged into main. What this PR was fixingScale-up after scale-down would always hang (600s timeout). The root cause:
What upstream already fixed#37131 (merged Mar 20) redesigned the entire scale-up/down lifecycle: scale-down is now driven by a state machine in Additionally #36330 and #37452 addressed related coordinator port races. VerificationTested
The fixes in this PR are still correct (the |
Overview:
Without this fix, elastic EP scale-up after a prior scale-down always fails — either timing out after 600s or hanging indefinitely. Scale-up from a cold start works, but once any scale-down has occurred, subsequent scale-up is broken.
Two bugs combine to cause this:
Bug 1 (
utils.py):CoreEngineActorManager.scale_down_elastic_ep()removed actor handles from tracking lists and deleted placement groups but never calledray.kill(actor). Removed actors blocked forever oninput_queue.get(block=True), keeping their ZMQ DEALER sockets alive with stale identities. On the next scale-up, new engines with reused identities triggeredROUTER_HANDOVERping-pong with the zombies — the client'spoll()never received the readyb""and timed out after 600s.Bug 2 (
core_client.py):_eep_wait_for_setup_switch_complete()was called as a bare coroutine (wait_future = self._eep_wait_for_setup_switch_complete()). The coroutine was not scheduled onto the event loop until explicitly awaited, so it missed events that fired between creation and the firstawait. Wrapping it inasyncio.ensure_future()schedules it immediately, so no events are dropped during the interveningasyncio.gather(*reconfig_futures).Details:
vllm/v1/engine/utils.py— addray.kill(actor)before placement group removal:scale_down_elastic_ep()now captures the popped actor handle and callsray.kill(actor)beforeray.util.remove_placement_group(pg), matching whatshutdown()already doesvllm/v1/engine/core_client.py— wrap wait coroutine inasyncio.ensure_future():_scale_up_elastic_ep) and scale-down (_scale_down_elastic_ep) paths changed fromwait_future = self._eep_wait_for_setup_switch_complete()towait_future = asyncio.ensure_future(...)awaitWhere should the reviewer start?
vllm/v1/engine/utils.py:784-793— theray.kill(actor)fix; compare againstshutdown()which already does this correctlyvllm/v1/engine/core_client.py:1585andL1678— theasyncio.ensure_future()wrapping in both scale pathsTest plan:
data_parallel_size=2, then issued the following sequence ofPOST /engine/scale_elastic_epcalls:dp=2 → 3 → 4 → 3 → 2 → 4 → 2nvidia-smi, confirmed Ray actor process count viaps aux, and validated inference with a live request{"status":"ok"}with correct GPU activation/release at each stepray list actors