-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Description
Motivation.
The scale_elastic_ep() API is tightly-coupled with Ray and conflicts with k8s orchestrators (AIBrix, llm-d, Dynamo, etc). We should allow scaling decisions should be made externally (i.e. by an orchestrator), allowing them to manage pod scaling from the operator level and implement their own scaling policies.
Proposed Change.
Overview
This RFC builds on top of #20323 and is relevant for phase 2 of #27774
Currently, elastic EP scaling forces use of DPLBAsyncMPClient which uses the CoreEngineActorManager to create new Ray Actors holding EngineCore instances on scale-up/scale-down. However, there should exist an alternate path where elastic EP can be managed externally, allowing the orchestrator to dynamically create/tear down pods that each correspond to a DP rank. There already exists a path for DP engines/ranks to be managed by the orchestrator, via DPAsyncMPClient. Our goal is to extend DPAsyncMPClient to support externally-managed elastic EP while ensuring internally managed elastic EP works well.
Proposed architecture for managing elastic EP externally. PUSH/PULL side-channel socket is for the orchestrator to receive notifications from engines, and is needed only in Milestone 2
Milestone 1: Compatibility with initial EEP support
In this Milestone, we aim to support existing EEP functionality in the main branch via #20775 (corresponds to Milestone 1 of #20323). We will expose a handle_eep_event event handler endpoint API for the frontend EngineClient that can be used to notify the DP engine of an elastic EP scaling event. This event handler can only be used if the user explicitly sets an environment variable VLLM_EEP_EXTERNAL that tells vLLM to use DPAsyncMPClient with externally-managed scale up/scale down of elastic EP engines.
In Milestone 1, communication is primarily one-directional: the orchestrator calls the handle_eep_event API on the engines. In Milestone 2, the asynchronous nature of elastic EP state transitions will require bidirectional communication so engines can notify the orchestrator when state transitions complete.
For Milestone 1, the new API signature will be:
def handle_eep_event(payload: EEPEventPayload)
The payload argument itself will specify the event's purpose, containing an EventType field. For this milestone, payloads fall into two categories:
-
SCALING_REQUEST: This payload instructs an engine to actively reconfigure. It contains the necessary parameters for scaling up or down (e.g., new world size) and will trigger a call toreinitialize_distributedwithin the engine.- When using
DPAsyncMPClient, the orchestrator performs scale-up/scale-down by sending thisSCALING_REQUESTpayload to thehandle_eep_eventendpoint of each DP engine. - This call is synchronous. Each engine blocks on its
reinitialize_distributedcall, which allows the orchestrator to know precisely when each rank has completed its reconfiguration.
- When using
-
NOTIFICATION: This payload informs an engine that an event has already occurred in other engines.- For Milestone 1, this is the scale-up marker that would otherwise be sent to the coordinator by the client to synchronize upon all engines completing reconfiguration if managing elastic EP internally via
DPLBAsyncMPClientand Ray.
** Instead, the orchestrator should handle this if usingVLLM_EEP_EXTERNAL=1. - After the orchestrator confirms all engines have finished reconfiguring (by waiting for all their synchronous
handle_eep_eventcalls to return), it sends aNOTIFICATIONpayload (e.g., a "scale marker notification") exclusively to the DP rank 0 engine.- This notification tells the DP rank 0 engine that all other engines are ready, allowing it to signal the
SCALE_ELASTIC_EPmarker to the coordinator, completing the scaling process.
- This notification tells the DP rank 0 engine that all other engines are ready, allowing it to signal the
- For Milestone 1, this is the scale-up marker that would otherwise be sent to the coordinator by the client to synchronize upon all engines completing reconfiguration if managing elastic EP internally via
Most importantly, we don't actually need to change the logic in DPLBAsyncMPClient to support this new API aside from removing the mandate for a Ray backend. We could potentially modify DPLBAsyncMPClient to use this new API internally to unify the scaling logic if desired.
Milestone 2: Compatibility with #26278
#26278 introduces asynchronous elastic EP scaling that breaks down the elastic scaling into smaller stages, allowing requests to be served while elastic EP scaling is happening. We will need to extend our work from Milestone 1 to support more types of NOTIFICATION payloads that signal state machine transitions, along with bidirectional communication (perhaps via a side channel PUSH/PULL socket with the orchestrator, similar to how fault reporting is done in #28296 - maybe we can unify reporting to orchestrator under some kind of metrics API?). Furthermore, the ParallelConfig class needs to be extended to support managing overlapping stateless process group addresses/ports that are currently being managed internally in DPLBAsyncMPClient.
Feedback Period.
No response
CC List.
@libertyeagle @ruisearch42 @tzulingk @pavanimajety @benchislett @xinli-sw
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
