[RFC]: Externally managed elastic EP (decouple from Ray backend)

### Motivation.

The `scale_elastic_ep()` API is tightly-coupled with Ray and conflicts with k8s orchestrators (AIBrix, llm-d, Dynamo, etc). We should allow scaling decisions should be made externally (i.e. by an orchestrator), allowing them to manage pod scaling from the operator level and implement their own scaling policies.

---

### Proposed Change.

#### Overview

This RFC builds on top of #20323 and is relevant for phase 2 of #27774

Currently, elastic EP scaling forces use of `DPLBAsyncMPClient` which uses the `CoreEngineActorManager` to create new Ray Actors holding `EngineCore` instances on scale-up/scale-down. However, there should exist an alternate path where elastic EP can be managed externally, allowing the orchestrator to dynamically create/tear down pods that each correspond to a DP rank. There already exists a path for DP engines/ranks to be managed by the orchestrator, via `DPAsyncMPClient`. Our goal is to extend `DPAsyncMPClient` to support externally-managed elastic EP while ensuring internally managed elastic EP works well.

![Proposed architecture](https://github.com/user-attachments/assets/e7d2295f-c7fc-42cb-8ece-6b0f471f8612)

*Proposed architecture for managing elastic EP externally. PUSH/PULL side-channel socket is for the orchestrator to receive notifications from engines, and is needed only in Milestone 2*

---

#### Milestone 1: Compatibility with initial EEP support

In this Milestone, we aim to support existing EEP functionality in the main branch via #20775 (corresponds to Milestone 1 of #20323). We will expose a **`handle_eep_event`** event handler endpoint API for the frontend `EngineClient` that can be used to notify the DP engine of an elastic EP scaling event. This event handler can only be used if the user explicitly sets an environment variable `VLLM_EEP_EXTERNAL` that tells vLLM to use `DPAsyncMPClient` with externally-managed  scale up/scale down of elastic EP engines.

In Milestone 1, communication is primarily one-directional: the orchestrator calls the `handle_eep_event` API on the engines. In Milestone 2, the asynchronous nature of elastic EP state transitions will require bidirectional communication so engines can notify the orchestrator when state transitions complete.

For Milestone 1, the new API signature will be:

```
def handle_eep_event(payload: EEPEventPayload)
```

The `payload` argument itself will specify the event's purpose, containing an `EventType` field. For this milestone, payloads fall into two categories:

* **`SCALING_REQUEST`**: This payload instructs an engine to actively reconfigure. It contains the necessary parameters for scaling up or down (e.g., new world size) and will trigger a call to `reinitialize_distributed` within the engine.
    * When using `DPAsyncMPClient`, the orchestrator performs scale-up/scale-down by sending this `SCALING_REQUEST` payload to the `handle_eep_event` endpoint of each DP engine.
    * This call is synchronous. Each engine blocks on its `reinitialize_distributed` call, which allows the orchestrator to know precisely when each rank has completed its reconfiguration.

* **`NOTIFICATION`**: This payload informs an engine that an event has already occurred in other engines.
    * For Milestone 1, this is the scale-up marker that would otherwise be sent to the coordinator by the client to synchronize upon all engines completing reconfiguration if managing elastic EP internally via `DPLBAsyncMPClient` and Ray.
  ** Instead, the orchestrator should handle this if using `VLLM_EEP_EXTERNAL=1`.
  * After the orchestrator confirms all engines have finished reconfiguring (by waiting for all their synchronous `handle_eep_event` calls to return), it sends a `NOTIFICATION` payload (e.g., a "scale marker notification") exclusively to the DP rank 0 engine.
    * This notification tells the DP rank 0 engine that all other engines are ready, allowing it to signal the `SCALE_ELASTIC_EP` marker to the coordinator, completing the scaling process.

Most importantly, we don't actually need to change the logic in `DPLBAsyncMPClient` to support this new API aside from removing the mandate for a Ray backend. We could potentially modify `DPLBAsyncMPClient` to use this new API internally to unify the scaling logic if desired.

---

#### Milestone 2: Compatibility with #26278

#26278 introduces asynchronous elastic EP scaling that breaks down the elastic scaling into smaller stages, allowing requests to be served while elastic EP scaling is happening. We will need to extend our work from Milestone 1 to support more types of `NOTIFICATION` payloads that signal state machine transitions, along with bidirectional communication (perhaps via a side channel PUSH/PULL socket with the orchestrator, similar to how fault reporting is done in #28296 - maybe we can unify reporting to orchestrator under some kind of metrics API?). Furthermore, the `ParallelConfig` class needs to be extended to support managing overlapping stateless process group addresses/ports that are currently being managed internally in `DPLBAsyncMPClient`.

---

### Feedback Period.

_No response_

### CC List.

@libertyeagle @ruisearch42 @tzulingk @pavanimajety @benchislett @xinli-sw

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Externally managed elastic EP (decouple from Ray backend) #28243

Motivation.

Proposed Change.

Overview

Milestone 1: Compatibility with initial EEP support

Milestone 2: Compatibility with #26278

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Externally managed elastic EP (decouple from Ray backend) #28243

Description

Motivation.

Proposed Change.

Overview

Milestone 1: Compatibility with initial EEP support

Milestone 2: Compatibility with #26278

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions