Skip to content

[RFC]: Externally managed elastic EP (decouple from Ray backend) #28243

@galletas1712

Description

@galletas1712

Motivation.

The scale_elastic_ep() API is tightly-coupled with Ray and conflicts with k8s orchestrators (AIBrix, llm-d, Dynamo, etc). We should allow scaling decisions should be made externally (i.e. by an orchestrator), allowing them to manage pod scaling from the operator level and implement their own scaling policies.


Proposed Change.

Overview

This RFC builds on top of #20323 and is relevant for phase 2 of #27774

Currently, elastic EP scaling forces use of DPLBAsyncMPClient which uses the CoreEngineActorManager to create new Ray Actors holding EngineCore instances on scale-up/scale-down. However, there should exist an alternate path where elastic EP can be managed externally, allowing the orchestrator to dynamically create/tear down pods that each correspond to a DP rank. There already exists a path for DP engines/ranks to be managed by the orchestrator, via DPAsyncMPClient. Our goal is to extend DPAsyncMPClient to support externally-managed elastic EP while ensuring internally managed elastic EP works well.

Proposed architecture

Proposed architecture for managing elastic EP externally. PUSH/PULL side-channel socket is for the orchestrator to receive notifications from engines, and is needed only in Milestone 2


Milestone 1: Compatibility with initial EEP support

In this Milestone, we aim to support existing EEP functionality in the main branch via #20775 (corresponds to Milestone 1 of #20323). We will expose a handle_eep_event event handler endpoint API for the frontend EngineClient that can be used to notify the DP engine of an elastic EP scaling event. This event handler can only be used if the user explicitly sets an environment variable VLLM_EEP_EXTERNAL that tells vLLM to use DPAsyncMPClient with externally-managed scale up/scale down of elastic EP engines.

In Milestone 1, communication is primarily one-directional: the orchestrator calls the handle_eep_event API on the engines. In Milestone 2, the asynchronous nature of elastic EP state transitions will require bidirectional communication so engines can notify the orchestrator when state transitions complete.

For Milestone 1, the new API signature will be:

def handle_eep_event(payload: EEPEventPayload)

The payload argument itself will specify the event's purpose, containing an EventType field. For this milestone, payloads fall into two categories:

  • SCALING_REQUEST: This payload instructs an engine to actively reconfigure. It contains the necessary parameters for scaling up or down (e.g., new world size) and will trigger a call to reinitialize_distributed within the engine.

    • When using DPAsyncMPClient, the orchestrator performs scale-up/scale-down by sending this SCALING_REQUEST payload to the handle_eep_event endpoint of each DP engine.
    • This call is synchronous. Each engine blocks on its reinitialize_distributed call, which allows the orchestrator to know precisely when each rank has completed its reconfiguration.
  • NOTIFICATION: This payload informs an engine that an event has already occurred in other engines.

    • For Milestone 1, this is the scale-up marker that would otherwise be sent to the coordinator by the client to synchronize upon all engines completing reconfiguration if managing elastic EP internally via DPLBAsyncMPClient and Ray.
      ** Instead, the orchestrator should handle this if using VLLM_EEP_EXTERNAL=1.
    • After the orchestrator confirms all engines have finished reconfiguring (by waiting for all their synchronous handle_eep_event calls to return), it sends a NOTIFICATION payload (e.g., a "scale marker notification") exclusively to the DP rank 0 engine.
      • This notification tells the DP rank 0 engine that all other engines are ready, allowing it to signal the SCALE_ELASTIC_EP marker to the coordinator, completing the scaling process.

Most importantly, we don't actually need to change the logic in DPLBAsyncMPClient to support this new API aside from removing the mandate for a Ray backend. We could potentially modify DPLBAsyncMPClient to use this new API internally to unify the scaling logic if desired.


Milestone 2: Compatibility with #26278

#26278 introduces asynchronous elastic EP scaling that breaks down the elastic scaling into smaller stages, allowing requests to be served while elastic EP scaling is happening. We will need to extend our work from Milestone 1 to support more types of NOTIFICATION payloads that signal state machine transitions, along with bidirectional communication (perhaps via a side channel PUSH/PULL socket with the orchestrator, similar to how fault reporting is done in #28296 - maybe we can unify reporting to orchestrator under some kind of metrics API?). Furthermore, the ParallelConfig class needs to be extended to support managing overlapping stateless process group addresses/ports that are currently being managed internally in DPLBAsyncMPClient.


Feedback Period.

No response

CC List.

@libertyeagle @ruisearch42 @tzulingk @pavanimajety @benchislett @xinli-sw

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions