Skip to content

[DisaggEverything] DisaggregatedRequestManager aka Coordinator [1/N] #26178

Draft
NickLucche wants to merge 5 commits intovllm-project:mainfrom
NickLucche:disaggev-coordinator
Draft

[DisaggEverything] DisaggregatedRequestManager aka Coordinator [1/N] #26178
NickLucche wants to merge 5 commits intovllm-project:mainfrom
NickLucche:disaggev-coordinator

Conversation

@NickLucche
Copy link
Collaborator

@NickLucche NickLucche commented Oct 3, 2025

Second step in implementing the "Disaggregated Everything" proposal #22817.
Follows from #24261 (although not a strong pre-requisite) .
This PR focuses on the following component:

image

Which would
image

Note

As I feel the name Coordinator is quite overloaded, I have renamed the component presented in the original chart to DisaggregatedRequestManager to try and have a clearer identity that would be harder to confuse. The change is totally opinionated and I am very much open to better naming suggestions.

Overview

In very concrete terms, this PR introduces the following:

  • A DisaggregatedRequestManager interface
  • A DisaggregatedRequestManagerFactory as factory builder and for registering custom OOT managers
  • A DisaggregatedServerMixin class, intended to be plugged in on some API endpoint (eg serving tokens in [DisaggEverything] Tokens in<>out /generate endpoint #24261 on /v1/inference/generate), meant to add the disaggregation coordination capability.
  • PrefillLocalDecodeRemoteManager a specialization of DisaggregatedRequestManager implementing a concrete coordination behavior.
  • Tests to showcase functionalities

What this PR does not:

  • It does NOT plug-in the DisaggregatedRequestManager capabilities into any API endpoint. No changes at all are expected in vLLM's behavior. This is laying the foundation to allow it, once we figure out the right implementation.

Design

A DisaggregatedRequestManager subclass is a particular implementation of a disaggregated protocol: eg it defines if/what request should be executed locally and if/what request should be sent for execution remotely instead.
One concrete example, the PrefillLocalDecodeRemoteManager, expects a completion request from LB/IGW, executes the prefill phase of the request locally (P instance), then sends the request to a remote D for decoding.
Mind that optionally, a deferred decode selection logic could be injected here at this point, reaching back to the EndointPicker (EPP) to get a remote address for D.

A manager can be placed independently on both P and D, depending on some startup config or dynamically added at runtime (eg in response , TODO).
Each vLLM instance can have multiple policies active at once: this enables the LB to dynamically switch coordination policy in response to to change in deployment or traffic conditions (eg Eeager vs Deferred Decode).

The DisaggregatedServerMixin class is introducing for handling multiple managers, maintaining shareable state and deciding dispatching priority (this is now fixed at startup, to allow dynamically changing priority).

The manager is also responsible for providing a single abstraction over the connection from the LB pov: when connection is dropped client-side, it should enable cleanup routines to run on both local and remote.

PrefillLocalDecodeRemote streaming example:
image

Future work

I will iterate on this interface based on feedback, then move on to implement more core features like streaming, and finally plug the Mixin into the /generate endpoint.

Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
@NickLucche
Copy link
Collaborator Author

@mergify mergify bot added the v1 label Oct 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant