[DisaggEverything] DisaggregatedRequestManager aka Coordinator [1/N] #26178
Draft
NickLucche wants to merge 5 commits intovllm-project:mainfrom
Draft
[DisaggEverything] DisaggregatedRequestManager aka Coordinator [1/N] #26178NickLucche wants to merge 5 commits intovllm-project:mainfrom
DisaggregatedRequestManager aka Coordinator [1/N] #26178NickLucche wants to merge 5 commits intovllm-project:mainfrom
Conversation
Collaborator
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Second step in implementing the "Disaggregated Everything" proposal #22817.
Follows from #24261 (although not a strong pre-requisite) .
This PR focuses on the following component:
Which would

Note
As I feel the name
Coordinatoris quite overloaded, I have renamed the component presented in the original chart toDisaggregatedRequestManagerto try and have a clearer identity that would be harder to confuse. The change is totally opinionated and I am very much open to better naming suggestions.Overview
In very concrete terms, this PR introduces the following:
DisaggregatedRequestManagerinterfaceDisaggregatedRequestManagerFactoryas factory builder and for registering custom OOT managersDisaggregatedServerMixinclass, intended to be plugged in on some API endpoint (eg serving tokens in [DisaggEverything] Tokens in<>out/generateendpoint #24261 on/v1/inference/generate), meant to add the disaggregation coordination capability.PrefillLocalDecodeRemoteManagera specialization ofDisaggregatedRequestManagerimplementing a concrete coordination behavior.What this PR does not:
DisaggregatedRequestManagercapabilities into any API endpoint. No changes at all are expected in vLLM's behavior. This is laying the foundation to allow it, once we figure out the right implementation.Design
A
DisaggregatedRequestManagersubclass is a particular implementation of a disaggregated protocol: eg it defines if/what request should be executed locally and if/what request should be sent for execution remotely instead.One concrete example, the
PrefillLocalDecodeRemoteManager, expects a completion request from LB/IGW, executes the prefill phase of the request locally (P instance), then sends the request to a remote D for decoding.Mind that optionally, a deferred decode selection logic could be injected here at this point, reaching back to the EndointPicker (EPP) to get a remote address for D.
A manager can be placed independently on both P and D, depending on some startup config or dynamically added at runtime (eg in response , TODO).
Each vLLM instance can have multiple policies active at once: this enables the LB to dynamically switch coordination policy in response to to change in deployment or traffic conditions (eg Eeager vs Deferred Decode).
The
DisaggregatedServerMixinclass is introducing for handling multiple managers, maintaining shareable state and deciding dispatching priority (this is now fixed at startup, to allow dynamically changing priority).The manager is also responsible for providing a single abstraction over the connection from the LB pov: when connection is dropped client-side, it should enable cleanup routines to run on both local and remote.
PrefillLocalDecodeRemote streaming example:

Future work
I will iterate on this interface based on feedback, then move on to implement more core features like streaming, and finally plug the Mixin into the /generate endpoint.