[Core] Encoder separation for Encode-Prefill-Decode Disaggregation#25233
Merged
ywang96 merged 78 commits intovllm-project:mainfrom Nov 12, 2025
Merged
[Core] Encoder separation for Encode-Prefill-Decode Disaggregation#25233ywang96 merged 78 commits intovllm-project:mainfrom
ywang96 merged 78 commits intovllm-project:mainfrom
Conversation
Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>
Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>
Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>
Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>
Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>
…hceduler for remote load request Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>
Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>
…Connector Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>
Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>
…oder input Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: herotai214 <herotai214@gmail.com>
Signed-off-by: herotai214 <herotai214@gmail.com>
Signed-off-by: herotai214 <herotai214@gmail.com>
6962cad to
4ef546e
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
ywang96
approved these changes
Nov 11, 2025
Member
ywang96
left a comment
There was a problem hiding this comment.
Per offline discussion, since we're not going to include this in the 0.11.1 release, let's update this PR and get it in if all CI tests pass!
Very much appreciate everyone's efforts on pushing this PR to the finishing line 🚀 🚀 🚀
Signed-off-by: herotai214 <herotai214@gmail.com>
Contributor
|
We really appreciate the great work in this PR. We’d really like to have it merged into the main branch. |
Collaborator
|
Nice! @fake0fan |
1 task
57 tasks
|
Do we have any plan to refine the Whisper implementation on benefit from EPD disaggregation? |
5 tasks
7 tasks
44 tasks
7 tasks
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Disaggregated Encoder
A disaggregated encoder runs the vision-encoder stage of a multimodal LLM in a process that is separate from the prefill / decoder stage. Deploying these two stages in independent vLLM instances brings three practical benefits:
We received a series of feedback from the 21740 (Many thanks to @LastZhabka for the excellent work on the initial version), and based on this feedback, we updated our overall design logic.
Design doc: https://docs.google.com/document/d/1aed8KtC6XkXtdoV87pWT0a8OJlZ-CpnuLLzmR8l9BAE
@ywang96 @NickLucche @DarkLight1337 @WoosukKwon
CLOSE #20799
1 New Encoder Cache Connector
Disaggregated encoding in v1 is enabled by an explicit EC connector abstraction. The
ECConnectorBasedefines a clear split of responsibilities between the scheduler-side and worker-side, and provides the lifecycle hooks required to check, load, save, and track encoder cache (EC) across processes.ECConnector – interface for retrieving EC caches produced by the encoder.
Scheduler role – checks cache existence and schedules loads. (lives in the scheduler process):
check_caches_exist(request) -> list[bool]: probe whether EC exists for each multimodal datum in the request.update_state_after_alloc(request, index): record EC loading intent once encoder cache space is allocated for a given mm datum.build_connector_meta(scheduler_output) -> ECConnectorMetadata: build per-step metadata consumed by workers; also resets connector internal state for the next step.request_finished(request) -> (bool, Optional[dict]): optionally indicate async save/send in progress and attach metadata.update_connector_output(connector_output): ingest worker-side completion info.Worker role – loads the embeddings into memory. (lives in each worker process):
bind_connector_metadata(meta) / clear_connector_metadata(): attach per-step metadata from the scheduler before/after forward.start_load_caches(**kwargs): load required EC into the local encoder cache before embeddings are gathered.save_caches(**kwargs): persist/transfer EC produced on the encoder.wait_for_save(): block until outstanding saves complete (if any).get_finished(finished_req_ids) -> (set[str]|None, set[str]|None): report completion of async send/recv to help scheduler bookkeeping.2 Change highlights (Scheduler and GPUModelRunner)
Scheduler (
vllm/v1/core/sched/scheduler.py):ec_transfer_configis present viaECConnectorFactory.create_connector(..., role=SCHEDULER).ec_connector.update_state_after_alloc(...)so the nextbuild_connector_meta(...)includes the mm hashes to load.ec_connector_metadatainSchedulerOutputfor the workers, and ingests worker completion viaec_connector.update_connector_output(...).Worker (
vllm/v1/worker/gpu_model_runner.py):execute_model, whenget_ec_transfer().is_produceris True, the runner enterswith maybe_get_ec_connector_output(..., encoder_cache=self.encoder_cache):before running the multimodal encoder.encoder_cache[mm_hash].mm_hash, the runner callsmaybe_save_ec_to_connector(self.encoder_cache, mm_hash)which invokesECConnectorBase.save_caches(encoder_cache=..., mm_hash=...).wait_for_save()is invoked (if enabled) to ensure the persisted EC is durable/visible to consumers;get_finished(...)is queried to surface completion status back to the scheduler.ec_connector_metadatathat lists themm_hashitems needing loads.start_load_caches(encoder_cache=self.encoder_cache)prior to_gather_mm_embeddings, allowing the connector to populateencoder_cache[mm_hash]from the external store._gather_mm_embeddingsthen reads the loaded tensors fromencoder_cacheand returns them as multimodal embeddings for the subsequent decoder input embedding construction.ECConnectorOutputfor the scheduler to free resources when safe.3 Usage Example
The current reference pathway is ECSharedStorageConnector.
Below ready-to-run scripts show the workflows:
1 Encoder + 1 PD:
examples/online_serving/disaggregated_encoder/shared_storage_connector/disagg_1e1pd_example.shexamples/online_serving/disaggregated_encoder/shared_storage_connector/disagg_encoder_example.sh1 Encoder + 1 Prefill + 1 Decode:
examples/online_serving/disaggregated_encoder/shared_storage_connector/disagg_1e1p1d_example.sh3.1 Minimal ECTransfer CLI config
The Encoder and PD share/transfer encoder cache (EC) via EC transfer. The current reference implementation is
ECSharedStorageConnector(dump/load via a shared directory).Notes:
shared_storage_pathis not explicitly set,/tmpis used by default.mm_hashto/<shared_storage_path>/<mm_hash>/encoder_cache.safetensors, and the PD side loads it to GPU on demand.ECConnectorFactory(withECSharedStorageConnectorregistered by default).4 Development
Here is a figure illustrating disaggregate encoder flow:
Disaggregated encoding is implemented by running two parts:
disagg_1e1pd_example.sh/disagg_encoder_example.sh), or disaggregated with Decode (E->P->D, seedisagg_1e1p1d_example.sh).A connector transfers encoder-cache (EC) embeddings from the encoder instance to the PD instance.
All related code is under
vllm/distributed/ec_transfer.For the PD disaggregation part, the Prefill instance receives cache exactly the same as the disaggregate encoder flow above. Prefill instance executes 1 step (prefill -> 1 token output) and then transfers KV cache to the Decode instance for the remaining execution. The KV transfer part purely happens after the execution of the Prefill instance.
docs/features/disagg_prefill.mdshows the brief idea about the disaggregated prefill (v0). We create the example setup with the NixlConnector fromvllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.pyin vllm V1 and referred to thetests/v1/kv_connector/nixl_integration/toy_proxy_server.pyto facilitate the kv transfer between P and D;5. Next-Step Plan
5.1 Broaden EC Connector Types
5.2 Multi-Hardware Platform Support
5.3 Comprehensive Performance Evaluation