[Core] Encoder separation for Encode-Prefill-Decode Disaggregation by fake0fan · Pull Request #25233 · vllm-project/vllm

fake0fan · 2025-09-19T07:30:02Z

Disaggregated Encoder

A disaggregated encoder runs the vision-encoder stage of a multimodal LLM in a process that is separate from the prefill / decoder stage. Deploying these two stages in independent vLLM instances brings three practical benefits:

Independent, fine-grained scaling
Lower time-to-first-token (TTFT)
Cross-process reuse and caching of encoder outputs

We received a series of feedback from the 21740 (Many thanks to @LastZhabka for the excellent work on the initial version), and based on this feedback, we updated our overall design logic.

Design doc: https://docs.google.com/document/d/1aed8KtC6XkXtdoV87pWT0a8OJlZ-CpnuLLzmR8l9BAE

@ywang96 @NickLucche @DarkLight1337 @WoosukKwon

CLOSE #20799

1 New Encoder Cache Connector

Disaggregated encoding in v1 is enabled by an explicit EC connector abstraction. The ECConnectorBase defines a clear split of responsibilities between the scheduler-side and worker-side, and provides the lifecycle hooks required to check, load, save, and track encoder cache (EC) across processes.

ECConnector – interface for retrieving EC caches produced by the encoder.

Scheduler role – checks cache existence and schedules loads. (lives in the scheduler process):
- check_caches_exist(request) -> list[bool]: probe whether EC exists for each multimodal datum in the request.
- update_state_after_alloc(request, index): record EC loading intent once encoder cache space is allocated for a given mm datum.
- build_connector_meta(scheduler_output) -> ECConnectorMetadata: build per-step metadata consumed by workers; also resets connector internal state for the next step.
- request_finished(request) -> (bool, Optional[dict]): optionally indicate async save/send in progress and attach metadata.
- update_connector_output(connector_output): ingest worker-side completion info.
Worker role – loads the embeddings into memory. (lives in each worker process):
- bind_connector_metadata(meta) / clear_connector_metadata(): attach per-step metadata from the scheduler before/after forward.
- start_load_caches(**kwargs): load required EC into the local encoder cache before embeddings are gathered.
- save_caches(**kwargs): persist/transfer EC produced on the encoder.
- wait_for_save(): block until outstanding saves complete (if any).
- get_finished(finished_req_ids) -> (set[str]|None, set[str]|None): report completion of async send/recv to help scheduler bookkeeping.

2 Change highlights (Scheduler and GPUModelRunner)

Scheduler (vllm/v1/core/sched/scheduler.py):
- Initializes an EC connector when ec_transfer_config is present via ECConnectorFactory.create_connector(..., role=SCHEDULER).
- Tracks an encoder-token budget and skips EC work when cache already exists (local or remote as reported by the connector).
- After allocation, calls ec_connector.update_state_after_alloc(...) so the next build_connector_meta(...) includes the mm hashes to load.
- Emits ec_connector_metadata in SchedulerOutput for the workers, and ingests worker completion via ec_connector.update_connector_output(...).
Worker (vllm/v1/worker/gpu_model_runner.py):
- Encoder-side (producer):
  - Within execute_model, when get_ec_transfer().is_producer is True, the runner enters with maybe_get_ec_connector_output(..., encoder_cache=self.encoder_cache): before running the multimodal encoder.
  - The encode pass computes embeddings and writes them into encoder_cache[mm_hash].
  - Immediately after finishing the encode for a given mm_hash, the runner calls maybe_save_ec_to_connector(self.encoder_cache, mm_hash) which invokes ECConnectorBase.save_caches(encoder_cache=..., mm_hash=...).
  - On context exit, wait_for_save() is invoked (if enabled) to ensure the persisted EC is durable/visible to consumers; get_finished(...) is queried to surface completion status back to the scheduler.
- PD-side (consumer):
  - For requests scheduled on PD, the scheduler supplies ec_connector_metadata that lists the mm_hash items needing loads.
  - The runner binds this metadata and calls start_load_caches(encoder_cache=self.encoder_cache) prior to _gather_mm_embeddings, allowing the connector to populate encoder_cache[mm_hash] from the external store.
  - _gather_mm_embeddings then reads the loaded tensors from encoder_cache and returns them as multimodal embeddings for the subsequent decoder input embedding construction.
  - After the forward step, the runner clears metadata; any connector-furnished completion info is recorded into ECConnectorOutput for the scheduler to free resources when safe.

3 Usage Example

The current reference pathway is ECSharedStorageConnector.
Below ready-to-run scripts show the workflows:

1 Encoder + 1 PD:
- examples/online_serving/disaggregated_encoder/shared_storage_connector/disagg_1e1pd_example.sh
- (legacy/simple demo) examples/online_serving/disaggregated_encoder/shared_storage_connector/disagg_encoder_example.sh
1 Encoder + 1 Prefill + 1 Decode:
- examples/online_serving/disaggregated_encoder/shared_storage_connector/disagg_1e1p1d_example.sh

3.1 Minimal ECTransfer CLI config

The Encoder and PD share/transfer encoder cache (EC) via EC transfer. The current reference implementation is ECSharedStorageConnector (dump/load via a shared directory).

Encoder (producer) example:

vllm serve <model> \
  --ec-transfer-config '{
    "ec_connector": "ECSharedStorageConnector",
    "ec_role": "ec_producer",
    "ec_connector_extra_config": {"shared_storage_path": "/tmp"}
  }'

PD (consumer) example:

vllm serve <model> \
  --ec-transfer-config '{
    "ec_connector": "ECSharedStorageConnector",
    "ec_role": "ec_consumer",
    "ec_connector_extra_config": {"shared_storage_path": "/tmp"}
  }'

Notes:

If shared_storage_path is not explicitly set, /tmp is used by default.
This connector persists EC per mm_hash to /<shared_storage_path>/<mm_hash>/encoder_cache.safetensors, and the PD side loads it to GPU on demand.
Connector selection and loading are handled by ECConnectorFactory (with ECSharedStorageConnector registered by default).

4 Development

Here is a figure illustrating disaggregate encoder flow:

Disaggregated encoding is implemented by running two parts:

Encoder instance – a vLLM instance to performs vision encoding.
Prefill/Decode (PD) instance(s) – runs language pre-fill and decode.
- PD can be a single instance (E->PD, see disagg_1e1pd_example.sh / disagg_encoder_example.sh), or disaggregated with Decode (E->P->D, see disagg_1e1p1d_example.sh).

A connector transfers encoder-cache (EC) embeddings from the encoder instance to the PD instance.
All related code is under vllm/distributed/ec_transfer.

For the PD disaggregation part, the Prefill instance receives cache exactly the same as the disaggregate encoder flow above. Prefill instance executes 1 step (prefill -> 1 token output) and then transfers KV cache to the Decode instance for the remaining execution. The KV transfer part purely happens after the execution of the Prefill instance. docs/features/disagg_prefill.md shows the brief idea about the disaggregated prefill (v0). We create the example setup with the NixlConnector from vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py in vllm V1 and referred to the tests/v1/kv_connector/nixl_integration/toy_proxy_server.py to facilitate the kv transfer between P and D;

5. Next-Step Plan

5.1 Broaden EC Connector Types

5.2 Multi-Hardware Platform Support

5.3 Comprehensive Performance Evaluation

Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>

…hceduler for remote load request Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>

Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>

…Connector Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>

Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>

…oder input Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

Signed-off-by: herotai214 <herotai214@gmail.com>

mergify · 2025-11-11T08:10:21Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fake0fan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ywang96

Per offline discussion, since we're not going to include this in the 0.11.1 release, let's update this PR and get it in if all CI tests pass!

Very much appreciate everyone's efforts on pushing this PR to the finishing line 🚀 🚀 🚀

Signed-off-by: herotai214 <herotai214@gmail.com>

gty111 · 2025-11-12T01:24:30Z

We really appreciate the great work in this PR. We’d really like to have it merged into the main branch.

chaunceyjiang · 2025-11-12T03:00:56Z

Nice! @fake0fan

npuichigo · 2025-12-26T05:59:45Z

Do we have any plan to refine the Whisper implementation on benefit from EPD disaggregation?

n00909098 and others added 21 commits September 8, 2025 16:32

[Draft] EPD

49d76a1

Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>

[Fix] Runable for non EPD mode

00d1185

Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>

[Fix] Remove hard code path in share storage connector

3031ea0

Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>

[Misc] Clean code and docs

c2ea7db

Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>

[Misc] Clean code

7448888

Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>

[Fix] Enable start Encoder Instance without KVC and fix PD instance s…

bc00658

…hceduler for remote load request Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>

[Minor] Fix EC update state after encoder cache allocation

946bfb4

Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>

[Bug] Check actualy tensor file for EC Cache exist in ECSharedStorage…

e1ecb30

…Connector Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>

[Bugfix] Encoder Instance does not return Output to client

d20c0f8

Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>

[Bugfix] Fix try scheduler encoder inputs return when there is no enc…

a290dda

…oder input Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

[Fix] Fix typo and move all hardcode to launch config

668125d

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

[Fix] Fix typo in docs

62d4a2a

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

[Misc] Add example for disaggregate encoder

a0735ce

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

[Fix] Clean code

bc1f9ba

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

[Bugfix] Fix bug when request there is no encoder input

bcf5286

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

[Misc] Optimize disaggregate encoder proxy

250d9e3

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

[Minor] Remove some comments and update docs

f5ba236

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

[Minor] Remove unused logging

c5514d6

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

[Misc] Add docs for disaggregate encoder

f470d17

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

[Feature] E+P+D disagg with proxy & example

756663c

Signed-off-by: herotai214 <herotai214@gmail.com>

[Misc] Proxy support E+P+D & E+PD / Update docs & examples

3613ecc

Signed-off-by: herotai214 <herotai214@gmail.com>

fake0fan requested review from ApostaC, WoosukKwon, alexm-redhat, comaniac, heheda12345, hmellor, njhill, robertgshaw2-redhat and ywang96 as code owners September 19, 2025 07:30

mergify bot added deepseek Related to DeepSeek models frontend performance Performance-related issues qwen Related to Qwen models rocm Related to AMD ROCm structured-output tpu Related to Google TPUs labels Nov 11, 2025

github-project-automation bot added this to Structured Output Nov 11, 2025

mergify bot removed the needs-rebase label Nov 11, 2025

Fix pre-commit

4ef546e

Signed-off-by: herotai214 <herotai214@gmail.com>

herotai214 force-pushed the epd_draft branch from 6962cad to 4ef546e Compare November 11, 2025 08:09

mergify bot removed the tpu Related to Google TPUs label Nov 11, 2025

mergify bot added the needs-rebase label Nov 11, 2025

ywang96 approved these changes Nov 11, 2025

View reviewed changes

Merge branch 'main' into epd_draft

e834bc3

Signed-off-by: herotai214 <herotai214@gmail.com>

mergify bot removed the needs-rebase label Nov 11, 2025

Merge branch 'main' into epd_draft

bdf5bfa

ywang96 merged commit 4ccffe5 into vllm-project:main Nov 12, 2025
56 checks passed

DarkLight1337 mentioned this pull request Nov 18, 2025

[RFC]: Prototype Separating Vision Encoder to Its Own Worker #20799

Closed

1 task

DarkLight1337 mentioned this pull request Dec 15, 2025

[RFC]: Multi-modality Support on vLLM #4194

Open

57 tasks

Shirley125 mentioned this pull request Jan 20, 2026

[RFC]: Tracking Remaining Work for Encode–Prefill–Decode Disaggregation vllm-project/vllm-ascend#6026

Open

5 tasks

fake0fan mentioned this pull request Jan 20, 2026

[RFC]: Tracking follow-up progress on Encode–Prefill–Decode Disaggregation #32659

Open

7 tasks

NickLucche mentioned this pull request Feb 3, 2026

[Roadmap]: PD Disaggregation with NixlConnector Roadmap #33702

Open

44 tasks

roytman mentioned this pull request Feb 15, 2026

Feature: Extend P/D Disaggregation to support Encode-Prefill-Decode (EPD) llm-d/llm-d-inference-scheduler#608

Open

7 tasks

furionw mentioned this pull request Feb 18, 2026

[BugFix] Fix implicit and incorrect assumption on ECConnector is_producer #34783

Merged

herotai214 mentioned this pull request Mar 16, 2026

[Performance]: EPD Disaggregation Performance Testing Scripts #31961

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core] Encoder separation for Encode-Prefill-Decode Disaggregation#25233

[Core] Encoder separation for Encode-Prefill-Decode Disaggregation#25233
ywang96 merged 78 commits intovllm-project:mainfrom
fake0fan:epd_draft

fake0fan commented Sep 19, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Nov 11, 2025

Uh oh!

ywang96 left a comment

Uh oh!

gty111 commented Nov 12, 2025

Uh oh!

Uh oh!

chaunceyjiang commented Nov 12, 2025

Uh oh!

npuichigo commented Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

Uh oh!

Conversation

fake0fan commented Sep 19, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Disaggregated Encoder

CLOSE #20799

1 New Encoder Cache Connector

ECConnector – interface for retrieving EC caches produced by the encoder.

2 Change highlights (Scheduler and GPUModelRunner)

3 Usage Example

3.1 Minimal ECTransfer CLI config

4 Development

5. Next-Step Plan

5.1 Broaden EC Connector Types

5.2 Multi-Hardware Platform Support

5.3 Comprehensive Performance Evaluation

Uh oh!

mergify bot commented Nov 11, 2025

Uh oh!

ywang96 left a comment

Choose a reason for hiding this comment

Uh oh!

gty111 commented Nov 12, 2025

Uh oh!

Uh oh!

chaunceyjiang commented Nov 12, 2025

Uh oh!

npuichigo commented Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

fake0fan commented Sep 19, 2025 •

edited by github-actions bot

Loading