[POC] Encoder Disaggregation #4047
Draft
+2,161
−42
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Not ready to be merged or fully reviewed yet.
However, since we have already implemented the essential building blocks and passed my naive single request test, I propose a draft PR here for whom it may concern.
Motivation
InternVL3.5 proposed
Decoupled Vision Deployment (DvD), also known asEncode-Prefill-Decode Disaggregation (EPD)in many other papers. We use the termEPDin the following descriptions. This paradigm has the potential to improve Time-to-First-Token (TTFT) and throughput.Design
In this section, we explain the design logic of EPD in lmdeploy.
Sequence Diagram (click to expand)
sequenceDiagram participant Client participant API Server (Proxy) participant AsyncEngine participant Encoder Engine participant LLM Engine participant P2P Connection (ZMQ) %% Phase 1: Encoder Processing Client->>+API Server (Proxy): POST /v1/chat/completions (prompt, image_data) API Server (Proxy)->>+AsyncEngine: generate(prompt, image_data) AsyncEngine->>+Encoder Engine: step(ADD_MESSAGE, image_data) Encoder Engine->>Encoder Engine: Process image, generate feature embeddings Note right of Encoder Engine: Features stored in its local cache blocks Encoder Engine-->>-AsyncEngine: InferOutput (encoder_result) Note over AsyncEngine, Encoder Engine: encoder_result contains feature block_ids %% Phase 2: Feature Cache Migration AsyncEngine->>+LLM Engine: step(ADD_MESSAGE, prompt, encoder_result) LLM Engine->>LLM Engine: _on_add_message() -> _add_message() Note right of LLM Engine: Sequence created, status = WAITING_EP_MIGRATION LLM Engine->>LLM Engine: _async_loop_ep_migration() -> schedule_ep_migration() Note right of LLM Engine: Sequence status -> RUNNING_EP_MIGRATION LLM Engine->>LLM Engine: executor.migrate(encoder_blocks -> llm_blocks) Note right of LLM Engine: Physical copy of feature embeddings LLM Engine->>+P2P Connection (ZMQ): zmq_send(ack) P2P Connection (ZMQ)->>+Encoder Engine: Receive ACK Encoder Engine->>Encoder Engine: Free its copy of the feature cache blocks deactivate P2P Connection (ZMQ) deactivate Encoder Engine Note right of LLM Engine: After migration, sequence status -> WAITING %% Phase 3: LLM Prefill & Decode AsyncEngine->>+LLM Engine: step(empty request to trigger prefill) LLM Engine->>LLM Engine: _async_loop_main() -> schedule(is_prefill=True) Note right of LLM Engine: Sequence status -> RUNNING LLM Engine->>LLM Engine: executor.forward(prompt + migrated_features) Note right of LLM Engine: GPU performs prefill, generates 1st token LLM Engine-->>-AsyncEngine: InferOutput (1st token) AsyncEngine-->>API Server (Proxy): Stream 1st token API Server (Proxy)-->>Client: Stream 1st token loop Decode Loop AsyncEngine->>+LLM Engine: step(empty request to trigger decode) LLM Engine->>LLM Engine: _async_loop_main() -> schedule(is_prefill=False) LLM Engine->>LLM Engine: executor.forward(last_token) LLM Engine-->>-AsyncEngine: InferOutput (next_token) AsyncEngine-->>API Server (Proxy): Stream next_token API Server (Proxy)-->>Client: Stream next_token end %% Phase 4: Finish Note over LLM Engine, AsyncEngine: Generation finishes (EOS/max_tokens) LLM Engine-->>AsyncEngine: InferOutput (finish=True) AsyncEngine->>+LLM Engine: step(END_SESSION) LLM Engine->>LLM Engine: _on_end_session() -> scheduler.end_session() Note right of LLM Engine: Frees all resources for the session deactivate LLM Engine AsyncEngine-->>API Server (Proxy): Close stream API Server (Proxy)-->>Client: Close connection deactivate API Server (Proxy) deactivate AsyncEnginePD distserve attaches
migration_requestto P instance response, and routes to D instance. Similarly, we propose a new attributeencoder_resultattached to the E instance response, and routes to the PD instance.State Diagram (click to expand)
stateDiagram-v2 direction LR [*] --> WAITING_EPD_MIGRATION: Request with encoder_result [*] --> WAITING_MIGRATION: Request with migration_request [*] --> WAITING: Standard Request state "Encoder-Prefill-Decode Path" as EPD_Path { WAITING_EPD_MIGRATION --> RUNNING_EPD_MIGRATION: Scheduler._schedule_epd_migration() RUNNING_EPD_MIGRATION --> EPD_MIGRATION_LOCKED: Engine locks after migration EPD_MIGRATION_LOCKED --> WAITING: Engine unlocks, ready for prefill } state "Prefill-Decode Path" as PD_Path { WAITING_MIGRATION --> RUNNING_MIGRATION: Scheduler._schedule_migration() RUNNING_MIGRATION --> MIGRATION_LOCKED: Engine locks after migration MIGRATION_LOCKED --> MIGRATION_DONE: Engine unlocks MIGRATION_DONE --> RUNNING: Scheduler.collect_migration_done() } state "Standard Inference Path" as Standard_Path { WAITING --> RUNNING: Scheduler._schedule_prefill() RUNNING --> LOCKED: Engine locks for forward pass LOCKED --> RUNNING: Engine unlocks after forward pass RUNNING --> WAITING: Evicted during decode scheduling } RUNNING --> ENDED: Generation finished (EOS/max_tokens) RUNNING --> STOPPED: User cancelled RUNNING --> ABORTED: Error (e.g., OOM) STOPPED --> ENDED ABORTED --> ENDED ENDED --> [*]To migrate features from the E instance to the PD instance, we add relevant scheduling logic inside the PyTorch engine. Specifically, we treat the scheduling and migration E -> PD as an extension of the current PD disaggregation, adding extra states such as
WAITING_EPD_MIGRATION, RUNNING_EPD_MIGRATION, EPD_MIGRATION_LOCKEDModifications
Modifications are threefold:
-- New engine role 'Encoder'.
-- Proxy routing.
-- P2P connections/initializations.
-- A separate engine for the encoder.
-- A multimodal cache engine. Credit to @FirwoodLin
-- Accept results from the encoder side.
-- Schedule multimodal cache migration.
Performance
TODO
Tasks
-- Multimodal engine
-- Minimal modifications to the LLM engine
-- Proxy routing logic
-- Multi-batch
-- Metrics implementation
-- Performance test and optimizations
Related