[POC] Encoder Disaggregation #4047

CUHKSZzxy · 2025-10-17T07:50:34Z

Not ready to be merged or fully reviewed yet.

However, since we have already implemented the essential building blocks and passed my naive single request test, I propose a draft PR here for whom it may concern.

Motivation

InternVL3.5 proposed Decoupled Vision Deployment (DvD), also known as Encode-Prefill-Decode Disaggregation (EPD) in many other papers. We use the term EPD in the following descriptions. This paradigm has the potential to improve Time-to-First-Token (TTFT) and throughput.

Design

In this section, we explain the design logic of EPD in lmdeploy.

Entire workflow

Sequence Diagram (click to expand)

sequenceDiagram
    participant Client
    participant API Server (Proxy)
    participant AsyncEngine
    participant Encoder Engine
    participant LLM Engine
    participant P2P Connection (ZMQ)

    %% Phase 1: Encoder Processing
    Client->>+API Server (Proxy): POST /v1/chat/completions (prompt, image_data)
    API Server (Proxy)->>+AsyncEngine: generate(prompt, image_data)
    AsyncEngine->>+Encoder Engine: step(ADD_MESSAGE, image_data)
    Encoder Engine->>Encoder Engine: Process image, generate feature embeddings
    Note right of Encoder Engine: Features stored in its local cache blocks
    Encoder Engine-->>-AsyncEngine: InferOutput (encoder_result)
    Note over AsyncEngine, Encoder Engine: encoder_result contains feature block_ids

    %% Phase 2: Feature Cache Migration
    AsyncEngine->>+LLM Engine: step(ADD_MESSAGE, prompt, encoder_result)
    LLM Engine->>LLM Engine: _on_add_message() -> _add_message()
    Note right of LLM Engine: Sequence created, status = WAITING_EP_MIGRATION
    LLM Engine->>LLM Engine: _async_loop_ep_migration() -> schedule_ep_migration()
    Note right of LLM Engine: Sequence status -> RUNNING_EP_MIGRATION
    LLM Engine->>LLM Engine: executor.migrate(encoder_blocks -> llm_blocks)
    Note right of LLM Engine: Physical copy of feature embeddings
    LLM Engine->>+P2P Connection (ZMQ): zmq_send(ack)
    P2P Connection (ZMQ)->>+Encoder Engine: Receive ACK
    Encoder Engine->>Encoder Engine: Free its copy of the feature cache blocks
    deactivate P2P Connection (ZMQ)
    deactivate Encoder Engine
    
    Note right of LLM Engine: After migration, sequence status -> WAITING

    %% Phase 3: LLM Prefill & Decode
    AsyncEngine->>+LLM Engine: step(empty request to trigger prefill)
    LLM Engine->>LLM Engine: _async_loop_main() -> schedule(is_prefill=True)
    Note right of LLM Engine: Sequence status -> RUNNING
    LLM Engine->>LLM Engine: executor.forward(prompt + migrated_features)
    Note right of LLM Engine: GPU performs prefill, generates 1st token
    LLM Engine-->>-AsyncEngine: InferOutput (1st token)
    AsyncEngine-->>API Server (Proxy): Stream 1st token
    API Server (Proxy)-->>Client: Stream 1st token

    loop Decode Loop
        AsyncEngine->>+LLM Engine: step(empty request to trigger decode)
        LLM Engine->>LLM Engine: _async_loop_main() -> schedule(is_prefill=False)
        LLM Engine->>LLM Engine: executor.forward(last_token)
        LLM Engine-->>-AsyncEngine: InferOutput (next_token)
        AsyncEngine-->>API Server (Proxy): Stream next_token
        API Server (Proxy)-->>Client: Stream next_token
    end

    %% Phase 4: Finish
    Note over LLM Engine, AsyncEngine: Generation finishes (EOS/max_tokens)
    LLM Engine-->>AsyncEngine: InferOutput (finish=True)
    AsyncEngine->>+LLM Engine: step(END_SESSION)
    LLM Engine->>LLM Engine: _on_end_session() -> scheduler.end_session()
    Note right of LLM Engine: Frees all resources for the session
    deactivate LLM Engine
    AsyncEngine-->>API Server (Proxy): Close stream
    API Server (Proxy)-->>Client: Close connection
    deactivate API Server (Proxy)
    deactivate AsyncEngine

PD distserve attaches migration_request to P instance response, and routes to D instance. Similarly, we propose a new attribute encoder_result attached to the E instance response, and routes to the PD instance.

State transitions

State Diagram (click to expand)

stateDiagram-v2
    direction LR

    [*] --> WAITING_EPD_MIGRATION: Request with encoder_result
    [*] --> WAITING_MIGRATION: Request with migration_request
    [*] --> WAITING: Standard Request

    state "Encoder-Prefill-Decode Path" as EPD_Path {
        WAITING_EPD_MIGRATION --> RUNNING_EPD_MIGRATION: Scheduler._schedule_epd_migration()
        RUNNING_EPD_MIGRATION --> EPD_MIGRATION_LOCKED: Engine locks after migration
        EPD_MIGRATION_LOCKED --> WAITING: Engine unlocks, ready for prefill
    }

    state "Prefill-Decode Path" as PD_Path {
        WAITING_MIGRATION --> RUNNING_MIGRATION: Scheduler._schedule_migration()
        RUNNING_MIGRATION --> MIGRATION_LOCKED: Engine locks after migration
        MIGRATION_LOCKED --> MIGRATION_DONE: Engine unlocks
        MIGRATION_DONE --> RUNNING: Scheduler.collect_migration_done()
    }

    state "Standard Inference Path" as Standard_Path {
        WAITING --> RUNNING: Scheduler._schedule_prefill()
        RUNNING --> LOCKED: Engine locks for forward pass
        LOCKED --> RUNNING: Engine unlocks after forward pass
        RUNNING --> WAITING: Evicted during decode scheduling
    }

    RUNNING --> ENDED: Generation finished (EOS/max_tokens)
    RUNNING --> STOPPED: User cancelled
    RUNNING --> ABORTED: Error (e.g., OOM)

    STOPPED --> ENDED
    ABORTED --> ENDED
    ENDED --> [*]

To migrate features from the E instance to the PD instance, we add relevant scheduling logic inside the PyTorch engine. Specifically, we treat the scheduling and migration E -> PD as an extension of the current PD disaggregation, adding extra states such as WAITING_EPD_MIGRATION, RUNNING_EPD_MIGRATION, EPD_MIGRATION_LOCKED

Modifications

Modifications are threefold:

Proxy Router / EPD Connections. Credit to @FirwoodLin
-- New engine role 'Encoder'.
-- Proxy routing.
-- P2P connections/initializations.
Multimodal Engine
-- A separate engine for the encoder.
-- A multimodal cache engine. Credit to @FirwoodLin
LLM Engine
-- Accept results from the encoder side.
-- Schedule multimodal cache migration.

Performance

TODO

Tasks

Extensive refactoring and fixing
-- Multimodal engine
-- Minimal modifications to the LLM engine
-- Proxy routing logic
Acc and performance test
-- Multi-batch
-- Metrics implementation
-- Performance test and optimizations
Compatible with turbomind
Compatible with previous api server usages for VL models
Automatic p2p connection warmup
Disable LLM weight loading for MM engine
Extend EPD for QwenVL series and more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[POC] Encoder Disaggregation #4047

[POC] Encoder Disaggregation #4047

CUHKSZzxy commented Oct 17, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[POC] Encoder Disaggregation #4047

Are you sure you want to change the base?

[POC] Encoder Disaggregation #4047

Conversation

CUHKSZzxy commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Design

Modifications

Performance

Tasks

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CUHKSZzxy commented Oct 17, 2025 •

edited

Loading