diff --git a/docs/disagg_pd.md b/docs/disagg_pd.md index 383b9c193b..ad459bcb03 100644 --- a/docs/disagg_pd.md +++ b/docs/disagg_pd.md @@ -1,8 +1,8 @@ -# Disaggregated Prefill/Decode Inference Serving in llm-d +# Disaggregated Prefill/Decode Inference Serving in LLM-D ## Overview -This document describes the architecture and request lifecycle for enabling **disaggregated prefill and decode (P/D)** inference execution in the llm-d router. The architecture aims to improve flexibility, scalability, and performance by enabling separation of prefill and decode stages onto different workers. +This document describes the architecture and request lifecycle for enabling **disaggregated prefill and decode (P/D)** inference execution in the LLM-D router. The architecture aims to improve flexibility, scalability, and performance by enabling separation of prefill and decode stages onto different workers. This evolved version removes the requirement for sidecars on the **prefill node**, simplifying deployment while maintaining orchestration from the **decode node**. @@ -25,7 +25,7 @@ This evolved version removes the requirement for sidecars on the **prefill node* | **Decode Worker** | Handles decode stage and contains the sidecar for coordination | | **Sidecar (Decode)** | Orchestrates communication with prefill worker and manages lifecycle | | **Envoy Proxy** | Accepts OpenAI-style requests and forwards them to EPP | -| **EPP** | End Point Picker, makes scheduling decisions | +| **EPP** | Endpoint Picker, makes scheduling decisions | --- @@ -37,7 +37,7 @@ This evolved version removes the requirement for sidecars on the **prefill node* 2. **EPP Scheduling Decision** - EPP evaluates: - Prompt length - - KV cache hit probability + - KV-cache hit probability - System and pod load - Selects either: - **Single node** path (decode handles all) @@ -47,10 +47,10 @@ This evolved version removes the requirement for sidecars on the **prefill node* 3. **Execution** - Request lands on Decode Worker (as selected by EPP) - Decode sidecar coordinates: - - If `prefill_worker_id == nil`, runs both stages locally by passing request to local vllm - - If split: - - Sends prefill job to Prefill Worker with a special header `do_remote_decode=true` - - Upon receiving response from Prefill Worker runs decode stage + - If `x-prefiller-host-port` header doesn't exist, runs both stages locally by passing request to local vLLM + - If `x-prefiller-host-port` header exists: + - Sends the prefill job to the selected Prefill Worker with a special request field `do_remote_decode=true` + - Upon receiving the response from the Prefill Worker runs the decode stage 4. **Response Flow** - Response flows from decode sidecar → Envoy → EPP → User @@ -59,11 +59,34 @@ This evolved version removes the requirement for sidecars on the **prefill node* ## Architectural Details + +```mermaid +sequenceDiagram + participant C as Client + participant I as Inference Gateway + participant DS as Decode Worker Sidecar + participant D as Decode Worker(vLLM) + participant P as Prefill Worker(vLLM) + + + C->>I: Inference Request + I->>DS: Request is sent to the Decode Worker Sidecar
with the selected Prefill worker set in a header. + DS->>P: Remote Prefill with prompt(max_tokens=1) + P-->>P: Run prefill + P->>DS: Remote kv parameters + DS->> D: Request is sent to the Decode Worker (vLLM) with remote_prefill true,
prefill ID and memory block IDs + D-->>P: Read kv-cache + D-->>D: Schedule decode into queue & run decode + D->>DS: Inference Response + DS->>I: Inference Response + I->>C: Inference Response +``` + ### Sidecar Responsibilities (Decode Only) - Receives EPP metadata (decode pod, optional prefill pod) - Sends request to prefill -- Waits and validates result +- Waits for the result and validates it - Launches local decode job - Sends final response @@ -73,33 +96,21 @@ This evolved version removes the requirement for sidecars on the **prefill node* ## Worker Selection Logic -- **Decode Worker**: - - Prefer longest prefix match / KV cache utilization (depends on available scorers) - -- **Prefill Worker**: - - High prefix-cache hit rate - - Low load +- **Decode/Prefill Worker**: + - Prefer longest prefix match/kv-cache utilization (depends on available scorers) and low load > **Skip prefill worker** when: -> - Prefix match/kv cache hit is high +> - Prefix match/kv-cache hit is high > - Prompt is very short --- -## vLLM and LMCache Integration - -- **vLLM changes** (or wrapper APIs): - - `save()`, `load()` APIs - - `done_sending`, `done_receiving` - - Connector API supporting async transfer - ---- ## Drawbacks & Limitations -- Slight increase in TTFT for split P/D +- Slight increase in TTFT for disaggregated P/D - Possibility of stranded memory on prefill crash -- Need for timeout and retry logic +- The need for timeout and retry logic --- @@ -115,17 +126,17 @@ This evolved version removes the requirement for sidecars on the **prefill node* ## Future Considerations - Cache coordinate -- Pre allocation of kv blocks in decode node , push cache from prefill to decode worker during calculation +- Pre-allocation of kv blocks in the decode node, push cache from the prefill to the decode worker during calculation --- ## Integrating External Prefill/Decode Workloads -The llm-d inference scheduler supports integration with external disaggregated prefill/decode (P/D) workloads other inference frameworks that follow the same P/D separation pattern but use **different Kubernetes Pod labeling conventions**. +The LLM-D inference scheduler supports integration with external disaggregated prefill/decode (P/D) workloads other inference frameworks that follow the same P/D separation pattern but use **different Kubernetes Pod labeling conventions**. ### Labeling Convention Flexibility -By default, llm-d uses the label key `llm-d.ai/role` with values: +By default, LLM-D uses the label key `llm-d.ai/role` with values: - `"prefill"` → prefill-only pods - `"decode"` or `"both"` → decode-capable pods @@ -159,7 +170,8 @@ plugins: validValues: ["decode"] - type: prefix-cache-scorer parameters: - hashBlockSize: 5 + autoTune: false + blockSize: 5 maxPrefixBlocksToMatch: 256 lruCapacityPerServer: 31250 - type: max-score-picker @@ -175,13 +187,11 @@ schedulingProfiles: - pluginRef: "prefill-pods" - pluginRef: "max-score-picker" - pluginRef: "prefix-cache-scorer" - weight: 2 - name: decode plugins: - pluginRef: "decode-pods" - pluginRef: "max-score-picker" - pluginRef: "prefix-cache-scorer" - weight: 2 ``` ---