llm-d · github-actions · Jan 19, 2026 · Jan 19, 2026 · Jan 19, 2026 · Jan 19, 2026
diff --git a/docs/disagg_pd.md b/docs/disagg_pd.md
@@ -1,8 +1,8 @@
-# Disaggregated Prefill/Decode Inference Serving in llm-d
+# Disaggregated Prefill/Decode Inference Serving in LLM-D
 
 ## Overview
 
-This document describes the architecture and request lifecycle for enabling **disaggregated prefill and decode (P/D)** inference execution in the llm-d router. The architecture aims to improve flexibility, scalability, and performance by enabling separation of prefill and decode stages onto different workers.
+This document describes the architecture and request lifecycle for enabling **disaggregated prefill and decode (P/D)** inference execution in the LLM-D router. The architecture aims to improve flexibility, scalability, and performance by enabling separation of prefill and decode stages onto different workers.
 
 This evolved version removes the requirement for sidecars on the **prefill node**, simplifying deployment while maintaining orchestration from the **decode node**.
 
@@ -25,7 +25,7 @@ This evolved version removes the requirement for sidecars on the **prefill node*
 | **Decode Worker**    | Handles decode stage and contains the sidecar for coordination       |
 | **Sidecar (Decode)** | Orchestrates communication with prefill worker and manages lifecycle |
 | **Envoy Proxy**      | Accepts OpenAI-style requests and forwards them to EPP               |
-| **EPP**              | End Point Picker, makes scheduling decisions                     |
+| **EPP**              | Endpoint Picker, makes scheduling decisions                     |
 
 ---
 
@@ -37,7 +37,7 @@ This evolved version removes the requirement for sidecars on the **prefill node*
 2. **EPP Scheduling Decision**
    - EPP evaluates:
      - Prompt length
-     - KV cache hit probability
+     - KV-cache hit probability
      - System and pod load
    - Selects either:
      - **Single node** path (decode handles all)
@@ -47,10 +47,10 @@ This evolved version removes the requirement for sidecars on the **prefill node*
 3. **Execution**
    - Request lands on Decode Worker (as selected by EPP)
    - Decode sidecar coordinates:
-     - If `prefill_worker_id == nil`, runs both stages locally by passing request to local vllm
-     - If split:
-       - Sends prefill job to Prefill Worker with a special header `do_remote_decode=true`
-       - Upon receiving response from Prefill Worker runs decode stage
+     - If `x-prefiller-host-port` header doesn't exist, runs both stages locally by passing request to local vLLM
+     - If `x-prefiller-host-port` header exists:
+       - Sends the prefill job to the selected Prefill Worker with a special request field `do_remote_decode=true`
+       - Upon receiving the response from the Prefill Worker runs the decode stage
 
 4. **Response Flow**
    - Response flows from decode sidecar → Envoy → EPP → User
@@ -59,11 +59,34 @@ This evolved version removes the requirement for sidecars on the **prefill node*
 
 ## Architectural Details
 
+
+```mermaid
+sequenceDiagram
+  participant C as Client
+  participant I as Inference Gateway
+  participant DS as Decode Worker Sidecar
+  participant D as Decode Worker(vLLM)
+  participant P as Prefill Worker(vLLM)
+
+
+  C->>I: Inference Request
+  I->>DS: Request is sent to the Decode Worker Sidecar <br/> with the selected Prefill worker set in a header.
+  DS->>P: Remote Prefill with prompt(max_tokens=1)
+  P-->>P: Run prefill
+  P->>DS: Remote kv parameters
+  DS->> D: Request is sent to the Decode Worker (vLLM) with remote_prefill true, <br/>prefill ID and memory block IDs 
+        D-->>P: Read kv-cache
+        D-->>D: Schedule decode into queue & run decode
+  D->>DS: Inference Response
+  DS->>I: Inference Response
+  I->>C: Inference Response
+```
+
 ### Sidecar Responsibilities (Decode Only)
 
 - Receives EPP metadata (decode pod, optional prefill pod)
 - Sends request to prefill
-- Waits and validates result
+- Waits for the result and validates it
 - Launches local decode job
 - Sends final response
 
@@ -73,33 +96,21 @@ This evolved version removes the requirement for sidecars on the **prefill node*
 
 ## Worker Selection Logic
 
-- **Decode Worker**:
-  - Prefer longest prefix match / KV cache utilization (depends on available scorers)
-
-- **Prefill Worker**:
-  - High prefix-cache hit rate
-  - Low load
+- **Decode/Prefill Worker**:
+  - Prefer longest prefix match/kv-cache utilization (depends on available scorers) and low load
 
 > **Skip prefill worker** when:
-> - Prefix match/kv cache hit is high
+> - Prefix match/kv-cache hit is high
 > - Prompt is very short
 
 ---
 
-## vLLM and LMCache Integration
-
-- **vLLM changes** (or wrapper APIs):
-  - `save()`, `load()` APIs
-  - `done_sending`, `done_receiving`
-  - Connector API supporting async transfer
-
----
 
 ## Drawbacks & Limitations
 
-- Slight increase in TTFT for split P/D
+- Slight increase in TTFT for disaggregated P/D 
 - Possibility of stranded memory on prefill crash
-- Need for timeout and retry logic
+- The need for timeout and retry logic
 
 ---
 
@@ -115,17 +126,17 @@ This evolved version removes the requirement for sidecars on the **prefill node*
 ## Future Considerations
 
 - Cache coordinate
-- Pre allocation of kv blocks in decode node , push cache from prefill to decode worker during calculation
+- Pre-allocation of kv blocks in the decode node, push cache from the prefill to the decode worker during calculation
 
 ---
 
 ## Integrating External Prefill/Decode Workloads
 
-The llm-d inference scheduler supports integration with external disaggregated prefill/decode (P/D) workloads other inference frameworks that follow the same P/D separation pattern but use **different Kubernetes Pod labeling conventions**.
+The LLM-D inference scheduler supports integration with external disaggregated prefill/decode (P/D) workloads other inference frameworks that follow the same P/D separation pattern but use **different Kubernetes Pod labeling conventions**.
 
 ### Labeling Convention Flexibility
 
-By default, llm-d uses the label key `llm-d.ai/role` with values:
+By default, LLM-D uses the label key `llm-d.ai/role` with values:
 - `"prefill"` → prefill-only pods
 - `"decode"` or `"both"` → decode-capable pods  
 
@@ -159,7 +170,8 @@ plugins:
       validValues: ["decode"]
   - type: prefix-cache-scorer
     parameters:
-      hashBlockSize: 5
+      autoTune: false
+      blockSize: 5
       maxPrefixBlocksToMatch: 256
       lruCapacityPerServer: 31250
   - type: max-score-picker
@@ -175,13 +187,11 @@ schedulingProfiles:
       - pluginRef: "prefill-pods"
       - pluginRef: "max-score-picker"
       - pluginRef: "prefix-cache-scorer"
-        weight: 2
   - name: decode
     plugins:
       - pluginRef: "decode-pods"
       - pluginRef: "max-score-picker"
       - pluginRef: "prefix-cache-scorer"
-        weight: 2
 ```
 
 ---