diff --git a/docs/disagg_pd.md b/docs/disagg_pd.md
index 383b9c193b..ad459bcb03 100644
--- a/docs/disagg_pd.md
+++ b/docs/disagg_pd.md
@@ -1,8 +1,8 @@
-# Disaggregated Prefill/Decode Inference Serving in llm-d
+# Disaggregated Prefill/Decode Inference Serving in LLM-D
## Overview
-This document describes the architecture and request lifecycle for enabling **disaggregated prefill and decode (P/D)** inference execution in the llm-d router. The architecture aims to improve flexibility, scalability, and performance by enabling separation of prefill and decode stages onto different workers.
+This document describes the architecture and request lifecycle for enabling **disaggregated prefill and decode (P/D)** inference execution in the LLM-D router. The architecture aims to improve flexibility, scalability, and performance by enabling separation of prefill and decode stages onto different workers.
This evolved version removes the requirement for sidecars on the **prefill node**, simplifying deployment while maintaining orchestration from the **decode node**.
@@ -25,7 +25,7 @@ This evolved version removes the requirement for sidecars on the **prefill node*
| **Decode Worker** | Handles decode stage and contains the sidecar for coordination |
| **Sidecar (Decode)** | Orchestrates communication with prefill worker and manages lifecycle |
| **Envoy Proxy** | Accepts OpenAI-style requests and forwards them to EPP |
-| **EPP** | End Point Picker, makes scheduling decisions |
+| **EPP** | Endpoint Picker, makes scheduling decisions |
---
@@ -37,7 +37,7 @@ This evolved version removes the requirement for sidecars on the **prefill node*
2. **EPP Scheduling Decision**
- EPP evaluates:
- Prompt length
- - KV cache hit probability
+ - KV-cache hit probability
- System and pod load
- Selects either:
- **Single node** path (decode handles all)
@@ -47,10 +47,10 @@ This evolved version removes the requirement for sidecars on the **prefill node*
3. **Execution**
- Request lands on Decode Worker (as selected by EPP)
- Decode sidecar coordinates:
- - If `prefill_worker_id == nil`, runs both stages locally by passing request to local vllm
- - If split:
- - Sends prefill job to Prefill Worker with a special header `do_remote_decode=true`
- - Upon receiving response from Prefill Worker runs decode stage
+ - If `x-prefiller-host-port` header doesn't exist, runs both stages locally by passing request to local vLLM
+ - If `x-prefiller-host-port` header exists:
+ - Sends the prefill job to the selected Prefill Worker with a special request field `do_remote_decode=true`
+ - Upon receiving the response from the Prefill Worker runs the decode stage
4. **Response Flow**
- Response flows from decode sidecar → Envoy → EPP → User
@@ -59,11 +59,34 @@ This evolved version removes the requirement for sidecars on the **prefill node*
## Architectural Details
+
+```mermaid
+sequenceDiagram
+ participant C as Client
+ participant I as Inference Gateway
+ participant DS as Decode Worker Sidecar
+ participant D as Decode Worker(vLLM)
+ participant P as Prefill Worker(vLLM)
+
+
+ C->>I: Inference Request
+ I->>DS: Request is sent to the Decode Worker Sidecar
with the selected Prefill worker set in a header.
+ DS->>P: Remote Prefill with prompt(max_tokens=1)
+ P-->>P: Run prefill
+ P->>DS: Remote kv parameters
+ DS->> D: Request is sent to the Decode Worker (vLLM) with remote_prefill true,
prefill ID and memory block IDs
+ D-->>P: Read kv-cache
+ D-->>D: Schedule decode into queue & run decode
+ D->>DS: Inference Response
+ DS->>I: Inference Response
+ I->>C: Inference Response
+```
+
### Sidecar Responsibilities (Decode Only)
- Receives EPP metadata (decode pod, optional prefill pod)
- Sends request to prefill
-- Waits and validates result
+- Waits for the result and validates it
- Launches local decode job
- Sends final response
@@ -73,33 +96,21 @@ This evolved version removes the requirement for sidecars on the **prefill node*
## Worker Selection Logic
-- **Decode Worker**:
- - Prefer longest prefix match / KV cache utilization (depends on available scorers)
-
-- **Prefill Worker**:
- - High prefix-cache hit rate
- - Low load
+- **Decode/Prefill Worker**:
+ - Prefer longest prefix match/kv-cache utilization (depends on available scorers) and low load
> **Skip prefill worker** when:
-> - Prefix match/kv cache hit is high
+> - Prefix match/kv-cache hit is high
> - Prompt is very short
---
-## vLLM and LMCache Integration
-
-- **vLLM changes** (or wrapper APIs):
- - `save()`, `load()` APIs
- - `done_sending`, `done_receiving`
- - Connector API supporting async transfer
-
----
## Drawbacks & Limitations
-- Slight increase in TTFT for split P/D
+- Slight increase in TTFT for disaggregated P/D
- Possibility of stranded memory on prefill crash
-- Need for timeout and retry logic
+- The need for timeout and retry logic
---
@@ -115,17 +126,17 @@ This evolved version removes the requirement for sidecars on the **prefill node*
## Future Considerations
- Cache coordinate
-- Pre allocation of kv blocks in decode node , push cache from prefill to decode worker during calculation
+- Pre-allocation of kv blocks in the decode node, push cache from the prefill to the decode worker during calculation
---
## Integrating External Prefill/Decode Workloads
-The llm-d inference scheduler supports integration with external disaggregated prefill/decode (P/D) workloads other inference frameworks that follow the same P/D separation pattern but use **different Kubernetes Pod labeling conventions**.
+The LLM-D inference scheduler supports integration with external disaggregated prefill/decode (P/D) workloads other inference frameworks that follow the same P/D separation pattern but use **different Kubernetes Pod labeling conventions**.
### Labeling Convention Flexibility
-By default, llm-d uses the label key `llm-d.ai/role` with values:
+By default, LLM-D uses the label key `llm-d.ai/role` with values:
- `"prefill"` → prefill-only pods
- `"decode"` or `"both"` → decode-capable pods
@@ -159,7 +170,8 @@ plugins:
validValues: ["decode"]
- type: prefix-cache-scorer
parameters:
- hashBlockSize: 5
+ autoTune: false
+ blockSize: 5
maxPrefixBlocksToMatch: 256
lruCapacityPerServer: 31250
- type: max-score-picker
@@ -175,13 +187,11 @@ schedulingProfiles:
- pluginRef: "prefill-pods"
- pluginRef: "max-score-picker"
- pluginRef: "prefix-cache-scorer"
- weight: 2
- name: decode
plugins:
- pluginRef: "decode-pods"
- pluginRef: "max-score-picker"
- pluginRef: "prefix-cache-scorer"
- weight: 2
```
---