Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 42 additions & 32 deletions docs/disagg_pd.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Disaggregated Prefill/Decode Inference Serving in llm-d
# Disaggregated Prefill/Decode Inference Serving in LLM-D

## Overview

This document describes the architecture and request lifecycle for enabling **disaggregated prefill and decode (P/D)** inference execution in the llm-d router. The architecture aims to improve flexibility, scalability, and performance by enabling separation of prefill and decode stages onto different workers.
This document describes the architecture and request lifecycle for enabling **disaggregated prefill and decode (P/D)** inference execution in the LLM-D router. The architecture aims to improve flexibility, scalability, and performance by enabling separation of prefill and decode stages onto different workers.

This evolved version removes the requirement for sidecars on the **prefill node**, simplifying deployment while maintaining orchestration from the **decode node**.

Expand All @@ -25,7 +25,7 @@ This evolved version removes the requirement for sidecars on the **prefill node*
| **Decode Worker** | Handles decode stage and contains the sidecar for coordination |
| **Sidecar (Decode)** | Orchestrates communication with prefill worker and manages lifecycle |
| **Envoy Proxy** | Accepts OpenAI-style requests and forwards them to EPP |
| **EPP** | End Point Picker, makes scheduling decisions |
| **EPP** | Endpoint Picker, makes scheduling decisions |

---

Expand All @@ -37,7 +37,7 @@ This evolved version removes the requirement for sidecars on the **prefill node*
2. **EPP Scheduling Decision**
- EPP evaluates:
- Prompt length
- KV cache hit probability
- KV-cache hit probability
- System and pod load
- Selects either:
- **Single node** path (decode handles all)
Expand All @@ -47,10 +47,10 @@ This evolved version removes the requirement for sidecars on the **prefill node*
3. **Execution**
- Request lands on Decode Worker (as selected by EPP)
- Decode sidecar coordinates:
- If `prefill_worker_id == nil`, runs both stages locally by passing request to local vllm
- If split:
- Sends prefill job to Prefill Worker with a special header `do_remote_decode=true`
- Upon receiving response from Prefill Worker runs decode stage
- If `x-prefiller-host-port` header doesn't exist, runs both stages locally by passing request to local vLLM
- If `x-prefiller-host-port` header exists:
- Sends the prefill job to the selected Prefill Worker with a special request field `do_remote_decode=true`
- Upon receiving the response from the Prefill Worker runs the decode stage

4. **Response Flow**
- Response flows from decode sidecar → Envoy → EPP → User
Expand All @@ -59,11 +59,34 @@ This evolved version removes the requirement for sidecars on the **prefill node*

## Architectural Details


```mermaid
sequenceDiagram
participant C as Client
participant I as Inference Gateway
participant DS as Decode Worker Sidecar
participant D as Decode Worker(vLLM)
participant P as Prefill Worker(vLLM)


C->>I: Inference Request
I->>DS: Request is sent to the Decode Worker Sidecar <br/> with the selected Prefill worker set in a header.
DS->>P: Remote Prefill with prompt(max_tokens=1)
P-->>P: Run prefill
P->>DS: Remote kv parameters
DS->> D: Request is sent to the Decode Worker (vLLM) with remote_prefill true, <br/>prefill ID and memory block IDs
D-->>P: Read kv-cache
D-->>D: Schedule decode into queue & run decode
D->>DS: Inference Response
DS->>I: Inference Response
I->>C: Inference Response
```

### Sidecar Responsibilities (Decode Only)

- Receives EPP metadata (decode pod, optional prefill pod)
- Sends request to prefill
- Waits and validates result
- Waits for the result and validates it
- Launches local decode job
- Sends final response

Expand All @@ -73,33 +96,21 @@ This evolved version removes the requirement for sidecars on the **prefill node*

## Worker Selection Logic

- **Decode Worker**:
- Prefer longest prefix match / KV cache utilization (depends on available scorers)

- **Prefill Worker**:
- High prefix-cache hit rate
- Low load
- **Decode/Prefill Worker**:
- Prefer longest prefix match/kv-cache utilization (depends on available scorers) and low load

> **Skip prefill worker** when:
> - Prefix match/kv cache hit is high
> - Prefix match/kv-cache hit is high
> - Prompt is very short

---

## vLLM and LMCache Integration

- **vLLM changes** (or wrapper APIs):
- `save()`, `load()` APIs
- `done_sending`, `done_receiving`
- Connector API supporting async transfer

---

## Drawbacks & Limitations

- Slight increase in TTFT for split P/D
- Slight increase in TTFT for disaggregated P/D
- Possibility of stranded memory on prefill crash
- Need for timeout and retry logic
- The need for timeout and retry logic

---

Expand All @@ -115,17 +126,17 @@ This evolved version removes the requirement for sidecars on the **prefill node*
## Future Considerations

- Cache coordinate
- Pre allocation of kv blocks in decode node , push cache from prefill to decode worker during calculation
- Pre-allocation of kv blocks in the decode node, push cache from the prefill to the decode worker during calculation

---

## Integrating External Prefill/Decode Workloads

The llm-d inference scheduler supports integration with external disaggregated prefill/decode (P/D) workloads other inference frameworks that follow the same P/D separation pattern but use **different Kubernetes Pod labeling conventions**.
The LLM-D inference scheduler supports integration with external disaggregated prefill/decode (P/D) workloads other inference frameworks that follow the same P/D separation pattern but use **different Kubernetes Pod labeling conventions**.

### Labeling Convention Flexibility

By default, llm-d uses the label key `llm-d.ai/role` with values:
By default, LLM-D uses the label key `llm-d.ai/role` with values:
- `"prefill"` → prefill-only pods
- `"decode"` or `"both"` → decode-capable pods

Expand Down Expand Up @@ -159,7 +170,8 @@ plugins:
validValues: ["decode"]
- type: prefix-cache-scorer
parameters:
hashBlockSize: 5
autoTune: false
blockSize: 5
maxPrefixBlocksToMatch: 256
lruCapacityPerServer: 31250
- type: max-score-picker
Expand All @@ -175,13 +187,11 @@ schedulingProfiles:
- pluginRef: "prefill-pods"
- pluginRef: "max-score-picker"
- pluginRef: "prefix-cache-scorer"
weight: 2
- name: decode
plugins:
- pluginRef: "decode-pods"
- pluginRef: "max-score-picker"
- pluginRef: "prefix-cache-scorer"
weight: 2
```

---
Expand Down