From c7406384516c5d447da459fad6ecb0d5d23c2907 Mon Sep 17 00:00:00 2001 From: David Breitgand Date: Thu, 25 Dec 2025 20:09:17 +0200 Subject: [PATCH] Updates the architecture description with reference to BBR and support for multiple GenAI models and LoRAs to remove confusion about llm-d only supporing one model per cluster --- docs/architecture.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/architecture.md b/docs/architecture.md index a6187939fb..0b08be6ff2 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -10,8 +10,7 @@ The design enables: -- Support for **multiple base models** within a shared cluster [Not supported in -Phase1] +- Support for **multiple base models** within a shared cluster (see [serving multiple gen AI models and LoRAs](https://gateway-api-inference-extension.sigs.k8s.io/guides/serve-multiple-genai-models/)) - Efficient routing based on **KV cache locality**, **session affinity**, **load**, and **model metadata** - Disaggregated **Prefill/Decode (P/D)** execution @@ -39,6 +38,7 @@ The inference scheduler is built on top of: - **Envoy** as a programmable data plane - **EPP (External Processing Plugin)** using **GIE** +- **BBR (External Processing Plugin)** using **GIE** ---