From 94ddb679204bace030c2d6af4f9696e78d46296f Mon Sep 17 00:00:00 2001 From: Indrajit Bhosale Date: Thu, 20 Nov 2025 11:23:41 -0800 Subject: [PATCH 01/12] 2/3 Done Signed-off-by: Indrajit Bhosale --- .../trtllm/multimodal_trtllm_guide.md | 374 ++++++++++++++++++ docs/backends/vllm/multimodal_vllm_guide.md | 204 ++++++++++ 2 files changed, 578 insertions(+) create mode 100644 docs/backends/trtllm/multimodal_trtllm_guide.md create mode 100644 docs/backends/vllm/multimodal_vllm_guide.md diff --git a/docs/backends/trtllm/multimodal_trtllm_guide.md b/docs/backends/trtllm/multimodal_trtllm_guide.md new file mode 100644 index 00000000000..ed584962cea --- /dev/null +++ b/docs/backends/trtllm/multimodal_trtllm_guide.md @@ -0,0 +1,374 @@ + + +# TRT-LLM Multimodal Guide + +This document provides a comprehensive guide for multimodal inference using TensorRT-LLM backend in Dynamo. + +## Multimodal Support Matrix + +| Modality | Input Format | Aggregated | Disaggregated | Notes | +|----------|--------------|------------|---------------|-------| +| **Image** | HTTP/HTTPS URL | Yes | Yes | Full support for all image models | +| **Image** | Pre-computed Embeddings (.pt, .pth, .bin) | Yes | Yes | Direct embedding files | + +## Architecture Comparison + +TRT-LLM multimodal supports three deployment patterns: + +``` +SIMPLE AGGREGATED (agg.sh): + Client → Frontend (Rust) → Worker [image load, encode, P+D] → Response + • 2 components • --modality multimodal • Easiest setup + +DISAGGREGATED P->D (disagg_multimodal.sh): + Client → Frontend → Prefill [image load, encode] → Decode → Response + • 3 components • --disaggregation-mode prefill/decode • Multi-GPU, KV transfer + +EPD DISAGGREGATED - WIP (epd_disagg.sh): + Client → Frontend → Encode [MultimodalEncoder] → Prefill [via params] → Decode → Response + • 4 components • --disaggregation-mode encode/prefill/decode • WIP PR #3818 +``` + +## Input Format Details + +### Supported URL Formats + +| Format | Example | Description | Support | +|--------|---------|-------------|---------| +| **HTTP/HTTPS** | `http://example.com/image.jpg` | Remote media files | ✅ | +| **Pre-computed Embeddings** | `/path/to/embedding.pt` | Local embedding files (.pt, .pth, .bin) | ✅ | + +## Simple Aggregated Mode (PD) + +In aggregated mode, all processing (image loading, encoding, prefill, decode) happens within a single worker. + +### Architecture + +``` +HTTP Frontend (Rust) + ↓ +TRT-LLM Worker (Python - ModelInput.Tokens) + ↓ downloads media, encodes, prefill + decode +Response +``` + +### Components + +| Component | Flag | ModelInput | Registered | Purpose | +|-----------|------|-----------|------------|---------| +| Worker | `--modality multimodal` | Tokens | Yes | Complete inference pipeline | + +### Launch Script + +Example: `examples/backends/trtllm/launch/agg.sh` + +## Disaggregated Mode (P->D) + +In disaggregated mode, prefill and decode are handled by separate workers. The prefill worker handles image loading and encoding internally. + +### Architecture + +``` +HTTP Frontend (Rust) + ↓ +Prefill Worker (Python - ModelInput.Tokens) + ↓ downloads media, encodes, prefill, KV cache transfer +Decode Worker (Python - ModelInput.Tokens) + ↓ decode only, token generation +Response +``` + +### Components + +| Component | Flag | ModelInput | Registered | Purpose | +|-----------|------|-----------|------------|---------| +| Prefill Worker | `--disaggregation-mode prefill` | Tokens | Yes | Image processing + Prefill | +| Decode Worker | `--disaggregation-mode decode` | Tokens | Yes | Decode only | + +### Launch Script + +Example: `examples/backends/trtllm/launch/disagg_multimodal.sh` + +## EPD Disaggregated Mode (E->P->D) - WIP + +**Status:** Work In Progress (WIP PR #3818) - Full EPD flow with MultimodalEncoder + +In EPD mode, encoding, prefill, and decode are handled by separate workers. The encode worker uses TensorRT-LLM's `MultimodalEncoder` to process images and transfer embeddings via disaggregated parameters. + +### Architecture + +``` +HTTP Frontend (Rust) + ↓ +Encode Worker (Python - NOT registered, uses MultimodalEncoder) + ↓ downloads image, encodes with vision model, transfers via disaggregated_params +Prefill Worker (Python - ModelInput.Tokens) + ↓ receives embeddings via disaggregated_params, prefill only, KV cache transfer +Decode Worker (Python - ModelInput.Tokens) + ↓ decode only, token generation +Response +``` + +**Note (WIP):** The encode worker uses `MultimodalEncoder` from TensorRT-LLM to actually encode images, not just load pre-computed embeddings. This is a significant change from the legacy NIXL-based embedding transfer. + +### Components + +| Component | Flag | ModelInput | Registered | Purpose | +|-----------|------|-----------|------------|---------| +| Encode Worker | `--disaggregation-mode encode` | N/A | No | Image encoding with MultimodalEncoder | +| Prefill Worker | `--disaggregation-mode prefill --encode-endpoint` | Tokens | Yes | Prefill only | +| Decode Worker | `--disaggregation-mode decode` | Tokens | Yes | Decode only | + +### Launch Script + +Example: `examples/backends/trtllm/launch/epd_disagg.sh` + +**Note (WIP):** The default model in the WIP PR is `llava-hf/llava-v1.6-mistral-7b-hf`. + +## ModelInput Types and Registration + +### Understanding ModelInput + +TRT-LLM workers register with Dynamo using: + +| ModelInput Type | Preprocessing | Use Case | +|-----------------|---------------|----------| +| `ModelInput.Tokens` | Rust SDK tokenizes text (bypassed for multimodal) | All TRT-LLM workers | + +### Component Registration Pattern + +```python +# TRT-LLM Worker - Register with Tokens +await register_llm( + ModelInput.Tokens, # Rust does minimal preprocessing + model_type, # ModelType.Chat or ModelType.Prefill + generate_endpoint, + model_name, + ... +) +``` + +## Inter-Component Communication + +### NATS-Based Messaging + +TRT-LLM components communicate using NATS messaging: + +| Transfer Stage | NATS Message | NIXL Transfer | +|----------------|--------------|---------------| +| **Frontend → Prefill** | Request with image URL or embedding path | No | +| **Encode → Prefill (Precomputed Embeddings)** | NIXL metadata (pre-computed embeddings) | Yes (Embeddings tensor) | +| **Encode → Prefill (Image URL) (WIP)** | Disaggregated params with multimodal handles | No (Handles via params) | +| **Prefill → Decode** | Disaggregated params | Yes/No (KV cache - UCX or NIXL) | + + +## **NIXL USE** + +| Use Case | Script | NIXL Used? | Data Transfer | +|----------|--------|------------|---------------| +| Simple Aggregated | `examples/backends/trtllm/launch/agg.sh` | ❌ No | All in one worker | +| P->D Disaggregated | `examples/backends/trtllm/launch/disagg_multimodal.sh` | ⚙️ Optional | Prefill → Decode (KV cache via UCX or NIXL) | +| E->P->D Disaggregated (Precomputed Embeddings) | `examples/backends/trtllm/launch/epd_disagg.sh` | ✅ Yes | Encoder → Prefill (pre-computed embeddings via NIXL) | +| E->P->D Disaggregated (WIP) | `examples/backends/trtllm/launch/url_epd_disagg.sh` | ❌ No | Encoder → Prefill (multimodal handles via disaggregated_params)
Prefill → Decode (KV cache via UCX/NIXL) | + +**Note:** NIXL for KV cache transfer is currently beta and only supported on AMD64 (x86_64) architecture. + +## **GAPS and Known Limitations** + +### 1. No Base64 Data URL Support + +**Current State:** +- TRT-LLM does NOT support base64-encoded `data:image/...` URLs +- Use HTTP/HTTPS URLs or pre-computed embedding files instead + +### 2. E->P->D Mode is WIP + +**Current State (WIP PR #3818):** +- EPD mode (E->P->D) is under active development +- Uses `MultimodalEncoder` from TensorRT-LLM for actual image encoding (not just pre-computed embeddings) +- Embeddings transferred via `disaggregated_params` (includes `multimodal_embedding_handles` and `multimodal_hashes`) +- Encode worker does not register with frontend; accessed via `--encode-endpoint` + + +### 3. NIXL KV Cache Transfer Beta + +## Pre-computed Embeddings (Legacy) + +TRT-LLM supports providing pre-computed embeddings, bypassing image-to-embedding processing. This is the **Embeddings URL** approach for EPD mode. + +### Supported File Types + +- `.pt` - PyTorch tensor files +- `.pth` - PyTorch checkpoint files +- `.bin` - Binary tensor files + +### Embedding File Formats + +TRT-LLM supports two formats for embedding files: + +**1. Simple Tensor Format** +- Direct tensor saved as `.pt` file +- Example: `llava_next_mm_embed_seashore.pt` +- Contains only the embedding tensor + +```python +# Example: Simple tensor format +embedding_tensor = torch.rand(1, 576, 4096) # [batch, seq_len, hidden_dim] +torch.save(embedding_tensor, "embedding.pt") +``` + +**2. Dictionary Format with Auxiliary Data** +- Dictionary containing multiple keys +- Used by models like Llama-4 that require additional metadata +- Must contain `mm_embeddings` key with the main tensor +- Can include auxiliary data like special tokens, offsets, etc. + +```python +# Example: Dictionary format (Llama-4 style) +embedding_dict = { + "mm_embeddings": torch.rand(1, 576, 4096), + "special_tokens": [128256, 128257], + "image_token_offsets": [[0, 576]], + # ... other model-specific metadata +} +torch.save(embedding_dict, "llama4_embedding.pt") +``` + +**How They're Used:** +- **Simple tensors**: Loaded directly and passed to `mm_embeddings` parameter +- **Dictionary format**: `mm_embeddings` key extracted as main tensor, other keys preserved as auxiliary data and transferred separately + +### Security Considerations + +For EPD mode with local embedding files: + +- `--allowed-local-media-path` - Specify secure directory for embedding files (default: `/tmp`) +- `--max-file-size-mb` - Limit max file size to prevent DoS attacks (default: `50MB`) + +## Full EPD with Image URLs (WIP) + +**Status:** Work In Progress (PR #3818) + +The WIP full EPD flow allows sending image URLs directly to the encode worker, which uses `MultimodalEncoder` to encode them. + +### How It Works (WIP) + +1. **Client** sends image URL in request +2. **Frontend** routes to **Prefill Worker** +3. **Prefill Worker** calls **Encode Worker** with image URL +4. **Encode Worker**: + - Downloads image using `default_multimodal_input_loader` + - Encodes with `MultimodalEncoder.generate()` + - Returns `ep_disaggregated_params` containing: + - `multimodal_embedding_handles` - GPU memory handles for embeddings + - `multimodal_hashes` - Hashes for embedding verification + - `processed_prompt` - Prompt with `` placeholders + - `prompt_token_ids` - Pre-tokenized prompt +5. **Prefill Worker** receives embeddings via disaggregated params, performs prefill +6. **Decode Worker** continues generation + +## Key Files + +| File | Description | +|------|-------------| +| `components/src/dynamo/trtllm/main.py` | Worker initialization and setup | +| `components/src/dynamo/trtllm/utils/trtllm_utils.py` | Command-line argument parsing | +| `components/src/dynamo/trtllm/multimodal_processor.py` | Multimodal request processing | +| `components/src/dynamo/trtllm/request_handlers/handlers.py` | Request handler factory | +| `components/src/dynamo/trtllm/request_handlers/handler_base.py` | Base handler and disaggregation modes | + +## **GAPS and Known Limitations** + +### 1. All Processing Happens in Python Workers + +**Current State:** +- TRT-LLM multimodal workers register with `ModelInput.Tokens` +- However, **all multimodal preprocessing happens in Python workers**, not in Rust frontend +- Rust frontend only validates URLs and tokenizes text-only prompts +- Python workers handle: + - Image downloading + - Image decoding (pixel-level) + - Vision encoding + - Multimodal prompt processing (adding `` tokens) + - Tokenization of multimodal prompts + +**Why This Is a Gap:** +- No reuse of Rust preprocessing/postprocessing logic for multimodal requests +- Inconsistent with text-only flows where Rust handles tokenization +- Limits optimization opportunities in the frontend + +### 2. TRT-LLM Requires Text Prompts, Not Tokens (Current) + +**Current State:** +- TRT-LLM's `MultimodalEncoder` and `LLM.generate_async()` expect **text prompts**, not pre-tokenized input +- This differs from vLLM which can accept `TokensPrompt` directly +- Forces Python workers to handle tokenization, even though workers register as `ModelInput.Tokens` + +**Ideal State:** +- TRT-LLM should accept **pre-tokenized input** (token IDs) +- Rust frontend could tokenize multimodal prompts (with `` placeholders) +- Python workers would only handle vision encoding + +**In Progress:** +- TRT-LLM team is working on accepting tokens instead of text prompts +- This would enable Rust preprocessing/postprocessing reuse for multimodal requests +- Would align TRT-LLM with vLLM's architecture where workers truly consume tokens + +### 3. Multimodal Processor Uses `ModelInput.Text` Semantics + +**Current State:** +- `MultimodalRequestProcessor` in TRT-LLM workers expects OpenAI format messages with raw text +- Workers effectively operate as `ModelInput.Text` despite registering as `ModelInput.Tokens` +- This is a workaround until TRT-LLM accepts tokenized input + +**Impact:** +- Architectural inconsistency between registration and actual behavior +- Cannot leverage Rust SDK's tokenization capabilities +- Additional complexity in Python worker code + +### 4. No Audio/Video Support in Dynamo TRT-LLM Backend + +**Current State:** +- TensorRT-LLM engine natively supports audio and video modalities +- Dynamo's TRT-LLM backend does **not yet** expose these capabilities +- Only image modality is currently supported: `--modality multimodal` (images only) + +**Why:** +- Dynamo backend implementation has not been extended to handle audio/video +- `MultimodalRequestProcessor` only extracts `image_url` from messages +- No handlers for `audio_url` or `video_url` content types + +**What's Missing:** +- Audio content type processing (`"type": "audio_url"`) +- Video content type processing (`"type": "video_url"`) +- Integration with TensorRT-LLM's audio/video input loaders +- Model-specific audio/video preprocessing + +**In Progress:** +- Backend extension to support audio and video is planned +- Will follow similar patterns to image support once implemented + +## Supported Models + +Multimodal models listed in [TensorRT-LLM supported models](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/inputs/utils.py#L221) are supported by Dynamo. + +Common examples: +- Llama 4 Vision models (Maverick, Scout) +- Qwen2-VL models +- Other vision-language models with TRT-LLM support + diff --git a/docs/backends/vllm/multimodal_vllm_guide.md b/docs/backends/vllm/multimodal_vllm_guide.md new file mode 100644 index 00000000000..0b7a3bb7c17 --- /dev/null +++ b/docs/backends/vllm/multimodal_vllm_guide.md @@ -0,0 +1,204 @@ + + +# vLLM Multimodal Guide + +This document provides a comprehensive guide for multimodal inference using vLLM backend in Dynamo. + +## Multimodal Support Matrix + +| Modality | Input Format | Aggregated | Disaggregated | Notes | +|----------|--------------|------------|---------------|-------| +| **Image** | HTTP/HTTPS URL | Yes | Yes | Full support for all image models | +| **Image** | Data URL (Base64) | Yes | Yes | Inline base64-encoded images | +| **Video** | HTTP/HTTPS URL | Yes | Yes | Frame extraction and processing | +| **Audio** | HTTP/HTTPS URL | Yes | Yes | Experimental - requires audio dependencies | + +## Architecture Comparison + +vLLM multimodal supports three deployment patterns: + +``` +SIMPLE AGGREGATED (agg_multimodal.sh): + Client → Frontend (Rust) → Worker [image load, encode, P+D] → Response + • 2 components • --connector none • Easiest setup + +EPD AGGREGATED (agg_multimodal_epd.sh): + Client → Frontend → Processor → Encoder [NIXL] → PD Worker → Response + • 4 components • --multimodal-processor • Custom templates, NIXL + +DISAGGREGATED (disagg_multimodal_qwen.sh): + Client → Frontend → Processor → Encoder [NIXL] → Prefill [NIXL] → Decode → Response + • 5 components • Separate P/D workers • Multi-node, max optimization +``` + +## Input Format Details + +### Supported URL Formats + +| Format | Example | Description | Support | +|--------|---------|-------------|---------| +| **HTTP/HTTPS** | `http://example.com/image.jpg` | Remote media files | ✅ | +| **Data URL** | `data:image/jpeg;base64,/9j/4AAQ...` | Base64-encoded inline data | ✅ | + + +## Aggregated Mode (PD) + +In aggregated mode, encoding, prefill, and decode happen within the same pipeline. + +### Architecture + +``` +HTTP Frontend (Rust) + ↓ +Processor (Python - ModelInput.Text) + ↓ tokenizes, extracts media URL +Encode Worker (Python - not registered) + ↓ downloads media, generates embeddings, NIXL transfer +PD Worker (Python - ModelInput.Tokens) + ↓ prefill + decode +Response +``` + +### Components + +| Component | Flag | ModelInput | Registered | Purpose | +|-----------|------|-----------|------------|---------| +| Processor | `--multimodal-processor` | Text | Yes | HTTP entry, tokenization | +| Encode Worker | `--multimodal-encode-worker` | N/A | No | Media encoding | +| PD Worker | `--multimodal-worker` | Tokens | Yes | Prefill + Decode | + +## Disaggregated Mode (E->P->D) + +In disaggregated mode, encoding, prefill, and decode are handled by separate workers. + +### Architecture + +``` +HTTP Frontend (Rust) + ↓ +Processor (Python - ModelInput.Text) + ↓ tokenizes, extracts media URL +Encode Worker (Python - not registered) + ↓ downloads media, generates embeddings, NIXL transfer +Prefill Worker (Python - ModelInput.Tokens) + ↓ prefill only, KV cache NIXL transfer +Decode Worker (Python - ModelInput.Tokens) + ↓ decode only, token generation +Response +``` + +### Components + +| Component | Flag | ModelInput | Registered | Purpose | +|-----------|------|-----------|------------|---------| +| Processor | `--multimodal-processor` | Text | Yes | HTTP entry, tokenization | +| Encode Worker | `--multimodal-encode-worker` | N/A | No | Media encoding | +| Prefill Worker | `--multimodal-worker --is-prefill-worker` | Tokens | Yes | Prefill only | +| Decode Worker | `--multimodal-decode-worker` | Tokens | Yes | Decode only | + +## Traditional Disagg (EP->D) + +Llama 4 models don't support pre-computed embeddings, so they use a combined Encode+Prefill worker. + +### Architecture + +``` +HTTP Frontend (Rust) + ↓ +Processor (Python - ModelInput.Text) + ↓ tokenizes, extracts media URL +Encode+Prefill Worker (Python - ModelInput.Tokens) + ↓ downloads media, encodes inline, prefill, KV cache NIXL transfer +Decode Worker (Python - ModelInput.Tokens) + ↓ decode only, token generation +Response +``` + +### Components + +| Component | Flag | ModelInput | Registered | Purpose | +|-----------|------|-----------|------------|---------| +| Processor | `--multimodal-processor` | Text | Yes | HTTP entry, tokenization | +| Encode+Prefill | `--multimodal-encode-prefill-worker --is-prefill-worker` | Tokens | Yes | Encode + Prefill | +| Decode Worker | `--multimodal-decode-worker` | Tokens | Yes | Decode only | + +### Launch Script + +Example: `examples/backends/vllm/launch/disagg_multimodal_llama.sh` + +## ModelInput Types and Registration + +### Understanding ModelInput + +Dynamo's Rust SDK supports two input types that determine how the HTTP frontend preprocesses requests: + +| ModelInput Type | Preprocessing | Use Case | +|-----------------|---------------|----------| +| `ModelInput.Text` | None (raw text passed through) | Components that tokenize themselves | +| `ModelInput.Tokens` | Rust SDK would tokenize (but bypassed in multimodal) | Components expecting pre-tokenized input | + +### Component Registration Pattern + +```python +# Processor - Entry point from HTTP frontend +await register_llm( + ModelInput.Text, # Frontend sends raw text + ModelType.Chat, + generate_endpoint, + model_name, + ... +) + +# Workers - Internal components +await register_llm( + ModelInput.Tokens, # Expect pre-tokenized input + ModelType.Chat, # or ModelType.Prefill for prefill workers + generate_endpoint, + model_name, + ... +) +``` + +## **NIXL USE** + +| Use Case | Script | NIXL Used? | Data Transfer | +|----------|--------|------------|---------------| +| Simple Aggregated | `examples/backends/vllm/launch/agg_multimodal.sh` | ❌ No | All in one worker | +| E->PD Aggregated | `examples/backends/vllm/launch/agg_multimodal_epd.sh` | ✅ Yes | Encoder → PD (embeddings) | +| E->P->D Disaggregated | `examples/backends/vllm/launch/disagg_multimodal_epd.sh` | ✅ Yes | Encoder → Prefill (embeddings)
Prefill → Decode (KV cache) | +| EP->D Disaggregated (Llama 4) | `examples/backends/vllm/launch/disagg_multimodal_llama.sh` | ✅ Yes | Prefill → Decode (KV cache) | + + +## **GAPS and Known Limitations** + +### 1. Token-Based P->D Disaggregation Not Supported + +**Current State:** +- All disaggregated multimodal flows require the **Processor** component (which uses `ModelInput.Text`) +- No support for pure token-based P->D disaggregation without multimodal processor + +### Key Files + +| File | Description | +|------|-------------| +| `components/src/dynamo/vllm/main.py` | Worker initialization and setup | +| `components/src/dynamo/vllm/args.py` | Command-line argument parsing | +| `components/src/dynamo/vllm/multimodal_handlers/processor_handler.py` | Processor implementation | +| `components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py` | Encode worker implementation | +| `components/src/dynamo/vllm/multimodal_handlers/worker_handler.py` | PD/Prefill/Decode worker implementation | + From cd8ed731d9d975171575a464d175b6d2651b1c76 Mon Sep 17 00:00:00 2001 From: Indrajit Bhosale Date: Fri, 21 Nov 2025 11:14:44 -0800 Subject: [PATCH 02/12] 3/3 Done Signed-off-by: Indrajit Bhosale --- .../sglang/multimodal_sglang_guide.md | 498 ++++++++++++++++++ 1 file changed, 498 insertions(+) create mode 100644 docs/backends/sglang/multimodal_sglang_guide.md diff --git a/docs/backends/sglang/multimodal_sglang_guide.md b/docs/backends/sglang/multimodal_sglang_guide.md new file mode 100644 index 00000000000..4ab46d0ffaf --- /dev/null +++ b/docs/backends/sglang/multimodal_sglang_guide.md @@ -0,0 +1,498 @@ + + +# SGLang Multimodal Guide + +This document provides a comprehensive guide for multimodal inference using SGLang backend in Dynamo. + +## Multimodal Support Matrix + +| Modality | Input Format | Aggregated | Disaggregated | Notes | +|----------|--------------|------------|---------------|-------| +| **Image** | HTTP/HTTPS URL | ✅ Yes | ✅ Yes | Vision encoder generates embeddings | +| **Image** | Data URL (Base64) | ❌ No | ❌ No | Not supported | +| **Video** | HTTP/HTTPS URL | ❌ No | ❌ No | Protocol accepts, but encode worker doesn't process | +| **Audio** | HTTP/HTTPS URL | ❌ No | ❌ No | Not implemented | + +## Architecture Comparison + +SGLang multimodal supports two deployment patterns: + +``` +AGGREGATED (E->PD): + Client → Frontend (Rust) → Processor → Encoder [NIXL] → PD Worker → Response + • 3 components • Vision encoder in Python • NIXL embeddings transfer + +DISAGGREGATED (E->P->D): + Client → Frontend → Processor → Encoder [NIXL] → Prefill [bootstrap] → Decode → Response + • 4 components • Vision encoder + KV sharing • Bootstrap coordination +``` + +## Aggregated Mode (E->PD) + +In aggregated mode, encoding happens in a separate worker, but prefill and decode share the same engine. + +### Architecture + +``` +HTTP Frontend (Rust) + ↓ +Processor (Python - ModelInput.Text - REGISTERED) + ↓ tokenizes with chat template, extracts image URL +Encode Worker (Python - NOT registered) + ↓ downloads image, runs vision encoder, generates embeddings, NIXL transfer +PD Worker (Python - NOT registered) + ↓ receives embeddings via NIXL, prefill + decode +Response → Processor → Frontend +``` + +### Components + +| Component | Flag | ModelInput | Registered | Has SGLang Engine? | Purpose | +|-----------|------|-----------|------------|-------------------|---------| +| Processor | `--multimodal-processor` | Text | ✅ Yes | ❌ No | HTTP entry, OpenAI→SGLang conversion | +| Encode Worker | `--multimodal-encode-worker` | N/A | ❌ No | ❌ No | Vision encoder, embeddings generation | +| PD Worker | `--multimodal-worker` | N/A | ❌ No | ✅ Yes | Prefill + Decode with embeddings | + +### Key Characteristics + +- **Vision Encoder in Python**: Encode worker loads vision model (AutoModel) and image processor (AutoImageProcessor) +- **Token Expansion**: Single `<|image_pad|>` token replaced with N tokens based on embedding shape +- **NIXL Transfer**: Embeddings transferred from Encoder → PD Worker using NIXL +- **No Rust Processing**: All tokenization and image handling happens in Python + +## Disaggregated Mode (E->P->D) + +In disaggregated mode, encoding, prefill, and decode are handled by separate workers using SGLang's bootstrap coordination. + +### Architecture + +``` +HTTP Frontend (Rust) + ↓ +Processor (Python - ModelInput.Text - REGISTERED) + ↓ tokenizes with chat template, extracts image URL +Encode Worker (Python - NOT registered) + ↓ downloads image, runs vision encoder, generates embeddings, NIXL transfer +Prefill Worker (Python - NOT registered) + ↓ receives embeddings via NIXL, prefill only, returns bootstrap info +Decode Worker (Python - NOT registered) + ↓ uses bootstrap info, decode only, token generation +Response → Processor → Frontend +``` + +### Components + +| Component | Flag | ModelInput | Registered | Has SGLang Engine? | Purpose | +|-----------|------|-----------|------------|-------------------|---------| +| Processor | `--multimodal-processor` | Text | ✅ Yes | ❌ No | HTTP entry, OpenAI→SGLang conversion | +| Encode Worker | `--multimodal-encode-worker` | N/A | ❌ No | ❌ No | Vision encoder, embeddings generation | +| Decode Worker | `--multimodal-worker --serving-mode=decode` | N/A | ❌ No | ✅ Yes | **Entry point for disaggregation**, calls Prefill | +| Prefill Worker | `--multimodal-worker --serving-mode=prefill` | N/A | ❌ No | ✅ Yes | Called by Decode, bootstrap coordination | + +### Bootstrap Coordination + +SGLang disaggregation uses a bootstrap mechanism for P->D coordination: + +**Request Flow (Important):** +``` +Client → Frontend → Processor → Encode → DECODE Worker → Prefill Worker + ↑ + Entry point for disaggregation! +``` + +**Bootstrap Process:** +1. **Decode Worker** receives request from Encode Worker +2. **Decode Worker** calls Prefill Worker via NATS to request bootstrap info +3. **Prefill Worker** generates `{host, port, room}` and returns immediately +4. **Both workers** connect to same "room" using bootstrap coordinates +5. **SGLang internally** transfers KV cache state via bootstrap connection (not NIXL) + +**Key Difference from vLLM:** +- vLLM: Frontend → Prefill → Decode (Prefill is entry point) +- SGLang: Frontend → Processor → Encode → **Decode → Prefill** (Decode is entry point) + +## ModelInput Types and Registration + +**Only the Processor registers with Dynamo Rust.** + +### Registration Pattern + +```python +# ONLY Processor registers with Dynamo Rust +await register_llm_with_readiness_gate( + None, # No engine for processor + generate_endpoint, + server_args, + dynamo_args, + input_type=ModelInput.Text, # Receives raw OpenAI format + readiness_gate=ready_event, +) + +# Workers do NOT register - they are internal components +# They communicate via NATS clients created in main.py +``` + +### Component Initialization + +```python +# Encode Worker - connects to downstream PD worker +pd_worker_client = ( + await runtime.namespace(dynamo_args.namespace) + .component("backend") + .endpoint("generate") + .client() +) + +# PD Worker (Decode mode) - connects to upstream Prefill worker +prefill_client = ( + await runtime.namespace(dynamo_args.namespace) + .component("prefill") + .endpoint("generate") + .client() +) +``` + +## Inter-Component Communication + +### Control Flow (NATS) + +All component-to-component communication happens via NATS: + +**Aggregated Mode (E→PD):** +``` +Processor → Encode Worker → PD Worker + (NATS) (NATS + NIXL embeddings) +``` + +**Disaggregated Mode (E→P→D):** +``` +Processor → Encode Worker → DECODE Worker → Prefill Worker + (NATS) (NATS) (NATS) + ↓ + Decode requests bootstrap + ↓ + Prefill returns {host, port, room} + ↓ + Both connect via bootstrap + ↓ + SGLang internal KV cache transfer +``` + +**Detailed Message Flow:** + +``` +Processor → Encode Worker: + - NATS round_robin with SglangMultimodalRequest + - Contains: tokenized input_ids, image URL, sampling params + +Encode Worker → Decode/PD Worker: + - NATS round_robin to "backend" component + - Contains: expanded token_ids, NIXL metadata, embeddings shape + - NIXL transfer: embeddings tensor + +Decode Worker → Prefill Worker (disagg only): + - NATS call to "prefill" component + - Decode requests bootstrap coordinates + - Prefill returns: {bootstrap_host, bootstrap_port, bootstrap_room} + +Prefill ↔ Decode (via bootstrap): + - SGLang internal connection (not NATS) + - KV cache state shared via bootstrap mechanism +``` + +### Data Transfer (NIXL) + +NIXL is used only for embedding transfer: + +``` +Encode Worker: + descriptor = connect.Descriptor(precomputed_embeddings) + with connector.create_readable(descriptor) as readable: + request.serialized_request = readable.metadata() + # Send request with NIXL metadata + await pd_worker_client.round_robin(request) + await readable.wait_for_completion() + +PD Worker: + embeddings = torch.empty(request.embeddings_shape, dtype=torch.float16) + descriptor = connect.Descriptor(embeddings) + read_op = await connector.begin_read(request.serialized_request, descriptor) + await read_op.wait_for_completion() +``` + +## Vision Encoding Details + +### Encode Worker Components + +The encode worker loads and runs the vision model in Python: + +```python +# Vision components loaded in encode worker +self.image_processor = AutoImageProcessor.from_pretrained( + model_path, trust_remote_code=True +) +self.vision_model = AutoModel.from_pretrained( + model_path, + device_map="auto", + torch_dtype=torch.float16, + trust_remote_code=True +) +``` + +### Token Expansion Process + +1. Processor inserts single image token (e.g., `<|image_pad|>`) +2. Encode worker generates embeddings: `shape = (batch, num_patches, hidden_dim)` +3. Encode worker replaces single token with `num_patches` tokens +4. Downstream worker receives expanded token sequence + +Example: +```python +# Before: ["Hello", "<|image_pad|>", "world"] +# After: ["Hello", "<|image_pad|>", "<|image_pad|>", ...(576 tokens), "world"] +``` + +## Chat Template Processing + +SGLang uses its own chat template system: + +```python +from sglang.srt.parser.conversation import chat_templates + +conv = chat_templates["qwen2-vl"].copy() +conv.append_message(conv.roles[0], f"{conv.image_token} Describe this image") +processed = tokenizer(text=conv.get_prompt(), return_tensors="pt") +``` + +Supported templates: `qwen2-vl`, `llama-3`, `vicuna`, etc. + +## NIXL USE + +| Use Case | NIXL Used? | Data Transfer | Notes | +|----------|------------|---------------|-------| +| E→PD Aggregated | ✅ Yes | Encoder → PD (embeddings) | Vision encoder separate | +| E→P→D Disaggregated | ✅ Yes | Encoder → Prefill (embeddings) | KV cache via SGLang bootstrap | + +**Key Difference:** SGLang P→D uses bootstrap mechanism, not NIXL for KV cache like vLLM. + +## GAPS and Known Limitations + +### 1. No Base64 (Data URL) Support + +**Current State:** +- Only HTTP/HTTPS URLs supported for images +- Data URLs (`data:image/jpeg;base64,...`) are **not supported** +- vLLM and TRT-LLM support data URLs, SGLang does not + +**Impact:** +- Cannot send embedded images in requests +- Requires external image hosting for all images + +### 2. No Pre-computed Embeddings Support + +**Current State:** +- No support for pre-computed embeddings (`.pt`, `.pth`, `.bin` files) +- Vision encoder must run for every request +- Cannot bypass encoding like TRT-LLM legacy flow + +**Impact:** +- Higher latency for repeated images +- Cannot optimize by pre-computing embeddings offline + +### 3. Only Processor Registers with Rust + +**Current State:** +- Only the Processor component registers with Dynamo Rust using `ModelInput.Text` +- All workers are internal and do not register +- Different from vLLM/TRT-LLM where workers also register + +**Implications:** +- Frontend always routes to Processor (cannot route directly to workers) +- No token-based entry point (no `ModelInput.Tokens` registration for workers) +- More complex multi-component setup required for all multimodal requests + +### 4. All Processing Happens in Python Workers + +**Current State:** +- No Rust-based image decoding or preprocessing +- No Rust tokenization (all tokenization in Python Processor) +- Frontend only handles HTTP routing + +**Impact:** +- Cannot leverage Rust performance for preprocessing +- All multimodal logic in Python components +- Similar limitation to TRT-LLM + +### 5. No Video/Audio Model Support + +**Current State:** +- **Video models are NOT supported** - Encode worker only implements image loading and processing +- **Audio models are NOT supported** - No audio encoder implementation +- Only **image modality** is production-ready +- Protocol accepts `video_url` and Processor can forward it, but Encode Worker **only processes `image_url`** + +**Why:** +```python +# encode_worker_handler.py only checks for image_url +if not request.multimodal_input.image_url: + raise ValueError("image_url is required for the encode worker.") +``` + +**Impact:** +- Cannot run video models like `LLaVA-NeXT-Video-7B-hf` +- Cannot run audio models like `Qwen2-Audio-7B-Instruct` +- Use **vLLM backend** for video/audio support (has full implementation) + +**Workaround:** +- For video models: Use vLLM (`examples/multimodal/launch/video_agg.sh`) +- For audio models: Use vLLM (`examples/multimodal/launch/audio_agg.sh`) +- Or implement custom video/audio encode worker for SGLang + +### 6. Bootstrap Coordination and Routing Complexity + +**Current State:** +- Disaggregated mode requires bootstrap coordination between P and D workers +- Uses host/port/room mechanism from SGLang +- **Decode Worker is the entry point** (not Prefill like vLLM) +- Request path: `Encode → Decode → Prefill` (Decode calls Prefill) + +**Architectural Pattern:** +``` +Encode Worker → pd_worker_client → DECODE Worker + ↓ + prefill_client → PREFILL Worker +``` + +**Impact:** +- More complex P→D coordination than vLLM +- Requires network connectivity between P and D workers +- Different debugging model than vLLM + +**Routing Implications:** + +**Cannot Route Directly to Prefill:** +- Prefill Worker does NOT register with Dynamo +- Frontend cannot route requests to Prefill directly +- All disaggregated requests MUST go through Decode Worker first +- Decode Worker initiates bootstrap coordination with Prefill + +**Load Balancing Constraints:** +- Cannot distribute load directly to Prefill workers +- Must load balance at Decode Worker level +- Decode Worker becomes bottleneck for prefill requests +- Different from vLLM where frontend can route to prefill workers directly + +**Multi-Instance Limitations:** +- If you scale Prefill workers, Decode must discover them +- Cannot use frontend routing to select specific Prefill worker +- Decode Worker uses `prefill_client.generate()` (round-robin to any prefill) +- Less control over prefill worker selection compared to vLLM + +### 7. Manual Token Expansion in Encode Worker + +**Current State:** +- Encode worker **manually** expands image tokens from 1 → N based on embedding shape +- Token expansion happens in Python code, not handled by SGLang engine +- Hard-coded logic specific to model architecture + +**Code Location:** +```python +# encode_worker_handler.py:144-157 +# Find single image token in sequence +image_token_id_index = request.request.token_ids.index(self.image_token_id) + +# Get number of patches from embedding shape +num_image_tokens = precomputed_embeddings.shape[1] # e.g., 576 patches + +# Replace 1 token with N tokens +request.request.token_ids = ( + request.request.token_ids[:image_token_id_index] + + [self.image_token_id] * num_image_tokens + + request.request.token_ids[image_token_id_index + 1:] +) +``` + +**Why This Is Error-Prone:** + +1. **Model-Specific Logic Required:** + - Different models have different patch sizes and embedding dimensions + - Number of tokens depends on: image resolution, patch size, pooling strategy + - Must update code for each new model architecture + - Example: Qwen2-VL uses 576 patches, but other models may use different counts + +2. **Assumes Shape[1] is Patch Count:** + - Hard-coded assumption: `num_image_tokens = precomputed_embeddings.shape[1]` + - Works for: `(batch, patches, hidden_dim)` format + - Breaks for: Different embedding formats (e.g., pooled, multi-scale) + - No validation that shape[1] is actually the patch dimension + +3. **Single Image Token Assumption:** + - Assumes exactly one image token in sequence + - Fails for: Multiple images, video frames, complex layouts + - `token_ids.index()` throws error if token not found or multiple tokens + +4. **No Dynamic Resolution Support:** + - Fixed expansion based on embedding shape + - Cannot handle dynamic image resolutions without code changes + - Models with resolution-dependent patch counts need special handling + +5. **Tight Coupling with Chat Template:** + - Must know exact image token string from chat template + - Hard-coded token extraction logic (lines 72-87) + - Different templates may use different token formats + +**Impact:** +- **Maintenance burden**: Must update encode worker for each new model +- **Error-prone**: Easy to miscalculate token counts for new architectures +- **No abstraction**: Token expansion logic embedded in handler, not engine +- **Limited flexibility**: Cannot easily support models with variable patch counts +- **Debugging difficulty**: Token count mismatches hard to diagnose + +**Comparison with vLLM:** +- vLLM handles token expansion **internally in the engine** +- vLLM workers just pass image data, engine figures out tokens +- More robust and less prone to manual errors + +**Workaround:** +- Carefully study each new model's architecture +- Test token expansion with known inputs +- Add extensive logging for token count validation + + +## Supported Models + +SGLang multimodal **only supports image-based vision-language models**: + +### ✅ Supported (Images Only) +- **Qwen2-VL** / **Qwen2.5-VL** (primary support) +- Models with `AutoImageProcessor` and vision tower +- Models compatible with SGLang's image embedding format + + +## Key Files + +| File | Description | +|------|-------------| +| `components/src/dynamo/sglang/main.py` | Component initialization, only Processor registers | +| `components/src/dynamo/sglang/request_handlers/multimodal/processor_handler.py` | Processor implementation, OpenAI→SGLang | +| `components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py` | Vision encoder, embeddings generation | +| `components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py` | PD/Prefill/Decode workers, NIXL read | +| `components/src/dynamo/sglang/multimodal_utils/multimodal_chat_processor.py` | Chat template processing | +| `components/src/dynamo/sglang/protocol.py` | Request/response data structures | +| `components/src/dynamo/sglang/register.py` | Registration logic (only called for Processor) | + From 63ae834e71ad404c3d1b2f9aecce3eaca2f006bd Mon Sep 17 00:00:00 2001 From: krishung5 Date: Thu, 4 Dec 2025 14:18:18 -0800 Subject: [PATCH 03/12] Address comments --- .../sglang/multimodal_sglang_guide.md | 4 +- .../trtllm/multimodal_trtllm_guide.md | 18 ++++----- docs/backends/vllm/multimodal_vllm_guide.md | 39 ++++++++++++------- docs/hidden_toctree.rst | 4 ++ 4 files changed, 41 insertions(+), 24 deletions(-) diff --git a/docs/backends/sglang/multimodal_sglang_guide.md b/docs/backends/sglang/multimodal_sglang_guide.md index 4ab46d0ffaf..fdb9f9c9fb9 100644 --- a/docs/backends/sglang/multimodal_sglang_guide.md +++ b/docs/backends/sglang/multimodal_sglang_guide.md @@ -359,8 +359,8 @@ if not request.multimodal_input.image_url: - Use **vLLM backend** for video/audio support (has full implementation) **Workaround:** -- For video models: Use vLLM (`examples/multimodal/launch/video_agg.sh`) -- For audio models: Use vLLM (`examples/multimodal/launch/audio_agg.sh`) +- For video models: Use vLLM (`[examples/multimodal/launch/video_agg.sh](../../../examples/multimodal/launch/video_agg.sh)`) +- For audio models: Use vLLM (`[examples/multimodal/launch/audio_agg.sh](../../../examples/multimodal/launch/audio_agg.sh)`) - Or implement custom video/audio encode worker for SGLang ### 6. Bootstrap Coordination and Routing Complexity diff --git a/docs/backends/trtllm/multimodal_trtllm_guide.md b/docs/backends/trtllm/multimodal_trtllm_guide.md index ed584962cea..62a3b82fff2 100644 --- a/docs/backends/trtllm/multimodal_trtllm_guide.md +++ b/docs/backends/trtllm/multimodal_trtllm_guide.md @@ -33,15 +33,15 @@ TRT-LLM multimodal supports three deployment patterns: ``` SIMPLE AGGREGATED (agg.sh): Client → Frontend (Rust) → Worker [image load, encode, P+D] → Response - • 2 components • --modality multimodal • Easiest setup + • 2 components • worker flag `--modality multimodal` • Easiest setup DISAGGREGATED P->D (disagg_multimodal.sh): Client → Frontend → Prefill [image load, encode] → Decode → Response - • 3 components • --disaggregation-mode prefill/decode • Multi-GPU, KV transfer + • 3 components • worker flag `--disaggregation-mode prefill/decode` • Multi-GPU, KV transfer EPD DISAGGREGATED - WIP (epd_disagg.sh): Client → Frontend → Encode [MultimodalEncoder] → Prefill [via params] → Decode → Response - • 4 components • --disaggregation-mode encode/prefill/decode • WIP PR #3818 + • 4 components • worker flag `--disaggregation-mode encode/prefill/decode` • WIP PR #3818 ``` ## Input Format Details @@ -75,7 +75,7 @@ Response ### Launch Script -Example: `examples/backends/trtllm/launch/agg.sh` +Example: `[examples/backends/trtllm/launch/agg.sh](../../../examples/backends/trtllm/launch/agg.sh)` ## Disaggregated Mode (P->D) @@ -102,7 +102,7 @@ Response ### Launch Script -Example: `examples/backends/trtllm/launch/disagg_multimodal.sh` +Example: `[examples/backends/trtllm/launch/disagg_multimodal.sh](../../../examples/backends/trtllm/launch/disagg_multimodal.sh)` ## EPD Disaggregated Mode (E->P->D) - WIP @@ -136,7 +136,7 @@ Response ### Launch Script -Example: `examples/backends/trtllm/launch/epd_disagg.sh` +Example: `[examples/backends/trtllm/launch/epd_disagg.sh](../../../examples/backends/trtllm/launch/epd_disagg.sh)` **Note (WIP):** The default model in the WIP PR is `llava-hf/llava-v1.6-mistral-7b-hf`. @@ -181,9 +181,9 @@ TRT-LLM components communicate using NATS messaging: | Use Case | Script | NIXL Used? | Data Transfer | |----------|--------|------------|---------------| -| Simple Aggregated | `examples/backends/trtllm/launch/agg.sh` | ❌ No | All in one worker | -| P->D Disaggregated | `examples/backends/trtllm/launch/disagg_multimodal.sh` | ⚙️ Optional | Prefill → Decode (KV cache via UCX or NIXL) | -| E->P->D Disaggregated (Precomputed Embeddings) | `examples/backends/trtllm/launch/epd_disagg.sh` | ✅ Yes | Encoder → Prefill (pre-computed embeddings via NIXL) | +| Simple Aggregated | `[examples/backends/trtllm/launch/agg.sh](../../../examples/backends/trtllm/launch/agg.sh)` | ❌ No | All in one worker | +| P->D Disaggregated | `[examples/backends/trtllm/launch/disagg_multimodal.sh](../../../examples/backends/trtllm/launch/disagg_multimodal.sh)` | ⚙️ Optional | Prefill → Decode (KV cache via UCX or NIXL) | +| E->P->D Disaggregated (Precomputed Embeddings) | `[examples/backends/trtllm/launch/epd_disagg.sh](../../../examples/backends/trtllm/launch/epd_disagg.sh)` | ✅ Yes | Encoder → Prefill (pre-computed embeddings via NIXL) | | E->P->D Disaggregated (WIP) | `examples/backends/trtllm/launch/url_epd_disagg.sh` | ❌ No | Encoder → Prefill (multimodal handles via disaggregated_params)
Prefill → Decode (KV cache via UCX/NIXL) | **Note:** NIXL for KV cache transfer is currently beta and only supported on AMD64 (x86_64) architecture. diff --git a/docs/backends/vllm/multimodal_vllm_guide.md b/docs/backends/vllm/multimodal_vllm_guide.md index 0b7a3bb7c17..cdccb85876d 100644 --- a/docs/backends/vllm/multimodal_vllm_guide.md +++ b/docs/backends/vllm/multimodal_vllm_guide.md @@ -33,15 +33,15 @@ This document provides a comprehensive guide for multimodal inference using vLLM vLLM multimodal supports three deployment patterns: ``` -SIMPLE AGGREGATED (agg_multimodal.sh): - Client → Frontend (Rust) → Worker [image load, encode, P+D] → Response +SIMPLE AGGREGATED ([examples/backends/vllm/launch/agg_multimodal.sh](../../../examples/backends/vllm/launch/agg_multimodal.sh)): + Client → Frontend (Rust processor) → Worker [image load, encode, P+D] → Response • 2 components • --connector none • Easiest setup -EPD AGGREGATED (agg_multimodal_epd.sh): +EPD AGGREGATED ([examples/backends/vllm/launch/agg_multimodal_epd.sh](../../../examples/backends/vllm/launch/agg_multimodal_epd.sh)): Client → Frontend → Processor → Encoder [NIXL] → PD Worker → Response • 4 components • --multimodal-processor • Custom templates, NIXL -DISAGGREGATED (disagg_multimodal_qwen.sh): +DISAGGREGATED ([examples/backends/vllm/launch/disagg_multimodal_epd.sh](../../../examples/backends/vllm/launch/disagg_multimodal_epd.sh)): Client → Frontend → Processor → Encoder [NIXL] → Prefill [NIXL] → Decode → Response • 5 components • Separate P/D workers • Multi-node, max optimization ``` @@ -55,10 +55,23 @@ DISAGGREGATED (disagg_multimodal_qwen.sh): | **HTTP/HTTPS** | `http://example.com/image.jpg` | Remote media files | ✅ | | **Data URL** | `data:image/jpeg;base64,/9j/4AAQ...` | Base64-encoded inline data | ✅ | +## Simple Aggregated Mode (PD) -## Aggregated Mode (PD) +In simple aggregated mode, encoding, prefill, and decode happen within the same worker. -In aggregated mode, encoding, prefill, and decode happen within the same pipeline. +### Architecture + +``` +HTTP Frontend with Rust processor + ↓ +Worker (Python - ModelInput.Tokens) + ↓ encode + prefill + decode +Response +``` + +## EPD Aggregated Mode (PD) + +In EPD aggregated mode, encoding happens in a separate worker and prefill and decode happen within the same pipeline. ### Architecture @@ -82,9 +95,9 @@ Response | Encode Worker | `--multimodal-encode-worker` | N/A | No | Media encoding | | PD Worker | `--multimodal-worker` | Tokens | Yes | Prefill + Decode | -## Disaggregated Mode (E->P->D) +## EPD Disaggregated Mode (E->P->D) -In disaggregated mode, encoding, prefill, and decode are handled by separate workers. +In EPD disaggregated mode, encoding, prefill, and decode are handled by separate workers. ### Architecture @@ -139,7 +152,7 @@ Response ### Launch Script -Example: `examples/backends/vllm/launch/disagg_multimodal_llama.sh` +Example: `[examples/backends/vllm/launch/disagg_multimodal_llama.sh](../../../examples/backends/vllm/launch/disagg_multimodal_llama.sh)` ## ModelInput Types and Registration @@ -178,10 +191,10 @@ await register_llm( | Use Case | Script | NIXL Used? | Data Transfer | |----------|--------|------------|---------------| -| Simple Aggregated | `examples/backends/vllm/launch/agg_multimodal.sh` | ❌ No | All in one worker | -| E->PD Aggregated | `examples/backends/vllm/launch/agg_multimodal_epd.sh` | ✅ Yes | Encoder → PD (embeddings) | -| E->P->D Disaggregated | `examples/backends/vllm/launch/disagg_multimodal_epd.sh` | ✅ Yes | Encoder → Prefill (embeddings)
Prefill → Decode (KV cache) | -| EP->D Disaggregated (Llama 4) | `examples/backends/vllm/launch/disagg_multimodal_llama.sh` | ✅ Yes | Prefill → Decode (KV cache) | +| Simple Aggregated | `[examples/backends/vllm/launch/agg_multimodal.sh](../../../examples/backends/vllm/launch/agg_multimodal.sh)` | ❌ No | All in one worker | +| E->PD Aggregated | `[examples/backends/vllm/launch/agg_multimodal_epd.sh](../../../examples/backends/vllm/launch/agg_multimodal_epd.sh)` | ✅ Yes | Encoder → PD (embeddings) | +| E->P->D Disaggregated | `[examples/backends/vllm/launch/disagg_multimodal_epd.sh](../../../examples/backends/vllm/launch/disagg_multimodal_epd.sh)` | ✅ Yes | Encoder → Prefill (embeddings)
Prefill → Decode (KV cache) | +| EP->D Disaggregated (Llama 4) | `[examples/backends/vllm/launch/disagg_multimodal_llama.sh](../../../examples/backends/vllm/launch/disagg_multimodal_llama.sh)` | ✅ Yes | Prefill → Decode (KV cache) | ## **GAPS and Known Limitations** diff --git a/docs/hidden_toctree.rst b/docs/hidden_toctree.rst index 669ae0339cb..5cf74c62300 100644 --- a/docs/hidden_toctree.rst +++ b/docs/hidden_toctree.rst @@ -51,6 +51,7 @@ backends/trtllm/kv-cache-transfer.md backends/trtllm/multimodal_support.md backends/trtllm/multimodal_epd.md + backends/trtllm/multimodal_trtllm_guide.md backends/trtllm/gemma3_sliding_window_attention.md backends/trtllm/gpt-oss.md backends/trtllm/prometheus.md @@ -61,6 +62,7 @@ backends/sglang/expert-distribution-eplb.md backends/sglang/gpt-oss.md backends/sglang/multimodal_epd.md + backends/sglang/multimodal_sglang_guide.md backends/sglang/sgl-hicache-example.md backends/sglang/sglang-disaggregation.md backends/sglang/prometheus.md @@ -73,8 +75,10 @@ backends/vllm/deepseek-r1.md backends/vllm/gpt-oss.md + backends/vllm/LMCache_Integration.md backends/vllm/multi-node.md backends/vllm/multimodal.md + backends/vllm/multimodal_vllm_guide.md backends/vllm/prometheus.md benchmarks/kv-router-ab-testing.md From 96fd82d998d15521719e5b5981b36adb771d3585 Mon Sep 17 00:00:00 2001 From: krishung5 Date: Thu, 4 Dec 2025 14:55:45 -0800 Subject: [PATCH 04/12] Fix syntax for renderring --- docs/backends/sglang/multimodal_sglang_guide.md | 4 ++-- docs/backends/trtllm/multimodal_trtllm_guide.md | 12 ++++++------ docs/backends/vllm/multimodal_vllm_guide.md | 10 +++++----- 3 files changed, 13 insertions(+), 13 deletions(-) diff --git a/docs/backends/sglang/multimodal_sglang_guide.md b/docs/backends/sglang/multimodal_sglang_guide.md index fdb9f9c9fb9..d55b47a4b82 100644 --- a/docs/backends/sglang/multimodal_sglang_guide.md +++ b/docs/backends/sglang/multimodal_sglang_guide.md @@ -359,8 +359,8 @@ if not request.multimodal_input.image_url: - Use **vLLM backend** for video/audio support (has full implementation) **Workaround:** -- For video models: Use vLLM (`[examples/multimodal/launch/video_agg.sh](../../../examples/multimodal/launch/video_agg.sh)`) -- For audio models: Use vLLM (`[examples/multimodal/launch/audio_agg.sh](../../../examples/multimodal/launch/audio_agg.sh)`) +- For video models: Use vLLM ([`examples/multimodal/launch/video_agg.sh`](../../../examples/multimodal/launch/video_agg.sh)) +- For audio models: Use vLLM ([`examples/multimodal/launch/audio_agg.sh`](../../../examples/multimodal/launch/audio_agg.sh)) - Or implement custom video/audio encode worker for SGLang ### 6. Bootstrap Coordination and Routing Complexity diff --git a/docs/backends/trtllm/multimodal_trtllm_guide.md b/docs/backends/trtllm/multimodal_trtllm_guide.md index 62a3b82fff2..618752491b9 100644 --- a/docs/backends/trtllm/multimodal_trtllm_guide.md +++ b/docs/backends/trtllm/multimodal_trtllm_guide.md @@ -75,7 +75,7 @@ Response ### Launch Script -Example: `[examples/backends/trtllm/launch/agg.sh](../../../examples/backends/trtllm/launch/agg.sh)` +Example: [`examples/backends/trtllm/launch/agg.sh`](../../../examples/backends/trtllm/launch/agg.sh) ## Disaggregated Mode (P->D) @@ -102,7 +102,7 @@ Response ### Launch Script -Example: `[examples/backends/trtllm/launch/disagg_multimodal.sh](../../../examples/backends/trtllm/launch/disagg_multimodal.sh)` +Example: [`examples/backends/trtllm/launch/disagg_multimodal.sh`](../../../examples/backends/trtllm/launch/disagg_multimodal.sh) ## EPD Disaggregated Mode (E->P->D) - WIP @@ -136,7 +136,7 @@ Response ### Launch Script -Example: `[examples/backends/trtllm/launch/epd_disagg.sh](../../../examples/backends/trtllm/launch/epd_disagg.sh)` +Example: [`examples/backends/trtllm/launch/epd_disagg.sh`](../../../examples/backends/trtllm/launch/epd_disagg.sh) **Note (WIP):** The default model in the WIP PR is `llava-hf/llava-v1.6-mistral-7b-hf`. @@ -181,9 +181,9 @@ TRT-LLM components communicate using NATS messaging: | Use Case | Script | NIXL Used? | Data Transfer | |----------|--------|------------|---------------| -| Simple Aggregated | `[examples/backends/trtllm/launch/agg.sh](../../../examples/backends/trtllm/launch/agg.sh)` | ❌ No | All in one worker | -| P->D Disaggregated | `[examples/backends/trtllm/launch/disagg_multimodal.sh](../../../examples/backends/trtllm/launch/disagg_multimodal.sh)` | ⚙️ Optional | Prefill → Decode (KV cache via UCX or NIXL) | -| E->P->D Disaggregated (Precomputed Embeddings) | `[examples/backends/trtllm/launch/epd_disagg.sh](../../../examples/backends/trtllm/launch/epd_disagg.sh)` | ✅ Yes | Encoder → Prefill (pre-computed embeddings via NIXL) | +| Simple Aggregated | [`examples/backends/trtllm/launch/agg.sh`](../../../examples/backends/trtllm/launch/agg.sh) | ❌ No | All in one worker | +| P->D Disaggregated | [`examples/backends/trtllm/launch/disagg_multimodal.sh`](../../../examples/backends/trtllm/launch/disagg_multimodal.sh) | ⚙️ Optional | Prefill → Decode (KV cache via UCX or NIXL) | +| E->P->D Disaggregated (Precomputed Embeddings) | [`examples/backends/trtllm/launch/epd_disagg.sh`](../../../examples/backends/trtllm/launch/epd_disagg.sh) | ✅ Yes | Encoder → Prefill (pre-computed embeddings via NIXL) | | E->P->D Disaggregated (WIP) | `examples/backends/trtllm/launch/url_epd_disagg.sh` | ❌ No | Encoder → Prefill (multimodal handles via disaggregated_params)
Prefill → Decode (KV cache via UCX/NIXL) | **Note:** NIXL for KV cache transfer is currently beta and only supported on AMD64 (x86_64) architecture. diff --git a/docs/backends/vllm/multimodal_vllm_guide.md b/docs/backends/vllm/multimodal_vllm_guide.md index cdccb85876d..de562d77cdb 100644 --- a/docs/backends/vllm/multimodal_vllm_guide.md +++ b/docs/backends/vllm/multimodal_vllm_guide.md @@ -152,7 +152,7 @@ Response ### Launch Script -Example: `[examples/backends/vllm/launch/disagg_multimodal_llama.sh](../../../examples/backends/vllm/launch/disagg_multimodal_llama.sh)` +Example: [`examples/backends/vllm/launch/disagg_multimodal_llama.sh`](../../../examples/backends/vllm/launch/disagg_multimodal_llama.sh) ## ModelInput Types and Registration @@ -191,10 +191,10 @@ await register_llm( | Use Case | Script | NIXL Used? | Data Transfer | |----------|--------|------------|---------------| -| Simple Aggregated | `[examples/backends/vllm/launch/agg_multimodal.sh](../../../examples/backends/vllm/launch/agg_multimodal.sh)` | ❌ No | All in one worker | -| E->PD Aggregated | `[examples/backends/vllm/launch/agg_multimodal_epd.sh](../../../examples/backends/vllm/launch/agg_multimodal_epd.sh)` | ✅ Yes | Encoder → PD (embeddings) | -| E->P->D Disaggregated | `[examples/backends/vllm/launch/disagg_multimodal_epd.sh](../../../examples/backends/vllm/launch/disagg_multimodal_epd.sh)` | ✅ Yes | Encoder → Prefill (embeddings)
Prefill → Decode (KV cache) | -| EP->D Disaggregated (Llama 4) | `[examples/backends/vllm/launch/disagg_multimodal_llama.sh](../../../examples/backends/vllm/launch/disagg_multimodal_llama.sh)` | ✅ Yes | Prefill → Decode (KV cache) | +| Simple Aggregated | [`examples/backends/vllm/launch/agg_multimodal.sh`](../../../examples/backends/vllm/launch/agg_multimodal.sh) | ❌ No | All in one worker | +| E->PD Aggregated | [`examples/backends/vllm/launch/agg_multimodal_epd.sh`](../../../examples/backends/vllm/launch/agg_multimodal_epd.sh) | ✅ Yes | Encoder → PD (embeddings) | +| E->P->D Disaggregated | [`examples/backends/vllm/launch/disagg_multimodal_epd.sh`](../../../examples/backends/vllm/launch/disagg_multimodal_epd.sh) | ✅ Yes | Encoder → Prefill (embeddings)
Prefill → Decode (KV cache) | +| EP->D Disaggregated (Llama 4) | [`examples/backends/vllm/launch/disagg_multimodal_llama.sh`](../../../examples/backends/vllm/launch/disagg_multimodal_llama.sh) | ✅ Yes | Prefill → Decode (KV cache) | ## **GAPS and Known Limitations** From 3aa5f29b5e628e98f18530549d64443e05687240 Mon Sep 17 00:00:00 2001 From: krishung5 Date: Thu, 4 Dec 2025 21:47:26 -0800 Subject: [PATCH 05/12] Coderabbit comments --- .../backends/sglang/multimodal_sglang_guide.md | 18 +++++++++--------- .../backends/trtllm/multimodal_trtllm_guide.md | 18 ++++++++++-------- docs/backends/vllm/multimodal_vllm_guide.md | 10 +++++----- 3 files changed, 24 insertions(+), 22 deletions(-) diff --git a/docs/backends/sglang/multimodal_sglang_guide.md b/docs/backends/sglang/multimodal_sglang_guide.md index d55b47a4b82..0b8c0034e83 100644 --- a/docs/backends/sglang/multimodal_sglang_guide.md +++ b/docs/backends/sglang/multimodal_sglang_guide.md @@ -32,7 +32,7 @@ This document provides a comprehensive guide for multimodal inference using SGLa SGLang multimodal supports two deployment patterns: -``` +```text AGGREGATED (E->PD): Client → Frontend (Rust) → Processor → Encoder [NIXL] → PD Worker → Response • 3 components • Vision encoder in Python • NIXL embeddings transfer @@ -48,7 +48,7 @@ In aggregated mode, encoding happens in a separate worker, but prefill and decod ### Architecture -``` +```text HTTP Frontend (Rust) ↓ Processor (Python - ModelInput.Text - REGISTERED) @@ -81,7 +81,7 @@ In disaggregated mode, encoding, prefill, and decode are handled by separate wor ### Architecture -``` +```text HTTP Frontend (Rust) ↓ Processor (Python - ModelInput.Text - REGISTERED) @@ -109,7 +109,7 @@ Response → Processor → Frontend SGLang disaggregation uses a bootstrap mechanism for P->D coordination: **Request Flow (Important):** -``` +```text Client → Frontend → Processor → Encode → DECODE Worker → Prefill Worker ↑ Entry point for disaggregation! @@ -174,13 +174,13 @@ prefill_client = ( All component-to-component communication happens via NATS: **Aggregated Mode (E→PD):** -``` +```text Processor → Encode Worker → PD Worker (NATS) (NATS + NIXL embeddings) ``` **Disaggregated Mode (E→P→D):** -``` +```text Processor → Encode Worker → DECODE Worker → Prefill Worker (NATS) (NATS) (NATS) ↓ @@ -195,7 +195,7 @@ Processor → Encode Worker → DECODE Worker → Prefill Worker **Detailed Message Flow:** -``` +```text Processor → Encode Worker: - NATS round_robin with SglangMultimodalRequest - Contains: tokenized input_ids, image URL, sampling params @@ -219,7 +219,7 @@ Prefill ↔ Decode (via bootstrap): NIXL is used only for embedding transfer: -``` +```python Encode Worker: descriptor = connect.Descriptor(precomputed_embeddings) with connector.create_readable(descriptor) as readable: @@ -372,7 +372,7 @@ if not request.multimodal_input.image_url: - Request path: `Encode → Decode → Prefill` (Decode calls Prefill) **Architectural Pattern:** -``` +```text Encode Worker → pd_worker_client → DECODE Worker ↓ prefill_client → PREFILL Worker diff --git a/docs/backends/trtllm/multimodal_trtllm_guide.md b/docs/backends/trtllm/multimodal_trtllm_guide.md index 618752491b9..ae5416fd022 100644 --- a/docs/backends/trtllm/multimodal_trtllm_guide.md +++ b/docs/backends/trtllm/multimodal_trtllm_guide.md @@ -30,7 +30,7 @@ This document provides a comprehensive guide for multimodal inference using Tens TRT-LLM multimodal supports three deployment patterns: -``` +```text SIMPLE AGGREGATED (agg.sh): Client → Frontend (Rust) → Worker [image load, encode, P+D] → Response • 2 components • worker flag `--modality multimodal` • Easiest setup @@ -59,7 +59,7 @@ In aggregated mode, all processing (image loading, encoding, prefill, decode) ha ### Architecture -``` +```text HTTP Frontend (Rust) ↓ TRT-LLM Worker (Python - ModelInput.Tokens) @@ -83,7 +83,7 @@ In disaggregated mode, prefill and decode are handled by separate workers. The p ### Architecture -``` +```text HTTP Frontend (Rust) ↓ Prefill Worker (Python - ModelInput.Tokens) @@ -112,7 +112,7 @@ In EPD mode, encoding, prefill, and decode are handled by separate workers. The ### Architecture -``` +```text HTTP Frontend (Rust) ↓ Encode Worker (Python - NOT registered, uses MultimodalEncoder) @@ -172,7 +172,7 @@ TRT-LLM components communicate using NATS messaging: | Transfer Stage | NATS Message | NIXL Transfer | |----------------|--------------|---------------| | **Frontend → Prefill** | Request with image URL or embedding path | No | -| **Encode → Prefill (Precomputed Embeddings)** | NIXL metadata (pre-computed embeddings) | Yes (Embeddings tensor) | +| **Encode → Prefill (pre-computed embeddings)** | NIXL metadata (pre-computed embeddings) | Yes (Embeddings tensor) | | **Encode → Prefill (Image URL) (WIP)** | Disaggregated params with multimodal handles | No (Handles via params) | | **Prefill → Decode** | Disaggregated params | Yes/No (KV cache - UCX or NIXL) | @@ -183,7 +183,7 @@ TRT-LLM components communicate using NATS messaging: |----------|--------|------------|---------------| | Simple Aggregated | [`examples/backends/trtllm/launch/agg.sh`](../../../examples/backends/trtllm/launch/agg.sh) | ❌ No | All in one worker | | P->D Disaggregated | [`examples/backends/trtllm/launch/disagg_multimodal.sh`](../../../examples/backends/trtllm/launch/disagg_multimodal.sh) | ⚙️ Optional | Prefill → Decode (KV cache via UCX or NIXL) | -| E->P->D Disaggregated (Precomputed Embeddings) | [`examples/backends/trtllm/launch/epd_disagg.sh`](../../../examples/backends/trtllm/launch/epd_disagg.sh) | ✅ Yes | Encoder → Prefill (pre-computed embeddings via NIXL) | +| E->P->D Disaggregated (pre-computed embeddings) | [`examples/backends/trtllm/launch/epd_disagg.sh`](../../../examples/backends/trtllm/launch/epd_disagg.sh) | ✅ Yes | Encoder → Prefill (pre-computed embeddings via NIXL) | | E->P->D Disaggregated (WIP) | `examples/backends/trtllm/launch/url_epd_disagg.sh` | ❌ No | Encoder → Prefill (multimodal handles via disaggregated_params)
Prefill → Decode (KV cache via UCX/NIXL) | **Note:** NIXL for KV cache transfer is currently beta and only supported on AMD64 (x86_64) architecture. @@ -221,7 +221,8 @@ TRT-LLM supports providing pre-computed embeddings, bypassing image-to-embedding TRT-LLM supports two formats for embedding files: -**1. Simple Tensor Format** +#### 1. Simple Tensor Format + - Direct tensor saved as `.pt` file - Example: `llava_next_mm_embed_seashore.pt` - Contains only the embedding tensor @@ -232,7 +233,8 @@ embedding_tensor = torch.rand(1, 576, 4096) # [batch, seq_len, hidden_dim] torch.save(embedding_tensor, "embedding.pt") ``` -**2. Dictionary Format with Auxiliary Data** +#### 2. Dictionary Format with Auxiliary Data + - Dictionary containing multiple keys - Used by models like Llama-4 that require additional metadata - Must contain `mm_embeddings` key with the main tensor diff --git a/docs/backends/vllm/multimodal_vllm_guide.md b/docs/backends/vllm/multimodal_vllm_guide.md index de562d77cdb..025d46bc1c8 100644 --- a/docs/backends/vllm/multimodal_vllm_guide.md +++ b/docs/backends/vllm/multimodal_vllm_guide.md @@ -32,7 +32,7 @@ This document provides a comprehensive guide for multimodal inference using vLLM vLLM multimodal supports three deployment patterns: -``` +```text SIMPLE AGGREGATED ([examples/backends/vllm/launch/agg_multimodal.sh](../../../examples/backends/vllm/launch/agg_multimodal.sh)): Client → Frontend (Rust processor) → Worker [image load, encode, P+D] → Response • 2 components • --connector none • Easiest setup @@ -61,7 +61,7 @@ In simple aggregated mode, encoding, prefill, and decode happen within the same ### Architecture -``` +```text HTTP Frontend with Rust processor ↓ Worker (Python - ModelInput.Tokens) @@ -75,7 +75,7 @@ In EPD aggregated mode, encoding happens in a separate worker and prefill and de ### Architecture -``` +```text HTTP Frontend (Rust) ↓ Processor (Python - ModelInput.Text) @@ -101,7 +101,7 @@ In EPD disaggregated mode, encoding, prefill, and decode are handled by separate ### Architecture -``` +```text HTTP Frontend (Rust) ↓ Processor (Python - ModelInput.Text) @@ -130,7 +130,7 @@ Llama 4 models don't support pre-computed embeddings, so they use a combined Enc ### Architecture -``` +```text HTTP Frontend (Rust) ↓ Processor (Python - ModelInput.Text) From 79feec637c703d4bb013f5f38b9707cd15cfe952 Mon Sep 17 00:00:00 2001 From: krishung5 Date: Thu, 4 Dec 2025 21:57:58 -0800 Subject: [PATCH 06/12] Revise some parts --- docs/backends/sglang/multimodal_sglang_guide.md | 2 +- docs/backends/trtllm/multimodal_trtllm_guide.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/backends/sglang/multimodal_sglang_guide.md b/docs/backends/sglang/multimodal_sglang_guide.md index 0b8c0034e83..556179b9035 100644 --- a/docs/backends/sglang/multimodal_sglang_guide.md +++ b/docs/backends/sglang/multimodal_sglang_guide.md @@ -39,7 +39,7 @@ AGGREGATED (E->PD): DISAGGREGATED (E->P->D): Client → Frontend → Processor → Encoder [NIXL] → Prefill [bootstrap] → Decode → Response - • 4 components • Vision encoder + KV sharing • Bootstrap coordination + • 4 components • Vision encoder in Python • KV cache transfer via bootstrap mechanism ``` ## Aggregated Mode (E->PD) diff --git a/docs/backends/trtllm/multimodal_trtllm_guide.md b/docs/backends/trtllm/multimodal_trtllm_guide.md index ae5416fd022..6ab1a45ddd3 100644 --- a/docs/backends/trtllm/multimodal_trtllm_guide.md +++ b/docs/backends/trtllm/multimodal_trtllm_guide.md @@ -174,7 +174,7 @@ TRT-LLM components communicate using NATS messaging: | **Frontend → Prefill** | Request with image URL or embedding path | No | | **Encode → Prefill (pre-computed embeddings)** | NIXL metadata (pre-computed embeddings) | Yes (Embeddings tensor) | | **Encode → Prefill (Image URL) (WIP)** | Disaggregated params with multimodal handles | No (Handles via params) | -| **Prefill → Decode** | Disaggregated params | Yes/No (KV cache - UCX or NIXL) | +| **Prefill → Decode** | Disaggregated params | Configurable (KV cache: NIXL default, UCX optional) | ## **NIXL USE** From 3ec390091bbadf64bbe93dc2ee513436b00dfca5 Mon Sep 17 00:00:00 2001 From: krishung5 Date: Fri, 5 Dec 2025 22:18:29 -0800 Subject: [PATCH 07/12] Address comments --- .../sglang/multimodal_sglang_guide.md | 193 +------------ docs/backends/trtllm/multimodal_epd.md | 139 ---------- docs/backends/trtllm/multimodal_support.md | 95 +++++-- .../trtllm/multimodal_trtllm_guide.md | 254 +++++------------- docs/backends/vllm/multimodal_vllm_guide.md | 20 +- docs/hidden_toctree.rst | 1 - 6 files changed, 167 insertions(+), 535 deletions(-) delete mode 100644 docs/backends/trtllm/multimodal_epd.md diff --git a/docs/backends/sglang/multimodal_sglang_guide.md b/docs/backends/sglang/multimodal_sglang_guide.md index 556179b9035..f850fea4e95 100644 --- a/docs/backends/sglang/multimodal_sglang_guide.md +++ b/docs/backends/sglang/multimodal_sglang_guide.md @@ -17,7 +17,7 @@ limitations under the License. # SGLang Multimodal Guide -This document provides a comprehensive guide for multimodal inference using SGLang backend in Dynamo. +This document provides a comprehensive guide for multimodal inference using SGLang backend in Dynamo. For more details on the multimodal examples, see [Multimodal Examples Documentation](./multimodal_epd.md). ## Multimodal Support Matrix @@ -25,7 +25,7 @@ This document provides a comprehensive guide for multimodal inference using SGLa |----------|--------------|------------|---------------|-------| | **Image** | HTTP/HTTPS URL | ✅ Yes | ✅ Yes | Vision encoder generates embeddings | | **Image** | Data URL (Base64) | ❌ No | ❌ No | Not supported | -| **Video** | HTTP/HTTPS URL | ❌ No | ❌ No | Protocol accepts, but encode worker doesn't process | +| **Video** | HTTP/HTTPS URL | ❌ No | ❌ No | Not implemented | | **Audio** | HTTP/HTTPS URL | ❌ No | ❌ No | Not implemented | ## Architecture Comparison @@ -290,189 +290,14 @@ Supported templates: `qwen2-vl`, `llama-3`, `vicuna`, etc. **Key Difference:** SGLang P→D uses bootstrap mechanism, not NIXL for KV cache like vLLM. -## GAPS and Known Limitations - -### 1. No Base64 (Data URL) Support - -**Current State:** -- Only HTTP/HTTPS URLs supported for images -- Data URLs (`data:image/jpeg;base64,...`) are **not supported** -- vLLM and TRT-LLM support data URLs, SGLang does not - -**Impact:** -- Cannot send embedded images in requests -- Requires external image hosting for all images - -### 2. No Pre-computed Embeddings Support - -**Current State:** -- No support for pre-computed embeddings (`.pt`, `.pth`, `.bin` files) -- Vision encoder must run for every request -- Cannot bypass encoding like TRT-LLM legacy flow - -**Impact:** -- Higher latency for repeated images -- Cannot optimize by pre-computing embeddings offline - -### 3. Only Processor Registers with Rust - -**Current State:** -- Only the Processor component registers with Dynamo Rust using `ModelInput.Text` -- All workers are internal and do not register -- Different from vLLM/TRT-LLM where workers also register - -**Implications:** -- Frontend always routes to Processor (cannot route directly to workers) -- No token-based entry point (no `ModelInput.Tokens` registration for workers) -- More complex multi-component setup required for all multimodal requests - -### 4. All Processing Happens in Python Workers - -**Current State:** -- No Rust-based image decoding or preprocessing -- No Rust tokenization (all tokenization in Python Processor) -- Frontend only handles HTTP routing - -**Impact:** -- Cannot leverage Rust performance for preprocessing -- All multimodal logic in Python components -- Similar limitation to TRT-LLM - -### 5. No Video/Audio Model Support - -**Current State:** -- **Video models are NOT supported** - Encode worker only implements image loading and processing -- **Audio models are NOT supported** - No audio encoder implementation -- Only **image modality** is production-ready -- Protocol accepts `video_url` and Processor can forward it, but Encode Worker **only processes `image_url`** - -**Why:** -```python -# encode_worker_handler.py only checks for image_url -if not request.multimodal_input.image_url: - raise ValueError("image_url is required for the encode worker.") -``` - -**Impact:** -- Cannot run video models like `LLaVA-NeXT-Video-7B-hf` -- Cannot run audio models like `Qwen2-Audio-7B-Instruct` -- Use **vLLM backend** for video/audio support (has full implementation) - -**Workaround:** -- For video models: Use vLLM ([`examples/multimodal/launch/video_agg.sh`](../../../examples/multimodal/launch/video_agg.sh)) -- For audio models: Use vLLM ([`examples/multimodal/launch/audio_agg.sh`](../../../examples/multimodal/launch/audio_agg.sh)) -- Or implement custom video/audio encode worker for SGLang - -### 6. Bootstrap Coordination and Routing Complexity - -**Current State:** -- Disaggregated mode requires bootstrap coordination between P and D workers -- Uses host/port/room mechanism from SGLang -- **Decode Worker is the entry point** (not Prefill like vLLM) -- Request path: `Encode → Decode → Prefill` (Decode calls Prefill) - -**Architectural Pattern:** -```text -Encode Worker → pd_worker_client → DECODE Worker - ↓ - prefill_client → PREFILL Worker -``` - -**Impact:** -- More complex P→D coordination than vLLM -- Requires network connectivity between P and D workers -- Different debugging model than vLLM - -**Routing Implications:** - -**Cannot Route Directly to Prefill:** -- Prefill Worker does NOT register with Dynamo -- Frontend cannot route requests to Prefill directly -- All disaggregated requests MUST go through Decode Worker first -- Decode Worker initiates bootstrap coordination with Prefill - -**Load Balancing Constraints:** -- Cannot distribute load directly to Prefill workers -- Must load balance at Decode Worker level -- Decode Worker becomes bottleneck for prefill requests -- Different from vLLM where frontend can route to prefill workers directly - -**Multi-Instance Limitations:** -- If you scale Prefill workers, Decode must discover them -- Cannot use frontend routing to select specific Prefill worker -- Decode Worker uses `prefill_client.generate()` (round-robin to any prefill) -- Less control over prefill worker selection compared to vLLM - -### 7. Manual Token Expansion in Encode Worker - -**Current State:** -- Encode worker **manually** expands image tokens from 1 → N based on embedding shape -- Token expansion happens in Python code, not handled by SGLang engine -- Hard-coded logic specific to model architecture - -**Code Location:** -```python -# encode_worker_handler.py:144-157 -# Find single image token in sequence -image_token_id_index = request.request.token_ids.index(self.image_token_id) - -# Get number of patches from embedding shape -num_image_tokens = precomputed_embeddings.shape[1] # e.g., 576 patches - -# Replace 1 token with N tokens -request.request.token_ids = ( - request.request.token_ids[:image_token_id_index] - + [self.image_token_id] * num_image_tokens - + request.request.token_ids[image_token_id_index + 1:] -) -``` - -**Why This Is Error-Prone:** - -1. **Model-Specific Logic Required:** - - Different models have different patch sizes and embedding dimensions - - Number of tokens depends on: image resolution, patch size, pooling strategy - - Must update code for each new model architecture - - Example: Qwen2-VL uses 576 patches, but other models may use different counts - -2. **Assumes Shape[1] is Patch Count:** - - Hard-coded assumption: `num_image_tokens = precomputed_embeddings.shape[1]` - - Works for: `(batch, patches, hidden_dim)` format - - Breaks for: Different embedding formats (e.g., pooled, multi-scale) - - No validation that shape[1] is actually the patch dimension - -3. **Single Image Token Assumption:** - - Assumes exactly one image token in sequence - - Fails for: Multiple images, video frames, complex layouts - - `token_ids.index()` throws error if token not found or multiple tokens - -4. **No Dynamic Resolution Support:** - - Fixed expansion based on embedding shape - - Cannot handle dynamic image resolutions without code changes - - Models with resolution-dependent patch counts need special handling - -5. **Tight Coupling with Chat Template:** - - Must know exact image token string from chat template - - Hard-coded token extraction logic (lines 72-87) - - Different templates may use different token formats - -**Impact:** -- **Maintenance burden**: Must update encode worker for each new model -- **Error-prone**: Easy to miscalculate token counts for new architectures -- **No abstraction**: Token expansion logic embedded in handler, not engine -- **Limited flexibility**: Cannot easily support models with variable patch counts -- **Debugging difficulty**: Token count mismatches hard to diagnose - -**Comparison with vLLM:** -- vLLM handles token expansion **internally in the engine** -- vLLM workers just pass image data, engine figures out tokens -- More robust and less prone to manual errors - -**Workaround:** -- Carefully study each new model's architecture -- Test token expansion with known inputs -- Add extensive logging for token count validation +## Known Limitations +- **No Data URL support** - Only HTTP/HTTPS URLs supported; `data:image/...` base64 URLs not supported +- **No pre-computed embeddings** - Cannot use `.pt`, `.pth`, `.bin` embedding files; vision encoder runs for every request +- **No video support** - No video encoder implementation +- **No audio support** - No audio encoder implementation +- **Only Processor registers with Dynamo** - Workers are internal components, frontend routes to Processor only +- **Disaggregated routing** - Decode Worker is the entry point (calls Prefill), cannot route directly to Prefill workers ## Supported Models diff --git a/docs/backends/trtllm/multimodal_epd.md b/docs/backends/trtllm/multimodal_epd.md deleted file mode 100644 index f70dc76d739..00000000000 --- a/docs/backends/trtllm/multimodal_epd.md +++ /dev/null @@ -1,139 +0,0 @@ -# Encode-Prefill-Decode (EPD) Flow with NIXL - -For high-performance multimodal inference with large embeddings, Dynamo supports a specialized **Encode-Prefill-Decode (EPD)** flow using **NIXL (RDMA)** for zero-copy tensor transfer. - -## Enabling the Feature - -This is an experimental feature that requires using a specific TensorRT-LLM commit. -To enable it build the dynamo container with the `--tensorrtllm-commit` flag, followed by the commit hash: - -```bash -./container/build.sh --framework trtllm --tensorrtllm-git-url https://github.com/NVIDIA/TensorRT-LLM.git --tensorrtllm-commit v1.2.0rc3 -``` - -## Key Features - -- **High Performance**: Zero-copy RDMA transfer for embeddings -- **Dynamic Shape Allocation**: Automatically handles variable embedding shapes per image -- **Multi-Format Support**: Works with tensor files (`.pt`) and dictionary-based embeddings -- **Hybrid Transfer**: Large tensors via NIXL, small metadata via JSON - -## How to use - -```bash -cd $DYNAMO_HOME/examples/backends/trtllm - -# Launch 3-worker EPD flow with NIXL. -./launch/epd_disagg.sh -``` - -## Pre-requsites - -This script is specifically designed to work on 8 node H200 and `Llama-4-Maverick-17B-128E-Instruct` model with assumption that you already have a model specific embedding file ready. - -## Configuration - -The EPD flow uses a dedicated **Encode Worker** that runs separately from the Prefill and Decode workers. The `ENCODE_ENDPOINT` environment variable specifies how the Prefill worker communicates with the Encode worker: - -```bash -export ENCODE_ENDPOINT="dyn://dynamo.tensorrt_llm_encode.generate" -``` - -This endpoint follows Dynamo's standard format: `dyn://namespace.component.endpoint` where the Encode worker registers itself as `dynamo.tensorrt_llm_encode.generate`. - -For local embedding file access, use the `--allowed-local-media-path "$ALLOWED_LOCAL_MEDIA_PATH"` parameter to specify the secure directory path where embedding files can be loaded from (default: `/tmp`). This prevents path traversal attacks while allowing flexible file access within the designated directory. - -```bash -export ALLOWED_LOCAL_MEDIA_PATH="/tmp" -``` - -For tensor file size protection, use the `--max-file-size-mb "$MAX_FILE_SIZE_MB"` parameter to limit the maximum size of downloadable embedding files/Image URLs (default: `50MB`). This prevents Denial of Service (DoS) attacks from maliciously large files while accommodating typical embedding file sizes. - -```bash -export MAX_FILE_SIZE_MB=50 -``` - -## Architecture Overview - -The EPD flow implements a **3-worker architecture** for high-performance multimodal inference: - -- **Encode Worker**: Loads and processes multimodal embeddings -- **Prefill Worker**: Handles initial context processing and KV-cache generation -- **Decode Worker**: Performs streaming token generation - -## Request Flow Diagram - -```mermaid -sequenceDiagram - participant Client - participant Frontend - participant PrefillWorker as "Prefill Worker
(PrefillHandler)" - participant EncodeWorker as "Encode Worker
(EncodeHandler)" - participant DecodeWorker as "Decode Worker
(DecodeHandler)" - participant NIXL as "NIXL
(RDMA Transfer)" - - Note over Client,NIXL: Unified Frontend: Context processing followed by streaming generation - - Client->>Frontend: POST /v1/chat/completions
(multimodal request) - Frontend->>PrefillWorker: Route to prefill worker - - Note over PrefillWorker: Check for multimodal content - PrefillWorker->>EncodeWorker: Send request
(contains embedding paths) - - Note over EncodeWorker: Load embeddings from file/url
- EncodeWorker->>NIXL: Create readable operation
- EncodeWorker->>PrefillWorker: Send metadata + NIXL info
(JSON: shape, dtype, aux_data) - - Note over PrefillWorker: Allocate tensor with dynamic shape - PrefillWorker->>NIXL: Begin read operation - NIXL-->>PrefillWorker: Zero-copy transfer complete
- - Note over PrefillWorker: Reconstruct embeddings
(mm_embeddings + special_tokens + offsets) - Note over PrefillWorker: Process full context
(text + multimodal embeddings) - Note over PrefillWorker: Generate KV-cache
(max_tokens=1 in prefill mode) - - PrefillWorker->>Frontend: Return prefill response
(disaggregated_params) - - Frontend->>DecodeWorker: Route to decode worker
with disaggregated_params - - Note over DecodeWorker: Continue generation
(streaming tokens) - DecodeWorker->>Frontend: Stream response chunk 1 - Frontend->>Client: Response chunk 1 - DecodeWorker->>Frontend: Stream response chunk 2 - Frontend->>Client: Response chunk 2 - DecodeWorker->>Frontend: ... (continue streaming) - Frontend->>Client: ... (continue streaming) - DecodeWorker->>Frontend: Final response + [DONE] - Frontend->>Client: Final response + [DONE] -``` - -## How the System Works - -1. **Request Processing**: Multimodal requests containing embedding file paths or URLs are routed by the frontend to prefill workers -2. **Multimodal Loading**: EncodeWorker loads large embedding files and extracts auxiliary metadata -3. **NIXL Transfer**: Main tensors transferred via zero-copy RDMA, small metadata via JSON for efficiency -4. **Dynamic Allocation**: Consumer workers allocate tensors with exact shapes received from EncodeWorker -5. **Reconstruction**: Original embedding format (dictionary or tensor) is reconstructed for model processing - -## Example Request - -The request format is identical to regular multimodal requests: - -```bash -curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ - "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct", - "messages": [ - { - "role": "user", - "content": [ - {"type": "text", "text": "Describe the image"}, - { - "type": "image_url", - "image_url": {"url": "/path/to/embeddings.pt"} - } - ] - } - ], - "max_tokens": 160 -}' -``` diff --git a/docs/backends/trtllm/multimodal_support.md b/docs/backends/trtllm/multimodal_support.md index 7f90874be73..876bdb21a05 100644 --- a/docs/backends/trtllm/multimodal_support.md +++ b/docs/backends/trtllm/multimodal_support.md @@ -92,23 +92,50 @@ In general, disaggregated serving can run on a single node, provided the model f To deploy `Llama-4-Maverick-17B-128E-Instruct` in disaggregated mode, you will need to follow the multi-node setup instructions, which can be found [here](./multinode/multinode-multimodal-example.md). -## Using Pre-computed Embeddings (Experimental) +## Pre-computed Embeddings with EPD Flow -Dynamo with TensorRT-LLM supports providing pre-computed embeddings directly in an inference request. This bypasses the need for the model to process an image and generate embeddings itself, which is useful for performance optimization or when working with custom, pre-generated embeddings. +For high-performance multimodal inference, Dynamo supports pre-computed embeddings with an **Encode-Prefill-Decode (EPD)** flow using **NIXL (RDMA)** for zero-copy tensor transfer. -### How to Use +### Enabling the Feature -Once the container is built, you can send requests with paths to local embedding files. +This is an experimental feature that requires using a specific TensorRT-LLM commit. +To enable it build the dynamo container with the `--tensorrtllm-commit` flag: -- **Format:** Provide the embedding as part of the `messages` array, using the `image_url` content type. -- **URL:** The `url` field should contain the absolute or relative path to your embedding file on the local filesystem. -- **File Types:** Supported embedding file extensions are `.pt`, `.pth`, and `.bin`. Dynamo will automatically detect these extensions. +```bash +./container/build.sh --framework trtllm --tensorrtllm-git-url https://github.com/NVIDIA/TensorRT-LLM.git --tensorrtllm-commit v1.2.0rc3 +``` -When a request with a supported embedding file is received, Dynamo will load the tensor from the file and pass it directly to the model for inference, skipping the image-to-embedding pipeline. +### Supported File Types -### Example Request +- `.pt` - PyTorch tensor files +- `.pth` - PyTorch checkpoint files +- `.bin` - Binary tensor files + +### How to Launch + +```bash +cd $DYNAMO_HOME/examples/backends/trtllm + +# Launch 3-worker EPD flow with NIXL +./launch/epd_disagg.sh +``` + +> **Note:** This script is designed for 8-node H200 with `Llama-4-Scout-17B-16E-Instruct` model and assumes you have a model-specific embedding file ready. + +### Configuration + +```bash +# Encode endpoint for Prefill → Encode communication +export ENCODE_ENDPOINT="dyn://dynamo.tensorrt_llm_encode.generate" + +# Security: Allowed directory for embedding files (default: /tmp) +export ALLOWED_LOCAL_MEDIA_PATH="/tmp" -Here is an example of how to send a request with a pre-computed embedding file. +# Security: Max file size to prevent DoS attacks (default: 50MB) +export MAX_FILE_SIZE_MB=50 +``` + +### Example Request ```bash curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ @@ -117,27 +144,47 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d ' { "role": "user", "content": [ - { - "type": "text", - "text": "Describe the content represented by the embeddings" - }, - { - "type": "image_url", - "image_url": { - "url": "/path/to/your/embedding.pt" - } - } + {"type": "text", "text": "Describe the image"}, + {"type": "image_url", "image_url": {"url": "/path/to/embedding.pt"}} ] } ], - "stream": false, "max_tokens": 160 }' ``` -## Encode-Prefill-Decode (EPD) Flow with NIXL -Dynamo with the TensorRT-LLM backend supports multimodal models in Encode -> Decode -> Prefill fashion, enabling you to process embeddings seperately in a seperate worker. For detailed setup instructions, example requests, and best practices, see the [Multimodal EPD Support Guide](./multimodal_epd.md). +### Architecture + +The EPD flow implements a **3-worker architecture**: + +- **Encode Worker**: Loads pre-computed embeddings, transfers via NIXL +- **Prefill Worker**: Receives embeddings, handles context processing and KV-cache generation +- **Decode Worker**: Performs streaming token generation + +### Request Flow + +```mermaid +sequenceDiagram + participant Client + participant Frontend + participant PrefillWorker as "Prefill Worker" + participant EncodeWorker as "Encode Worker" + participant DecodeWorker as "Decode Worker" + participant NIXL as "NIXL (RDMA)" + + Client->>Frontend: POST /v1/chat/completions + Frontend->>PrefillWorker: Route to prefill worker + PrefillWorker->>EncodeWorker: Send request (embedding paths) + EncodeWorker->>NIXL: Create readable operation + EncodeWorker->>PrefillWorker: Send metadata + NIXL info + PrefillWorker->>NIXL: Begin read operation + NIXL-->>PrefillWorker: Zero-copy transfer complete + PrefillWorker->>Frontend: Return prefill response + Frontend->>DecodeWorker: Route to decode worker + DecodeWorker->>Frontend: Stream response chunks + Frontend->>Client: Stream response +``` ## Supported Multimodal Models -Multimodel models listed [here](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/inputs/utils.py#L221) are supported by dynamo. \ No newline at end of file +Multimodal models listed in [TensorRT-LLM supported models](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/models/supported-models.md) are supported by Dynamo. diff --git a/docs/backends/trtllm/multimodal_trtllm_guide.md b/docs/backends/trtllm/multimodal_trtllm_guide.md index 6ab1a45ddd3..8f5997923d8 100644 --- a/docs/backends/trtllm/multimodal_trtllm_guide.md +++ b/docs/backends/trtllm/multimodal_trtllm_guide.md @@ -17,7 +17,7 @@ limitations under the License. # TRT-LLM Multimodal Guide -This document provides a comprehensive guide for multimodal inference using TensorRT-LLM backend in Dynamo. +This document provides a comprehensive guide for multimodal inference using TensorRT-LLM backend in Dynamo. For more details on the multimodal examples, see [Multimodal Examples Documentation](./multimodal_support.md). ## Multimodal Support Matrix @@ -25,6 +25,8 @@ This document provides a comprehensive guide for multimodal inference using Tens |----------|--------------|------------|---------------|-------| | **Image** | HTTP/HTTPS URL | Yes | Yes | Full support for all image models | | **Image** | Pre-computed Embeddings (.pt, .pth, .bin) | Yes | Yes | Direct embedding files | +| **Video** | HTTP/HTTPS URL | ❌ No | ❌ No | Not implemented | +| **Audio** | HTTP/HTTPS URL | ❌ No | ❌ No | Not implemented | ## Architecture Comparison @@ -39,9 +41,9 @@ DISAGGREGATED P->D (disagg_multimodal.sh): Client → Frontend → Prefill [image load, encode] → Decode → Response • 3 components • worker flag `--disaggregation-mode prefill/decode` • Multi-GPU, KV transfer -EPD DISAGGREGATED - WIP (epd_disagg.sh): +EPD DISAGGREGATED - WIP: Client → Frontend → Encode [MultimodalEncoder] → Prefill [via params] → Decode → Response - • 4 components • worker flag `--disaggregation-mode encode/prefill/decode` • WIP PR #3818 + • 4 components • worker flag `--disaggregation-mode encode/prefill/decode` • WIP PR #4668 ``` ## Input Format Details @@ -104,9 +106,68 @@ Response Example: [`examples/backends/trtllm/launch/disagg_multimodal.sh`](../../../examples/backends/trtllm/launch/disagg_multimodal.sh) +## Pre-computed Embeddings + +TRT-LLM supports providing pre-computed embeddings, bypassing image-to-embedding processing. + +### Supported File Types + +- `.pt` - PyTorch tensor files +- `.pth` - PyTorch checkpoint files +- `.bin` - Binary tensor files + +### Embedding File Formats + +TRT-LLM supports two formats for embedding files: + +#### 1. Simple Tensor Format + +- Direct tensor saved as `.pt` file +- Example: `llava_next_mm_embed_seashore.pt` +- Contains only the embedding tensor + +```python +# Example: Simple tensor format +embedding_tensor = torch.rand(1, 576, 4096) # [batch, seq_len, hidden_dim] +torch.save(embedding_tensor, "embedding.pt") +``` + +#### 2. Dictionary Format with Auxiliary Data + +- Dictionary containing multiple keys +- Used by models like Llama-4 that require additional metadata +- Must contain `mm_embeddings` key with the main tensor +- Can include auxiliary data like special tokens, offsets, etc. + +```python +# Example: Dictionary format (Llama-4 style) +embedding_dict = { + "mm_embeddings": torch.rand(1, 576, 4096), + "special_tokens": [128256, 128257], + "image_token_offsets": [[0, 576]], + # ... other model-specific metadata +} +torch.save(embedding_dict, "llama4_embedding.pt") +``` + +**How They're Used:** +- **Simple tensors**: Loaded directly and passed to `mm_embeddings` parameter +- **Dictionary format**: `mm_embeddings` key extracted as main tensor, other keys preserved as auxiliary data and transferred separately + +### Launch Script + +Example: [`examples/backends/trtllm/launch/epd_disagg.sh`](../../../examples/backends/trtllm/launch/epd_disagg.sh) + +### Security Considerations + +For EPD mode with local embedding files: + +- `--allowed-local-media-path` - Specify secure directory for embedding files (default: `/tmp`) +- `--max-file-size-mb` - Limit max file size to prevent DoS attacks (default: `50MB`) + ## EPD Disaggregated Mode (E->P->D) - WIP -**Status:** Work In Progress (WIP PR #3818) - Full EPD flow with MultimodalEncoder +**Status:** Work In Progress (WIP PR #4668) - Full EPD flow with MultimodalEncoder In EPD mode, encoding, prefill, and decode are handled by separate workers. The encode worker uses TensorRT-LLM's `MultimodalEncoder` to process images and transfer embeddings via disaggregated parameters. @@ -134,11 +195,6 @@ Response | Prefill Worker | `--disaggregation-mode prefill --encode-endpoint` | Tokens | Yes | Prefill only | | Decode Worker | `--disaggregation-mode decode` | Tokens | Yes | Decode only | -### Launch Script - -Example: [`examples/backends/trtllm/launch/epd_disagg.sh`](../../../examples/backends/trtllm/launch/epd_disagg.sh) - -**Note (WIP):** The default model in the WIP PR is `llava-hf/llava-v1.6-mistral-7b-hf`. ## ModelInput Types and Registration @@ -165,11 +221,7 @@ await register_llm( ## Inter-Component Communication -### NATS-Based Messaging - -TRT-LLM components communicate using NATS messaging: - -| Transfer Stage | NATS Message | NIXL Transfer | +| Transfer Stage | Message | NIXL Transfer | |----------------|--------------|---------------| | **Frontend → Prefill** | Request with image URL or embedding path | No | | **Encode → Prefill (pre-computed embeddings)** | NIXL metadata (pre-computed embeddings) | Yes (Embeddings tensor) | @@ -184,105 +236,10 @@ TRT-LLM components communicate using NATS messaging: | Simple Aggregated | [`examples/backends/trtllm/launch/agg.sh`](../../../examples/backends/trtllm/launch/agg.sh) | ❌ No | All in one worker | | P->D Disaggregated | [`examples/backends/trtllm/launch/disagg_multimodal.sh`](../../../examples/backends/trtllm/launch/disagg_multimodal.sh) | ⚙️ Optional | Prefill → Decode (KV cache via UCX or NIXL) | | E->P->D Disaggregated (pre-computed embeddings) | [`examples/backends/trtllm/launch/epd_disagg.sh`](../../../examples/backends/trtllm/launch/epd_disagg.sh) | ✅ Yes | Encoder → Prefill (pre-computed embeddings via NIXL) | -| E->P->D Disaggregated (WIP) | `examples/backends/trtllm/launch/url_epd_disagg.sh` | ❌ No | Encoder → Prefill (multimodal handles via disaggregated_params)
Prefill → Decode (KV cache via UCX/NIXL) | +| E->P->D Disaggregated (WIP) | X | ❌ No | Encoder → Prefill (multimodal handles via disaggregated_params)
Prefill → Decode (KV cache via UCX/NIXL) | **Note:** NIXL for KV cache transfer is currently beta and only supported on AMD64 (x86_64) architecture. -## **GAPS and Known Limitations** - -### 1. No Base64 Data URL Support - -**Current State:** -- TRT-LLM does NOT support base64-encoded `data:image/...` URLs -- Use HTTP/HTTPS URLs or pre-computed embedding files instead - -### 2. E->P->D Mode is WIP - -**Current State (WIP PR #3818):** -- EPD mode (E->P->D) is under active development -- Uses `MultimodalEncoder` from TensorRT-LLM for actual image encoding (not just pre-computed embeddings) -- Embeddings transferred via `disaggregated_params` (includes `multimodal_embedding_handles` and `multimodal_hashes`) -- Encode worker does not register with frontend; accessed via `--encode-endpoint` - - -### 3. NIXL KV Cache Transfer Beta - -## Pre-computed Embeddings (Legacy) - -TRT-LLM supports providing pre-computed embeddings, bypassing image-to-embedding processing. This is the **Embeddings URL** approach for EPD mode. - -### Supported File Types - -- `.pt` - PyTorch tensor files -- `.pth` - PyTorch checkpoint files -- `.bin` - Binary tensor files - -### Embedding File Formats - -TRT-LLM supports two formats for embedding files: - -#### 1. Simple Tensor Format - -- Direct tensor saved as `.pt` file -- Example: `llava_next_mm_embed_seashore.pt` -- Contains only the embedding tensor - -```python -# Example: Simple tensor format -embedding_tensor = torch.rand(1, 576, 4096) # [batch, seq_len, hidden_dim] -torch.save(embedding_tensor, "embedding.pt") -``` - -#### 2. Dictionary Format with Auxiliary Data - -- Dictionary containing multiple keys -- Used by models like Llama-4 that require additional metadata -- Must contain `mm_embeddings` key with the main tensor -- Can include auxiliary data like special tokens, offsets, etc. - -```python -# Example: Dictionary format (Llama-4 style) -embedding_dict = { - "mm_embeddings": torch.rand(1, 576, 4096), - "special_tokens": [128256, 128257], - "image_token_offsets": [[0, 576]], - # ... other model-specific metadata -} -torch.save(embedding_dict, "llama4_embedding.pt") -``` - -**How They're Used:** -- **Simple tensors**: Loaded directly and passed to `mm_embeddings` parameter -- **Dictionary format**: `mm_embeddings` key extracted as main tensor, other keys preserved as auxiliary data and transferred separately - -### Security Considerations - -For EPD mode with local embedding files: - -- `--allowed-local-media-path` - Specify secure directory for embedding files (default: `/tmp`) -- `--max-file-size-mb` - Limit max file size to prevent DoS attacks (default: `50MB`) - -## Full EPD with Image URLs (WIP) - -**Status:** Work In Progress (PR #3818) - -The WIP full EPD flow allows sending image URLs directly to the encode worker, which uses `MultimodalEncoder` to encode them. - -### How It Works (WIP) - -1. **Client** sends image URL in request -2. **Frontend** routes to **Prefill Worker** -3. **Prefill Worker** calls **Encode Worker** with image URL -4. **Encode Worker**: - - Downloads image using `default_multimodal_input_loader` - - Encodes with `MultimodalEncoder.generate()` - - Returns `ep_disaggregated_params` containing: - - `multimodal_embedding_handles` - GPU memory handles for embeddings - - `multimodal_hashes` - Hashes for embedding verification - - `processed_prompt` - Prompt with `` placeholders - - `prompt_token_ids` - Pre-tokenized prompt -5. **Prefill Worker** receives embeddings via disaggregated params, performs prefill -6. **Decode Worker** continues generation ## Key Files @@ -294,76 +251,13 @@ The WIP full EPD flow allows sending image URLs directly to the encode worker, w | `components/src/dynamo/trtllm/request_handlers/handlers.py` | Request handler factory | | `components/src/dynamo/trtllm/request_handlers/handler_base.py` | Base handler and disaggregation modes | -## **GAPS and Known Limitations** - -### 1. All Processing Happens in Python Workers - -**Current State:** -- TRT-LLM multimodal workers register with `ModelInput.Tokens` -- However, **all multimodal preprocessing happens in Python workers**, not in Rust frontend -- Rust frontend only validates URLs and tokenizes text-only prompts -- Python workers handle: - - Image downloading - - Image decoding (pixel-level) - - Vision encoding - - Multimodal prompt processing (adding `` tokens) - - Tokenization of multimodal prompts - -**Why This Is a Gap:** -- No reuse of Rust preprocessing/postprocessing logic for multimodal requests -- Inconsistent with text-only flows where Rust handles tokenization -- Limits optimization opportunities in the frontend - -### 2. TRT-LLM Requires Text Prompts, Not Tokens (Current) - -**Current State:** -- TRT-LLM's `MultimodalEncoder` and `LLM.generate_async()` expect **text prompts**, not pre-tokenized input -- This differs from vLLM which can accept `TokensPrompt` directly -- Forces Python workers to handle tokenization, even though workers register as `ModelInput.Tokens` - -**Ideal State:** -- TRT-LLM should accept **pre-tokenized input** (token IDs) -- Rust frontend could tokenize multimodal prompts (with `` placeholders) -- Python workers would only handle vision encoding - -**In Progress:** -- TRT-LLM team is working on accepting tokens instead of text prompts -- This would enable Rust preprocessing/postprocessing reuse for multimodal requests -- Would align TRT-LLM with vLLM's architecture where workers truly consume tokens - -### 3. Multimodal Processor Uses `ModelInput.Text` Semantics - -**Current State:** -- `MultimodalRequestProcessor` in TRT-LLM workers expects OpenAI format messages with raw text -- Workers effectively operate as `ModelInput.Text` despite registering as `ModelInput.Tokens` -- This is a workaround until TRT-LLM accepts tokenized input - -**Impact:** -- Architectural inconsistency between registration and actual behavior -- Cannot leverage Rust SDK's tokenization capabilities -- Additional complexity in Python worker code - -### 4. No Audio/Video Support in Dynamo TRT-LLM Backend - -**Current State:** -- TensorRT-LLM engine natively supports audio and video modalities -- Dynamo's TRT-LLM backend does **not yet** expose these capabilities -- Only image modality is currently supported: `--modality multimodal` (images only) - -**Why:** -- Dynamo backend implementation has not been extended to handle audio/video -- `MultimodalRequestProcessor` only extracts `image_url` from messages -- No handlers for `audio_url` or `video_url` content types - -**What's Missing:** -- Audio content type processing (`"type": "audio_url"`) -- Video content type processing (`"type": "video_url"`) -- Integration with TensorRT-LLM's audio/video input loaders -- Model-specific audio/video preprocessing - -**In Progress:** -- Backend extension to support audio and video is planned -- Will follow similar patterns to image support once implemented +## Known Limitations + +- **No Data URL support** - Only HTTP/HTTPS URLs supported; `data:image/...` base64 URLs not supported +- **No video support** - No video encoder implementation +- **No audio support** - No audio encoder implementation +- **No Rust preprocessing** - All preprocessing happens in Python workers +- **E->P->D mode is WIP** - Full EPD with image URLs under development ## Supported Models diff --git a/docs/backends/vllm/multimodal_vllm_guide.md b/docs/backends/vllm/multimodal_vllm_guide.md index 025d46bc1c8..235c5e23f94 100644 --- a/docs/backends/vllm/multimodal_vllm_guide.md +++ b/docs/backends/vllm/multimodal_vllm_guide.md @@ -17,7 +17,7 @@ limitations under the License. # vLLM Multimodal Guide -This document provides a comprehensive guide for multimodal inference using vLLM backend in Dynamo. +This document provides a comprehensive guide for multimodal inference using vLLM backend in Dynamo. For more details on the multimodal examples, see [Multimodal Examples Documentation](./multimodal.md). ## Multimodal Support Matrix @@ -197,15 +197,21 @@ await register_llm( | EP->D Disaggregated (Llama 4) | [`examples/backends/vllm/launch/disagg_multimodal_llama.sh`](../../../examples/backends/vllm/launch/disagg_multimodal_llama.sh) | ✅ Yes | Prefill → Decode (KV cache) | -## **GAPS and Known Limitations** +## Known Limitations -### 1. Token-Based P->D Disaggregation Not Supported +- **Disaggregated flows require Python Processor** - All multimodal disaggregation requires the Python Processor component (`ModelInput.Text`). -**Current State:** -- All disaggregated multimodal flows require the **Processor** component (which uses `ModelInput.Text`) -- No support for pure token-based P->D disaggregation without multimodal processor +## Supported Models -### Key Files +The following models have been tested with Dynamo's vLLM multimodal backend: + +- **Qwen2.5-VL** - `Qwen/Qwen2.5-VL-7B-Instruct` +- **LLaVA 1.5** - `llava-hf/llava-1.5-7b-hf` +- **Llama 4 Maverick** - `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 +- **LLa Next Video** - `llava-hf/LLaVA-NeXT-Video-7B-hf` +- **Qwen2-Audio** - `Qwen/Qwen2-Audio-7B-Instruct` + +## Key Files | File | Description | |------|-------------| diff --git a/docs/hidden_toctree.rst b/docs/hidden_toctree.rst index 5cf74c62300..c1452e968af 100644 --- a/docs/hidden_toctree.rst +++ b/docs/hidden_toctree.rst @@ -50,7 +50,6 @@ backends/trtllm/llama4_plus_eagle.md backends/trtllm/kv-cache-transfer.md backends/trtllm/multimodal_support.md - backends/trtllm/multimodal_epd.md backends/trtllm/multimodal_trtllm_guide.md backends/trtllm/gemma3_sliding_window_attention.md backends/trtllm/gpt-oss.md From a87145940aa98853e4d9af185eccf782851f7c9c Mon Sep 17 00:00:00 2001 From: krishung5 Date: Fri, 5 Dec 2025 22:29:18 -0800 Subject: [PATCH 08/12] Add vLLM links --- docs/backends/vllm/multimodal_vllm_guide.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/backends/vllm/multimodal_vllm_guide.md b/docs/backends/vllm/multimodal_vllm_guide.md index 235c5e23f94..7bc9be53ecf 100644 --- a/docs/backends/vllm/multimodal_vllm_guide.md +++ b/docs/backends/vllm/multimodal_vllm_guide.md @@ -211,6 +211,8 @@ The following models have been tested with Dynamo's vLLM multimodal backend: - **LLa Next Video** - `llava-hf/LLaVA-NeXT-Video-7B-hf` - **Qwen2-Audio** - `Qwen/Qwen2-Audio-7B-Instruct` +For a complete list of multimodal models supported by vLLM, see [vLLM Supported Multimodal Models](https://docs.vllm.ai/en/latest/models/supported_models/#list-of-multimodal-language-models). Models listed there should work with Simple Aggregated Mode but may not be explicitly tested. + ## Key Files | File | Description | From b2870e275d756cb4096b997cc91cfd2e90bcf646 Mon Sep 17 00:00:00 2001 From: krishung5 Date: Fri, 5 Dec 2025 22:38:55 -0800 Subject: [PATCH 09/12] Replace error-prone part --- docs/backends/sglang/multimodal_sglang_guide.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/backends/sglang/multimodal_sglang_guide.md b/docs/backends/sglang/multimodal_sglang_guide.md index f850fea4e95..e1ecc03e6bb 100644 --- a/docs/backends/sglang/multimodal_sglang_guide.md +++ b/docs/backends/sglang/multimodal_sglang_guide.md @@ -298,6 +298,7 @@ Supported templates: `qwen2-vl`, `llama-3`, `vicuna`, etc. - **No audio support** - No audio encoder implementation - **Only Processor registers with Dynamo** - Workers are internal components, frontend routes to Processor only - **Disaggregated routing** - Decode Worker is the entry point (calls Prefill), cannot route directly to Prefill workers +- **Limited model generalization** - Token expansion logic is model-specific; adding new models may require implementation updates ## Supported Models From 13a858c5c2c85d2a9eeec5f9690675435ba4ac7a Mon Sep 17 00:00:00 2001 From: krishung5 Date: Fri, 5 Dec 2025 22:46:45 -0800 Subject: [PATCH 10/12] Add missing backtick --- docs/backends/vllm/multimodal_vllm_guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/backends/vllm/multimodal_vllm_guide.md b/docs/backends/vllm/multimodal_vllm_guide.md index 7bc9be53ecf..7d5317f8688 100644 --- a/docs/backends/vllm/multimodal_vllm_guide.md +++ b/docs/backends/vllm/multimodal_vllm_guide.md @@ -207,7 +207,7 @@ The following models have been tested with Dynamo's vLLM multimodal backend: - **Qwen2.5-VL** - `Qwen/Qwen2.5-VL-7B-Instruct` - **LLaVA 1.5** - `llava-hf/llava-1.5-7b-hf` -- **Llama 4 Maverick** - `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 +- **Llama 4 Maverick** - `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8` - **LLa Next Video** - `llava-hf/LLaVA-NeXT-Video-7B-hf` - **Qwen2-Audio** - `Qwen/Qwen2-Audio-7B-Instruct` From dcbe92b6d237563f3b96ae25383f2d54dc455e87 Mon Sep 17 00:00:00 2001 From: krishung5 Date: Fri, 5 Dec 2025 22:49:28 -0800 Subject: [PATCH 11/12] Update trtllm supported models link --- docs/backends/trtllm/multimodal_trtllm_guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/backends/trtllm/multimodal_trtllm_guide.md b/docs/backends/trtllm/multimodal_trtllm_guide.md index 8f5997923d8..3bd3b9b0feb 100644 --- a/docs/backends/trtllm/multimodal_trtllm_guide.md +++ b/docs/backends/trtllm/multimodal_trtllm_guide.md @@ -261,7 +261,7 @@ await register_llm( ## Supported Models -Multimodal models listed in [TensorRT-LLM supported models](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/inputs/utils.py#L221) are supported by Dynamo. +Multimodal models listed in [TensorRT-LLM supported models](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/models/supported-models.md) are supported by Dynamo. Common examples: - Llama 4 Vision models (Maverick, Scout) From 73ed7aee6a664dcf4d39f1bc002cb48ee1dc9bbb Mon Sep 17 00:00:00 2001 From: krishung5 Date: Mon, 8 Dec 2025 23:08:47 -0800 Subject: [PATCH 12/12] Add qwen3-vl and fix typo --- docs/backends/vllm/multimodal_vllm_guide.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/backends/vllm/multimodal_vllm_guide.md b/docs/backends/vllm/multimodal_vllm_guide.md index 7d5317f8688..a68647d31f0 100644 --- a/docs/backends/vllm/multimodal_vllm_guide.md +++ b/docs/backends/vllm/multimodal_vllm_guide.md @@ -206,9 +206,10 @@ await register_llm( The following models have been tested with Dynamo's vLLM multimodal backend: - **Qwen2.5-VL** - `Qwen/Qwen2.5-VL-7B-Instruct` +- **Qwen3-VL** - `Qwen/Qwen3-VL-30B-A3B-Instruct-FP8` - **LLaVA 1.5** - `llava-hf/llava-1.5-7b-hf` - **Llama 4 Maverick** - `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8` -- **LLa Next Video** - `llava-hf/LLaVA-NeXT-Video-7B-hf` +- **LLaVA Next Video** - `llava-hf/LLaVA-NeXT-Video-7B-hf` - **Qwen2-Audio** - `Qwen/Qwen2-Audio-7B-Instruct` For a complete list of multimodal models supported by vLLM, see [vLLM Supported Multimodal Models](https://docs.vllm.ai/en/latest/models/supported_models/#list-of-multimodal-language-models). Models listed there should work with Simple Aggregated Mode but may not be explicitly tested.