You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -192,7 +192,7 @@ It is recommended to use [NGC PyTorch Container](https://catalog.ngc.nvidia.com/
192
192
193
193
> [!Note]
194
194
> Ensure that you select a PyTorch container image version that matches the version of TensorRT-LLM you are using.
195
-
> For example, if you are using `tensorrt-llm==1.0.0rc4`, use the PyTorch container image version `25.05`.
195
+
> For example, if you are using `tensorrt-llm==1.0.0rc6`, use the PyTorch container image version `25.06`.
196
196
> To find the correct PyTorch container version for your desired `tensorrt-llm` release, visit the [TensorRT-LLM Dockerfile.multi](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docker/Dockerfile.multi) on GitHub. Switch to the branch that matches your `tensorrt-llm` version, and look for the `BASE_TAG` line to identify the recommended PyTorch container tag.
- MTP is only available within the container built with the experimental TensorRT-LLM commit. Please add --use-default-experimental-tensorrtllm-commit to the arguments of the build.sh script.
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
177
173
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
178
174
@@ -244,140 +240,7 @@ To benchmark your deployment with GenAI-Perf, see this utility script, configuri
244
240
245
241
## Multimodal support
246
242
247
-
TRTLLM supports multimodal models with dynamo. You can provide multimodal inputs in the following ways:
248
-
249
-
- By sending image URLs
250
-
- By providing paths to pre-computed embedding files
251
-
252
-
Please note that you should provide **either image URLs or embedding file paths** in a single request.
253
-
254
-
### Aggregated
255
-
256
-
Here are quick steps to launch Llama-4 Maverick BF16 in aggregated mode
{"id":"unknown-id","choices":[{"index":0,"message":{"content":"The image depicts a serene landscape featuring a large rock formation, likely El Capitan in Yosemite National Park, California. The scene is characterized by a winding road that curves from the bottom-right corner towards the center-left of the image, with a few rocks and trees lining its edge.\n\n**Key Features:**\n\n* **Rock Formation:** A prominent, tall, and flat-topped rock formation dominates the center of the image.\n* **Road:** A paved road winds its way through the landscape, curving from the bottom-right corner towards the center-left.\n* **Trees and Rocks:** Trees are visible on both sides of the road, with rocks scattered along the left side.\n* **Sky:** The sky above is blue, dotted with white clouds.\n* **Atmosphere:** The overall atmosphere of the","refusal":null,"tool_calls":null,"role":"assistant","function_call":null,"audio":null},"finish_reason":"stop","logprobs":null}],"created":1753322607,"model":"meta-llama/Llama-4-Maverick-17B-128E-Instruct","service_tier":null,"system_fingerprint":null,"object":"chat.completion","usage":null}
300
-
```
301
-
302
-
### Disaggregated
303
-
304
-
Here are quick steps to launch in disaggregated mode.
305
-
306
-
The following is an example of launching a model in disaggregated mode. While this example uses `Qwen/Qwen2-VL-7B-Instruct`, you can adapt it for other models by modifying the environment variables for the model path and engine configurations.
For a large model like `meta-llama/Llama-4-Maverick-17B-128E-Instruct`, a multi-node setup is required for disaggregated serving, while aggregated serving can run on a single node. This is because the model with a disaggregated configuration is too large to fit on a single node's GPUs. For instance, running this model in disaggregated mode requires a setup of 2 nodes with 8xH200 GPUs or 4 nodes with 4xGB200 GPUs.
321
-
322
-
In general, disaggregated serving can run on a single node, provided the model fits on the GPU. The multi-node requirement in this example is specific to the size and configuration of the `meta-llama/Llama-4-Maverick-17B-128E-Instruct` model.
323
-
324
-
To deploy `Llama-4-Maverick-17B-128E-Instruct` in disaggregated mode, you will need to follow the multi-node setup instructions, which can be found [here](multinode/multinode-multimodal-example.md).
325
-
326
-
### Using Pre-computed Embeddings (Experimental)
327
-
328
-
Dynamo with TensorRT-LLM supports providing pre-computed embeddings directly in an inference request. This bypasses the need for the model to process an image and generate embeddings itself, which is useful for performance optimization or when working with custom, pre-generated embeddings.
329
-
330
-
#### Enabling the Feature
331
-
332
-
This is an experimental feature that requires using a specific TensorRT-LLM commit.
333
-
To enable it build the dynamo container with the `--tensorrtllm-commit` flag, followed by the commit hash:
Once the container is built, you can send requests with paths to local embedding files.
342
-
343
-
-**Format:** Provide the embedding as part of the `messages` array, using the `image_url` content type.
344
-
-**URL:** The `url` field should contain the absolute or relative path to your embedding file on the local filesystem.
345
-
-**File Types:** Supported embedding file extensions are `.pt`, `.pth`, and `.bin`. Dynamo will automatically detect these extensions.
346
-
347
-
When a request with a supported embedding file is received, Dynamo will load the tensor from the file and pass it directly to the model for inference, skipping the image-to-embedding pipeline.
348
-
349
-
#### Example Request
350
-
351
-
Here is an example of how to send a request with a pre-computed embedding file.
"text": "Describe the content represented by the embeddings"
363
-
},
364
-
{
365
-
"type": "image_url",
366
-
"image_url": {
367
-
"url": "/path/to/your/embedding.pt"
368
-
}
369
-
}
370
-
]
371
-
}
372
-
],
373
-
"stream": false,
374
-
"max_tokens": 160
375
-
}'
376
-
```
377
-
378
-
### Supported Multimodal Models
379
-
380
-
Multimodel models listed [here](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/inputs/utils.py#L221) are supported by dynamo.
243
+
Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [Multimodal Support Guide](./multimodal_support.md).
Copy file name to clipboardExpand all lines: components/backends/trtllm/deploy/README.md
-8Lines changed: 0 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -219,14 +219,6 @@ Send a test request to verify your deployment. See the [client section](../../..
219
219
220
220
The deployment templates support various TensorRT-LLM models and configurations. You can customize model-specific arguments in the worker configuration sections of the YAML files.
221
221
222
-
### Multi-Token Prediction (MTP) Support
223
-
224
-
For models supporting Multi-Token Prediction (such as DeepSeek R1), special configuration is available. Note that MTP requires the experimental TensorRT-LLM commit:
Copy file name to clipboardExpand all lines: components/backends/trtllm/gemma3_sliding_window_attention.md
+3-6Lines changed: 3 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,12 +20,9 @@ limitations under the License.
20
20
This guide demonstrates how to deploy google/gemma-3-1b-it with Variable Sliding Window Attention (VSWA) using Dynamo. Since google/gemma-3-1b-it is a small model, each aggregated, decode, or prefill worker only requires one H100 GPU or one GB200 GPU.
21
21
VSWA is a mechanism in which a model’s layers alternate between multiple sliding window sizes. An example of this is Gemma 3, which incorporates both global attention layers and sliding window layers.
22
22
23
-
## Notes
24
-
* To run Gemma 3 with VSWA and KV Routing with KV block reuse, ensure that the container is built using commit ID `c9eebcb4541d961ab390f0bd0a22e2c89f1bcc78` from Tensorrt-LLM.
0 commit comments