Skip to content

Commit 091ba01

Browse files
tanmayv25nnshah1
authored andcommitted
fix: Update tensorrt_llm to 1.0.0rc6 (#2606)
Signed-off-by: nnshah1 <[email protected]>
1 parent a98cd6e commit 091ba01

File tree

9 files changed

+20
-168
lines changed

9 files changed

+20
-168
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -192,7 +192,7 @@ It is recommended to use [NGC PyTorch Container](https://catalog.ngc.nvidia.com/
192192

193193
> [!Note]
194194
> Ensure that you select a PyTorch container image version that matches the version of TensorRT-LLM you are using.
195-
> For example, if you are using `tensorrt-llm==1.0.0rc4`, use the PyTorch container image version `25.05`.
195+
> For example, if you are using `tensorrt-llm==1.0.0rc6`, use the PyTorch container image version `25.06`.
196196
> To find the correct PyTorch container version for your desired `tensorrt-llm` release, visit the [TensorRT-LLM Dockerfile.multi](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docker/Dockerfile.multi) on GitHub. Switch to the branch that matches your `tensorrt-llm` version, and look for the `BASE_TAG` line to identify the recommended PyTorch container tag.
197197
198198
> [!Important]

components/backends/trtllm/README.md

Lines changed: 1 addition & 138 deletions
Original file line numberDiff line numberDiff line change
@@ -169,10 +169,6 @@ export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
169169
```
170170

171171
Notes:
172-
- MTP is only available within the container built with the experimental TensorRT-LLM commit. Please add --use-default-experimental-tensorrtllm-commit to the arguments of the build.sh script.
173-
174-
Example: `./container/build.sh --framework trtllm --use-default-experimental-tensorrtllm-commit`
175-
176172
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
177173
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
178174

@@ -244,140 +240,7 @@ To benchmark your deployment with GenAI-Perf, see this utility script, configuri
244240

245241
## Multimodal support
246242

247-
TRTLLM supports multimodal models with dynamo. You can provide multimodal inputs in the following ways:
248-
249-
- By sending image URLs
250-
- By providing paths to pre-computed embedding files
251-
252-
Please note that you should provide **either image URLs or embedding file paths** in a single request.
253-
254-
### Aggregated
255-
256-
Here are quick steps to launch Llama-4 Maverick BF16 in aggregated mode
257-
```bash
258-
cd $DYNAMO_HOME/components/backends/trtllm
259-
260-
export AGG_ENGINE_ARGS=./engine_configs/multinode/agg.yaml
261-
export SERVED_MODEL_NAME="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
262-
export MODEL_PATH="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
263-
./launch/agg.sh
264-
```
265-
### Example Requests
266-
267-
#### With Image URL
268-
269-
Below is an example of an image being sent to `Llama-4-Maverick-17B-128E-Instruct` model
270-
271-
Request :
272-
```bash
273-
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
274-
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
275-
"messages": [
276-
{
277-
"role": "user",
278-
"content": [
279-
{
280-
"type": "text",
281-
"text": "Describe the image"
282-
},
283-
{
284-
"type": "image_url",
285-
"image_url": {
286-
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"
287-
}
288-
}
289-
]
290-
}
291-
],
292-
"stream": false,
293-
"max_tokens": 160
294-
}'
295-
```
296-
Response :
297-
298-
```
299-
{"id":"unknown-id","choices":[{"index":0,"message":{"content":"The image depicts a serene landscape featuring a large rock formation, likely El Capitan in Yosemite National Park, California. The scene is characterized by a winding road that curves from the bottom-right corner towards the center-left of the image, with a few rocks and trees lining its edge.\n\n**Key Features:**\n\n* **Rock Formation:** A prominent, tall, and flat-topped rock formation dominates the center of the image.\n* **Road:** A paved road winds its way through the landscape, curving from the bottom-right corner towards the center-left.\n* **Trees and Rocks:** Trees are visible on both sides of the road, with rocks scattered along the left side.\n* **Sky:** The sky above is blue, dotted with white clouds.\n* **Atmosphere:** The overall atmosphere of the","refusal":null,"tool_calls":null,"role":"assistant","function_call":null,"audio":null},"finish_reason":"stop","logprobs":null}],"created":1753322607,"model":"meta-llama/Llama-4-Maverick-17B-128E-Instruct","service_tier":null,"system_fingerprint":null,"object":"chat.completion","usage":null}
300-
```
301-
302-
### Disaggregated
303-
304-
Here are quick steps to launch in disaggregated mode.
305-
306-
The following is an example of launching a model in disaggregated mode. While this example uses `Qwen/Qwen2-VL-7B-Instruct`, you can adapt it for other models by modifying the environment variables for the model path and engine configurations.
307-
```bash
308-
cd $DYNAMO_HOME/components/backends/trtllm
309-
310-
export MODEL_PATH=${MODEL_PATH:-"Qwen/Qwen2-VL-7B-Instruct"}
311-
export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen/Qwen2-VL-7B-Instruct"}
312-
export DISAGGREGATION_STRATEGY=${DISAGGREGATION_STRATEGY:-"decode_first"}
313-
export PREFILL_ENGINE_ARGS=${PREFILL_ENGINE_ARGS:-"engine_configs/multimodal/prefill.yaml"}
314-
export DECODE_ENGINE_ARGS=${DECODE_ENGINE_ARGS:-"engine_configs/multimodal/decode.yaml"}
315-
export MODALITY=${MODALITY:-"multimodal"}
316-
317-
./launch/disagg.sh
318-
```
319-
320-
For a large model like `meta-llama/Llama-4-Maverick-17B-128E-Instruct`, a multi-node setup is required for disaggregated serving, while aggregated serving can run on a single node. This is because the model with a disaggregated configuration is too large to fit on a single node's GPUs. For instance, running this model in disaggregated mode requires a setup of 2 nodes with 8xH200 GPUs or 4 nodes with 4xGB200 GPUs.
321-
322-
In general, disaggregated serving can run on a single node, provided the model fits on the GPU. The multi-node requirement in this example is specific to the size and configuration of the `meta-llama/Llama-4-Maverick-17B-128E-Instruct` model.
323-
324-
To deploy `Llama-4-Maverick-17B-128E-Instruct` in disaggregated mode, you will need to follow the multi-node setup instructions, which can be found [here](multinode/multinode-multimodal-example.md).
325-
326-
### Using Pre-computed Embeddings (Experimental)
327-
328-
Dynamo with TensorRT-LLM supports providing pre-computed embeddings directly in an inference request. This bypasses the need for the model to process an image and generate embeddings itself, which is useful for performance optimization or when working with custom, pre-generated embeddings.
329-
330-
#### Enabling the Feature
331-
332-
This is an experimental feature that requires using a specific TensorRT-LLM commit.
333-
To enable it build the dynamo container with the `--tensorrtllm-commit` flag, followed by the commit hash:
334-
335-
```bash
336-
./container/build.sh --framework trtllm --tensorrtllm-commit b4065d8ca64a64eee9fdc64b39cb66d73d4be47c
337-
```
338-
339-
#### How to Use
340-
341-
Once the container is built, you can send requests with paths to local embedding files.
342-
343-
- **Format:** Provide the embedding as part of the `messages` array, using the `image_url` content type.
344-
- **URL:** The `url` field should contain the absolute or relative path to your embedding file on the local filesystem.
345-
- **File Types:** Supported embedding file extensions are `.pt`, `.pth`, and `.bin`. Dynamo will automatically detect these extensions.
346-
347-
When a request with a supported embedding file is received, Dynamo will load the tensor from the file and pass it directly to the model for inference, skipping the image-to-embedding pipeline.
348-
349-
#### Example Request
350-
351-
Here is an example of how to send a request with a pre-computed embedding file.
352-
353-
```bash
354-
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
355-
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
356-
"messages": [
357-
{
358-
"role": "user",
359-
"content": [
360-
{
361-
"type": "text",
362-
"text": "Describe the content represented by the embeddings"
363-
},
364-
{
365-
"type": "image_url",
366-
"image_url": {
367-
"url": "/path/to/your/embedding.pt"
368-
}
369-
}
370-
]
371-
}
372-
],
373-
"stream": false,
374-
"max_tokens": 160
375-
}'
376-
```
377-
378-
### Supported Multimodal Models
379-
380-
Multimodel models listed [here](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/inputs/utils.py#L221) are supported by dynamo.
243+
Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [Multimodal Support Guide](./multimodal_support.md).
381244

382245
## Performance Sweep
383246

components/backends/trtllm/deploy/README.md

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -219,14 +219,6 @@ Send a test request to verify your deployment. See the [client section](../../..
219219

220220
The deployment templates support various TensorRT-LLM models and configurations. You can customize model-specific arguments in the worker configuration sections of the YAML files.
221221

222-
### Multi-Token Prediction (MTP) Support
223-
224-
For models supporting Multi-Token Prediction (such as DeepSeek R1), special configuration is available. Note that MTP requires the experimental TensorRT-LLM commit:
225-
226-
```bash
227-
./container/build.sh --framework tensorrtllm --use-default-experimental-tensorrtllm-commit
228-
```
229-
230222
## Monitoring and Health
231223

232224
- **Frontend health endpoint**: `http://<frontend-service>:8000/health`

components/backends/trtllm/gemma3_sliding_window_attention.md

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -20,12 +20,9 @@ limitations under the License.
2020
This guide demonstrates how to deploy google/gemma-3-1b-it with Variable Sliding Window Attention (VSWA) using Dynamo. Since google/gemma-3-1b-it is a small model, each aggregated, decode, or prefill worker only requires one H100 GPU or one GB200 GPU.
2121
VSWA is a mechanism in which a model’s layers alternate between multiple sliding window sizes. An example of this is Gemma 3, which incorporates both global attention layers and sliding window layers.
2222

23-
## Notes
24-
* To run Gemma 3 with VSWA and KV Routing with KV block reuse, ensure that the container is built using commit ID `c9eebcb4541d961ab390f0bd0a22e2c89f1bcc78` from Tensorrt-LLM.
25-
```bash
26-
./container/build.sh --framework trtllm --tensorrtllm-commit c9eebcb4541d961ab390f0bd0a22e2c89f1bcc78
27-
```
28-
* The 1.0.0rc4 release version of TensorRT-LLM can also run Gemma 3 with VSWA, but KV block reuse cannot be turned on in that version.
23+
> [!Note]
24+
> - Ensure that required services such as `nats` and `etcd` are running before starting.
25+
> - Request access to `google/gemma-3-1b-it` on Hugging Face and set your `HF_TOKEN` environment variable for authentication.
2926
3027
### Aggregated Serving
3128
```bash

components/backends/trtllm/multimodal_support.md

Whitespace-only changes.

container/Dockerfile.trtllm

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,10 @@
22
# SPDX-License-Identifier: Apache-2.0
33

44
ARG BASE_IMAGE="nvcr.io/nvidia/pytorch"
5-
ARG BASE_IMAGE_TAG="25.05-py3"
5+
ARG BASE_IMAGE_TAG="25.06-py3"
66
ARG RELEASE_BUILD
77
ARG RUNTIME_IMAGE="nvcr.io/nvidia/cuda"
8-
ARG RUNTIME_IMAGE_TAG="12.9.0-runtime-ubuntu24.04"
8+
ARG RUNTIME_IMAGE_TAG="12.9.1-runtime-ubuntu24.04"
99

1010
# Define general architecture ARGs for supporting both x86 and aarch64 builds.
1111
# ARCH: Used for package suffixes (e.g., amd64, arm64)
@@ -140,7 +140,7 @@ COPY --from=trtllm_wheel . /trtllm_wheel/
140140
# Note: TensorRT needs to be uninstalled before installing the TRTLLM wheel
141141
# because there might be mismatched versions of TensorRT between the NGC PyTorch
142142
# and the TRTLLM wheel.
143-
# Locking triton version to 3.3.1 as 3.4.0 breaks tensorrt-llm 1.0.0rc4
143+
# Locking triton version to 3.3.1 as 3.4.0 breaks tensorrt-llm 1.0.0rc6
144144
RUN [ -f /etc/pip/constraint.txt ] && : > /etc/pip/constraint.txt || true && \
145145
pip uninstall -y tensorrt && \
146146
if [ "$HAS_TRTLLM_CONTEXT" = "1" ]; then \
@@ -425,15 +425,15 @@ COPY --from=build /usr/local/cuda/lib64/libcudart.so* /usr/local/cuda/lib64/
425425
COPY --from=build /usr/local/cuda/nvvm /usr/local/cuda/nvvm
426426

427427
# Copy pytorch installation from NGC PyTorch
428-
ARG TORCH_VER=2.8.0a0+5228986c39.nv25.5
429-
ARG TORCHVISION_VER=0.22.0a0
428+
ARG TORCH_VER=2.8.0a0+5228986c39.nv25.6
429+
ARG TORCHVISION_VER=0.22.0a0+95f10a4e
430430
ARG SETUPTOOLS_VER=78.1.1
431431
ARG PYTORCH_TRITON_VER=3.3.0+git96316ce52.nvinternal
432432
ARG JINJA2_VER=3.1.6
433-
ARG NETWORKX_VER=3.4.2
433+
ARG NETWORKX_VER=3.5
434434
ARG SYMPY_VER=1.14.0
435435
ARG PACKAGING_VER=23.2
436-
ARG FLASH_ATTN_VER=2.7.3
436+
ARG FLASH_ATTN_VER=2.7.4.post1
437437
ARG MPMATH_VER=1.3.0
438438
COPY --from=build /usr/local/lib/lib* /usr/local/lib/
439439
COPY --from=build /usr/local/lib/python3.12/dist-packages/torch /usr/local/lib/python3.12/dist-packages/torch
@@ -474,8 +474,8 @@ COPY --from=dev /workspace/target/release/metrics /usr/local/bin/metrics
474474
# NOTE: If a package (tensorrt_llm) exists on both --index-url and --extra-index-url,
475475
# uv will prioritize the --extra-index-url, unless --index-strategy unsafe-best-match
476476
# is also specified. So set the configurable index as a --extra-index-url for prioritization.
477-
# NOTE: locking triton version to 3.3.1 as 3.4.0 breaks tensorrt-llm 1.0.0rc4
478-
# NOTE: locking cuda-python version to <13 to avoid breaks with tensorrt-llm 1.0.0rc4. This
477+
# NOTE: locking triton version to 3.3.1 as 3.4.0 breaks tensorrt-llm 1.0.0rc6
478+
# NOTE: locking cuda-python version to <13 to avoid breaks with tensorrt-llm 1.0.0rc6. This
479479
# can be removed after https://github.com/NVIDIA/TensorRT-LLM/pull/6703 is merged
480480
# we upgrade to a published pip wheel containing this change.
481481
RUN uv pip install "cuda-python>=12,<13" && \

container/build.sh

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ BUILD_CONTEXT=$(dirname "$(readlink -f "$SOURCE_DIR")")
5959

6060
# Base Images
6161
TRTLLM_BASE_IMAGE=nvcr.io/nvidia/pytorch
62-
TRTLLM_BASE_IMAGE_TAG=25.05-py3
62+
TRTLLM_BASE_IMAGE_TAG=25.06-py3
6363

6464
# Important Note: Because of ABI compatibility issues between TensorRT-LLM and NGC PyTorch,
6565
# we need to build the TensorRT-LLM wheel from source.
@@ -89,7 +89,7 @@ TENSORRTLLM_PIP_WHEEL_DIR="/tmp/trtllm_wheel/"
8989
# TensorRT-LLM commit to use for building the trtllm wheel if not provided.
9090
# Important Note: This commit is not used in our CI pipeline. See the CI
9191
# variables to learn how to run a pipeline with a specific commit.
92-
DEFAULT_EXPERIMENTAL_TRTLLM_COMMIT="69e9f6d48944b2ae0124ff57aa59340aa4dfae15"
92+
DEFAULT_EXPERIMENTAL_TRTLLM_COMMIT="a16ba6445c61ed70e7aadfe787d6f316bb422652"
9393
TRTLLM_COMMIT=""
9494
TRTLLM_USE_NIXL_KVCACHE_EXPERIMENTAL="0"
9595
TRTLLM_GIT_URL=""
@@ -98,7 +98,7 @@ TRTLLM_GIT_URL=""
9898
TENSORRTLLM_INDEX_URL="https://pypi.python.org/simple"
9999
# TODO: Remove the version specification from here and use the ai-dynamo[trtllm] package.
100100
# Need to update the Dockerfile.trtllm to use the ai-dynamo[trtllm] package.
101-
DEFAULT_TENSORRTLLM_PIP_WHEEL="tensorrt-llm==1.0.0rc4"
101+
DEFAULT_TENSORRTLLM_PIP_WHEEL="tensorrt-llm==1.0.0rc6"
102102
TENSORRTLLM_PIP_WHEEL=""
103103

104104

docs/support_matrix.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
6767
| **Build Dependency** | **Version** |
6868
| :------------------- | :------------------------------------------------------------------------------- |
6969
| **Base Container** | [25.03](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda-dl-base/tags) |
70-
| **TensorRT-LLM** | 1.0.0rc4 |
70+
| **TensorRT-LLM** | 1.0.0rc6 |
7171
| **NIXL** | 0.4.1 |
7272

7373
> [!Important]

pyproject.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,8 +48,8 @@ Repository = "https://github.com/ai-dynamo/dynamo.git"
4848
[project.optional-dependencies]
4949
trtllm =[
5050
"uvloop",
51-
"tensorrt-llm==1.0.0rc4",
52-
"triton==3.3.1", # locking triton as version 3.4.0 breaks tensorrt-llm 1.0.0rc4
51+
"tensorrt-llm==1.0.0rc6",
52+
"triton==3.3.1", # locking triton as version 3.4.0 breaks tensorrt-llm 1.0.0rc6
5353
]
5454

5555
vllm = [

0 commit comments

Comments
 (0)