You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
*[08/05] 🌟 TensorRT-LLM delivers Day-0 support for OpenAI's latest open-weights models: GPT-OSS-120B [➡️ link](https://huggingface.co/openai/gpt-oss-120b) and GPT-OSS-20B [➡️ link](https://huggingface.co/openai/gpt-oss-20b)
47
47
*[07/15] 🌟 TensorRT-LLM delivers Day-0 support for LG AI Research's latest model, EXAONE 4.0 [➡️ link](https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B)
48
48
*[06/17] Join NVIDIA and DeepInfra for a developer meetup on June 26 ✨ [➡️ link](https://events.nvidia.com/scaletheunscalablenextgenai)
49
49
*[05/22] Blackwell Breaks the 1,000 TPS/User Barrier With Meta’s Llama 4 Maverick
Copy file name to clipboardExpand all lines: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
+14-15Lines changed: 14 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,11 +19,11 @@ We have a forthcoming guide for achieving great performance on H100; however, th
19
19
20
20
In this section, we introduce several ways to install TensorRT-LLM.
21
21
22
-
### NGC Docker Image of dev branch
22
+
### NGC Docker Image
23
23
24
-
Day-0 support for gpt-oss is provided via the NGC container image `nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev`. This image was built on top of the pre-day-0 **dev branch**. This container is multi-platform and will run on both x64 and arm64 architectures.
24
+
Visit the [NGC TensorRT-LLM Release page](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release) to find the most up-to-date NGC container image to use. You can also check the latest [release notes](https://github.com/NVIDIA/TensorRT-LLM/releases) to keep track of the support status of the latest releases.
25
25
26
-
Run the following docker command to start the TensorRT-LLM container in interactive mode:
26
+
Run the following Docker command to start the TensorRT-LLM container in interactive mode (change the image tag to match latest release):
27
27
28
28
```bash
29
29
docker run --rm --ipc=host -it \
@@ -33,7 +33,7 @@ docker run --rm --ipc=host -it \
33
33
-p 8000:8000 \
34
34
-e TRTLLM_ENABLE_PDL=1 \
35
35
-v ~/.cache:/root/.cache:rw \
36
-
nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev \
36
+
nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 \
37
37
/bin/bash
38
38
```
39
39
@@ -53,9 +53,9 @@ Additionally, the container mounts your user `.cache` directory to save the down
53
53
Support for gpt-oss has been [merged](https://github.com/NVIDIA/TensorRT-LLM/pull/6645) into the **main branch** of TensorRT-LLM. As we continue to optimize gpt-oss performance, you can build TensorRT-LLM from source to get the latest features and support. Please refer to the [doc](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html) if you want to build from source yourself.
54
54
55
55
56
-
### Regular Release of TensorRT-LLM
56
+
### TensorRT-LLM Python Wheel Install
57
57
58
-
Since gpt-oss has been supported on the main branch, you can get TensorRT-LLM out of the box through its regular release in the future. Please check the latest [release notes](https://github.com/NVIDIA/TensorRT-LLM/releases) to keep track of the support status. The release is provided as [NGC Container Image](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags) or [pip Python wheel](https://pypi.org/project/tensorrt-llm/#history). You can find instructions on pip install [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html).
58
+
Regular releases of TensorRT-LLM are also provided as [Python wheels](https://pypi.org/project/tensorrt-llm/#history). You can find instructions on the pip install [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html).
59
59
60
60
61
61
## Performance Benchmarking and Model Serving
@@ -210,7 +210,10 @@ We can use `trtllm-serve` to serve the model by translating the benchmark comman
210
210
211
211
```bash
212
212
trtllm-serve \
213
-
gpt-oss-120b \ # Or ${local_model_path}
213
+
Note: You can also point to a local path containing the model weights instead of the HF repo (e.g., `${local_model_path}`).
214
+
215
+
trtllm-serve \
216
+
openai/gpt-oss-120b \
214
217
--host 0.0.0.0 \
215
218
--port 8000 \
216
219
--backend pytorch \
@@ -228,7 +231,8 @@ For max-throughput configuration, run:
"content": "What is NVIDIA's advantage for inference?"
269
+
"content": "What is NVIDIAs advantage for inference?"
266
270
}
267
271
],
268
272
"max_tokens": 1024,
@@ -348,12 +352,7 @@ others according to your needs.
348
352
349
353
## (H200/H100 Only) Using OpenAI Triton Kernels for MoE
350
354
351
-
OpenAI ships a set of Triton kernels optimized for its MoE models. TensorRT-LLM can leverage these kernels for Hopper-based GPUs like NVIDIA's H200 for optimal performance. `TRTLLM` MoE backend is not supported on Hopper, and `CUTLASS` backend support is still ongoing. Please enable`TRITON` backend with the steps below if you are running on Hopper GPUs.
352
-
353
-
### Installing OpenAI Triton
354
-
355
-
The `nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev` has prepared Triton already (`echo $TRITON_ROOT` could reveal the path). In other situations, you will need to build and install a specific version of Triton. Please follow the instructions in this [link](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt_oss#using-openai-triton-kernels-for-moe).
356
-
355
+
OpenAI ships a set of Triton kernels optimized for its MoE models. TensorRT-LLM can leverage these kernels for Hopper-based GPUs like NVIDIA's H200 for optimal performance. `TRTLLM` MoE backend is not supported on Hopper, and `CUTLASS` backend support is still ongoing. Please follow the instructions in this [link](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt_oss#using-openai-triton-kernels-for-moe) to install and enable the `TRITON` MoE kernels on Hopper GPUs.
0 commit comments