Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,9 @@ TensorRT-LLM
<div align="left">

## Tech Blogs
* [08/06] Running a High Performance GPT-OSS-120B Inference Server with TensorRT-LLM
* [08/05] Running a High-Performance GPT-OSS-120B Inference Server with TensorRT-LLM
✨ [➡️ link](./docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md)


* [08/01] Scaling Expert Parallelism in TensorRT-LLM (Part 2: Performance Status and Optimization)
✨ [➡️ link](./docs/source/blogs/tech_blog/blog8_Scaling_Expert_Parallelism_in_TensorRT-LLM_part2.md)

Expand All @@ -44,6 +43,7 @@ TensorRT-LLM
✨ [➡️ link](./docs/source/blogs/tech_blog/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.md)

## Latest News
* [08/05] 🌟 TensorRT-LLM delivers Day-0 support for OpenAI's latest open-weights models: GPT-OSS-120B [➡️ link](https://huggingface.co/openai/gpt-oss-120b) and GPT-OSS-20B [➡️ link](https://huggingface.co/openai/gpt-oss-20b)
* [07/15] 🌟 TensorRT-LLM delivers Day-0 support for LG AI Research's latest model, EXAONE 4.0 [➡️ link](https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B)
* [06/17] Join NVIDIA and DeepInfra for a developer meetup on June 26 ✨ [➡️ link](https://events.nvidia.com/scaletheunscalablenextgenai)
* [05/22] Blackwell Breaks the 1,000 TPS/User Barrier With Meta’s Llama 4 Maverick
Expand Down
29 changes: 14 additions & 15 deletions docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,11 @@ We have a forthcoming guide for achieving great performance on H100; however, th

In this section, we introduce several ways to install TensorRT-LLM.

### NGC Docker Image of dev branch
### NGC Docker Image

Day-0 support for gpt-oss is provided via the NGC container image `nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev`. This image was built on top of the pre-day-0 **dev branch**. This container is multi-platform and will run on both x64 and arm64 architectures.
Visit the [NGC TensorRT-LLM Release page](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release) to find the most up-to-date NGC container image to use. You can also check the latest [release notes](https://github.com/NVIDIA/TensorRT-LLM/releases) to keep track of the support status of the latest releases.

Run the following docker command to start the TensorRT-LLM container in interactive mode:
Run the following Docker command to start the TensorRT-LLM container in interactive mode (change the image tag to match latest release):

```bash
docker run --rm --ipc=host -it \
Expand All @@ -33,7 +33,7 @@ docker run --rm --ipc=host -it \
-p 8000:8000 \
-e TRTLLM_ENABLE_PDL=1 \
-v ~/.cache:/root/.cache:rw \
nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev \
nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 \
/bin/bash
```

Expand All @@ -53,9 +53,9 @@ Additionally, the container mounts your user `.cache` directory to save the down
Support for gpt-oss has been [merged](https://github.com/NVIDIA/TensorRT-LLM/pull/6645) into the **main branch** of TensorRT-LLM. As we continue to optimize gpt-oss performance, you can build TensorRT-LLM from source to get the latest features and support. Please refer to the [doc](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html) if you want to build from source yourself.


### Regular Release of TensorRT-LLM
### TensorRT-LLM Python Wheel Install

Since gpt-oss has been supported on the main branch, you can get TensorRT-LLM out of the box through its regular release in the future. Please check the latest [release notes](https://github.com/NVIDIA/TensorRT-LLM/releases) to keep track of the support status. The release is provided as [NGC Container Image](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags) or [pip Python wheel](https://pypi.org/project/tensorrt-llm/#history). You can find instructions on pip install [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html).
Regular releases of TensorRT-LLM are also provided as [Python wheels](https://pypi.org/project/tensorrt-llm/#history). You can find instructions on the pip install [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html).


## Performance Benchmarking and Model Serving
Expand Down Expand Up @@ -210,7 +210,10 @@ We can use `trtllm-serve` to serve the model by translating the benchmark comman

```bash
trtllm-serve \
gpt-oss-120b \ # Or ${local_model_path}
Note: You can also point to a local path containing the model weights instead of the HF repo (e.g., `${local_model_path}`).

trtllm-serve \
openai/gpt-oss-120b \
--host 0.0.0.0 \
--port 8000 \
--backend pytorch \
Expand All @@ -228,7 +231,8 @@ For max-throughput configuration, run:

```bash
trtllm-serve \
gpt-oss-120b \ # Or ${local_model_path}
trtllm-serve \
openai/gpt-oss-120b \
--host 0.0.0.0 \
--port 8000 \
--backend pytorch \
Expand Down Expand Up @@ -262,7 +266,7 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '
"messages": [
{
"role": "user",
"content": "What is NVIDIA's advantage for inference?"
"content": "What is NVIDIAs advantage for inference?"
}
],
"max_tokens": 1024,
Expand Down Expand Up @@ -348,12 +352,7 @@ others according to your needs.

## (H200/H100 Only) Using OpenAI Triton Kernels for MoE

OpenAI ships a set of Triton kernels optimized for its MoE models. TensorRT-LLM can leverage these kernels for Hopper-based GPUs like NVIDIA's H200 for optimal performance. `TRTLLM` MoE backend is not supported on Hopper, and `CUTLASS` backend support is still ongoing. Please enable `TRITON` backend with the steps below if you are running on Hopper GPUs.

### Installing OpenAI Triton

The `nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev` has prepared Triton already (`echo $TRITON_ROOT` could reveal the path). In other situations, you will need to build and install a specific version of Triton. Please follow the instructions in this [link](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt_oss#using-openai-triton-kernels-for-moe).

OpenAI ships a set of Triton kernels optimized for its MoE models. TensorRT-LLM can leverage these kernels for Hopper-based GPUs like NVIDIA's H200 for optimal performance. `TRTLLM` MoE backend is not supported on Hopper, and `CUTLASS` backend support is still ongoing. Please follow the instructions in this [link](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt_oss#using-openai-triton-kernels-for-moe) to install and enable the `TRITON` MoE kernels on Hopper GPUs.

### Selecting Triton as the MoE backend

Expand Down