NVIDIA · mikeiovine · Dec 16, 2025 · Dec 2, 2025 · Dec 2, 2025 · Dec 3, 2025
@@ -10,15 +10,34 @@ This step-by-step tutorial covers the following topics for running online servin
  * Using `extra_llm_api_options`
  * Multimodal Serving and Benchmarking
 
+## Table of Contents
+- [Run benchmarking with `trtllm-serve`](#run-benchmarking-with-trtllm-serve)
+  - [Table of Contents](#table-of-contents)
+  - [Methodology Introduction](#methodology-introduction)
+  - [Preparation](#preparation)
+    - [Launch the NGC container](#launch-the-ngc-container)
+    - [Start the trtllm-serve service](#start-the-trtllm-serve-service)
+  - [Benchmark using `tensorrt_llm.serve.scripts.benchmark_serving`](#benchmark-using-tensorrt_llmservescriptsbenchmark_serving)
+    - [Key Metrics](#key-metrics)
+  - [About `extra_llm_api_options`](#about-extra_llm_api_options)
+      - [`kv_cache_config`](#kv_cache_config)
+      - [`cuda_graph_config`](#cuda_graph_config)
+      - [`moe_config`](#moe_config)
+      - [`attention_backend`](#attention_backend)
+  - [Multimodal Serving and Benchmarking](#multimodal-serving-and-benchmarking)
+    - [Setting up Multimodal Serving](#setting-up-multimodal-serving)
+    - [Multimodal Benchmarking](#multimodal-benchmarking)
+
 
 ## Methodology Introduction
 
 The overall performance benchmarking involves:
    1. Launch the OpenAI-compatible service with `trtllm-serve`
    2. Run the benchmark with [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)
 
+## Preparation
 
-## Launch the NGC container
+### Launch the NGC container
 
 TensorRT LLM distributes the pre-built container on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags).
 
@@ -29,7 +48,7 @@ docker run --rm -it --ipc host -p 8000:8000 --gpus all --ulimit memlock=-1 --uli
 ```
 
 
-## Start the trtllm-serve service
+### Start the trtllm-serve service
 > [!WARNING]
 > The commands and configurations presented in this document are for illustrative purposes only.
 > They serve as examples and may not deliver the optimal performance for your specific use case.
@@ -79,9 +98,9 @@ INFO:     Application startup complete.
 INFO:     Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
 ```
 
-## Run the benchmark
+## Benchmark using `tensorrt_llm.serve.scripts.benchmark_serving`
 
-Similar to starting trtllm-serve, create a script to execute the benchmark using the following code and name it bench.sh.
+Similar to starting `trtllm-serve`, create a script to execute the benchmark using the following code and name it bench.sh.
 
 ```bash
 concurrency_list="1 2 4 8 16 32 64 128 256"

@@ -2,21 +2,38 @@
 
 # TensorRT LLM Benchmarking
 
-```{important}
-This benchmarking suite is a work in progress.
-Expect breaking API changes.
-```
-
 TensorRT LLM provides the `trtllm-bench` CLI, a packaged benchmarking utility that aims to make it
 easier for users to reproduce our officially published [performance overview](./perf-overview.md#throughput-measurements). `trtllm-bench` provides the follows:
 
 - A streamlined way to build tuned engines for benchmarking for a variety of models and platforms.
 - An entirely Python workflow for benchmarking.
 - Ability to benchmark various flows and features within TensorRT LLM.
 
-`trtllm-bench` executes all benchmarks using `in-flight batching` -- for more information see
-the [in-flight batching section](../features/attention.md#inflight-batching) that describes the concept
-in further detail.
+TensorRT LLM also provides the OpenAI-compatible API via `trtllm-serve` command, which starts an OpenAI compatible server that supports the following endpoints:
+- `/v1/models`
+- `/v1/completions`
+- `/v1/chat/completions`
+
+The following guidance will mostly focus on benchmarks using `trtllm-bench` CLI. To benchmark the OpenAI-compatible `trtllm-serve`, please refer to the [run benchmarking with `trtllm-serve`](../commands/trtllm-serve/run-benchmark-with-trtllm-serve.md) section.
+
+## Table of Contents
+- [TensorRT LLM Benchmarking](#tensorrt-llm-benchmarking)
+  - [Table of Contents](#table-of-contents)
+  - [Before Benchmarking](#before-benchmarking)
+    - [Persistence mode](#persistence-mode)
+    - [GPU Clock Management](#gpu-clock-management)
+    - [Set power limits](#set-power-limits)
+    - [Boost settings](#boost-settings)
+  - [Throughput Benchmarking](#throughput-benchmarking)
+    - [Limitations and Caveats](#limitations-and-caveats)
+      - [Validated Networks for Benchmarking](#validated-networks-for-benchmarking)
+      - [Supported Quantization Modes](#supported-quantization-modes)
+    - [Preparing a Dataset](#preparing-a-dataset)
+    - [Running with the PyTorch Workflow](#running-with-the-pytorch-workflow)
+      - [Benchmarking with LoRA Adapters in PyTorch workflow](#benchmarking-with-lora-adapters-in-pytorch-workflow)
+      - [Running multi-modal models in the PyTorch Workflow](#running-multi-modal-models-in-the-pytorch-workflow)
+      - [Quantization in the PyTorch Flow](#quantization-in-the-pytorch-flow)
+  - [Online Serving Benchmarking](#online-serving-benchmarking)
 
 To benchmark the OpenAI-compatible `trtllm-serve`, please refer to the [run benchmarking with `trtllm-serve`](../commands/trtllm-serve/run-benchmark-with-trtllm-serve.md) section.
 
@@ -424,7 +441,7 @@ checkpoint. For the Llama-3.1 models, TensorRT LLM provides the following checkp
 - [`nvidia/Llama-3.1-70B-Instruct-FP8`](https://huggingface.co/nvidia/Llama-3.1-70B-Instruct-FP8)
 - [`nvidia/Llama-3.1-405B-Instruct-FP8`](https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP8)
 
-To understand more about how to quantize your own checkpoints, refer to ModelOpt [documentation](https://nvidia.github.io/Model-Optimizer/deployment/1_tensorrt_llm.html).
+To understand more about how to quantize your own checkpoints, refer to ModelOpt [documentation](https://nvidia.github.io/Model-Optimizer/deployment/3_unified_hf.html).
 
 `trtllm-bench` utilizes the `hf_quant_config.json` file present in the pre-quantized checkpoints above. The configuration
 file is present in checkpoints quantized with [Model Optimizer](https://github.com/NVIDIA/Model-Optimizer)
@@ -468,3 +485,9 @@ kv_cache_config:
 ```{tip}
 The two valid values for `kv_cache_config.dtype` are `auto` and `fp8`.
 ```
+
+## Online Serving Benchmarking
+
+TensorRT LLM provides the OpenAI-compatible API via `trtllm-serve` command, and `tensorrt_llm.serve.scripts.benchmark_serving` package to benchmark the online server. Alternatively, [AIPerf](https://github.com/ai-dynamo/aiperf) is a comprehensive benchmarking tool that can also measure the performance of the OpenAI-compatible server launched by `trtllm-serve`.
+
+To benchmark the OpenAI-compatible `trtllm-serve`, please refer to the [run benchmarking with `trtllm-serve`](../commands/trtllm-serve/run-benchmark-with-trtllm-serve.md) section.