NVIDIA · kaiyux · Dec 2, 2025 · Dec 2, 2025 · Dec 2, 2025 · Dec 2, 2025
@@ -1,6 +1,6 @@
 # Run benchmarking with `trtllm-serve`
 
-TensorRT LLM provides the OpenAI-compatiable API via `trtllm-serve` command.
+TensorRT LLM provides the OpenAI-compatible API via `trtllm-serve` command.
 A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/api-reference).
 
 This step-by-step tutorial covers the following topics for running online serving benchmarking with Llama 3.1 70B and Qwen2.5-VL-7B for multimodal models:
@@ -10,15 +10,34 @@ This step-by-step tutorial covers the following topics for running online servin
  * Using `extra_llm_api_options`
  * Multimodal Serving and Benchmarking
 
+## Table of Contents
+- [Run benchmarking with `trtllm-serve`](#run-benchmarking-with-trtllm-serve)
+  - [Table of Contents](#table-of-contents)
+  - [Methodology Introduction](#methodology-introduction)
+  - [Preparation](#preparation)
+    - [Launch the NGC container](#launch-the-ngc-container)
+    - [Start the trtllm-serve service](#start-the-trtllm-serve-service)
+  - [Benchmark using `tensorrt_llm.serve.scripts.benchmark_serving`](#benchmark-using-tensorrt_llmservescriptsbenchmark_serving)
+    - [Key Metrics](#key-metrics)
+  - [About `extra_llm_api_options`](#about-extra_llm_api_options)
+      - [`kv_cache_config`](#kv_cache_config)
+      - [`cuda_graph_config`](#cuda_graph_config)
+      - [`moe_config`](#moe_config)
+      - [`attention_backend`](#attention_backend)
+  - [Multimodal Serving and Benchmarking](#multimodal-serving-and-benchmarking)
+    - [Setting up Multimodal Serving](#setting-up-multimodal-serving)
+    - [Multimodal Benchmarking](#multimodal-benchmarking)
+
 
 ## Methodology Introduction
 
 The overall performance benchmarking involves:
    1. Launch the OpenAI-compatible service with `trtllm-serve`
    2. Run the benchmark with [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)
 
+## Preparation
 
-## Launch the NGC container
+### Launch the NGC container
 
 TensorRT LLM distributes the pre-built container on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags).
 
@@ -29,7 +48,7 @@ docker run --rm -it --ipc host -p 8000:8000 --gpus all --ulimit memlock=-1 --uli
 ```
 
 
-## Start the trtllm-serve service
+### Start the trtllm-serve service
 > [!WARNING]
 > The commands and configurations presented in this document are for illustrative purposes only.
 > They serve as examples and may not deliver the optimal performance for your specific use case.
@@ -79,9 +98,9 @@ INFO:     Application startup complete.
 INFO:     Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
 ```
 
-## Run the benchmark
+## Benchmark using `tensorrt_llm.serve.scripts.benchmark_serving`
 
-Similar to starting trtllm-serve, create a script to execute the benchmark using the following code and name it bench.sh.
+Similar to starting `trtllm-serve`, create a script to execute the benchmark using the following code and name it bench.sh.
 
 ```bash
 concurrency_list="1 2 4 8 16 32 64 128 256"
@@ -267,28 +286,28 @@ python -m tensorrt_llm.serve.scripts.benchmark_serving \
 Below is some example TensorRT-LLM serving benchmark output. Your actual results may vary.
 ```
 ============ Serving Benchmark Result ============
-Successful requests:                     1         
-Benchmark duration (s):                  0.83      
-Total input tokens:                      128       
-Total generated tokens:                  128       
-Request throughput (req/s):              1.20      
-Output token throughput (tok/s):         153.92    
-Total Token throughput (tok/s):          307.85    
-User throughput (tok/s):                 154.15    
-Mean Request AR:                         0.9845    
-Median Request AR:                       0.9845    
+Successful requests:                     1
+Benchmark duration (s):                  0.83
+Total input tokens:                      128
+Total generated tokens:                  128
+Request throughput (req/s):              1.20
+Output token throughput (tok/s):         153.92
+Total Token throughput (tok/s):          307.85
+User throughput (tok/s):                 154.15
+Mean Request AR:                         0.9845
+Median Request AR:                       0.9845
 ---------------Time to First Token----------------
-Mean TTFT (ms):                          84.03     
-Median TTFT (ms):                        84.03     
-P99 TTFT (ms):                           84.03     
+Mean TTFT (ms):                          84.03
+Median TTFT (ms):                        84.03
+P99 TTFT (ms):                           84.03
 -----Time per Output Token (excl. 1st token)------
-Mean TPOT (ms):                          5.88      
-Median TPOT (ms):                        5.88      
-P99 TPOT (ms):                           5.88      
+Mean TPOT (ms):                          5.88
+Median TPOT (ms):                        5.88
+P99 TPOT (ms):                           5.88
 ---------------Inter-token Latency----------------
-Mean ITL (ms):                           5.83      
-Median ITL (ms):                         5.88      
-P99 ITL (ms):                            6.14      
+Mean ITL (ms):                           5.83
+Median ITL (ms):                         5.88
+P99 ITL (ms):                            6.14
 ==================================================
 ```
 

@@ -2,21 +2,38 @@
 
 # TensorRT LLM Benchmarking
 
-```{important}
-This benchmarking suite is a work in progress.
-Expect breaking API changes.
-```
-
 TensorRT LLM provides the `trtllm-bench` CLI, a packaged benchmarking utility that aims to make it
 easier for users to reproduce our officially published [performance overview](./perf-overview.md#throughput-measurements). `trtllm-bench` provides the follows:
 
 - A streamlined way to build tuned engines for benchmarking for a variety of models and platforms.
 - An entirely Python workflow for benchmarking.
 - Ability to benchmark various flows and features within TensorRT LLM.
 
-`trtllm-bench` executes all benchmarks using `in-flight batching` -- for more information see
-the [in-flight batching section](../features/attention.md#inflight-batching) that describes the concept
-in further detail.
+TensorRT LLM also provides the OpenAI-compatible API via `trtllm-serve` command, which starts an OpenAI compatible server that supports the following endpoints:
+- `/v1/models`
+- `/v1/completions`
+- `/v1/chat/completions`
+
+The following guidance will mostly focus on benchmarks using `trtllm-bench` CLI. To benchmark the OpenAI-compatible `trtllm-serve`, please refer to the [run benchmarking with `trtllm-serve`](../commands/trtllm-serve/run-benchmark-with-trtllm-serve.md) section.
+
+## Table of Contents
+- [TensorRT LLM Benchmarking](#tensorrt-llm-benchmarking)
+  - [Table of Contents](#table-of-contents)
+  - [Before Benchmarking](#before-benchmarking)
+    - [Persistence mode](#persistence-mode)
+    - [GPU Clock Management](#gpu-clock-management)
+    - [Set power limits](#set-power-limits)
+    - [Boost settings](#boost-settings)
+  - [Throughput Benchmarking](#throughput-benchmarking)
+    - [Limitations and Caveats](#limitations-and-caveats)
+      - [Validated Networks for Benchmarking](#validated-networks-for-benchmarking)
+      - [Supported Quantization Modes](#supported-quantization-modes)
+    - [Preparing a Dataset](#preparing-a-dataset)
+    - [Running with the PyTorch Workflow](#running-with-the-pytorch-workflow)
+      - [Benchmarking with LoRA Adapters in PyTorch workflow](#benchmarking-with-lora-adapters-in-pytorch-workflow)
+      - [Running multi-modal models in the PyTorch Workflow](#running-multi-modal-models-in-the-pytorch-workflow)
+      - [Quantization in the PyTorch Flow](#quantization-in-the-pytorch-flow)
+  - [Online Serving Benchmarking](#online-serving-benchmarking)
 
 ## Before Benchmarking
 
@@ -465,3 +482,9 @@ kv_cache_config:
 ```{tip}
 The two valid values for `kv_cache_config.dtype` are `auto` and `fp8`.
 ```
+
+## Online Serving Benchmarking
+
+TensorRT LLM provides the OpenAI-compatible API via `trtllm-serve` command, and `tensorrt_llm.serve.scripts.benchmark_serving` package to benchmark the online server. Alternatively, [AIPerf](https://github.com/ai-dynamo/aiperf) is a comprehensive benchmarking tool that can also measure the performance of the OpenAI-compatible server launched by `trtllm-serve`.
+
+To benchmark the OpenAI-compatible `trtllm-serve`, please refer to the [run benchmarking with `trtllm-serve`](../commands/trtllm-serve/run-benchmark-with-trtllm-serve.md) section.