Skip to content
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,34 @@ This step-by-step tutorial covers the following topics for running online servin
* Using `extra_llm_api_options`
* Multimodal Serving and Benchmarking

## Table of Contents
- [Run benchmarking with `trtllm-serve`](#run-benchmarking-with-trtllm-serve)
- [Table of Contents](#table-of-contents)
- [Methodology Introduction](#methodology-introduction)
- [Preparation](#preparation)
- [Launch the NGC container](#launch-the-ngc-container)
- [Start the trtllm-serve service](#start-the-trtllm-serve-service)
- [Benchmark using `tensorrt_llm.serve.scripts.benchmark_serving`](#benchmark-using-tensorrt_llmservescriptsbenchmark_serving)
- [Key Metrics](#key-metrics)
- [About `extra_llm_api_options`](#about-extra_llm_api_options)
- [`kv_cache_config`](#kv_cache_config)
- [`cuda_graph_config`](#cuda_graph_config)
- [`moe_config`](#moe_config)
- [`attention_backend`](#attention_backend)
- [Multimodal Serving and Benchmarking](#multimodal-serving-and-benchmarking)
- [Setting up Multimodal Serving](#setting-up-multimodal-serving)
- [Multimodal Benchmarking](#multimodal-benchmarking)


## Methodology Introduction

The overall performance benchmarking involves:
1. Launch the OpenAI-compatible service with `trtllm-serve`
2. Run the benchmark with [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)

## Preparation

## Launch the NGC container
### Launch the NGC container

TensorRT LLM distributes the pre-built container on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags).

Expand All @@ -29,7 +48,7 @@ docker run --rm -it --ipc host -p 8000:8000 --gpus all --ulimit memlock=-1 --uli
```


## Start the trtllm-serve service
### Start the trtllm-serve service
> [!WARNING]
> The commands and configurations presented in this document are for illustrative purposes only.
> They serve as examples and may not deliver the optimal performance for your specific use case.
Expand Down Expand Up @@ -79,9 +98,9 @@ INFO: Application startup complete.
INFO: Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
```

## Run the benchmark
## Benchmark using `tensorrt_llm.serve.scripts.benchmark_serving`

Similar to starting trtllm-serve, create a script to execute the benchmark using the following code and name it bench.sh.
Similar to starting `trtllm-serve`, create a script to execute the benchmark using the following code and name it bench.sh.

```bash
concurrency_list="1 2 4 8 16 32 64 128 256"
Expand Down
41 changes: 32 additions & 9 deletions docs/source/developer-guide/perf-benchmarking.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,38 @@

# TensorRT LLM Benchmarking

```{important}
This benchmarking suite is a work in progress.
Expect breaking API changes.
```

TensorRT LLM provides the `trtllm-bench` CLI, a packaged benchmarking utility that aims to make it
easier for users to reproduce our officially published [performance overview](./perf-overview.md#throughput-measurements). `trtllm-bench` provides the follows:

- A streamlined way to build tuned engines for benchmarking for a variety of models and platforms.
- An entirely Python workflow for benchmarking.
- Ability to benchmark various flows and features within TensorRT LLM.

`trtllm-bench` executes all benchmarks using `in-flight batching` -- for more information see
the [in-flight batching section](../features/attention.md#inflight-batching) that describes the concept
in further detail.
TensorRT LLM also provides the OpenAI-compatible API via `trtllm-serve` command, which starts an OpenAI compatible server that supports the following endpoints:
- `/v1/models`
- `/v1/completions`
- `/v1/chat/completions`

The following guidance will mostly focus on benchmarks using `trtllm-bench` CLI. To benchmark the OpenAI-compatible `trtllm-serve`, please refer to the [run benchmarking with `trtllm-serve`](../commands/trtllm-serve/run-benchmark-with-trtllm-serve.md) section.

## Table of Contents
- [TensorRT LLM Benchmarking](#tensorrt-llm-benchmarking)
- [Table of Contents](#table-of-contents)
- [Before Benchmarking](#before-benchmarking)
- [Persistence mode](#persistence-mode)
- [GPU Clock Management](#gpu-clock-management)
- [Set power limits](#set-power-limits)
- [Boost settings](#boost-settings)
- [Throughput Benchmarking](#throughput-benchmarking)
- [Limitations and Caveats](#limitations-and-caveats)
- [Validated Networks for Benchmarking](#validated-networks-for-benchmarking)
- [Supported Quantization Modes](#supported-quantization-modes)
- [Preparing a Dataset](#preparing-a-dataset)
- [Running with the PyTorch Workflow](#running-with-the-pytorch-workflow)
- [Benchmarking with LoRA Adapters in PyTorch workflow](#benchmarking-with-lora-adapters-in-pytorch-workflow)
- [Running multi-modal models in the PyTorch Workflow](#running-multi-modal-models-in-the-pytorch-workflow)
- [Quantization in the PyTorch Flow](#quantization-in-the-pytorch-flow)
- [Online Serving Benchmarking](#online-serving-benchmarking)

To benchmark the OpenAI-compatible `trtllm-serve`, please refer to the [run benchmarking with `trtllm-serve`](../commands/trtllm-serve/run-benchmark-with-trtllm-serve.md) section.

Expand Down Expand Up @@ -424,7 +441,7 @@ checkpoint. For the Llama-3.1 models, TensorRT LLM provides the following checkp
- [`nvidia/Llama-3.1-70B-Instruct-FP8`](https://huggingface.co/nvidia/Llama-3.1-70B-Instruct-FP8)
- [`nvidia/Llama-3.1-405B-Instruct-FP8`](https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP8)

To understand more about how to quantize your own checkpoints, refer to ModelOpt [documentation](https://nvidia.github.io/Model-Optimizer/deployment/1_tensorrt_llm.html).
To understand more about how to quantize your own checkpoints, refer to ModelOpt [documentation](https://nvidia.github.io/Model-Optimizer/deployment/3_unified_hf.html).

`trtllm-bench` utilizes the `hf_quant_config.json` file present in the pre-quantized checkpoints above. The configuration
file is present in checkpoints quantized with [Model Optimizer](https://github.com/NVIDIA/Model-Optimizer)
Expand Down Expand Up @@ -468,3 +485,9 @@ kv_cache_config:
```{tip}
The two valid values for `kv_cache_config.dtype` are `auto` and `fp8`.
```

## Online Serving Benchmarking

TensorRT LLM provides the OpenAI-compatible API via `trtllm-serve` command, and `tensorrt_llm.serve.scripts.benchmark_serving` package to benchmark the online server. Alternatively, [AIPerf](https://github.com/ai-dynamo/aiperf) is a comprehensive benchmarking tool that can also measure the performance of the OpenAI-compatible server launched by `trtllm-serve`.

To benchmark the OpenAI-compatible `trtllm-serve`, please refer to the [run benchmarking with `trtllm-serve`](../commands/trtllm-serve/run-benchmark-with-trtllm-serve.md) section.
Loading