Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Run benchmarking with `trtllm-serve`

TensorRT LLM provides the OpenAI-compatiable API via `trtllm-serve` command.
TensorRT LLM provides the OpenAI-compatible API via `trtllm-serve` command.
A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/api-reference).

This step-by-step tutorial covers the following topics for running online serving benchmarking with Llama 3.1 70B and Qwen2.5-VL-7B for multimodal models:
Expand All @@ -10,15 +10,34 @@ This step-by-step tutorial covers the following topics for running online servin
* Using `extra_llm_api_options`
* Multimodal Serving and Benchmarking

## Table of Contents
- [Run benchmarking with `trtllm-serve`](#run-benchmarking-with-trtllm-serve)
- [Table of Contents](#table-of-contents)
- [Methodology Introduction](#methodology-introduction)
- [Preparation](#preparation)
- [Launch the NGC container](#launch-the-ngc-container)
- [Start the trtllm-serve service](#start-the-trtllm-serve-service)
- [Benchmark using `tensorrt_llm.serve.scripts.benchmark_serving`](#benchmark-using-tensorrt_llmservescriptsbenchmark_serving)
- [Key Metrics](#key-metrics)
- [About `extra_llm_api_options`](#about-extra_llm_api_options)
- [`kv_cache_config`](#kv_cache_config)
- [`cuda_graph_config`](#cuda_graph_config)
- [`moe_config`](#moe_config)
- [`attention_backend`](#attention_backend)
- [Multimodal Serving and Benchmarking](#multimodal-serving-and-benchmarking)
- [Setting up Multimodal Serving](#setting-up-multimodal-serving)
- [Multimodal Benchmarking](#multimodal-benchmarking)


## Methodology Introduction

The overall performance benchmarking involves:
1. Launch the OpenAI-compatible service with `trtllm-serve`
2. Run the benchmark with [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)

## Preparation

## Launch the NGC container
### Launch the NGC container

TensorRT LLM distributes the pre-built container on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags).

Expand All @@ -29,7 +48,7 @@ docker run --rm -it --ipc host -p 8000:8000 --gpus all --ulimit memlock=-1 --uli
```


## Start the trtllm-serve service
### Start the trtllm-serve service
> [!WARNING]
> The commands and configurations presented in this document are for illustrative purposes only.
> They serve as examples and may not deliver the optimal performance for your specific use case.
Expand Down Expand Up @@ -79,9 +98,9 @@ INFO: Application startup complete.
INFO: Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
```

## Run the benchmark
## Benchmark using `tensorrt_llm.serve.scripts.benchmark_serving`

Similar to starting trtllm-serve, create a script to execute the benchmark using the following code and name it bench.sh.
Similar to starting `trtllm-serve`, create a script to execute the benchmark using the following code and name it bench.sh.

```bash
concurrency_list="1 2 4 8 16 32 64 128 256"
Expand Down Expand Up @@ -267,28 +286,28 @@ python -m tensorrt_llm.serve.scripts.benchmark_serving \
Below is some example TensorRT-LLM serving benchmark output. Your actual results may vary.
```
============ Serving Benchmark Result ============
Successful requests: 1
Benchmark duration (s): 0.83
Total input tokens: 128
Total generated tokens: 128
Request throughput (req/s): 1.20
Output token throughput (tok/s): 153.92
Total Token throughput (tok/s): 307.85
User throughput (tok/s): 154.15
Mean Request AR: 0.9845
Median Request AR: 0.9845
Successful requests: 1
Benchmark duration (s): 0.83
Total input tokens: 128
Total generated tokens: 128
Request throughput (req/s): 1.20
Output token throughput (tok/s): 153.92
Total Token throughput (tok/s): 307.85
User throughput (tok/s): 154.15
Mean Request AR: 0.9845
Median Request AR: 0.9845
---------------Time to First Token----------------
Mean TTFT (ms): 84.03
Median TTFT (ms): 84.03
P99 TTFT (ms): 84.03
Mean TTFT (ms): 84.03
Median TTFT (ms): 84.03
P99 TTFT (ms): 84.03
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 5.88
Median TPOT (ms): 5.88
P99 TPOT (ms): 5.88
Mean TPOT (ms): 5.88
Median TPOT (ms): 5.88
P99 TPOT (ms): 5.88
---------------Inter-token Latency----------------
Mean ITL (ms): 5.83
Median ITL (ms): 5.88
P99 ITL (ms): 6.14
Mean ITL (ms): 5.83
Median ITL (ms): 5.88
P99 ITL (ms): 6.14
==================================================
```

Expand Down
39 changes: 31 additions & 8 deletions docs/source/developer-guide/perf-benchmarking.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,38 @@

# TensorRT LLM Benchmarking

```{important}
This benchmarking suite is a work in progress.
Expect breaking API changes.
```

TensorRT LLM provides the `trtllm-bench` CLI, a packaged benchmarking utility that aims to make it
easier for users to reproduce our officially published [performance overview](./perf-overview.md#throughput-measurements). `trtllm-bench` provides the follows:

- A streamlined way to build tuned engines for benchmarking for a variety of models and platforms.
- An entirely Python workflow for benchmarking.
- Ability to benchmark various flows and features within TensorRT LLM.

`trtllm-bench` executes all benchmarks using `in-flight batching` -- for more information see
the [in-flight batching section](../features/attention.md#inflight-batching) that describes the concept
in further detail.
TensorRT LLM also provides the OpenAI-compatible API via `trtllm-serve` command, which starts an OpenAI compatible server that supports the following endpoints:
- `/v1/models`
- `/v1/completions`
- `/v1/chat/completions`

The following guidance will mostly focus on benchmarks using `trtllm-bench` CLI. To benchmark the OpenAI-compatible `trtllm-serve`, please refer to the [run benchmarking with `trtllm-serve`](../commands/trtllm-serve/run-benchmark-with-trtllm-serve.md) section.

## Table of Contents
- [TensorRT LLM Benchmarking](#tensorrt-llm-benchmarking)
- [Table of Contents](#table-of-contents)
- [Before Benchmarking](#before-benchmarking)
- [Persistence mode](#persistence-mode)
- [GPU Clock Management](#gpu-clock-management)
- [Set power limits](#set-power-limits)
- [Boost settings](#boost-settings)
- [Throughput Benchmarking](#throughput-benchmarking)
- [Limitations and Caveats](#limitations-and-caveats)
- [Validated Networks for Benchmarking](#validated-networks-for-benchmarking)
- [Supported Quantization Modes](#supported-quantization-modes)
- [Preparing a Dataset](#preparing-a-dataset)
- [Running with the PyTorch Workflow](#running-with-the-pytorch-workflow)
- [Benchmarking with LoRA Adapters in PyTorch workflow](#benchmarking-with-lora-adapters-in-pytorch-workflow)
- [Running multi-modal models in the PyTorch Workflow](#running-multi-modal-models-in-the-pytorch-workflow)
- [Quantization in the PyTorch Flow](#quantization-in-the-pytorch-flow)
- [Online Serving Benchmarking](#online-serving-benchmarking)

## Before Benchmarking

Expand Down Expand Up @@ -465,3 +482,9 @@ kv_cache_config:
```{tip}
The two valid values for `kv_cache_config.dtype` are `auto` and `fp8`.
```

## Online Serving Benchmarking

TensorRT LLM provides the OpenAI-compatible API via `trtllm-serve` command, and `tensorrt_llm.serve.scripts.benchmark_serving` package to benchmark the online server. Alternatively, [AIPerf](https://github.com/ai-dynamo/aiperf) is a comprehensive benchmarking tool that can also measure the performance of the OpenAI-compatible server launched by `trtllm-serve`.

To benchmark the OpenAI-compatible `trtllm-serve`, please refer to the [run benchmarking with `trtllm-serve`](../commands/trtllm-serve/run-benchmark-with-trtllm-serve.md) section.
Loading