Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
d067884
[Benchmark] Convenience script for multiple parameter combinations
DarkLight1337 Oct 17, 2025
e9f546d
Put run number first for easier conversion to CSV
DarkLight1337 Oct 17, 2025
211aa98
Output CSV and separate results by timestamp
DarkLight1337 Oct 17, 2025
cc687a1
Clean up
DarkLight1337 Oct 17, 2025
c809ff9
Comment
DarkLight1337 Oct 17, 2025
481e1e2
Clean
DarkLight1337 Oct 17, 2025
ba3e8dc
Comment
DarkLight1337 Oct 17, 2025
978c626
Support benchmark overrides
DarkLight1337 Oct 17, 2025
34f35ea
Fix a wrong link
DarkLight1337 Oct 17, 2025
e9c13a9
Convert to str
DarkLight1337 Oct 17, 2025
af893ab
Add resume functionality
DarkLight1337 Oct 17, 2025
9483e6c
Avoid restarting the server
DarkLight1337 Oct 18, 2025
729fb43
Add SLA tuning
DarkLight1337 Oct 18, 2025
85cfc56
Fix name
DarkLight1337 Oct 18, 2025
76b165d
Fix name
DarkLight1337 Oct 18, 2025
7695590
Fix name
DarkLight1337 Oct 18, 2025
95bef21
Fix name
DarkLight1337 Oct 18, 2025
dcfa092
Fix name
DarkLight1337 Oct 18, 2025
1016665
Fix request rate not set
DarkLight1337 Oct 18, 2025
a293bce
Use binary search
DarkLight1337 Oct 18, 2025
c824d02
Simplify
DarkLight1337 Oct 18, 2025
0871752
Fix convergence
DarkLight1337 Oct 18, 2025
42ab905
Multiple runs per iter for better reliability
DarkLight1337 Oct 18, 2025
dd9a578
Fix name
DarkLight1337 Oct 18, 2025
d2bb55a
Avoid redundant print
DarkLight1337 Oct 18, 2025
69b5f58
Be less noisy
DarkLight1337 Oct 18, 2025
bbcf90a
Improve search
DarkLight1337 Oct 18, 2025
2604782
Parametrize variable
DarkLight1337 Oct 18, 2025
06a9004
Log
DarkLight1337 Oct 18, 2025
ead78bf
Round up
DarkLight1337 Oct 18, 2025
82cde34
Simplify
DarkLight1337 Oct 18, 2025
db63f8d
Prettify JSON
DarkLight1337 Oct 18, 2025
a41728d
Fix boolean arg
DarkLight1337 Oct 18, 2025
d19b370
Handle host as well
DarkLight1337 Oct 18, 2025
1abac31
Log
DarkLight1337 Oct 18, 2025
685cd35
Fix edge case
DarkLight1337 Oct 18, 2025
47a2d22
Address comment
DarkLight1337 Oct 18, 2025
ded92c8
Fix init estimate
DarkLight1337 Oct 18, 2025
87c313a
Optimize search
DarkLight1337 Oct 18, 2025
a7cdfd8
Fix
DarkLight1337 Oct 18, 2025
2058c3c
Merge branch 'main' into bench-serve-multi
DarkLight1337 Oct 18, 2025
fac6773
Handle infinity server
DarkLight1337 Oct 18, 2025
bb9c807
Share SLA results
DarkLight1337 Oct 18, 2025
5471c22
Improve detection
DarkLight1337 Oct 18, 2025
21a60fa
Fix off by one
DarkLight1337 Oct 18, 2025
d7c0576
Sanitize
DarkLight1337 Oct 18, 2025
6058f11
Add plotter
DarkLight1337 Oct 18, 2025
bf116f4
Add capture output
DarkLight1337 Oct 18, 2025
0af3463
Fix
DarkLight1337 Oct 18, 2025
91f548f
Fix
DarkLight1337 Oct 18, 2025
50410f9
Add doc
DarkLight1337 Oct 18, 2025
5022d1b
Example command
DarkLight1337 Oct 18, 2025
50dd600
Reword
DarkLight1337 Oct 18, 2025
383f3a0
Fix pre-commit
DarkLight1337 Oct 18, 2025
79e7a85
Reword
DarkLight1337 Oct 18, 2025
f65c15c
add links
DarkLight1337 Oct 19, 2025
6ce907f
Fix `after_bench` not used
DarkLight1337 Oct 19, 2025
27c27da
Fix
DarkLight1337 Oct 19, 2025
c9adffa
Mention `--dry-run`
DarkLight1337 Oct 19, 2025
600ce3c
Add tip
DarkLight1337 Oct 19, 2025
bd80c80
Refactor
DarkLight1337 Oct 19, 2025
66c4f68
Typo
DarkLight1337 Oct 19, 2025
2cb649c
Reword
DarkLight1337 Oct 19, 2025
abead3c
Increase inf
DarkLight1337 Oct 19, 2025
2910133
Fix
DarkLight1337 Oct 19, 2025
e3e32b2
Update serve_multi.py
DarkLight1337 Oct 19, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
148 changes: 145 additions & 3 deletions docs/contributing/benchmarks.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@ toc_depth: 4

vLLM provides comprehensive benchmarking tools for performance testing and evaluation:

- **[Benchmark CLI]**: `vllm bench` CLI tools and specialized benchmark scripts for interactive performance testing
- **[Benchmark CLI](#benchmark-cli)**: `vllm bench` CLI tools and specialized benchmark scripts for interactive performance testing
- **[Batch Scripts](#batch-scripts)**: Run `vllm bench` against multiple configurations conveniently
- **[Performance benchmarks](#performance-benchmarks)**: Automated CI benchmarks for development
- **[Nightly benchmarks](#nightly-benchmarks)**: Comparative benchmarks against alternatives

Expand All @@ -29,7 +30,7 @@ th {
| Dataset | Online | Offline | Data Path |
|---------|--------|---------|-----------|
| ShareGPT | ✅ | ✅ | `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json` |
| ShareGPT4V (Image) | ✅ | ✅ | `wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/sharegpt4v_instruct_gpt4-vision_cap100k.json`<br>Note that the images need to be downloaded separately. For example, to download COCO's 2017 Train images:<br>`wget http://images.cocodataset.org/zips/train2017.zip` |
| ShareGPT4V (Image) | ✅ | ✅ | `wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/resolve/main/sharegpt4v_instruct_gpt4-vision_cap100k.json`<br>Note that the images need to be downloaded separately. For example, to download COCO's 2017 Train images:<br>`wget http://images.cocodataset.org/zips/train2017.zip` |
| ShareGPT4Video (Video) | ✅ | ✅ | `git clone https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video` |
| BurstGPT | ✅ | ✅ | `wget https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv` |
| Sonnet (deprecated) | ✅ | ✅ | Local file: `benchmarks/sonnet.txt` |
Expand Down Expand Up @@ -714,7 +715,7 @@ Generate synthetic image inputs alongside random text prompts to stress-test vis

Notes:

- Works only with online benchmark via the OpenAI backend (`--backend openai-chat`) and endpoint `/v1/chat/completions`.
- Works only with online benchmark via the OpenAI backend (`--backend openai-chat`) and endpoint `/v1/chat/completions`.
- Video sampling is not yet implemented.

Start the server (example):
Expand Down Expand Up @@ -924,6 +925,147 @@ throughput numbers correctly is also adjusted.

</details>

## Batch Scripts

### Batch Serving Script

[`vllm/benchmarks/serve_multi.py`](../../vllm/benchmarks/serve_multi.py) automatically starts `vllm serve` and runs `vllm bench serve` over multiple configurations.

#### Batch Mode

The basic purpose of this script is to evaluate vLLM under different settings. Follows these steps to run the script:

1. Construct the base command to `vllm serve`, and pass it to the `--serve-cmd` option.
2. Construct the base command to `vllm bench serve`, and pass it to the `--bench-cmd` option.
3. (Optional) If you would like to vary the settings of `vllm serve`, create a new JSON file and populate it with the parameter combinations you want to test. Pass the file path to `--serve-params`.

- Example: Tuning `--max-num-seqs` and `--max-num-batched-tokens`:

```json
[
{
"max_num_seqs": 32,
"max_num_batched_tokens": 1024
},
{
"max_num_seqs": 64,
"max_num_batched_tokens": 1024
},
{
"max_num_seqs": 64,
"max_num_batched_tokens": 2048
},
{
"max_num_seqs": 128,
"max_num_batched_tokens": 2048
},
{
"max_num_seqs": 128,
"max_num_batched_tokens": 4096
},
{
"max_num_seqs": 256,
"max_num_batched_tokens": 4096
}
]
```

4. (Optional) If you would like to vary the settings of `vllm bench serve`, create a new JSON file and populate it with the parameter combinations you want to test. Pass the file path to `--bench-params`.

- Example: Using different input/output lengths for random dataset:

```json
[
{
"random_input_len": 128,
"random_output_len": 32
},
{
"random_input_len": 256,
"random_output_len": 64
},
{
"random_input_len": 512,
"random_output_len": 128
}
]
```

5. Determine where you want to save the results, and pass that to `--output-dir`.

Example command:

```bash
python vllm/benchmarks/serve_multi.py \
--serve-cmd 'vllm serve meta-llama/Llama-2-7b-chat-hf' \
--bench-cmd 'vllm bench serve --model meta-llama/Llama-2-7b-chat-hf --backend vllm --endpoint /v1/completions --dataset-name sharegpt --dataset-path benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json' \
--serve-params benchmarks/serve_hparams.json \
--bench-params benchmarks/bench_hparams.json \
-o benchmarks/results
```

!!! important
If both `--serve-params` and `--bench-params` are passed, the script will iterate over the Cartesian product between them.
You can use `--dry-run` to preview the commands to be run.

We only start the server once for each `--serve-params`, and keep it running for multiple `--bench-params`.
Between each benchmark run, we call the `/reset_prefix_cache` and `/reset_mm_cache` endpoints to get a clean slate for the next run.
In case you are using a custom `--serve-cmd`, you can override the commands used for resetting the state by setting `--after-bench-cmd`.

!!! note
By default, each parameter combination is run 3 times to make the results more reliable. You can adjust the number of runs by setting `--num-runs`.

!!! tip
You can use the `--resume` option to continue the parameter sweep if one of the runs failed.

#### SLA Mode

By passing SLA constraints via `--sla-params`, you can run this script in SLA mode, causing it to adjust either the request rate or concurrency (choose using `--sla-variable`) in order to satisfy the SLA constraints.

For example, to ensure E2E latency within different target values for 99% of requests:

```json
[
{
"p99_e2el_ms": "<=200"
},
{
"p99_e2el_ms": "<=500"
},
{
"p99_e2el_ms": "<=1000"
},
{
"p99_e2el_ms": "<=2000"
}
]
```

Example command:

```bash
python vllm/benchmarks/serve_multi.py \
--serve-cmd 'vllm serve meta-llama/Llama-2-7b-chat-hf' \
--bench-cmd 'vllm bench serve --model meta-llama/Llama-2-7b-chat-hf --backend vllm --endpoint /v1/completions --dataset-name sharegpt --dataset-path benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json' \
--serve-params benchmarks/serve_hparams.json \
--bench-params benchmarks/bench_hparams.json \
--sla-params benchmarks/sla_hparams.json \
--sla-variable max_concurrency \
-o benchmarks/results
```

The algorithm for adjusting the SLA variable is as follows:

1. Run the benchmark with infinite QPS, and use the corresponding metrics to determine the initial value of the variable.
- For example, the initial request rate is set to the concurrency under infinite QPS.
2. If the SLA is still satisfied, keep doubling the value until the SLA is no longer satisfied. This gives a relatively narrow window that contains the point where the SLA is barely satisfied.
3. Apply binary search over the window to find the maximum value that still satisfies the SLA.

!!! important
SLA tuning is applied over each combination of `--serve-params`, `--bench-params`, and `--sla-params`.

For a given combination of `--serve-params` and `--bench-params`, we share the benchmark results across `--sla-params` to avoid rerunning benchmarks with the same SLA variable value.

## Performance Benchmarks

The performance benchmarks are used for development to confirm whether new changes improve performance under various workloads. They are triggered on every commit with both the `perf-benchmarks` and `ready` labels, and when a PR is merged into vLLM.
Expand Down
Loading