Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
a948c09
add benchmark
Jan 12, 2026
4406746
Merge branch 'vllm-project:main' into benchmark
yenuo26 Jan 14, 2026
b87e2bd
add online benchmark
Jan 14, 2026
9a9dae1
Merge branch 'main' into benchmark
yenuo26 Jan 20, 2026
7ce6ced
Merge branch 'main' into benchmark
yenuo26 Jan 21, 2026
a64df91
modify print and video generate
Jan 21, 2026
89f5626
Merge branch 'benchmark' of https://github.com/yenuo26/vllm-omni into…
Jan 21, 2026
6d4b25e
Fix the bug where random-output-len does not take effect.
Jan 21, 2026
372780a
Standardize the number of decimal places printed.
Jan 22, 2026
5f19b41
Merge branch 'main' into benchmark
yenuo26 Jan 22, 2026
03f30ac
add doc
Jan 22, 2026
6d03b5b
Merge branch 'benchmark' of https://github.com/yenuo26/vllm-omni into…
Jan 22, 2026
3aa7dae
Merge branch 'main' into benchmark
yenuo26 Jan 22, 2026
951ec2e
Merge branch 'vllm-project:main' into benchmark
yenuo26 Jan 23, 2026
0d8a0c8
add modalities instructions
Jan 23, 2026
9d65680
Merge branch 'benchmark' of https://github.com/yenuo26/vllm-omni into…
Jan 23, 2026
a9edb55
Merge branch 'main' into benchmark
yenuo26 Jan 23, 2026
5c7c493
Merge branch 'main' into benchmark
yenuo26 Jan 26, 2026
4bc3df0
Merge branch 'main' into benchmark
yenuo26 Jan 26, 2026
e7b53c2
fix expansion test for modify_stage_config
Jan 26, 2026
628d7ec
Merge branch 'benchmark' of https://github.com/yenuo26/vllm-omni into…
Jan 26, 2026
d433091
add rtf
Jan 27, 2026
f695a4a
Merge branch 'main' into benchmark
yenuo26 Jan 27, 2026
52ccd94
add rtf Description
Jan 27, 2026
6842457
Merge branch 'benchmark' of https://github.com/yenuo26/vllm-omni into…
Jan 27, 2026
d1dd09a
Merge branch 'main' into benchmark
yenuo26 Jan 27, 2026
4096d41
retry CI
Jan 27, 2026
7cd4ea7
Merge branch 'benchmark' of https://github.com/yenuo26/vllm-omni into…
Jan 27, 2026
9ccf8b0
Merge branch 'main' into benchmark
hsliuustc0106 Jan 27, 2026
1c96be4
Merge branch 'main' into benchmark
hsliuustc0106 Jan 27, 2026
f99ecdb
add gpu info
Jan 28, 2026
bc58a41
add gpu info
Jan 28, 2026
b28dd09
fix ai
Jan 28, 2026
a13400c
fix ai
Jan 28, 2026
1ece960
Merge branch 'main' into benchmark
hsliuustc0106 Jan 28, 2026
a0bc709
fix copilot
Jan 28, 2026
fd164e4
Merge branch 'benchmark' of https://github.com/yenuo26/vllm-omni into…
Jan 28, 2026
65ed4b7
Merge branch 'main' into benchmark
yenuo26 Jan 28, 2026
b407b0e
Merge branch 'main' into benchmark
hsliuustc0106 Jan 28, 2026
03a303d
Merge branch 'main' into benchmark
yenuo26 Jan 29, 2026
64f0ddd
Merge branch 'main' into benchmark
yenuo26 Jan 29, 2026
21281d5
fix audio stream
Jan 29, 2026
8e46ac5
Merge branch 'main' into benchmark
hsliuustc0106 Jan 30, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions docs/cli/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,21 @@ If you have custom stage configs file, launch the server with command below
```bash
vllm serve Qwen/Qwen2.5-Omni-7B --omni --stage-configs-path /path/to/stage_configs_file
```


## bench

Run benchmark tests for online serving throughput.
Available Commands:

```bash
vllm bench serve --omni \
--model Qwen/Qwen2.5-Omni-7B \
--host server-host \
--port server-port \
--random-input-len 32 \
--random-output-len 4 \
--num-prompts 5
```

See [vllm bench serve](./bench/serve.md) for the full reference of all available arguments.
359 changes: 359 additions & 0 deletions docs/cli/bench/serve.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,359 @@
# vLLM-Omni Benchmark CLI Guide
The vllm bench command launches the vLLM-Omni benchmark to evaluate the performance of multimodal models.

## Notes
We currently only support using the "openai-chat-omni" backend.

## Basic Parameter Description
You can use `vllm bench serve --omni --help=all` to get descriptions of all parameters. The commonly used parameters are described below:
- `--omni`
Enable Omni (multimodal) mode, supporting multimodal inputs and outputs such as images, videos, and audio.

- `--backend`
Specify the backend adapter as openai-chat-omni, using OpenAI Chat compatible API behavior as the protocol. Currently only openai-chat-omni is supported.

- `--model`
The model identifier to load, filled according to the models supported by vLLM-Omni.

- `--endpoint`
The API endpoint exposed externally, to which clients send their requests.

- `--dataset-name`
The name of the dataset used; random-mm indicates generating random multimodal inputs (images, videos, audio).

- `--num-prompts`
The total number of requests to send, an integer.

- `--max-concurrency`
"Maximum number of concurrent requests. This can be used "
"to help simulate an environment where a higher level component "
"is enforcing a maximum number of concurrent requests. While the "
"--request-rate argument controls the rate at which requests are "
"initiated, this argument will control how many are actually allowed "
"to execute at a time. This means that when used in combination, the "
"actual request rate may be lower than specified with --request-rate, "
"if the server is not processing requests fast enough to keep up."

- `--request-rate`
"Number of requests per second. If this is inf, "
"then all the requests are sent at time 0. "
"Otherwise, we use Poisson process or gamma distribution "
"to synthesize the request arrival times."

- `--ignore-eos`
"Set ignore_eos flag when sending the benchmark request."

- `--metric-percentiles`
Comma-separated list of percentiles for selected metrics. "
"To report 25-th, 50-th, and 75-th percentiles, use \"25,50,75\". "
"Default value is \"99\"."
"Use \"--percentile-metrics\" to select metrics.

- `--percentile-metrics`
"Comma-separated list of selected metrics to report percentiles."
"This argument specifies the metrics to report percentiles."
'Allowed metric names are "ttft", "tpot", "itl", "e2el", "audio_ttfp", "audio_rtf". '

- `--save-result`
Specify to save benchmark results to a json file

- `--save-detailed`
"When saving the results, whether to include per request "
"information such as response, error, ttfs, tpots, etc."

- `--result-dir`
"Specify directory to save benchmark json results."
"If not specified, results are saved in the current directory."

- `--result-filename`
"Specify the filename to save benchmark json results."
"If not specified, results will be saved in "
"{label}-{args.request_rate}qps-{base_model_id}-{current_dt}.json"

- `--random-prefix-len`
Number of fixed prefix tokens before the random context in a request.
The total input length is the sum of random-prefix-len and a random
context length sampled from [input_len * (1 - range_ratio),
input_len * (1 + range_ratio)].Only the random and random-mm modes
support this parameter.

- `--random-input-len`
Number of input tokens per request.Only the random and random-mm modes support this parameter.

- `--random-output-len`
Number of output tokens per request.Only the random and random-mm modes support this parameter.

- `--random-range-ratio`
Range ratio for sampling input/output length,
used only for random sampling. Must be in the range [0, 1) to define
a symmetric sampling range
[length * (1 - range_ratio), length * (1 + range_ratio)].
Only the random and random-mm modes support this parameter.

- `--random-mm-base-items-per-request`
Base number of multimodal items per request for random-mm.
Actual per-request count is sampled around this base using
--random-mm-num-mm-items-range-ratio.
Only the random-mm mode supports this parameter.

- `--random-mm-limit-mm-per-prompt`
Per-modality hard caps for items attached per request, e.g.
'{"image": 3, "video": 1, "audio": 1}'. The sampled per-request item
count is clamped to the sum of these limits. When a modality
reaches its cap, its buckets are excluded and probabilities are
renormalized.
Only the random-mm mode supports this parameter.

- `--random-mm-num-mm-items-range-ratio`
Range ratio r in [0, 1] for sampling items per request.
We sample uniformly from the closed integer range
[floor(n*(1-r)), ceil(n*(1+r))]
where n is the base items per request.
r=0 keeps it fixed; r=1 allows 0 items. The maximum is clamped
to the sum of per-modality limits from
--random-mm-limit-mm-per-prompt.
An error is raised if the computed min exceeds the max.
Only the random-mm mode supports this parameter.

- `--random-mm-bucket-config`
The bucket config is a dictionary mapping a multimodal item
sampling configuration to a probability.
Currently allows for 3 modalities: audio, images and videos.
A bucket key is a tuple of (height, width, num_frames)
The value is the probability of sampling that specific item.
Example:
--random-mm-bucket-config
"{(256, 256, 1): 0.5, (720, 1280, 16): 0.4, (0, 1, 5): 0.10}"
First item: images with resolution 256x256 w.p. 0.5
Second item: videos with resolution 720x1280 and 16 frames
Third item: audios with 1s duration and 5 channels w.p. 0.1
OBS.: If the probabilities do not sum to 1, they are normalized.
Only the random-mm mode supports this parameter.

## Usage Examples

### Online Benchmark
<details class="admonition abstract" markdown="1">
<summary>Show more</summary>

First start serving your model:

```bash
vllm serve Qwen/Qwen2.5-Omni-7B --omni
```

Then run the benchmarking for sharegpt:

```bash
# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
vllm bench serve \
--omni \
--port 43845 \
--model /home/models/Qwen/Qwen3-Omni-30B-A3B-Instruct \
--endpoint /v1/chat/completions \
--backend openai-chat-omni \
--num-prompts 2 \
--dataset-name sharegpt \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--percentile-metrics ttft,tpot,itl,e2el
```
If successful, you will see the following output:
```text
============ Serving Benchmark Result ============
Successful requests: 2
Failed requests: 0
Benchmark duration (s): 81.63
Request throughput (req/s): 0.02
Peak concurrent requests: 2.00
----------------End-to-end Latency----------------
Mean E2EL (ms): 56966.13
Median E2EL (ms): 56966.13
P99 E2EL (ms): 81016.80
================== Text Result ===================
Total input tokens: 36
Total generated tokens: 5926
Output token throughput (tok/s): 72.60
Peak output token throughput (tok/s): 103.00
Peak concurrent requests: 2.00
Total Token throughput (tok/s): 73.04
---------------Time to First Token----------------
Mean TTFT (ms): 124.76
Median TTFT (ms): 124.76
P99 TTFT (ms): 156.10
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 481.30
Median TPOT (ms): 481.30
P99 TPOT (ms): 947.55
---------------Inter-token Latency----------------
Mean ITL (ms): 25.11
Median ITL (ms): 0.33
P99 ITL (ms): 25.17
================== Audio Result ==================
Total audio duration generated(s): 3.95
Total audio frames generated: 94890
Audio throughput(audio duration/s): 0.05
==================================================
```

Or run the benchmarking for random:

```bash
vllm bench serve \
--omni \
--port 43845 \
--endpoint /v1/chat/completions \
--backend openai-chat-omni \
--model /home/models/Qwen/Qwen3-Omni-30B-A3B-Instruct \
--dataset-name random \
--num-prompts 2 \
--random-prefix-len 5 \
--random-input-len 10 \
--random-output-len 100 \
--percentile-metrics ttft,tpot,itl,e2el,audio_ttfp,audio_rtf \
--ignore-eos
```

If successful, you will see the following output:

```text
============ Serving Benchmark Result ============
Successful requests: 2
Failed requests: 0
Benchmark duration (s): 24.35
Request throughput (req/s): 0.08
Peak concurrent requests: 2.00
----------------End-to-end Latency----------------
Mean E2EL (ms): 22576.23
Median E2EL (ms): 22576.23
P99 E2EL (ms): 24205.72
================== Text Result ===================
Total input tokens: 30
Total generated tokens: 8973
Output token throughput (tok/s): 368.52
Peak output token throughput (tok/s): 81.00
Peak concurrent requests: 2.00
Total Token throughput (tok/s): 369.76
---------------Time to First Token----------------
Mean TTFT (ms): 125.16
Median TTFT (ms): 125.16
P99 TTFT (ms): 155.88
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 5.01
Median TPOT (ms): 5.01
P99 TPOT (ms): 5.42
---------------Inter-token Latency----------------
Mean ITL (ms): 34.15
Median ITL (ms): 0.01
P99 ITL (ms): 376.19
================== Audio Result ==================
Total audio duration generated(s): 3.95
Total audio frames generated: 94890
Audio throughput(audio duration/s): 0.16
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms): 11756.89
Median AUDIO_TTFP (ms): 11756.89
P99 AUDIO_TTFP (ms): 20854.25
-----------------Real Time Factor-----------------
Mean AUDIO_RTF: 3.75
Median AUDIO_RTF: 3.75
P99 AUDIO_RTF: 7.39
==================================================
```
Notes:
We use (audio generation time - first packet latency) / audio duration to calculate RTF.

</details>

### Multi-Modal Benchmark

<details class="admonition abstract" markdown="1">
<summary>Show more</summary>

Benchmark the performance of multi-modal requests in vLLM-Omni.

Generate synthetic image、video、audio inputs alongside random text prompts to stress-test vision models without external datasets.

Notes:

- Works only with online benchmark via the OpenAI backend (`--backend openai-chat-omni`) and endpoint `/v1/chat/completions`.

Start the server (example):

```bash
vllm serve Qwen/Qwen2.5-Omni-7B --omni
```

It is recommended to use the flag `--ignore-eos` to simulate real responses. You can set the size of the output via the arg `random-output-len`.

Then run the benchmarking script:
```bash
vllm bench serve \
--omni \
--dataset-name random-mm \
--port 40849 \
--model /home/models/Qwen/Qwen3-Omni-30B-A3B-Instruct \
--endpoint /v1/chat/completions \
--backend openai-chat-omni \
--request-rate 1 \
--num-prompts 1 \
--random-input-len 10 \
--random-range-ratio 0.0 \
--random-mm-base-items-per-request 2 \
--random-mm-num-mm-items-range-ratio 0 \
--random-mm-limit-mm-per-prompt '{"image":1,"video":1, "audio": 1}' \
--random-mm-bucket-config '{"(32, 32, 1)": 0.5, "(0, 1, 1)": 0.1, "(32, 32, 2)":0.4}' \
--ignore-eos \
--percentile-metrics ttft,tpot,itl \
--random-output-len 2 \
--extra_body '{"modalities": ["text"]}'
```

If successful, you will see the following output:

```text
============ Serving Benchmark Result ============
Successful requests: 1
Failed requests: 0
Request rate configured (RPS): 1.00
Benchmark duration (s): 1.21
Request throughput (req/s): 0.83
Peak concurrent requests: 1.00
================== Text Result ===================
Total input tokens: 10
Total generated tokens: 3
Output token throughput (tok/s): 2.49
Peak output token throughput (tok/s): 3.00
Peak concurrent requests: 1.00
Total Token throughput (tok/s): 10.77
---------------Time to First Token----------------
Mean TTFT (ms): 179.74
Median TTFT (ms): 179.74
P99 TTFT (ms): 179.74
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 12.76
Median TPOT (ms): 12.76
P99 TPOT (ms): 12.76
---------------Inter-token Latency----------------
Mean ITL (ms): 12.76
Median ITL (ms): 12.76
P99 ITL (ms): 25.24
================== Audio Result ==================
Total audio duration generated(s): 0.00
Total audio frames generated: 0
Audio throughput(audio duration/s): 0.00
==================================================
```

Behavioral notes:

- If the requested base item count cannot be satisfied under the provided per-prompt limits, the tool raises an error rather than silently clamping.

How sampling works:

- Determine per-request item count k by sampling uniformly from the integer range defined by `--random-mm-base-items-per-request` and `--random-mm-num-mm-items-range-ratio`, then clamp k to at most the sum of per-modality limits.
- For each of the k items, sample a bucket (H, W, T) according to the normalized probabilities in `--random-mm-bucket-config`, while tracking how many items of each modality have been added.
- If a modality (e.g., image) reaches its limit from `--random-mm-limit-mm-per-prompt`, all buckets of that modality are excluded and the remaining bucket probabilities are renormalized before continuing.
This should be seen as an edge case, and if this behavior can be avoided by setting `--random-mm-limit-mm-per-prompt` to a large number. Note that this might result in errors due to engine config `--limit-mm-per-prompt`.
- The resulting request contains synthetic image data in `multi_modal_data` (OpenAI Chat format). When `random-mm` is used with the OpenAI Chat backend, prompts remain text and MM content is attached via `multi_modal_data`.
</details>
Empty file.
Loading