Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,6 @@ docker run --rm \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
/bin/bash
```
Expand All @@ -38,158 +37,203 @@ pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/si
pip install -r benchmarks/requirements-bench.txt
```

## 3. (Optional) Prepare model weights
For faster running speed, we recommend downloading the model in advance:
## 3. Run basic benchmarks
This section introduces how to perform performance testing using the benchmark suite built into VLLM.

### 3.1 Dataset
VLLM supports a variety of (datasets)[https://github.com/vllm-project/vllm/blob/main/vllm/benchmarks/datasets.py].

<style>
th {
min-width: 0 !important;
}
</style>

| Dataset | Online | Offline | Data Path |
|---------|--------|---------|-----------|
| ShareGPT | ✅ | ✅ | `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json` |
| ShareGPT4V (Image) | ✅ | ✅ | `wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/resolve/main/sharegpt4v_instruct_gpt4-vision_cap100k.json`<br>Note that the images need to be downloaded separately. For example, to download COCO's 2017 Train images:<br>`wget http://images.cocodataset.org/zips/train2017.zip` |
| ShareGPT4Video (Video) | ✅ | ✅ | `git clone https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video` |
| BurstGPT | ✅ | ✅ | `wget https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv` |
| Sonnet (deprecated) | ✅ | ✅ | Local file: `benchmarks/sonnet.txt` |
| Random | ✅ | ✅ | `synthetic` |
| RandomMultiModal (Image/Video) | 🟡 | 🚧 | `synthetic` |
| RandomForReranking | ✅ | ✅ | `synthetic` |
| Prefix Repetition | ✅ | ✅ | `synthetic` |
| HuggingFace-VisionArena | ✅ | ✅ | `lmarena-ai/VisionArena-Chat` |
| HuggingFace-MMVU | ✅ | ✅ | `yale-nlp/MMVU` |
| HuggingFace-InstructCoder | ✅ | ✅ | `likaixin/InstructCoder` |
| HuggingFace-AIMO | ✅ | ✅ | `AI-MO/aimo-validation-aime`, `AI-MO/NuminaMath-1.5`, `AI-MO/NuminaMath-CoT` |
| HuggingFace-Other | ✅ | ✅ | `lmms-lab/LLaVA-OneVision-Data`, `Aeala/ShareGPT_Vicuna_unfiltered` |
| HuggingFace-MTBench | ✅ | ✅ | `philschmid/mt-bench` |
| HuggingFace-Blazedit | ✅ | ✅ | `vdaita/edit_5k_char`, `vdaita/edit_10k_char` |
| Spec Bench | ✅ | ✅ | `wget https://raw.githubusercontent.com/hemingkx/Spec-Bench/refs/heads/main/data/spec_bench/question.jsonl` |
| Custom | ✅ | ✅ | Local file: `data.jsonl` |

:::{note}
The datasets mentioned above are all links to datasets on huggingface.
The dataset's `dataset-name` should be set to `hf`.
For local `dataset-path`, please set `hf-name` to its Hugging Face ID like

```bash
modelscope download --model LLM-Research/Meta-Llama-3.1-8B-Instruct
--dataset-path /datasets/VisionArena-Chat/ --hf-name lmarena-ai/VisionArena-Chat
```

You can also replace all model paths in the [json](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests) files with your local paths:
:::

### 3.2 Run basic benchmark

#### 3.2.1 Online serving

First start serving your model:

```bash
[
{
"test_name": "latency_llama8B_tp1",
"parameters": {
"model": "your local model path",
"tensor_parallel_size": 1,
"load_format": "dummy",
"num_iters_warmup": 5,
"num_iters": 15
}
}
]
VLLM_USE_MODELSCOPE=True vllm serve Qwen/Qwen3-8B
```

## 4. Run benchmark script
Run benchmark script:
Then run the benchmarking script:

```bash
bash benchmarks/scripts/run-performance-benchmarks.sh
# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
export VLLM_USE_MODELSCOPE=True
vllm bench serve \
--backend vllm \
--model Qwen/Qwen3-8B \
--endpoint /v1/completions \
--dataset-name sharegpt \
--dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 10
```

After about 10 mins, the output is shown below:
If successful, you will see the following output:

```bash
online serving:
qps 1:
```shell
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 212.77
Total input tokens: 42659
Total generated tokens: 43545
Request throughput (req/s): 0.94
Output token throughput (tok/s): 204.66
Total Token throughput (tok/s): 405.16
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 19.92
Total input tokens: 1374
Total generated tokens: 2663
Request throughput (req/s): 0.50
Output token throughput (tok/s): 133.67
Peak output token throughput (tok/s): 312.00
Peak concurrent requests: 10.00
Total Token throughput (tok/s): 202.64
---------------Time to First Token----------------
Mean TTFT (ms): 104.14
Median TTFT (ms): 102.22
P99 TTFT (ms): 153.82
Mean TTFT (ms): 127.10
Median TTFT (ms): 136.29
P99 TTFT (ms): 137.83
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 38.78
Median TPOT (ms): 38.70
P99 TPOT (ms): 48.03
Mean TPOT (ms): 25.85
Median TPOT (ms): 25.78
P99 TPOT (ms): 26.64
---------------Inter-token Latency----------------
Mean ITL (ms): 38.46
Median ITL (ms): 36.96
P99 ITL (ms): 75.03
Mean ITL (ms): 25.78
Median ITL (ms): 25.74
P99 ITL (ms): 28.85
==================================================
```

qps 4:
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 72.55
Total input tokens: 42659
Total generated tokens: 43545
Request throughput (req/s): 2.76
Output token throughput (tok/s): 600.24
Total Token throughput (tok/s): 1188.27
---------------Time to First Token----------------
Mean TTFT (ms): 115.62
Median TTFT (ms): 109.39
P99 TTFT (ms): 169.03
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 51.48
Median TPOT (ms): 52.40
P99 TPOT (ms): 69.41
---------------Inter-token Latency----------------
Mean ITL (ms): 50.47
Median ITL (ms): 43.95
P99 ITL (ms): 130.29
==================================================
#### 3.2.2 Offline Throughput Benchmark

qps 16:
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 47.82
Total input tokens: 42659
Total generated tokens: 43545
Request throughput (req/s): 4.18
Output token throughput (tok/s): 910.62
Total Token throughput (tok/s): 1802.70
---------------Time to First Token----------------
Mean TTFT (ms): 128.50
Median TTFT (ms): 128.36
P99 TTFT (ms): 187.87
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 83.60
Median TPOT (ms): 77.85
P99 TPOT (ms): 165.90
---------------Inter-token Latency----------------
Mean ITL (ms): 65.72
Median ITL (ms): 54.84
P99 ITL (ms): 289.63
==================================================
```bash
VLLM_USE_MODELSCOPE=True
vllm bench throughput \
--model Qwen/Qwen3-8B \
--dataset-name random \
--input-len 128 \
--output-len 128
```

If successful, you will see the following output

```shell
Processed prompts: 100%|█| 10/10 [00:03<00:00, 2.74it/s, est. speed input: 351.02 toks/s, output: 351.02 t
Throughput: 2.73 requests/s, 699.93 total tokens/s, 349.97 output tokens/s
Total num prompt tokens: 1280
Total num output tokens: 1280
```

qps inf:
#### 3.2.4 Multi-Modal Benchmark

```shell
export VLLM_USE_MODELSCOPE=True
vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
--dtype bfloat16 \
--limit-mm-per-prompt '{"image": 1}' \
--allowed-local-media-path /path/to/sharegpt4v/images
```

```shell
export HF_ENDPOINT="https://hf-mirror.com"
vllm bench serve --model Qwen/Qwen2.5-VL-7B-Instruct \
--backend "openai-chat" \
--dataset-name hf \
--hf-split train \
--endpoint "/v1/chat/completions" \
--dataset-path "lmarena-ai/vision-arena-bench-v0.1" \
--num-prompts 10 \
--no-stream
```

```shell
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 41.26
Total input tokens: 42659
Total generated tokens: 43545
Request throughput (req/s): 4.85
Output token throughput (tok/s): 1055.44
Total Token throughput (tok/s): 2089.40
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 4.89
Total input tokens: 7191
Total generated tokens: 951
Request throughput (req/s): 2.05
Output token throughput (tok/s): 194.63
Peak output token throughput (tok/s): 290.00
Peak concurrent requests: 10.00
Total Token throughput (tok/s): 1666.35
---------------Time to First Token----------------
Mean TTFT (ms): 3394.37
Median TTFT (ms): 3359.93
P99 TTFT (ms): 3540.93
Mean TTFT (ms): 722.22
Median TTFT (ms): 589.81
P99 TTFT (ms): 1377.02
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 66.28
Median TPOT (ms): 64.19
P99 TPOT (ms): 97.66
Mean TPOT (ms): 44.13
Median TPOT (ms): 34.58
P99 TPOT (ms): 124.72
---------------Inter-token Latency----------------
Mean ITL (ms): 56.62
Median ITL (ms): 55.69
P99 ITL (ms): 82.90
Mean ITL (ms): 33.14
Median ITL (ms): 28.01
P99 ITL (ms): 182.28
==================================================
```

offline:
latency:
Avg latency: 4.944929537673791 seconds
10% percentile latency: 4.894104263186454 seconds
25% percentile latency: 4.909652255475521 seconds
50% percentile latency: 4.932477846741676 seconds
75% percentile latency: 4.9608619548380375 seconds
90% percentile latency: 5.035418218374252 seconds
99% percentile latency: 5.052476694583893 seconds

throughput:
Throughput: 4.64 requests/s, 2000.51 total tokens/s, 1010.54 output tokens/s
Total num prompt tokens: 42659
Total num output tokens: 43545
#### 3.2.5 Embedding Benchmark

```shell
vllm serve Qwen/Qwen3-Embedding-8B --trust-remote-code
```

The result json files are generated into the path `benchmark/results`.
These files contain detailed benchmarking results for further analysis.
```shell
# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
export VLLM_USE_MODELSCOPE=true
vllm bench serve \
--model Qwen/Qwen3-Embedding-8B \
--backend openai-embeddings \
--endpoint /v1/embeddings \
--dataset-name sharegpt \
--num-prompt 10 \
Comment thread
Potabk marked this conversation as resolved.
--dataset-path <your dataset path>/datasets/ShareGPT_V3_unfiltered_cleaned_split.json
```

```bash
.
|-- latency_llama8B_tp1.json
|-- serving_llama8B_tp1_qps_1.json
|-- serving_llama8B_tp1_qps_16.json
|-- serving_llama8B_tp1_qps_4.json
|-- serving_llama8B_tp1_qps_inf.json
`-- throughput_llama8B_tp1.json
```shell
============ Serving Benchmark Result ============
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 0.18
Total input tokens: 1372
Request throughput (req/s): 56.32
Total Token throughput (tok/s): 7726.76
----------------End-to-end Latency----------------
Mean E2EL (ms): 154.06
Median E2EL (ms): 165.57
P99 E2EL (ms): 166.66
==================================================
```