diff --git a/docs/source/developer_guide/performance_and_debug/performance_benchmark.md b/docs/source/developer_guide/performance_and_debug/performance_benchmark.md index d54f619d5f8..9a421393568 100644 --- a/docs/source/developer_guide/performance_and_debug/performance_benchmark.md +++ b/docs/source/developer_guide/performance_and_debug/performance_benchmark.md @@ -25,7 +25,6 @@ docker run --rm \ -v /root/.cache:/root/.cache \ -p 8000:8000 \ -e VLLM_USE_MODELSCOPE=True \ --e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \ -it $IMAGE \ /bin/bash ``` @@ -38,158 +37,203 @@ pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/si pip install -r benchmarks/requirements-bench.txt ``` -## 3. (Optional) Prepare model weights -For faster running speed, we recommend downloading the model in advance: +## 3. Run basic benchmarks +This section introduces how to perform performance testing using the benchmark suite built into VLLM. + +### 3.1 Dataset +VLLM supports a variety of (datasets)[https://github.com/vllm-project/vllm/blob/main/vllm/benchmarks/datasets.py]. + + + +| Dataset | Online | Offline | Data Path | +|---------|--------|---------|-----------| +| ShareGPT | ✅ | ✅ | `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json` | +| ShareGPT4V (Image) | ✅ | ✅ | `wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/resolve/main/sharegpt4v_instruct_gpt4-vision_cap100k.json`
Note that the images need to be downloaded separately. For example, to download COCO's 2017 Train images:
`wget http://images.cocodataset.org/zips/train2017.zip` | +| ShareGPT4Video (Video) | ✅ | ✅ | `git clone https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video` | +| BurstGPT | ✅ | ✅ | `wget https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv` | +| Sonnet (deprecated) | ✅ | ✅ | Local file: `benchmarks/sonnet.txt` | +| Random | ✅ | ✅ | `synthetic` | +| RandomMultiModal (Image/Video) | 🟡 | 🚧 | `synthetic` | +| RandomForReranking | ✅ | ✅ | `synthetic` | +| Prefix Repetition | ✅ | ✅ | `synthetic` | +| HuggingFace-VisionArena | ✅ | ✅ | `lmarena-ai/VisionArena-Chat` | +| HuggingFace-MMVU | ✅ | ✅ | `yale-nlp/MMVU` | +| HuggingFace-InstructCoder | ✅ | ✅ | `likaixin/InstructCoder` | +| HuggingFace-AIMO | ✅ | ✅ | `AI-MO/aimo-validation-aime`, `AI-MO/NuminaMath-1.5`, `AI-MO/NuminaMath-CoT` | +| HuggingFace-Other | ✅ | ✅ | `lmms-lab/LLaVA-OneVision-Data`, `Aeala/ShareGPT_Vicuna_unfiltered` | +| HuggingFace-MTBench | ✅ | ✅ | `philschmid/mt-bench` | +| HuggingFace-Blazedit | ✅ | ✅ | `vdaita/edit_5k_char`, `vdaita/edit_10k_char` | +| Spec Bench | ✅ | ✅ | `wget https://raw.githubusercontent.com/hemingkx/Spec-Bench/refs/heads/main/data/spec_bench/question.jsonl` | +| Custom | ✅ | ✅ | Local file: `data.jsonl` | + +:::{note} +The datasets mentioned above are all links to datasets on huggingface. +The dataset's `dataset-name` should be set to `hf`. +For local `dataset-path`, please set `hf-name` to its Hugging Face ID like ```bash -modelscope download --model LLM-Research/Meta-Llama-3.1-8B-Instruct +--dataset-path /datasets/VisionArena-Chat/ --hf-name lmarena-ai/VisionArena-Chat ``` -You can also replace all model paths in the [json](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests) files with your local paths: +::: + +### 3.2 Run basic benchmark + +#### 3.2.1 Online serving + +First start serving your model: ```bash -[ - { - "test_name": "latency_llama8B_tp1", - "parameters": { - "model": "your local model path", - "tensor_parallel_size": 1, - "load_format": "dummy", - "num_iters_warmup": 5, - "num_iters": 15 - } - } -] +VLLM_USE_MODELSCOPE=True vllm serve Qwen/Qwen3-8B ``` -## 4. Run benchmark script -Run benchmark script: +Then run the benchmarking script: ```bash -bash benchmarks/scripts/run-performance-benchmarks.sh +# download dataset +# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json +export VLLM_USE_MODELSCOPE=True +vllm bench serve \ + --backend vllm \ + --model Qwen/Qwen3-8B \ + --endpoint /v1/completions \ + --dataset-name sharegpt \ + --dataset-path /ShareGPT_V3_unfiltered_cleaned_split.json \ + --num-prompts 10 ``` -After about 10 mins, the output is shown below: +If successful, you will see the following output: -```bash -online serving: -qps 1: +```shell ============ Serving Benchmark Result ============ -Successful requests: 200 -Benchmark duration (s): 212.77 -Total input tokens: 42659 -Total generated tokens: 43545 -Request throughput (req/s): 0.94 -Output token throughput (tok/s): 204.66 -Total Token throughput (tok/s): 405.16 +Successful requests: 10 +Failed requests: 0 +Benchmark duration (s): 19.92 +Total input tokens: 1374 +Total generated tokens: 2663 +Request throughput (req/s): 0.50 +Output token throughput (tok/s): 133.67 +Peak output token throughput (tok/s): 312.00 +Peak concurrent requests: 10.00 +Total Token throughput (tok/s): 202.64 ---------------Time to First Token---------------- -Mean TTFT (ms): 104.14 -Median TTFT (ms): 102.22 -P99 TTFT (ms): 153.82 +Mean TTFT (ms): 127.10 +Median TTFT (ms): 136.29 +P99 TTFT (ms): 137.83 -----Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 38.78 -Median TPOT (ms): 38.70 -P99 TPOT (ms): 48.03 +Mean TPOT (ms): 25.85 +Median TPOT (ms): 25.78 +P99 TPOT (ms): 26.64 ---------------Inter-token Latency---------------- -Mean ITL (ms): 38.46 -Median ITL (ms): 36.96 -P99 ITL (ms): 75.03 +Mean ITL (ms): 25.78 +Median ITL (ms): 25.74 +P99 ITL (ms): 28.85 ================================================== +``` -qps 4: -============ Serving Benchmark Result ============ -Successful requests: 200 -Benchmark duration (s): 72.55 -Total input tokens: 42659 -Total generated tokens: 43545 -Request throughput (req/s): 2.76 -Output token throughput (tok/s): 600.24 -Total Token throughput (tok/s): 1188.27 ----------------Time to First Token---------------- -Mean TTFT (ms): 115.62 -Median TTFT (ms): 109.39 -P99 TTFT (ms): 169.03 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 51.48 -Median TPOT (ms): 52.40 -P99 TPOT (ms): 69.41 ----------------Inter-token Latency---------------- -Mean ITL (ms): 50.47 -Median ITL (ms): 43.95 -P99 ITL (ms): 130.29 -================================================== +#### 3.2.2 Offline Throughput Benchmark -qps 16: -============ Serving Benchmark Result ============ -Successful requests: 200 -Benchmark duration (s): 47.82 -Total input tokens: 42659 -Total generated tokens: 43545 -Request throughput (req/s): 4.18 -Output token throughput (tok/s): 910.62 -Total Token throughput (tok/s): 1802.70 ----------------Time to First Token---------------- -Mean TTFT (ms): 128.50 -Median TTFT (ms): 128.36 -P99 TTFT (ms): 187.87 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 83.60 -Median TPOT (ms): 77.85 -P99 TPOT (ms): 165.90 ----------------Inter-token Latency---------------- -Mean ITL (ms): 65.72 -Median ITL (ms): 54.84 -P99 ITL (ms): 289.63 -================================================== +```bash +VLLM_USE_MODELSCOPE=True +vllm bench throughput \ + --model Qwen/Qwen3-8B \ + --dataset-name random \ + --input-len 128 \ + --output-len 128 +``` + +If successful, you will see the following output + +```shell +Processed prompts: 100%|█| 10/10 [00:03<00:00, 2.74it/s, est. speed input: 351.02 toks/s, output: 351.02 t +Throughput: 2.73 requests/s, 699.93 total tokens/s, 349.97 output tokens/s +Total num prompt tokens: 1280 +Total num output tokens: 1280 +``` -qps inf: +#### 3.2.4 Multi-Modal Benchmark + +```shell +export VLLM_USE_MODELSCOPE=True +vllm serve Qwen/Qwen2.5-VL-7B-Instruct \ + --dtype bfloat16 \ + --limit-mm-per-prompt '{"image": 1}' \ + --allowed-local-media-path /path/to/sharegpt4v/images +``` + +```shell +export HF_ENDPOINT="https://hf-mirror.com" +vllm bench serve --model Qwen/Qwen2.5-VL-7B-Instruct \ +--backend "openai-chat" \ +--dataset-name hf \ +--hf-split train \ +--endpoint "/v1/chat/completions" \ +--dataset-path "lmarena-ai/vision-arena-bench-v0.1" \ +--num-prompts 10 \ +--no-stream +``` + +```shell ============ Serving Benchmark Result ============ -Successful requests: 200 -Benchmark duration (s): 41.26 -Total input tokens: 42659 -Total generated tokens: 43545 -Request throughput (req/s): 4.85 -Output token throughput (tok/s): 1055.44 -Total Token throughput (tok/s): 2089.40 +Successful requests: 10 +Failed requests: 0 +Benchmark duration (s): 4.89 +Total input tokens: 7191 +Total generated tokens: 951 +Request throughput (req/s): 2.05 +Output token throughput (tok/s): 194.63 +Peak output token throughput (tok/s): 290.00 +Peak concurrent requests: 10.00 +Total Token throughput (tok/s): 1666.35 ---------------Time to First Token---------------- -Mean TTFT (ms): 3394.37 -Median TTFT (ms): 3359.93 -P99 TTFT (ms): 3540.93 +Mean TTFT (ms): 722.22 +Median TTFT (ms): 589.81 +P99 TTFT (ms): 1377.02 -----Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 66.28 -Median TPOT (ms): 64.19 -P99 TPOT (ms): 97.66 +Mean TPOT (ms): 44.13 +Median TPOT (ms): 34.58 +P99 TPOT (ms): 124.72 ---------------Inter-token Latency---------------- -Mean ITL (ms): 56.62 -Median ITL (ms): 55.69 -P99 ITL (ms): 82.90 +Mean ITL (ms): 33.14 +Median ITL (ms): 28.01 +P99 ITL (ms): 182.28 ================================================== +``` -offline: -latency: -Avg latency: 4.944929537673791 seconds -10% percentile latency: 4.894104263186454 seconds -25% percentile latency: 4.909652255475521 seconds -50% percentile latency: 4.932477846741676 seconds -75% percentile latency: 4.9608619548380375 seconds -90% percentile latency: 5.035418218374252 seconds -99% percentile latency: 5.052476694583893 seconds - -throughput: -Throughput: 4.64 requests/s, 2000.51 total tokens/s, 1010.54 output tokens/s -Total num prompt tokens: 42659 -Total num output tokens: 43545 +#### 3.2.5 Embedding Benchmark + +```shell +vllm serve Qwen/Qwen3-Embedding-8B --trust-remote-code ``` -The result json files are generated into the path `benchmark/results`. -These files contain detailed benchmarking results for further analysis. +```shell +# download dataset +# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json +export VLLM_USE_MODELSCOPE=true +vllm bench serve \ + --model Qwen/Qwen3-Embedding-8B \ + --backend openai-embeddings \ + --endpoint /v1/embeddings \ + --dataset-name sharegpt \ + --num-prompt 10 \ + --dataset-path /datasets/ShareGPT_V3_unfiltered_cleaned_split.json +``` -```bash -. -|-- latency_llama8B_tp1.json -|-- serving_llama8B_tp1_qps_1.json -|-- serving_llama8B_tp1_qps_16.json -|-- serving_llama8B_tp1_qps_4.json -|-- serving_llama8B_tp1_qps_inf.json -`-- throughput_llama8B_tp1.json +```shell +============ Serving Benchmark Result ============ +Successful requests: 10 +Failed requests: 0 +Benchmark duration (s): 0.18 +Total input tokens: 1372 +Request throughput (req/s): 56.32 +Total Token throughput (tok/s): 7726.76 +----------------End-to-end Latency---------------- +Mean E2EL (ms): 154.06 +Median E2EL (ms): 165.57 +P99 E2EL (ms): 166.66 +================================================== ```