Skip to content

Commit d056b9e

Browse files
authored
docs: draft tutorials for advanced features (#312)
1 parent a58de00 commit d056b9e

File tree

8 files changed

+452
-2
lines changed

8 files changed

+452
-2
lines changed

README.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,39 @@ Features
4040

4141
</br>
4242

43+
## Tutorials & Advanced Features
44+
45+
### Getting Started
46+
- **[Basic Tutorial](docs/tutorial.md)** - Learn the fundamentals with Dynamo and vLLM examples
47+
48+
### Advanced Benchmarking Features
49+
| Feature | Description | Use Cases |
50+
|---------|-------------|-----------|
51+
| **[Request Cancellation](docs/tutorials/request-cancellation.md)** | Test timeout behavior and service resilience | SLA validation, cancellation modeling |
52+
| **[Trace Benchmarking](docs/tutorials/trace-benchmarking.md)** | Deterministic workload replay with custom datasets | Regression testing, A/B testing |
53+
| **[Fixed Schedule](docs/tutorials/fixed-schedule.md)** | Precise timestamp-based request execution | Traffic replay, temporal analysis, burst testing |
54+
| **[Time-based Benchmarking](docs/tutorials/time-based-benchmarking.md)** | Duration-based testing with grace period control | Stability testing, sustained performance |
55+
56+
### Quick Navigation
57+
```bash
58+
# Basic profiling
59+
aiperf profile --model Qwen/Qwen3-0.6B --url localhost:8000 --endpoint-type chat
60+
61+
# Request timeout testing
62+
aiperf profile --request-timeout-seconds 30.0 [other options...]
63+
64+
# Trace-based benchmarking
65+
aiperf profile --input-file trace.jsonl --custom-dataset-type single_turn [other options...]
66+
67+
# Fixed schedule execution
68+
aiperf profile --input-file schedule.jsonl --fixed-schedule --fixed-schedule-auto-offset [other options...]
69+
70+
# Time-based benchmarking
71+
aiperf profile --benchmark-duration 300.0 --benchmark-grace-period 30.0 [other options...]
72+
```
73+
74+
</br>
75+
4376
## Supported APIs
4477

4578
- OpenAI chat completions

docs/cli_options.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,12 +63,18 @@ Use these options to profile with AIPerf.
6363
```
6464
```
6565
╭─ Load Generator ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
66+
│ BENCHMARK-DURATION --benchmark-duration The duration in seconds for benchmarking. │
67+
│ BENCHMARK-GRACE-PERIOD --benchmark-grace-period The grace period in seconds to wait for responses after benchmark duration ends. Only applies when │
68+
│ --benchmark-duration is set. Responses received within this period are included in metrics. [default: 30.0] │
6669
│ CONCURRENCY --concurrency The concurrency value to benchmark. │
6770
│ REQUEST-RATE --request-rate Sets the request rate for the load generated by AIPerf. Unit: requests/second │
6871
│ REQUEST-RATE-MODE --request-rate-mode Sets the request rate mode for the load generated by AIPerf. Valid values: constant, poisson. constant: Generate │
6972
│ requests at a fixed rate. poisson: Generate requests using a poisson distribution. [default: poisson] │
7073
│ REQUEST-COUNT --request-count --num-requests The number of requests to use for measurement. [default: 10] │
7174
│ WARMUP-REQUEST-COUNT --warmup-request-count --num-warmup-requests The number of warmup requests to send before benchmarking. [default: 0] │
75+
│ REQUEST-CANCELLATION-RATE --request-cancellation-rate The percentage of requests to cancel. [default: 0.0] │
76+
│ REQUEST-CANCELLATION-DELAY --request-cancellation-delay The delay in seconds before cancelling requests. This is used when --request-cancellation-rate is greater than 0. │
77+
│ [default: 0.0] │
7278
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
7379
```
7480
```

docs/tutorial.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -129,4 +129,4 @@ aiperf profile \
129129
--request-count 64 \
130130
--url localhost:8000
131131
```
132-
<!-- /aiperf-run-vllm-default-openai-endpoint-server -->
132+
<!-- /aiperf-run-vllm-default-openai-endpoint-server -->

docs/tutorials/fixed-schedule.md

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
<!--
2+
SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
SPDX-License-Identifier: Apache-2.0
4+
-->
5+
6+
# Fixed Schedule Benchmarking
7+
8+
Fixed schedule benchmarking provides precise timing control by executing requests at specific timestamps.
9+
This mode is ideal for simulating exact traffic patterns, testing temporal performance characteristics,
10+
and reproducing time-sensitive scenarios.
11+
12+
## Overview
13+
14+
Fixed schedule mode enables:
15+
16+
- **Precise Timing**: Execute requests at exact millisecond intervals
17+
- **Traffic Simulation**: Replicate real-world traffic patterns
18+
- **Performance Analysis**: Identify how response times vary with request timing
19+
- **Load Testing**: Test system behavior under controlled temporal stress patterns
20+
21+
## Fixed Schedule File Format
22+
23+
Fixed schedule files use JSONL format with timestamp-based entries:
24+
25+
```jsonl
26+
{"timestamp": 0, "input_length": 100, "output_length": 200, "hash_ids": [1001]}
27+
{"timestamp": 500, "input_length": 200, "output_length": 400, "hash_ids": [1002]}
28+
{"timestamp": 1000, "input_length": 550, "output_length": 500, "hash_ids": [1003, 1005]}
29+
```
30+
31+
**Field Descriptions:**
32+
- `timestamp`: Milliseconds from schedule start when request should be sent
33+
- `input_length`: Number of tokens in the input prompt
34+
- `input_text`: Exact text to send in the request (provided instead of input_length)
35+
- `output_length`: Maximum number of tokens in the response (optional)
36+
- `hash_ids`: Hash block identifiers to simulate text reuse with 512-token blocks (optional)
37+
38+
## Basic Fixed Schedule Execution
39+
40+
### Setting Up the Server
41+
42+
```bash
43+
# Start vLLM server for fixed schedule testing
44+
docker pull vllm/vllm-openai:latest
45+
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
46+
--model Qwen/Qwen3-0.6B \
47+
--host 0.0.0.0 --port 8000 &
48+
```
49+
50+
```bash
51+
# Wait for server to be ready
52+
timeout 900 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\":\"Qwen/Qwen3-0.6B\",\"messages\":[{\"role\":\"user\",\"content\":\"test\"}],\"max_tokens\":1}")" != "200" ]; do sleep 2; done' || { echo "vLLM not ready after 15min"; exit 1; }
53+
```
54+
55+
### Running Basic Fixed Schedule
56+
57+
<!-- aiperf-run-vllm-default-openai-endpoint-server -->
58+
```bash
59+
# Create a fixed schedule with precise timing
60+
cat > precise_schedule.jsonl << 'EOF'
61+
{"timestamp": 0, "input_length": 100, "hash_ids": [3001]}
62+
{"timestamp": 500, "input_length": 200, "hash_ids": [3002]}
63+
{"timestamp": 750, "input_length": 150, "hash_ids": [3003]}
64+
{"timestamp": 1000, "input_length": 300, "hash_ids": [3004]}
65+
{"timestamp": 1250, "input_length": 180, "hash_ids": [3005]}
66+
{"timestamp": 2000, "input_length": 400, "hash_ids": [3006]}
67+
{"timestamp": 2500, "input_length": 250, "hash_ids": [3007]}
68+
{"timestamp": 3000, "input_length": 350, "hash_ids": [3008]}
69+
{"timestamp": 4000, "input_length": 500, "hash_ids": [3009]}
70+
{"timestamp": 5000, "input_length": 600, "hash_ids": [3010, 3050]}
71+
EOF
72+
# Run basic fixed schedule benchmarking
73+
aiperf profile \
74+
--model Qwen/Qwen3-0.6B \
75+
--endpoint-type chat \
76+
--endpoint /v1/chat/completions \
77+
--streaming \
78+
--url localhost:8000 \
79+
--input-file precise_schedule.jsonl \
80+
--custom-dataset-type mooncake_trace \
81+
--fixed-schedule-auto-offset
82+
```
83+
<!-- /aiperf-run-vllm-default-openai-endpoint-server -->
84+
85+
**Key Parameters:**
86+
- `--fixed-schedule-auto-offset`: Automatically adjusts timestamps to start from 0
87+
88+
## Advanced Schedule Patterns
89+
90+
### Time Window Execution
91+
92+
Execute only a portion of the schedule using start and end offsets:
93+
94+
<!-- aiperf-run-vllm-default-openai-endpoint-server -->
95+
```bash
96+
# Execute schedule from 2s to 6s window
97+
aiperf profile \
98+
--model Qwen/Qwen3-0.6B \
99+
--endpoint-type chat \
100+
--endpoint /v1/chat/completions \
101+
--streaming \
102+
--url localhost:8000 \
103+
--input-file precise_schedule.jsonl \
104+
--custom-dataset-type mooncake_trace \
105+
--fixed-schedule-start-offset 2000 \
106+
--fixed-schedule-end-offset 4000
107+
```
108+
<!-- /aiperf-run-vllm-default-openai-endpoint-server -->
109+
110+
**Windowing Parameters:**
111+
- `--fixed-schedule-start-offset 2000`: Start execution at 2000ms timestamp
112+
- `--fixed-schedule-end-offset 4000`: End execution at 4000ms timestamp
113+
114+
115+
## Use Cases
116+
117+
> [!IMPORTANT]
118+
> **When to Use Fixed Schedule Benchmarking:**
119+
> - **Traffic Replay**: Reproduce exact timing patterns from production logs
120+
> - **Temporal Analysis**: Study how performance varies with request timing
121+
> - **Peak Load Testing**: Test system behavior during known high-traffic periods
122+
> - **SLA Validation**: Verify performance under specific timing constraints
123+
> - **Capacity Planning**: Model future load based on projected growth patterns
124+
> - **Regression Testing**: Ensure temporal performance characteristics remain stable
125+
126+
## Related Tutorials
127+
128+
- [Trace Benchmarking](trace-benchmarking.md) - For deterministic request patterns
129+
- [Time-based Benchmarking](time-based-benchmarking.md) - For duration-based testing
130+
- [Request Cancellation](request-cancellation.md) - For timeout testing
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
<!--
2+
SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
SPDX-License-Identifier: Apache-2.0
4+
-->
5+
6+
# Request Cancellation Testing
7+
8+
AIPerf supports request timeout and cancellation scenarios, which are important for calculating the impact of user cancellation on performance.
9+
10+
## Setting Up the Server
11+
12+
```bash
13+
# Start vLLM server
14+
docker pull vllm/vllm-openai:latest
15+
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
16+
--model Qwen/Qwen3-0.6B \
17+
--host 0.0.0.0 --port 8000 &
18+
```
19+
20+
```bash
21+
# Wait for server to be ready
22+
timeout 900 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\":\"Qwen/Qwen3-0.6B\",\"messages\":[{\"role\":\"user\",\"content\":\"test\"}],\"max_tokens\":1}")" != "200" ]; do sleep 2; done' || { echo "vLLM not ready after 15min"; exit 1; }
23+
```
24+
25+
## Basic Request Cancellation
26+
27+
Test with a small percentage of cancelled requests:
28+
29+
<!-- aiperf-run-vllm-default-openai-endpoint-server -->
30+
```bash
31+
# Profile with 10% request cancellation
32+
aiperf profile \
33+
--model Qwen/Qwen3-0.6B \
34+
--endpoint-type chat \
35+
--endpoint /v1/chat/completions \
36+
--streaming \
37+
--url localhost:8000 \
38+
--request-cancellation-rate 10 \
39+
--request-cancellation-delay 0.5 \
40+
--synthetic-input-tokens-mean 800 \
41+
--synthetic-input-tokens-stddev 80 \
42+
--output-tokens-mean 400 \
43+
--output-tokens-stddev 40 \
44+
--concurrency 8 \
45+
--request-count 50 \
46+
--warmup-request-count 5
47+
```
48+
<!-- /aiperf-run-vllm-default-openai-endpoint-server -->
49+
50+
**Parameters Explained:**
51+
- `--request-cancellation-rate 10`: Cancel 10% of requests (value between 0.0 and 100.0)
52+
- `--request-cancellation-delay 0.5`: Wait .5 seconds before cancelling selected requests
53+
54+
### High Cancellation Rate Testing
55+
56+
Test service resilience under frequent cancellations:
57+
58+
<!-- aiperf-run-vllm-default-openai-endpoint-server -->
59+
```bash
60+
# Profile with 50% request cancellation
61+
aiperf profile \
62+
--model Qwen/Qwen3-0.6B \
63+
--endpoint-type chat \
64+
--endpoint /v1/chat/completions \
65+
--streaming \
66+
--url localhost:8000 \
67+
--request-cancellation-rate 50 \
68+
--request-cancellation-delay 1.0 \
69+
--synthetic-input-tokens-mean 1200 \
70+
--output-tokens-mean 600 \
71+
--concurrency 10 \
72+
--request-count 40
73+
```
74+
<!-- /aiperf-run-vllm-default-openai-endpoint-server -->
75+
76+
### Immediate Cancellation Testing
77+
78+
Test rapid cancellation scenarios:
79+
80+
<!-- aiperf-run-vllm-default-openai-endpoint-server -->
81+
```bash
82+
# Profile with immediate cancellation (0 delay)
83+
aiperf profile \
84+
--model Qwen/Qwen3-0.6B \
85+
--endpoint-type chat \
86+
--endpoint /v1/chat/completions \
87+
--streaming \
88+
--url localhost:8000 \
89+
--request-cancellation-rate 30 \
90+
--request-cancellation-delay 0.0 \
91+
--synthetic-input-tokens-mean 500 \
92+
--output-tokens-mean 100 \
93+
--concurrency 15 \
94+
--request-count 60
95+
```
96+
<!-- /aiperf-run-vllm-default-openai-endpoint-server -->
97+
98+
**Expected Results:**
99+
- Tests how quickly the server can handle connection terminations
100+
- Useful for testing resource cleanup and connection pooling
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
<!--
2+
SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
SPDX-License-Identifier: Apache-2.0
4+
-->
5+
6+
# Time-based Benchmarking
7+
8+
Time-based benchmarking allows you to run benchmarks for a specific duration rather than a fixed number of requests.
9+
This approach is ideal for measuring sustained performance and testing service stability over time.
10+
11+
## Overview
12+
13+
Time-based benchmarking provides several advantages:
14+
15+
- **Consistent Measurement Window**: Compare performance across different configurations using the same time duration
16+
- **Real-world Simulation**: Mirror production scenarios where load is sustained over time
17+
- **Resource Utilization**: Identify memory leaks, connection pooling issues, and resource exhaustion patterns
18+
- **SLA Validation**: Establish and verify performance guarantees over specific time periods
19+
- **Grace Period Control**: Handle in-flight requests gracefully or force immediate completion as needed
20+
21+
## Core Parameters
22+
23+
### Benchmark Duration
24+
- `--benchmark-duration SECONDS`: Total time to run the benchmark
25+
- Requests are sent continuously until duration expires
26+
27+
### Grace Period
28+
- `--benchmark-grace-period SECONDS`: Time to wait for in-flight requests after duration expires
29+
- Default: 30 seconds
30+
- Set to 0 for immediate completion when duration ends
31+
32+
## Basic Time-based Testing
33+
34+
### Setting Up the Server
35+
36+
```bash
37+
# Start vLLM server for time-based benchmarking
38+
docker pull vllm/vllm-openai:latest
39+
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
40+
--model Qwen/Qwen3-0.6B \
41+
--host 0.0.0.0 --port 8000 &
42+
```
43+
44+
```bash
45+
# Wait for server to be ready
46+
timeout 900 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\":\"Qwen/Qwen3-0.6B\",\"messages\":[{\"role\":\"user\",\"content\":\"test\"}],\"max_tokens\":1}")" != "200" ]; do sleep 2; done' || { echo "vLLM not ready after 15min"; exit 1; }
47+
```
48+
49+
### Duration Testing
50+
51+
Run brief performance checks to quickly validate service health:
52+
53+
<!-- aiperf-run-vllm-default-openai-endpoint-server -->
54+
```bash
55+
# Run 30-second benchmark with concurrency
56+
aiperf profile \
57+
--model Qwen/Qwen3-0.6B \
58+
--endpoint-type chat \
59+
--endpoint /v1/chat/completions \
60+
--streaming \
61+
--url localhost:8000 \
62+
--benchmark-duration 30.0 \
63+
--benchmark-grace-period 15.0 \
64+
--synthetic-input-tokens-mean 200 \
65+
--synthetic-input-tokens-stddev 50 \
66+
--output-tokens-mean 100 \
67+
--output-tokens-stddev 20 \
68+
--concurrency 5 \
69+
--warmup-request-count 3 \
70+
--random-seed 33333
71+
```
72+
<!-- /aiperf-run-vllm-default-openai-endpoint-server -->
73+
74+
75+
## Use Cases
76+
77+
> [!TIP]
78+
> **When to Use Time-based Benchmarking:**
79+
> - **SLA Validation**: Verify performance meets requirements over time
80+
> - **Capacity Planning**: Determine sustainable load levels
81+
> - **Stability Testing**: Identify performance degradation over time
82+
> - **Resource Planning**: Understand resource consumption patterns
83+
> - **Production Readiness**: Validate service stability before deployment

0 commit comments

Comments
 (0)