docs: draft tutorials for advanced features (#312)

the-david-oy · web-flow · commit d056b9ea7024 · 2025-09-29T12:00:54.000-07:00
diff --git a/README.md b/README.md
@@ -40,6 +40,39 @@ Features
 
 </br>
 
+## Tutorials & Advanced Features
+
+### Getting Started
+- **[Basic Tutorial](docs/tutorial.md)** - Learn the fundamentals with Dynamo and vLLM examples
+
+### Advanced Benchmarking Features
+| Feature | Description | Use Cases |
+|---------|-------------|-----------|
+| **[Request Cancellation](docs/tutorials/request-cancellation.md)** | Test timeout behavior and service resilience | SLA validation, cancellation modeling |
+| **[Trace Benchmarking](docs/tutorials/trace-benchmarking.md)** | Deterministic workload replay with custom datasets | Regression testing, A/B testing |
+| **[Fixed Schedule](docs/tutorials/fixed-schedule.md)** | Precise timestamp-based request execution | Traffic replay, temporal analysis, burst testing |
+| **[Time-based Benchmarking](docs/tutorials/time-based-benchmarking.md)** | Duration-based testing with grace period control | Stability testing, sustained performance |
+
+### Quick Navigation
+```bash
+# Basic profiling
+aiperf profile --model Qwen/Qwen3-0.6B --url localhost:8000 --endpoint-type chat
+
+# Request timeout testing
+aiperf profile --request-timeout-seconds 30.0 [other options...]
+
+# Trace-based benchmarking
+aiperf profile --input-file trace.jsonl --custom-dataset-type single_turn [other options...]
+
+# Fixed schedule execution
+aiperf profile --input-file schedule.jsonl --fixed-schedule --fixed-schedule-auto-offset [other options...]
+
+# Time-based benchmarking
+aiperf profile --benchmark-duration 300.0 --benchmark-grace-period 30.0 [other options...]
+```
+
+</br>
+
 ## Supported APIs
 
 - OpenAI chat completions
diff --git a/docs/cli_options.md b/docs/cli_options.md
@@ -63,12 +63,18 @@ Use these options to profile with AIPerf.
 ```
 ```
 ╭─ Load Generator ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
+│ BENCHMARK-DURATION --benchmark-duration                            The duration in seconds for benchmarking.                                                                          │
+│ BENCHMARK-GRACE-PERIOD --benchmark-grace-period                    The grace period in seconds to wait for responses after benchmark duration ends. Only applies when               │
+│                                                                    --benchmark-duration is set. Responses received within this period are included in metrics. [default: 30.0]       │
 │ CONCURRENCY --concurrency                                          The concurrency value to benchmark.                                                                                │
 │ REQUEST-RATE --request-rate                                        Sets the request rate for the load generated by AIPerf. Unit: requests/second                                      │
 │ REQUEST-RATE-MODE --request-rate-mode                              Sets the request rate mode for the load generated by AIPerf. Valid values: constant, poisson. constant: Generate   │
 │                                                                    requests at a fixed rate. poisson: Generate requests using a poisson distribution. [default: poisson]              │
 │ REQUEST-COUNT --request-count --num-requests                       The number of requests to use for measurement. [default: 10]                                                       │
 │ WARMUP-REQUEST-COUNT --warmup-request-count --num-warmup-requests  The number of warmup requests to send before benchmarking. [default: 0]                                            │
+│ REQUEST-CANCELLATION-RATE --request-cancellation-rate              The percentage of requests to cancel. [default: 0.0]                                                              │
+│ REQUEST-CANCELLATION-DELAY --request-cancellation-delay            The delay in seconds before cancelling requests. This is used when --request-cancellation-rate is greater than 0. │
+│                                                                    [default: 0.0]                                                                                                    │
 ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
 ```
 ```
diff --git a/docs/tutorial.md b/docs/tutorial.md
@@ -129,4 +129,4 @@ aiperf profile \
     --request-count 64 \
     --url localhost:8000
 ```
-<!-- /aiperf-run-vllm-default-openai-endpoint-server -->
+<!-- /aiperf-run-vllm-default-openai-endpoint-server -->
diff --git a/docs/tutorials/fixed-schedule.md b/docs/tutorials/fixed-schedule.md
@@ -0,0 +1,130 @@
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+
+# Fixed Schedule Benchmarking
+
+Fixed schedule benchmarking provides precise timing control by executing requests at specific timestamps.
+This mode is ideal for simulating exact traffic patterns, testing temporal performance characteristics,
+and reproducing time-sensitive scenarios.
+
+## Overview
+
+Fixed schedule mode enables:
+
+- **Precise Timing**: Execute requests at exact millisecond intervals
+- **Traffic Simulation**: Replicate real-world traffic patterns
+- **Performance Analysis**: Identify how response times vary with request timing
+- **Load Testing**: Test system behavior under controlled temporal stress patterns
+
+## Fixed Schedule File Format
+
+Fixed schedule files use JSONL format with timestamp-based entries:
+
+```jsonl
+{"timestamp": 0, "input_length": 100, "output_length": 200, "hash_ids": [1001]}
+{"timestamp": 500, "input_length": 200, "output_length": 400, "hash_ids": [1002]}
+{"timestamp": 1000, "input_length": 550, "output_length": 500, "hash_ids": [1003, 1005]}
+```
+
+**Field Descriptions:**
+- `timestamp`: Milliseconds from schedule start when request should be sent
+- `input_length`: Number of tokens in the input prompt
+- `input_text`: Exact text to send in the request (provided instead of input_length)
+- `output_length`: Maximum number of tokens in the response (optional)
+- `hash_ids`: Hash block identifiers to simulate text reuse with 512-token blocks (optional)
+
+## Basic Fixed Schedule Execution
+
+### Setting Up the Server
+
+```bash
+# Start vLLM server for fixed schedule testing
+docker pull vllm/vllm-openai:latest
+docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
+  --model Qwen/Qwen3-0.6B \
+  --host 0.0.0.0 --port 8000 &
+```
+
+```bash
+# Wait for server to be ready
+timeout 900 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\":\"Qwen/Qwen3-0.6B\",\"messages\":[{\"role\":\"user\",\"content\":\"test\"}],\"max_tokens\":1}")" != "200" ]; do sleep 2; done' || { echo "vLLM not ready after 15min"; exit 1; }
+```
+
+### Running Basic Fixed Schedule
+
+<!-- aiperf-run-vllm-default-openai-endpoint-server -->
+```bash
+# Create a fixed schedule with precise timing
+cat > precise_schedule.jsonl << 'EOF'
+{"timestamp": 0, "input_length": 100, "hash_ids": [3001]}
+{"timestamp": 500, "input_length": 200, "hash_ids": [3002]}
+{"timestamp": 750, "input_length": 150, "hash_ids": [3003]}
+{"timestamp": 1000, "input_length": 300, "hash_ids": [3004]}
+{"timestamp": 1250, "input_length": 180, "hash_ids": [3005]}
+{"timestamp": 2000, "input_length": 400, "hash_ids": [3006]}
+{"timestamp": 2500, "input_length": 250, "hash_ids": [3007]}
+{"timestamp": 3000, "input_length": 350, "hash_ids": [3008]}
+{"timestamp": 4000, "input_length": 500, "hash_ids": [3009]}
+{"timestamp": 5000, "input_length": 600, "hash_ids": [3010, 3050]}
+EOF
+# Run basic fixed schedule benchmarking
+aiperf profile \
+    --model Qwen/Qwen3-0.6B \
+    --endpoint-type chat \
+    --endpoint /v1/chat/completions \
+    --streaming \
+    --url localhost:8000 \
+    --input-file precise_schedule.jsonl \
+    --custom-dataset-type mooncake_trace \
+    --fixed-schedule-auto-offset
+```
+<!-- /aiperf-run-vllm-default-openai-endpoint-server -->
+
+**Key Parameters:**
+- `--fixed-schedule-auto-offset`: Automatically adjusts timestamps to start from 0
+
+## Advanced Schedule Patterns
+
+### Time Window Execution
+
+Execute only a portion of the schedule using start and end offsets:
+
+<!-- aiperf-run-vllm-default-openai-endpoint-server -->
+```bash
+# Execute schedule from 2s to 6s window
+aiperf profile \
+    --model Qwen/Qwen3-0.6B \
+    --endpoint-type chat \
+    --endpoint /v1/chat/completions \
+    --streaming \
+    --url localhost:8000 \
+    --input-file precise_schedule.jsonl \
+    --custom-dataset-type mooncake_trace \
+    --fixed-schedule-start-offset 2000 \
+    --fixed-schedule-end-offset 4000
+```
+<!-- /aiperf-run-vllm-default-openai-endpoint-server -->
+
+**Windowing Parameters:**
+- `--fixed-schedule-start-offset 2000`: Start execution at 2000ms timestamp
+- `--fixed-schedule-end-offset 4000`: End execution at 4000ms timestamp
+
+
+## Use Cases
+
+> [!IMPORTANT]
+> **When to Use Fixed Schedule Benchmarking:**
+> - **Traffic Replay**: Reproduce exact timing patterns from production logs
+> - **Temporal Analysis**: Study how performance varies with request timing
+> - **Peak Load Testing**: Test system behavior during known high-traffic periods
+> - **SLA Validation**: Verify performance under specific timing constraints
+> - **Capacity Planning**: Model future load based on projected growth patterns
+> - **Regression Testing**: Ensure temporal performance characteristics remain stable
+
+## Related Tutorials
+
+- [Trace Benchmarking](trace-benchmarking.md) - For deterministic request patterns
+- [Time-based Benchmarking](time-based-benchmarking.md) - For duration-based testing
+- [Request Cancellation](request-cancellation.md) - For timeout testing
diff --git a/docs/tutorials/request-cancellation.md b/docs/tutorials/request-cancellation.md
@@ -0,0 +1,100 @@
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+
+# Request Cancellation Testing
+
+AIPerf supports request timeout and cancellation scenarios, which are important for calculating the impact of user cancellation on performance.
+
+## Setting Up the Server
+
+```bash
+# Start vLLM server
+docker pull vllm/vllm-openai:latest
+docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
+  --model Qwen/Qwen3-0.6B \
+  --host 0.0.0.0 --port 8000 &
+```
+
+```bash
+# Wait for server to be ready
+timeout 900 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\":\"Qwen/Qwen3-0.6B\",\"messages\":[{\"role\":\"user\",\"content\":\"test\"}],\"max_tokens\":1}")" != "200" ]; do sleep 2; done' || { echo "vLLM not ready after 15min"; exit 1; }
+```
+
+## Basic Request Cancellation
+
+Test with a small percentage of cancelled requests:
+
+<!-- aiperf-run-vllm-default-openai-endpoint-server -->
+```bash
+# Profile with 10% request cancellation
+aiperf profile \
+    --model Qwen/Qwen3-0.6B \
+    --endpoint-type chat \
+    --endpoint /v1/chat/completions \
+    --streaming \
+    --url localhost:8000 \
+    --request-cancellation-rate 10 \
+    --request-cancellation-delay 0.5 \
+    --synthetic-input-tokens-mean 800 \
+    --synthetic-input-tokens-stddev 80 \
+    --output-tokens-mean 400 \
+    --output-tokens-stddev 40 \
+    --concurrency 8 \
+    --request-count 50 \
+    --warmup-request-count 5
+```
+<!-- /aiperf-run-vllm-default-openai-endpoint-server -->
+
+**Parameters Explained:**
+- `--request-cancellation-rate 10`: Cancel 10% of requests (value between 0.0 and 100.0)
+- `--request-cancellation-delay 0.5`: Wait .5 seconds before cancelling selected requests
+
+### High Cancellation Rate Testing
+
+Test service resilience under frequent cancellations:
+
+<!-- aiperf-run-vllm-default-openai-endpoint-server -->
+```bash
+# Profile with 50% request cancellation
+aiperf profile \
+    --model Qwen/Qwen3-0.6B \
+    --endpoint-type chat \
+    --endpoint /v1/chat/completions \
+    --streaming \
+    --url localhost:8000 \
+    --request-cancellation-rate 50 \
+    --request-cancellation-delay 1.0 \
+    --synthetic-input-tokens-mean 1200 \
+    --output-tokens-mean 600 \
+    --concurrency 10 \
+    --request-count 40
+```
+<!-- /aiperf-run-vllm-default-openai-endpoint-server -->
+
+### Immediate Cancellation Testing
+
+Test rapid cancellation scenarios:
+
+<!-- aiperf-run-vllm-default-openai-endpoint-server -->
+```bash
+# Profile with immediate cancellation (0 delay)
+aiperf profile \
+    --model Qwen/Qwen3-0.6B \
+    --endpoint-type chat \
+    --endpoint /v1/chat/completions \
+    --streaming \
+    --url localhost:8000 \
+    --request-cancellation-rate 30 \
+    --request-cancellation-delay 0.0 \
+    --synthetic-input-tokens-mean 500 \
+    --output-tokens-mean 100 \
+    --concurrency 15 \
+    --request-count 60
+```
+<!-- /aiperf-run-vllm-default-openai-endpoint-server -->
+
+**Expected Results:**
+- Tests how quickly the server can handle connection terminations
+- Useful for testing resource cleanup and connection pooling
diff --git a/docs/tutorials/time-based-benchmarking.md b/docs/tutorials/time-based-benchmarking.md
@@ -0,0 +1,83 @@
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+
+# Time-based Benchmarking
+
+Time-based benchmarking allows you to run benchmarks for a specific duration rather than a fixed number of requests.
+This approach is ideal for measuring sustained performance and testing service stability over time.
+
+## Overview
+
+Time-based benchmarking provides several advantages:
+
+- **Consistent Measurement Window**: Compare performance across different configurations using the same time duration
+- **Real-world Simulation**: Mirror production scenarios where load is sustained over time
+- **Resource Utilization**: Identify memory leaks, connection pooling issues, and resource exhaustion patterns
+- **SLA Validation**: Establish and verify performance guarantees over specific time periods
+- **Grace Period Control**: Handle in-flight requests gracefully or force immediate completion as needed
+
+## Core Parameters
+
+### Benchmark Duration
+- `--benchmark-duration SECONDS`: Total time to run the benchmark
+- Requests are sent continuously until duration expires
+
+### Grace Period
+- `--benchmark-grace-period SECONDS`: Time to wait for in-flight requests after duration expires
+- Default: 30 seconds
+- Set to 0 for immediate completion when duration ends
+
+## Basic Time-based Testing
+
+### Setting Up the Server
+
+```bash
+# Start vLLM server for time-based benchmarking
+docker pull vllm/vllm-openai:latest
+docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
+  --model Qwen/Qwen3-0.6B \
+  --host 0.0.0.0 --port 8000 &
+```
+
+```bash
+# Wait for server to be ready
+timeout 900 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\":\"Qwen/Qwen3-0.6B\",\"messages\":[{\"role\":\"user\",\"content\":\"test\"}],\"max_tokens\":1}")" != "200" ]; do sleep 2; done' || { echo "vLLM not ready after 15min"; exit 1; }
+```
+
+### Duration Testing
+
+Run brief performance checks to quickly validate service health:
+
+<!-- aiperf-run-vllm-default-openai-endpoint-server -->
+```bash
+# Run 30-second benchmark with concurrency
+aiperf profile \
+    --model Qwen/Qwen3-0.6B \
+    --endpoint-type chat \
+    --endpoint /v1/chat/completions \
+    --streaming \
+    --url localhost:8000 \
+    --benchmark-duration 30.0 \
+    --benchmark-grace-period 15.0 \
+    --synthetic-input-tokens-mean 200 \
+    --synthetic-input-tokens-stddev 50 \
+    --output-tokens-mean 100 \
+    --output-tokens-stddev 20 \
+    --concurrency 5 \
+    --warmup-request-count 3 \
+    --random-seed 33333
+```
+<!-- /aiperf-run-vllm-default-openai-endpoint-server -->
+
+
+## Use Cases
+
+> [!TIP]
+> **When to Use Time-based Benchmarking:**
+> - **SLA Validation**: Verify performance meets requirements over time
+> - **Capacity Planning**: Determine sustainable load levels
+> - **Stability Testing**: Identify performance degradation over time
+> - **Resource Planning**: Understand resource consumption patterns
+> - **Production Readiness**: Validate service stability before deployment
diff --git a/docs/tutorials/trace-benchmarking.md b/docs/tutorials/trace-benchmarking.md
diff --git a/mkdocs.yml b/mkdocs.yml