Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
268 changes: 122 additions & 146 deletions docs/contributing/profiling.md
Original file line number Diff line number Diff line change
@@ -1,216 +1,192 @@
# Profiling vLLM-Omni

> **Warning:** Profiling incurs significant overhead. Use only for development and debugging, never in production.
> **Warning:** Profiling is for development and debugging only. It adds significant overhead and should not be enabled in production.

vLLM-Omni uses the PyTorch Profiler to analyze performance across both **multi-stage omni-modality models** and **diffusion models**.
vLLM-Omni supports two profiler backends through `profiler_config`:

### 1. Configure Profiling in the Stage YAML
- `torch`: detailed CPU/CUDA traces written to `torch_profiler_dir`
- `cuda`: low-overhead CUDA range control for NVIDIA Nsight Systems (`nsys`)

Enable profiling by adding `profiler_config` under `engine_args` for the stage(s) you want to profile in your stage config YAML:
## 1. Configure Profiling

Use the same `profiler_config` shape everywhere:

```yaml
profiler_config:
profiler: torch
torch_profiler_dir: ./perf
```

Supported fields:

| Field | Description |
|---|---|
| `profiler` | Profiler backend. Supported values: `torch`, `cuda`. |
| `torch_profiler_dir` | Output directory for torch traces. Required when `profiler: torch`. |
| `delay_iterations` | Number of worker iterations to skip before profiling starts. |
| `max_iterations` | Maximum number of worker iterations to capture before auto-stop. |
| `warmup_iterations` | Torch-profiler warmup iterations. |
| `active_iterations` | Torch-profiler active iterations. |
| `wait_iterations` | Torch-profiler wait iterations before warmup. |

For multi-stage omni pipelines, put `profiler_config` under the target stage's `engine_args`.

```yaml
stage_args:
- stage_id: 0
stage_type: llm
engine_args:
# ... other engine args ...
profiler_config:
profiler: torch
torch_profiler_dir: ./perf
```

| Field | Description |
|---|---|
| `profiler` | Profiler backend to use. Currently supports `torch`. |
| `torch_profiler_dir` | Directory where trace files are saved. Created automatically if it doesn't exist. |

> **Tip:** Only enable `profiler_config` on stages you actually need to profile. Stages without it will not start a profiler, keeping overhead minimal.

### 2. Profiling Omni-Modality Models
For single-stage diffusion usage, pass `profiler_config` directly to `Omni(...)` or `vllm serve`.

**Selective Stage Profiling**
## 2. Profiling Omni Pipelines

It is highly recommended to profile specific stages to prevent producing overly large trace files:
It is usually best to profile only the stages you need.

```python
# Profile all stages
omni_llm.start_profile()
# Profile all stages.
omni.start_profile()

# Only profile Stage 1
omni_llm.start_profile(stages=[1])

# Stage 0 (Thinker) and Stage 2 (Audio Decoder) for qwen omni
omni_llm.start_profile(stages=[0, 2])
# Profile selected stages only.
omni.start_profile(stages=[0, 2])
...
omni.stop_profile(stages=[0, 2])
```

> **Important:** Always pass the same `stages` list to both `start_profile()` and `stop_profile()`. If you omit `stages` from `stop_profile()`, it defaults to stopping all stages — including ones that were never started — which will produce errors.

**Python Usage**: Wrap your generation logic with `start_profile()` and `stop_profile()`.
Always stop the same stage set that you started. If only some stages have `profiler_config`, pass an explicit `stages=[...]` list instead of relying on the default "all stages" behavior.

```python
profiler_stages = [0] # Only profile the stages you need
Examples:

# 1. Start profiling
omni.start_profile(stages=profiler_stages)
1. [Qwen2.5-Omni end2end](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen2_5_omni/end2end.py)
2. [Qwen3-Omni end2end](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen3_omni/end2end.py)

# Initialize generator
omni_generator = omni.generate(prompts, sampling_params_list, py_generator=args.py_generator)
## 3. Profiling Single-Stage Diffusion

total_requests = len(prompts)
processed_count = 0
Single-stage diffusion models use the same `start_profile()` / `stop_profile()` controls, but you must provide `profiler_config` explicitly.

# Main Processing Loop
for stage_outputs in omni_generator:
### PyTorch profiler

# ... [Output processing logic for text/audio would go here] ...
```python
from vllm_omni import Omni

omni = Omni(
model="Wan-AI/Wan2.2-I2V-A14B-Diffusers",
profiler_config={
"profiler": "torch",
"torch_profiler_dir": "./perf",
},
)

omni.start_profile()
...
omni.stop_profile()
```

# Update count to track when to stop profiling
processed_count += len(stage_outputs.request_output)
### Nsight Systems (`nsys`)

# 2. Check if all requests are done to stop the profiler safely
if profiler_enabled and processed_count >= total_requests:
print(f"[Info] Processed {processed_count}/{total_requests}. Stopping profiler inside active loop...")
For Nsight Systems, use `profiler: cuda` and wrap the process with `nsys profile`.

# Stop the profiler while workers are still active
# Pass the same stages list used in start_profile()
omni_llm.stop_profile(stages=profiler_stages)
```bash
nsys profile \
--trace-fork-before-exec=true \
--cuda-graph-trace=node \
--capture-range=cudaProfilerApi \
--capture-range-end=repeat \
-o diffusion_trace \
python image_to_video.py ...
```

# Wait for traces to flush to disk
print("[Info] Waiting 30s for workers to write trace files to disk...")
time.sleep(30)
print("[Info] Trace export wait time finished.")
The Python process being profiled must create the diffusion engine with:

omni_llm.close()
```python
profiler_config={"profiler": "cuda"}
```

Then call `start_profile()` before the requests you want to capture and `stop_profile()` after them. The diffusion worker processes open and close the CUDA capture range themselves, so `nsys` sees the actual GPU work instead of only the parent process.

**CLI Usage** (using `end2end.py`):
```bash
# Profile only Stage 0 (Thinker)
python end2end.py --output-wav output_audio \
--query-type text --enable-profiler --profiler-stages 0
Examples:

# Profile Stage 0 and Stage 2
python end2end.py --output-wav output_audio \
--query-type text --enable-profiler --profiler-stages 0 2
1. [Image edit example](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/image_to_image/image_edit.py)
2. [Image to video example](https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/image_to_video)

# Profile all stages (omit --profiler-stages)
python end2end.py --output-wav output_audio \
--query-type text --enable-profiler
```
## 4. Profiling Online Serving

**Examples**:
When any stage has `profiler_config.profiler` set, the server exposes:

1. **Qwen2.5-Omni**: [https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen2_5_omni/end2end.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen2_5_omni/end2end.py)
- `POST /start_profile`
- `POST /stop_profile`

2. **Qwen3-Omni**: [https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen3_omni/end2end.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen3_omni/end2end.py)
### Start the server

### 3. Profiling diffusion models
Multi-stage omni serving:

Diffusion profiling is End-to-End, capturing encoding, denoising loops, and decoding. Standalone diffusion scripts use `--profiler-dir` to enable profiling.

**CLI Usage:**
```bash
python image_to_video.py \
--model Wan-AI/Wan2.2-I2V-A14B-Diffusers \
--image qwen-bear.png \
--prompt "A cat playing with yarn, smooth motion" \
--profiler-dir \
\
# Minimize Spatial Dimensions (Optional but helpful):
# Drastically reduces memory usage so the profiler doesn't
# crash due to overhead, though for accurate performance
# tuning you often want target resolutions.
--height 48 \
--width 64 \
\
# Minimize Temporal Dimension (Frames):
# Video models process 3D tensors (Time, Height, Width).
# Reducing frames to the absolute minimum (2) keeps the
# tensor size small, ensuring the trace file doesn't become
# multi-gigabytes in size.
--num-frames 2 \
\
# Minimize Iteration Loop (Steps):
# This is the most critical setting for profiling.
# Diffusion models run the same loop X times.
# Profiling 2 steps gives you the exact same performance
# data as 50 steps, but saves minutes of runtime and
# prevents the trace viewer from freezing.
--num-inference-steps 2 \
\
--guidance-scale 5.0 \
--guidance-scale-high 6.0 \
--boundary-ratio 0.875 \
--flow-shift 12.0 \
--fps 16 \
--output i2v_output.mp4
vllm serve Qwen/Qwen2.5-Omni-7B \
--omni \
--stage-configs-path qwen2_5_omni.yaml \
--port 8091
```

> **Note:** For diffusion stages within a multi-stage omni pipeline, use `profiler_config` in the stage YAML instead (see Section 1).

**Examples**:

1. **Qwen image edit**: [https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/image_to_image/image_edit.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/image_to_image/image_edit.py)

2. **Wan-AI/Wan2.2-I2V-A14B-Diffusers**: [https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/image_to_video](https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/image_to_video)

### 4. Profiling Online Serving

When `profiler_config` is set in the stage YAML, the server automatically exposes `/start_profile` and `/stop_profile` HTTP endpoints.
Single-stage diffusion serving with torch profiler:

**1. Start the server** with a stage YAML that has `profiler_config` enabled:
```bash
vllm serve Qwen/Qwen2.5-Omni-7B \
--omni \
--stage-configs-path qwen2_5_omni.yaml \
--port 8091
vllm serve Wan-AI/Wan2.2-I2V-A14B-Diffusers \
--omni \
--port 8091 \
--profiler-config '{"profiler": "torch", "torch_profiler_dir": "./vllm_profile"}'
```

Or for one stage diffusion models:
Single-stage diffusion serving with Nsight Systems:

```bash
vllm serve Wan-AI/Wan2.2-I2V-A14B-Diffusers --omni --port 8091 --profiler-config '{"profiler": "torch", "torch_profiler_dir": "./vllm_profile"}'
nsys profile \
--trace-fork-before-exec=true \
--cuda-graph-trace=node \
--capture-range=cudaProfilerApi \
--capture-range-end=repeat \
-o serving_trace \
vllm serve Wan-AI/Wan2.2-I2V-A14B-Diffusers \
--omni \
--port 8091 \
--profiler-config '{"profiler": "cuda"}'
```

**2. Start profiling** by sending a POST request:
### Control capture

```bash
# Profile all stages that have profiler_config set
# Start profiling on all profiled stages.
curl -X POST http://localhost:8091/start_profile

# Profile specific stages only
# Start profiling on selected stages.
curl -X POST http://localhost:8091/start_profile \
-H "Content-Type: application/json" \
-d '{"stages": [0]}'
```
-H "Content-Type: application/json" \
-d '{"stages": [0]}'

**3. Send your inference requests** as normal while the profiler is running.

**4. Stop profiling** and collect traces:
```bash
# Stop all stages
# Stop profiling.
curl -X POST http://localhost:8091/stop_profile

# Stop specific stages (must match the stages you started)
curl -X POST http://localhost:8091/stop_profile \
-H "Content-Type: application/json" \
-d '{"stages": [0]}'
```

Trace files are written to the `torch_profiler_dir` specified in your stage YAML.
For mixed-stage pipelines, use explicit `stages` and pass the same stage list to both endpoints.

## 5. Analyze Results

> **Important:** Always stop the same stages you started. Stopping a stage that was never started will produce errors.
Torch profiler output:

### 5. Analyzing Traces
- Chrome/Perfetto traces under `torch_profiler_dir`
- Optional aggregated CUDA-time tables under the same directory

Output files are saved to the `torch_profiler_dir` specified in your stage YAML config.
CUDA profiler / Nsight Systems output:

**Output**
**Chrome Trace** (`.json.gz`): Visual timeline of kernels and stages. Open in Perfetto UI.
- `.nsys-rep` report files written by `nsys -o ...`

**Viewing Tools:**
Recommended viewers:

- [Perfetto](https://ui.perfetto.dev/) (recommended)
- `chrome://tracing` (Chrome only)
- [Perfetto](https://ui.perfetto.dev/) for torch traces
- `nsys stats <report>.nsys-rep` for CLI summaries
- Nsight Systems GUI for CUDA kernel timelines

**Note**: vLLM-Omni reuses the PyTorch Profiler infrastructure from vLLM. See the official vLLM profiler documentation: [vLLM Profiling Guide](https://docs.vllm.ai/en/stable/contributing/profiling/)
vLLM-Omni reuses the vLLM profiling infrastructure where possible. For the upstream reference, see the [vLLM profiling guide](https://docs.vllm.ai/en/stable/contributing/profiling/).
Loading
Loading