vllm-project · hsliuustc0106 · Apr 10, 2026 · Jan 30, 2026 · Jan 30, 2026 · Feb 3, 2026
@@ -1,216 +1,192 @@
 # Profiling vLLM-Omni
 
-> **Warning:** Profiling incurs significant overhead. Use only for development and debugging, never in production.
+> **Warning:** Profiling is for development and debugging only. It adds significant overhead and should not be enabled in production.
 
-vLLM-Omni uses the PyTorch Profiler to analyze performance across both **multi-stage omni-modality models** and **diffusion models**.
+vLLM-Omni supports two profiler backends through `profiler_config`:
 
-### 1. Configure Profiling in the Stage YAML
+- `torch`: detailed CPU/CUDA traces written to `torch_profiler_dir`
+- `cuda`: low-overhead CUDA range control for NVIDIA Nsight Systems (`nsys`)
 
-Enable profiling by adding `profiler_config` under `engine_args` for the stage(s) you want to profile in your stage config YAML:
+## 1. Configure Profiling
+
+Use the same `profiler_config` shape everywhere:
+
+```yaml
+profiler_config:
+  profiler: torch
+  torch_profiler_dir: ./perf
+```
+
+Supported fields:
+
+| Field | Description |
+|---|---|
+| `profiler` | Profiler backend. Supported values: `torch`, `cuda`. |
+| `torch_profiler_dir` | Output directory for torch traces. Required when `profiler: torch`. |
+| `delay_iterations` | Number of worker iterations to skip before profiling starts. |
+| `max_iterations` | Maximum number of worker iterations to capture before auto-stop. |
+| `warmup_iterations` | Torch-profiler warmup iterations. |
+| `active_iterations` | Torch-profiler active iterations. |
+| `wait_iterations` | Torch-profiler wait iterations before warmup. |
+
+For multi-stage omni pipelines, put `profiler_config` under the target stage's `engine_args`.
 
 ```yaml
 stage_args:
   - stage_id: 0
     stage_type: llm
     engine_args:
-      # ... other engine args ...
       profiler_config:
         profiler: torch
         torch_profiler_dir: ./perf
 ```
 
-| Field | Description |
-|---|---|
-| `profiler` | Profiler backend to use. Currently supports `torch`. |
-| `torch_profiler_dir` | Directory where trace files are saved. Created automatically if it doesn't exist. |
-
-> **Tip:** Only enable `profiler_config` on stages you actually need to profile. Stages without it will not start a profiler, keeping overhead minimal.
-
-### 2. Profiling Omni-Modality Models
+For single-stage diffusion usage, pass `profiler_config` directly to `Omni(...)` or `vllm serve`.
 
-**Selective Stage Profiling**
+## 2. Profiling Omni Pipelines
 
-It is highly recommended to profile specific stages to prevent producing overly large trace files:
+It is usually best to profile only the stages you need.
 
 ```python
-# Profile all stages
-omni_llm.start_profile()
+# Profile all stages.
+omni.start_profile()
 
-# Only profile Stage 1
-omni_llm.start_profile(stages=[1])
-
-# Stage 0 (Thinker) and Stage 2 (Audio Decoder) for qwen omni
-omni_llm.start_profile(stages=[0, 2])
+# Profile selected stages only.
+omni.start_profile(stages=[0, 2])
+...
+omni.stop_profile(stages=[0, 2])
 ```
 
-> **Important:** Always pass the same `stages` list to both `start_profile()` and `stop_profile()`. If you omit `stages` from `stop_profile()`, it defaults to stopping all stages — including ones that were never started — which will produce errors.
-
-**Python Usage**: Wrap your generation logic with `start_profile()` and `stop_profile()`.
+Always stop the same stage set that you started. If only some stages have `profiler_config`, pass an explicit `stages=[...]` list instead of relying on the default "all stages" behavior.
 
-```python
-profiler_stages = [0]  # Only profile the stages you need
+Examples:
 
-# 1. Start profiling
-omni.start_profile(stages=profiler_stages)
+1. [Qwen2.5-Omni end2end](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen2_5_omni/end2end.py)
+2. [Qwen3-Omni end2end](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen3_omni/end2end.py)
 
-# Initialize generator
-omni_generator = omni.generate(prompts, sampling_params_list, py_generator=args.py_generator)
+## 3. Profiling Single-Stage Diffusion
 
-total_requests = len(prompts)
-processed_count = 0
+Single-stage diffusion models use the same `start_profile()` / `stop_profile()` controls, but you must provide `profiler_config` explicitly.
 
-# Main Processing Loop
-for stage_outputs in omni_generator:
+### PyTorch profiler
 
-    # ... [Output processing logic for text/audio would go here] ...
+```python
+from vllm_omni import Omni
+
+omni = Omni(
+    model="Wan-AI/Wan2.2-I2V-A14B-Diffusers",
+    profiler_config={
+        "profiler": "torch",
+        "torch_profiler_dir": "./perf",
+    },
+)
+
+omni.start_profile()
+...
+omni.stop_profile()
+```
 
-    # Update count to track when to stop profiling
-    processed_count += len(stage_outputs.request_output)
+### Nsight Systems (`nsys`)
 
-    # 2. Check if all requests are done to stop the profiler safely
-    if profiler_enabled and processed_count >= total_requests:
-        print(f"[Info] Processed {processed_count}/{total_requests}. Stopping profiler inside active loop...")
+For Nsight Systems, use `profiler: cuda` and wrap the process with `nsys profile`.
 
-        # Stop the profiler while workers are still active
-        # Pass the same stages list used in start_profile()
-        omni_llm.stop_profile(stages=profiler_stages)
+```bash
+nsys profile \
+  --trace-fork-before-exec=true \
+  --cuda-graph-trace=node \
+  --capture-range=cudaProfilerApi \
+  --capture-range-end=repeat \
+  -o diffusion_trace \
+  python image_to_video.py ...
+```
 
-        # Wait for traces to flush to disk
-        print("[Info] Waiting 30s for workers to write trace files to disk...")
-        time.sleep(30)
-        print("[Info] Trace export wait time finished.")
+The Python process being profiled must create the diffusion engine with:
 
-omni_llm.close()
+```python
+profiler_config={"profiler": "cuda"}
 ```
 
+Then call `start_profile()` before the requests you want to capture and `stop_profile()` after them. The diffusion worker processes open and close the CUDA capture range themselves, so `nsys` sees the actual GPU work instead of only the parent process.
 
-**CLI Usage** (using `end2end.py`):
-```bash
-# Profile only Stage 0 (Thinker)
-python end2end.py --output-wav output_audio \
-    --query-type text --enable-profiler --profiler-stages 0
+Examples:
 
-# Profile Stage 0 and Stage 2
-python end2end.py --output-wav output_audio \
-    --query-type text --enable-profiler --profiler-stages 0 2
+1. [Image edit example](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/image_to_image/image_edit.py)
+2. [Image to video example](https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/image_to_video)
 
-# Profile all stages (omit --profiler-stages)
-python end2end.py --output-wav output_audio \
-    --query-type text --enable-profiler
-```
+## 4. Profiling Online Serving
 
-**Examples**:
+When any stage has `profiler_config.profiler` set, the server exposes:
 
-1. **Qwen2.5-Omni**:  [https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen2_5_omni/end2end.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen2_5_omni/end2end.py)
+- `POST /start_profile`
+- `POST /stop_profile`
 
-2. **Qwen3-Omni**:   [https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen3_omni/end2end.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen3_omni/end2end.py)
+### Start the server
 
-### 3. Profiling diffusion models
+Multi-stage omni serving:
 
-Diffusion profiling is End-to-End, capturing encoding, denoising loops, and decoding. Standalone diffusion scripts use `--profiler-dir` to enable profiling.
-
-**CLI Usage:**
 ```bash
-python image_to_video.py \
-    --model Wan-AI/Wan2.2-I2V-A14B-Diffusers \
-    --image qwen-bear.png \
-    --prompt "A cat playing with yarn, smooth motion" \
-    --profiler-dir \
-    \
-    # Minimize Spatial Dimensions (Optional but helpful):
-    #    Drastically reduces memory usage so the profiler doesn't
-    #    crash due to overhead, though for accurate performance
-    #    tuning you often want target resolutions.
-    --height 48 \
-    --width 64 \
-    \
-    # Minimize Temporal Dimension (Frames):
-    #    Video models process 3D tensors (Time, Height, Width).
-    #    Reducing frames to the absolute minimum (2) keeps the
-    #    tensor size small, ensuring the trace file doesn't become
-    #    multi-gigabytes in size.
-    --num-frames 2 \
-    \
-    # Minimize Iteration Loop (Steps):
-    #    This is the most critical setting for profiling.
-    #    Diffusion models run the same loop X times.
-    #    Profiling 2 steps gives you the exact same performance
-    #    data as 50 steps, but saves minutes of runtime and
-    #    prevents the trace viewer from freezing.
-    --num-inference-steps 2 \
-    \
-    --guidance-scale 5.0 \
-    --guidance-scale-high 6.0 \
-    --boundary-ratio 0.875 \
-    --flow-shift 12.0 \
-    --fps 16 \
-    --output i2v_output.mp4
+vllm serve Qwen/Qwen2.5-Omni-7B \
+  --omni \
+  --stage-configs-path qwen2_5_omni.yaml \
+  --port 8091
 ```
 
-> **Note:** For diffusion stages within a multi-stage omni pipeline, use `profiler_config` in the stage YAML instead (see Section 1).
-
-**Examples**:
-
-1. **Qwen image edit**:  [https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/image_to_image/image_edit.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/image_to_image/image_edit.py)
-
-2. **Wan-AI/Wan2.2-I2V-A14B-Diffusers**:   [https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/image_to_video](https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/image_to_video)
-
-### 4. Profiling Online Serving
-
-When `profiler_config` is set in the stage YAML, the server automatically exposes `/start_profile` and `/stop_profile` HTTP endpoints.
+Single-stage diffusion serving with torch profiler:
 
-**1. Start the server** with a stage YAML that has `profiler_config` enabled:
 ```bash
-vllm serve Qwen/Qwen2.5-Omni-7B \
-    --omni \
-    --stage-configs-path qwen2_5_omni.yaml \
-    --port 8091
+vllm serve Wan-AI/Wan2.2-I2V-A14B-Diffusers \
+  --omni \
+  --port 8091 \
+  --profiler-config '{"profiler": "torch", "torch_profiler_dir": "./vllm_profile"}'
 ```
 
-Or for one stage diffusion models:
+Single-stage diffusion serving with Nsight Systems:
 
 ```bash
-vllm serve Wan-AI/Wan2.2-I2V-A14B-Diffusers --omni --port 8091 --profiler-config '{"profiler": "torch", "torch_profiler_dir": "./vllm_profile"}'
+nsys profile \
+  --trace-fork-before-exec=true \
+  --cuda-graph-trace=node \
+  --capture-range=cudaProfilerApi \
+  --capture-range-end=repeat \
+  -o serving_trace \
+  vllm serve Wan-AI/Wan2.2-I2V-A14B-Diffusers \
+    --omni \
+    --port 8091 \
+    --profiler-config '{"profiler": "cuda"}'
 ```
 
-**2. Start profiling** by sending a POST request:
+### Control capture
+
 ```bash
-# Profile all stages that have profiler_config set
+# Start profiling on all profiled stages.
 curl -X POST http://localhost:8091/start_profile
 
-# Profile specific stages only
+# Start profiling on selected stages.
 curl -X POST http://localhost:8091/start_profile \
-    -H "Content-Type: application/json" \
-    -d '{"stages": [0]}'
-```
+  -H "Content-Type: application/json" \
+  -d '{"stages": [0]}'
 
-**3. Send your inference requests** as normal while the profiler is running.
-
-**4. Stop profiling** and collect traces:
-```bash
-# Stop all stages
+# Stop profiling.
 curl -X POST http://localhost:8091/stop_profile
-
-# Stop specific stages (must match the stages you started)
-curl -X POST http://localhost:8091/stop_profile \
-    -H "Content-Type: application/json" \
-    -d '{"stages": [0]}'
 ```
 
-Trace files are written to the `torch_profiler_dir` specified in your stage YAML.
+For mixed-stage pipelines, use explicit `stages` and pass the same stage list to both endpoints.
+
+## 5. Analyze Results
 
-> **Important:** Always stop the same stages you started. Stopping a stage that was never started will produce errors.
+Torch profiler output:
 
-### 5. Analyzing Traces
+- Chrome/Perfetto traces under `torch_profiler_dir`
+- Optional aggregated CUDA-time tables under the same directory
 
-Output files are saved to the `torch_profiler_dir` specified in your stage YAML config.
+CUDA profiler / Nsight Systems output:
 
-**Output**
-**Chrome Trace** (`.json.gz`): Visual timeline of kernels and stages. Open in Perfetto UI.
+- `.nsys-rep` report files written by `nsys -o ...`
 
-**Viewing Tools:**
+Recommended viewers:
 
-- [Perfetto](https://ui.perfetto.dev/) (recommended)
-- `chrome://tracing` (Chrome only)
+- [Perfetto](https://ui.perfetto.dev/) for torch traces
+- `nsys stats <report>.nsys-rep` for CLI summaries
+- Nsight Systems GUI for CUDA kernel timelines
 
-**Note**: vLLM-Omni reuses the PyTorch Profiler infrastructure from vLLM. See the official vLLM profiler documentation:  [vLLM Profiling Guide](https://docs.vllm.ai/en/stable/contributing/profiling/)
+vLLM-Omni reuses the vLLM profiling infrastructure where possible. For the upstream reference, see the [vLLM profiling guide](https://docs.vllm.ai/en/stable/contributing/profiling/).