vllm-project · lishunyang12 · Feb 10, 2026 · Feb 10, 2026 · Feb 10, 2026 · Feb 10, 2026
@@ -2,148 +2,256 @@
 
 > **Warning:** Profiling incurs significant overhead. Use only for development and debugging, never in production.
 
-vLLM-Omni uses the PyTorch Profiler to analyze performance across both **multi-stage omni-modality models** and **diffusion models**.
+vLLM-Omni provides a profiling module (`vllm_omni/profiler/`) aligned with upstream vLLM 0.16.0 semantics. It captures **performance traces** (TensorBoard/Chrome traces) using `tensorboard_trace_handler` and supports delay/max iteration control.
 
-### 1. Set the Output Directory
-Before running any script, set this environment variable. The system detects this and automatically saves traces here.
+## Quick Start
 
-```bash
-export VLLM_TORCH_PROFILER_DIR=./profiles
+```python
+from vllm_omni import Omni
+from vllm_omni.profiler import ProfilerConfig
+
+# Configure profiler at initialization
+omni = Omni(
+    model="Tongyi-MAI/Z-Image-Turbo",
+    profiler_config=ProfilerConfig(
+        profiler="torch",
+        torch_profiler_dir="./profiles",
+    )
+)
+
+# Profile your workload
+omni.start_profile()
+outputs = omni.generate({"prompt": "a cat"}, sampling_params)
+omni.stop_profile()
+
+# Trace files are written to ./profiles/ by each worker
 ```
 
-### 2. Profiling Omni-Modality Models
+## Command Line Usage
 
-It is best to limit profiling to one iteration to keep trace files manageable.
+All offline inference examples support profiling via CLI arguments:
 
 ```bash
-export VLLM_PROFILER_MAX_ITERS=1
+# Enable profiling
+python text_to_image.py --model MODEL --profile-dir ./profiles
 ```
 
-**Selective Stage Profiling**
-The profiler is default to function across all stages. But It is highly recommended to profile specific stages by passing the stages list, preventing from producing too large trace files:
+## ProfilerConfig
+
+```python
+from vllm_omni.profiler import ProfilerConfig
+
+ProfilerConfig(
+    profiler="torch",                          # Required: "torch" or "cuda"
+    torch_profiler_dir="./profiles",           # Required when profiler="torch"
+    torch_profiler_with_stack=True,            # Enable stack tracing
+    torch_profiler_with_flops=False,           # Enable FLOPS counting
+    torch_profiler_use_gzip=True,              # Save traces in gzip format
+    torch_profiler_dump_cuda_time_total=True,  # Dump CUDA time stats on stop
+    torch_profiler_record_shapes=False,        # Record tensor shapes
+    torch_profiler_with_memory=False,          # Enable memory profiling
+    delay_iterations=0,                        # Skip N iterations before starting
+    max_iterations=0,                          # Stop after N iterations (0=unlimited)
+)
+```
+
+### Serialization
+
+`ProfilerConfig` supports `to_dict()` / `from_dict()` for cross-process RPC serialization.
+
+## Output Files
+
+| File | Format | How to View |
+|------|--------|-------------|
+| `*.trace.json.gz` | TensorBoard trace | TensorBoard, chrome://tracing, or ui.perfetto.dev |
+| `profiler_out_*.txt` | CUDA time stats | Any text editor |
+
+---
+
+## Profiling Omni-Modality Models
+
+### Selective Stage Profiling
+
+Profile specific stages to keep trace files manageable:
+
 ```python
 # Profile all stages
-omni_llm.start_profile()
+omni.start_profile()
 
 # Only profile Stage 1
-omni_llm.start_profile(stages=[1])
+omni.start_profile(stages=[1])
+
+# Stage 0 (Thinker) and Stage 2 (Audio Decoder) for Qwen Omni
+omni.start_profile(stages=[0, 2])
 ```
 
-```python
-# Stage 0 (Thinker) and Stage 2 (Audio Decoder) for qwen omni
-omni_llm.start_profile(stages=[0, 2])
+### Examples
+
+- **Qwen2.5-Omni**: [examples/offline_inference/qwen2_5_omni/end2end.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen2_5_omni/end2end.py)
+- **Qwen3-Omni**: [examples/offline_inference/qwen3_omni/end2end.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen3_omni/end2end.py)
+
+---
+
+## Profiling Diffusion Models
+
+Diffusion profiling is end-to-end, capturing encoding, denoising loops, and decoding.
+
+### Minimizing Trace Size
+
+For profiling, minimize dimensions to keep trace files manageable:
+
+```bash
+python image_to_video.py \
+    --model Wan-AI/Wan2.2-I2V-A14B-Diffusers \
+    --image input.png \
+    --prompt "A cat playing with yarn" \
+    --profile-dir ./profiles \
+    \
+    # Minimize dimensions for profiling:
+    --height 48 \
+    --width 64 \
+    --num_frames 2 \
+    --num_inference_steps 2
 ```
 
-**Python Usage**: Wrap your generation logic with `start_profile()` and `stop_profile()`.
+### Examples
 
-```python
-from vllm_omni import omni_llm
+- **Image Edit**: [examples/offline_inference/image_to_image/image_edit.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/image_to_image/image_edit.py)
+- **Image to Video**: [examples/offline_inference/image_to_video/](https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/image_to_video)
+- **Text to Image**: [examples/offline_inference/text_to_image/text_to_image.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/text_to_image/text_to_image.py)
 
-profiler_enabled = bool(os.getenv("VLLM_TORCH_PROFILER_DIR"))
+---
 
-# 1. Start profiling if enabled
-if profiler_enabled:
-    omni_llm.start_profile(stages=[0])
+## Viewing Traces
 
-# Initialize generator
-omni_generator = omni_llm.generate(prompts, sampling_params_list, py_generator=args.py_generator)
+### Performance Traces (`.trace.json.gz`)
 
-total_requests = len(prompts)
-processed_count = 0
+- [TensorBoard](https://www.tensorflow.org/tensorboard) (recommended)
+- [Perfetto UI](https://ui.perfetto.dev/)
+- `chrome://tracing` (Chrome only)
 
-# Main Processing Loop
-for stage_outputs in omni_generator:
+---
 
-    # ... [Output processing logic for text/audio would go here] ...
+## API Reference
 
-    # Update count to track when to stop profiling
-    processed_count += len(stage_outputs.request_output)
+### ProfilerConfig
 
-    # 2. Check if all requests are done to stop the profiler safely
-    if profiler_enabled and processed_count >= total_requests:
-        print(f"[Info] Processed {processed_count}/{total_requests}. Stopping profiler inside active loop...")
+```python
+@dataclass
+class ProfilerConfig:
+    profiler: Literal["torch", "cuda"] | None = None
+    torch_profiler_dir: str = ""
+    torch_profiler_with_stack: bool = True
+    torch_profiler_with_flops: bool = False
+    torch_profiler_use_gzip: bool = True
+    torch_profiler_dump_cuda_time_total: bool = True
+    torch_profiler_record_shapes: bool = False
+    torch_profiler_with_memory: bool = False
+    delay_iterations: int = 0
+    max_iterations: int = 0
+```
+
+### TorchProfiler
+
+```python
+class TorchProfiler:
+    def __init__(self, config: ProfilerConfig, worker_name: str = "", local_rank: int = 0): ...
+    def start(self) -> None: ...
+    def stop(self) -> None: ...
+    def step(self) -> None: ...
+    def shutdown(self) -> None: ...
+    @property
+    def is_running(self) -> bool: ...
+```
 
-        # Stop the profiler while workers are still active
-        omni_llm.stop_profile()
+### Omni Methods
 
-        # Wait for traces to flush to disk
-        print("[Info] Waiting 30s for workers to write trace files to disk...")
-        time.sleep(30)
-        print("[Info] Trace export wait time finished.")
+```python
+# Start profiling for specified stages (None = all)
+omni.start_profile(stages: list[int] | None = None) -> None
 
-omni_llm.close()
+# Stop profiling for specified stages (None = all)
+omni.stop_profile(stages: list[int] | None = None) -> None
 ```
 
+---
 
-**Examples**:
+## Best Practices
 
-1. **Qwen2.5-Omni**:  [https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen2_5_omni/end2end.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen2_5_omni/end2end.py)
+1. **Profile specific stages**: Use `omni.start_profile(stages=[0])` to reduce overhead and file size
 
-2. **Qwen3-Omni**:   [https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen3_omni/end2end.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen3_omni/end2end.py)
+2. **Minimize dimensions for diffusion**: Use small height/width/frames/steps when profiling
 
+3. **Compare before/after**: Profile before and after optimizations to measure impact
 
-### 3. Profiling diffusion models
+4. **Use during development only**: Disable profiling in production for performance
 
-Diffusion profiling is End-to-End, capturing encoding, denoising loops, and decoding.
+---
 
-**CLI Usage:**
-```python
+## Troubleshooting
 
-python image_to_video.py \
-    --model Wan-AI/Wan2.2-I2V-A14B-Diffusers \
-    --image qwen-bear.png \
-    --prompt "A cat playing with yarn, smooth motion" \
-    \
-    # Minimize Spatial Dimensions (Optional but helpful):
-    #    Drastically reduces memory usage so the profiler doesn't
-    #    crash due to overhead, though for accurate performance
-    #    tuning you often want target resolutions.
-    --height 48 \
-    --width 64 \
-    \
-    # Minimize Temporal Dimension (Frames):
-    #    Video models process 3D tensors (Time, Height, Width).
-    #    Reducing frames to the absolute minimum (2) keeps the
-    #    tensor size small, ensuring the trace file doesn't become
-    #    multi-gigabytes in size.
-    --num_frames 2 \
-    \
-    # Minimize Iteration Loop (Steps):
-    #    This is the most critical setting for profiling.
-    #    Diffusion models run the same loop X times.
-    #    Profiling 2 steps gives you the exact same performance
-    #    data as 50 steps, but saves minutes of runtime and
-    #    prevents the trace viewer from freezing.
-    --num_inference_steps 2 \
-    \
-    --guidance_scale 5.0 \
-    --guidance_scale_high 6.0 \
-    --boundary_ratio 0.875 \
-    --flow_shift 12.0 \
-    --fps 16 \
-    --output i2v_output.mp4
+| Issue | Cause | Solution |
+|-------|-------|----------|
+| Import error | Missing module | Check `vllm_omni/profiler/__init__.py` |
+| OOM during profiling | Profiler overhead | Reduce model dimensions |
+| Huge trace files | Too many steps/frames | Reduce `num_inference_steps`, `num_frames` |
+
+---
+
+## Online Serving Profiling
 
+When running the vLLM-Omni API server, profiling can be enabled via CLI
+and controlled via HTTP endpoints at runtime.
+
+### Starting the Server with Profiling Enabled
+
+```bash
+python -m vllm_omni.entrypoints.openai.api_server \
+    --model Qwen/Qwen2.5-Omni-7B \
+    --profiler-config profiler=torch,torch_profiler_dir=./profiles
 ```
 
-**Examples**:
+### HTTP Endpoints
 
-1. **Qwen image edit**:  [https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/image_to_image/image_edit.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/image_to_image/image_edit.py)
+| Method | Endpoint | Body | Description |
+|--------|----------|------|-------------|
+| POST | `/start_profile` | `{"stages": [0, 1, 2]}` (optional) | Start profiling |
+| POST | `/stop_profile` | `{"stages": [0, 1, 2]}` (optional) | Stop profiling |
 
-2. **Wan-AI/Wan2.2-I2V-A14B-Diffusers**:   [https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/image_to_video](https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/image_to_video)
+If `stages` is omitted or null, all stages are profiled.
 
-> **Note:**
-As of now, asynchronous (online) profiling is not fully supported in vLLM-Omni. While start_profile() and stop_profile() methods exist, they are only reliable in offline inference scripts (e.g., the provided end2end.py examples). Do not use them in server-mode or streaming scenarios—traces may be incomplete or fail to flush.
+### Stage IDs for Qwen Omni Models
 
-### 4. Analyzing Omni Traces
+| Stage | Qwen2.5-Omni | Qwen3-Omni |
+|-------|-------------|------------|
+| 0 | Thinker (understanding) | Thinker (MoE understanding) |
+| 1 | Talker (text → RVQ codes) | Talker (code predictor) |
+| 2 | Code2Wav (codes → audio) | Code2Wav (codes → audio) |
 
-Output files are saved to your configured ```VLLM_TORCH_PROFILER_DIR```.
+### Examples
 
-**Output**
-**Chrome Trace** (```.json.gz```): Visual timeline of kernels and stages. Open in Perfetto UI.
+```bash
+# Profile all stages (default)
+curl -X POST http://localhost:8000/start_profile
 
-**Viewing Tools:**
+# Profile only the Thinker stage
+curl -X POST http://localhost:8000/start_profile \
+    -H "Content-Type: application/json" \
+    -d '{"stages": [0]}'
+
+# Profile Thinker and Talker stages
+curl -X POST http://localhost:8000/start_profile \
+    -H "Content-Type: application/json" \
+    -d '{"stages": [0, 1]}'
+
+# Stop profiling (traces written to torch_profiler_dir)
+curl -X POST http://localhost:8000/stop_profile
+```
 
-- [Perfetto](https://ui.perfetto.dev/)(recommended)
-- ```chrome://tracing```(Chrome only)
+### Tips
 
-**Note**: vLLM-Omni reuses the PyTorch Profiler infrastructure from vLLM. See the official vLLM profiler documentation:  [vLLM Profiling Guide](https://docs.vllm.ai/en/stable/contributing/profiling/)
+1. **Profile one stage at a time** for smaller, more focused traces
+2. **Profile the Thinker** (stage 0) to analyze LLM bottlenecks
+3. **Profile the Talker** (stage 1) to analyze codec generation
+4. **Profile Code2Wav** (stage 2) to analyze audio synthesis
+5. Trace files are named per-stage (e.g., `stage-0_*.trace.json.gz`)