Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
142 changes: 105 additions & 37 deletions docs/contributing/profiling.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,48 +4,54 @@

vLLM-Omni uses the PyTorch Profiler to analyze performance across both **multi-stage omni-modality models** and **diffusion models**.

### 1. Set the Output Directory
Before running any script, set this environment variable. The system detects this and automatically saves traces here.

```bash
export VLLM_TORCH_PROFILER_DIR=./profiles
### 1. Configure Profiling in the Stage YAML

Enable profiling by adding `profiler_config` under `engine_args` for the stage(s) you want to profile in your stage config YAML:

```yaml
stage_args:
- stage_id: 0
stage_type: llm
engine_args:
# ... other engine args ...
profiler_config:
profiler: torch
torch_profiler_dir: ./perf
```

### 2. Profiling Omni-Modality Models
| Field | Description |
|---|---|
| `profiler` | Profiler backend to use. Currently supports `torch`. |
| `torch_profiler_dir` | Directory where trace files are saved. Created automatically if it doesn't exist. |

It is best to limit profiling to one iteration to keep trace files manageable.
> **Tip:** Only enable `profiler_config` on stages you actually need to profile. Stages without it will not start a profiler, keeping overhead minimal.

```bash
export VLLM_PROFILER_MAX_ITERS=1
```
### 2. Profiling Omni-Modality Models

**Selective Stage Profiling**
The profiler is default to function across all stages. But It is highly recommended to profile specific stages by passing the stages list, preventing from producing too large trace files:

It is highly recommended to profile specific stages to prevent producing overly large trace files:

```python
# Profile all stages
omni.start_profile()
omni_llm.start_profile()

# Only profile Stage 1
omni.start_profile(stages=[1])
```
omni_llm.start_profile(stages=[1])

```python
# Stage 0 (Thinker) and Stage 2 (Audio Decoder) for qwen omni
omni.start_profile(stages=[0, 2])
omni_llm.start_profile(stages=[0, 2])
```

> **Important:** Always pass the same `stages` list to both `start_profile()` and `stop_profile()`. If you omit `stages` from `stop_profile()`, it defaults to stopping all stages — including ones that were never started — which will produce errors.

**Python Usage**: Wrap your generation logic with `start_profile()` and `stop_profile()`.

```python
from vllm_omni.entrypoints.omni import Omni
profiler_stages = [0] # Only profile the stages you need

omni = Omni(model="Qwen/Qwen3-Omni-30B-A3B-Instruct")

profiler_enabled = bool(os.getenv("VLLM_TORCH_PROFILER_DIR"))

# 1. Start profiling if enabled
if profiler_enabled:
omni.start_profile(stages=[0])
# 1. Start profiling
omni.start_profile(stages=profiler_stages)

# Initialize generator
omni_generator = omni.generate(prompts, sampling_params_list, py_generator=args.py_generator)
Expand All @@ -59,42 +65,57 @@ for stage_outputs in omni_generator:
# ... [Output processing logic for text/audio would go here] ...

# Update count to track when to stop profiling
processed_count += 1
processed_count += len(stage_outputs.request_output)

# 2. Check if all requests are done to stop the profiler safely
if profiler_enabled and processed_count >= total_requests:
print(f"[Info] Processed {processed_count}/{total_requests}. Stopping profiler inside active loop...")

# Stop the profiler while workers are still active
omni.stop_profile()
# Pass the same stages list used in start_profile()
omni_llm.stop_profile(stages=profiler_stages)

# Wait for traces to flush to disk
print("[Info] Waiting 30s for workers to write trace files to disk...")
time.sleep(30)
print("[Info] Trace export wait time finished.")

omni.close()
omni_llm.close()
```


**CLI Usage** (using `end2end.py`):
```bash
# Profile only Stage 0 (Thinker)
python end2end.py --output-wav output_audio \
--query-type text --enable-profiler --profiler-stages 0

# Profile Stage 0 and Stage 2
python end2end.py --output-wav output_audio \
--query-type text --enable-profiler --profiler-stages 0 2

# Profile all stages (omit --profiler-stages)
python end2end.py --output-wav output_audio \
--query-type text --enable-profiler
```

**Examples**:

1. **Qwen2.5-Omni**: [https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen2_5_omni/end2end.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen2_5_omni/end2end.py)

2. **Qwen3-Omni**: [https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen3_omni/end2end.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen3_omni/end2end.py)


### 3. Profiling diffusion models

Diffusion profiling is End-to-End, capturing encoding, denoising loops, and decoding.
Diffusion profiling is End-to-End, capturing encoding, denoising loops, and decoding. Standalone diffusion scripts use `--profiler-dir` to enable profiling.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to unify the argument to enable profiling for omni and diffusion models.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We're access to unified usage. The difference is only in example. But the way of config is consistent, e.g. set profiler config in yaml config(Currently, only diffusion can pass by CLI, omni model depends on stage CLI refactor.)


**CLI Usage:**
```python

```bash
python image_to_video.py \
--model Wan-AI/Wan2.2-I2V-A14B-Diffusers \
--image qwen-bear.png \
--prompt "A cat playing with yarn, smooth motion" \
--profiler-dir \
\
# Minimize Spatial Dimensions (Optional but helpful):
# Drastically reduces memory usage so the profiler doesn't
Expand Down Expand Up @@ -124,25 +145,72 @@ python image_to_video.py \
--flow-shift 12.0 \
--fps 16 \
--output i2v_output.mp4

```

> **Note:** For diffusion stages within a multi-stage omni pipeline, use `profiler_config` in the stage YAML instead (see Section 1).

**Examples**:

1. **Qwen image edit**: [https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/image_to_image/image_edit.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/image_to_image/image_edit.py)

2. **Wan-AI/Wan2.2-I2V-A14B-Diffusers**: [https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/image_to_video](https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/image_to_video)

### 4. Analyzing Omni Traces
### 4. Profiling Online Serving

When `profiler_config` is set in the stage YAML, the server automatically exposes `/start_profile` and `/stop_profile` HTTP endpoints.

**1. Start the server** with a stage YAML that has `profiler_config` enabled:
```bash
vllm serve Qwen/Qwen2.5-Omni-7B \
--omni \
--stage-configs-path qwen2_5_omni.yaml \
--port 8091
```

Or for one stage diffusion models:

```bash
vllm serve Wan-AI/Wan2.2-I2V-A14B-Diffusers --omni --port 8091 --profiler-config '{"profiler": "torch", "torch_profiler_dir": "./vllm_profile"}'
```

**2. Start profiling** by sending a POST request:
```bash
# Profile all stages that have profiler_config set
curl -X POST http://localhost:8091/start_profile

# Profile specific stages only
curl -X POST http://localhost:8091/start_profile \
-H "Content-Type: application/json" \
-d '{"stages": [0]}'
```

**3. Send your inference requests** as normal while the profiler is running.

**4. Stop profiling** and collect traces:
```bash
# Stop all stages
curl -X POST http://localhost:8091/stop_profile

# Stop specific stages (must match the stages you started)
curl -X POST http://localhost:8091/stop_profile \
-H "Content-Type: application/json" \
-d '{"stages": [0]}'
```

Trace files are written to the `torch_profiler_dir` specified in your stage YAML.

> **Important:** Always stop the same stages you started. Stopping a stage that was never started will produce errors.

### 5. Analyzing Traces

Output files are saved to your configured ```VLLM_TORCH_PROFILER_DIR```.
Output files are saved to the `torch_profiler_dir` specified in your stage YAML config.

**Output**
**Chrome Trace** (```.json.gz```): Visual timeline of kernels and stages. Open in Perfetto UI.
**Chrome Trace** (`.json.gz`): Visual timeline of kernels and stages. Open in Perfetto UI.

**Viewing Tools:**

- [Perfetto](https://ui.perfetto.dev/)(recommended)
- ```chrome://tracing```(Chrome only)
- [Perfetto](https://ui.perfetto.dev/) (recommended)
- `chrome://tracing` (Chrome only)

**Note**: vLLM-Omni reuses the PyTorch Profiler infrastructure from vLLM. See the official vLLM profiler documentation: [vLLM Profiling Guide](https://docs.vllm.ai/en/stable/contributing/profiling/)
19 changes: 16 additions & 3 deletions examples/offline_inference/qwen3_omni/end2end.py
Original file line number Diff line number Diff line change
Expand Up @@ -353,9 +353,9 @@ def main(args):
for i, prompt in enumerate(prompts):
prompt["modalities"] = output_modalities

profiler_enabled = bool(os.getenv("VLLM_TORCH_PROFILER_DIR"))
profiler_enabled = args.enable_profiler
if profiler_enabled:
omni.start_profile(stages=[0])
omni.start_profile(stages=args.profiler_stages)
omni_generator = omni.generate(prompts, sampling_params_list, py_generator=args.py_generator)
# Determine output directory: prefer --output-dir; fallback to --output-wav
output_dir = args.output_dir if getattr(args, "output_dir", None) else args.output_wav
Expand Down Expand Up @@ -405,7 +405,7 @@ def main(args):
if profiler_enabled and processed_count >= total_requests:
print(f"[Info] Processed {processed_count}/{total_requests}. Stopping profiler inside active loop...")
# Stop the profiler while workers are still alive
omni.stop_profile()
omni.stop_profile(stages=args.profiler_stages)

print("[Info] Waiting 30s for workers to write trace files to disk...")
time.sleep(30)
Expand Down Expand Up @@ -532,6 +532,19 @@ def parse_args():
action="store_true",
help="Enable diffusion pipeline profiler to display stage durations.",
)
parser.add_argument(
"--enable-profiler",
action="store_true",
default=False,
help="Enables profiling when set.",
)
parser.add_argument(
"--profiler-stages",
type=int,
nargs="*",
default=None,
help="List of stage IDs to profile. If not set, profiles all stages.",
)

return parser.parse_args()

Expand Down
39 changes: 39 additions & 0 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -1907,6 +1907,33 @@ def generate_multimodal(
)
return self.generate(omni_inputs, sampling_params_list)

def start_profile(
self,
profile_prefix: str | None = None,
stages: list[int] | None = None,
) -> list[Any]:
"""Start profiling specified stages.

Args:
profile_prefix: Optional prefix for the trace file names.
stages: List of stage IDs to profile. If None, profiles all stages.

Returns:
List of results from each stage.
"""
return self.omni.start_profile(profile_prefix=profile_prefix, stages=stages)

def stop_profile(self, stages: list[int] | None = None) -> list[Any]:
"""Stop profiling specified stages.

Args:
stages: List of stage IDs to profile. If None, stops all stages.

Returns:
List of results from each stage.
"""
return self.omni.stop_profile(stages=stages)

def _cleanup_process(self):
try:
keywords = ["enginecore"]
Expand Down Expand Up @@ -2020,6 +2047,18 @@ def send_request(self, request_config: dict[str, Any] | None = None) -> OmniResp
assert_omni_response(response, request_config, run_level="L2")
return response

def start_profile(
self,
profile_prefix: str | None = None,
stages: list[int] | None = None,
) -> list[Any]:
"""Start profiling specified stages."""
return self.runner.start_profile(profile_prefix=profile_prefix, stages=stages)

def stop_profile(self, stages: list[int] | None = None) -> list[Any]:
"""Stop profiling specified stages."""
return self.runner.stop_profile(stages=stages)


@pytest.fixture
def omni_runner_handler(omni_runner):
Expand Down
11 changes: 5 additions & 6 deletions tests/diffusion/test_multiproc_engine_concurrency.py
Original file line number Diff line number Diff line change
Expand Up @@ -274,20 +274,19 @@ def test_serial_collective_rpc_single_rank(self):
assert result.error == "result_for_Y"

def test_serial_collective_rpc_all_ranks(self):
"""``collective_rpc`` without *unique_reply_rank* collects
``num_gpus`` responses.
"""``collective_rpc`` without *unique_reply_rank* returns a single
response from rank 0 (only rank 0 has a result_mq).
"""
engine, _, _, res_q = _make_engine(num_gpus=2)

# Pre-populate two results (simulating two workers replying)
# Pre-populate one result (only rank 0 replies via result_mq)
res_q.put(_tagged_output("rank0"))
res_q.put(_tagged_output("rank1"))

results = engine.collective_rpc("ping", args=("multi",))

assert len(results) == 2
# Only 1 response expected since only rank 0 has result_mq
assert len(results) == 1
assert results[0].error == "rank0"
assert results[1].error == "rank1"

def test_serial_add_req_then_collective_rpc(self):
engine, _, req_q, res_q = _make_engine()
Expand Down
Loading
Loading