vllm-project · gcanlin · Mar 24, 2026 · Mar 23, 2026 · Mar 23, 2026 · Mar 23, 2026
@@ -4,48 +4,54 @@
 
 vLLM-Omni uses the PyTorch Profiler to analyze performance across both **multi-stage omni-modality models** and **diffusion models**.
 
-### 1. Set the Output Directory
-Before running any script, set this environment variable. The system detects this and automatically saves traces here.
-
-```bash
-export VLLM_TORCH_PROFILER_DIR=./profiles
+### 1. Configure Profiling in the Stage YAML
+
+Enable profiling by adding `profiler_config` under `engine_args` for the stage(s) you want to profile in your stage config YAML:
+
+```yaml
+stage_args:
+  - stage_id: 0
+    stage_type: llm
+    engine_args:
+      # ... other engine args ...
+      profiler_config:
+        profiler: torch
+        torch_profiler_dir: ./perf
 ```
 
-### 2. Profiling Omni-Modality Models
+| Field | Description |
+|---|---|
+| `profiler` | Profiler backend to use. Currently supports `torch`. |
+| `torch_profiler_dir` | Directory where trace files are saved. Created automatically if it doesn't exist. |
 
-It is best to limit profiling to one iteration to keep trace files manageable.
+> **Tip:** Only enable `profiler_config` on stages you actually need to profile. Stages without it will not start a profiler, keeping overhead minimal.
 
-```bash
-export VLLM_PROFILER_MAX_ITERS=1
-```
+### 2. Profiling Omni-Modality Models
 
 **Selective Stage Profiling**
-The profiler is default to function across all stages. But It is highly recommended to profile specific stages by passing the stages list, preventing from producing too large trace files:
+
+It is highly recommended to profile specific stages to prevent producing overly large trace files:
+
 ```python
 # Profile all stages
-omni.start_profile()
+omni_llm.start_profile()
 
 # Only profile Stage 1
-omni.start_profile(stages=[1])
-```
+omni_llm.start_profile(stages=[1])
 
-```python
 # Stage 0 (Thinker) and Stage 2 (Audio Decoder) for qwen omni
-omni.start_profile(stages=[0, 2])
+omni_llm.start_profile(stages=[0, 2])
 ```
 
+> **Important:** Always pass the same `stages` list to both `start_profile()` and `stop_profile()`. If you omit `stages` from `stop_profile()`, it defaults to stopping all stages — including ones that were never started — which will produce errors.
+
 **Python Usage**: Wrap your generation logic with `start_profile()` and `stop_profile()`.
 
 ```python
-from vllm_omni.entrypoints.omni import Omni
+profiler_stages = [0]  # Only profile the stages you need
 
-omni = Omni(model="Qwen/Qwen3-Omni-30B-A3B-Instruct")
-
-profiler_enabled = bool(os.getenv("VLLM_TORCH_PROFILER_DIR"))
-
-# 1. Start profiling if enabled
-if profiler_enabled:
-    omni.start_profile(stages=[0])
+# 1. Start profiling
+omni.start_profile(stages=profiler_stages)
 
 # Initialize generator
 omni_generator = omni.generate(prompts, sampling_params_list, py_generator=args.py_generator)
@@ -59,42 +65,57 @@ for stage_outputs in omni_generator:
     # ... [Output processing logic for text/audio would go here] ...
 
     # Update count to track when to stop profiling
-    processed_count += 1
+    processed_count += len(stage_outputs.request_output)
 
     # 2. Check if all requests are done to stop the profiler safely
     if profiler_enabled and processed_count >= total_requests:
         print(f"[Info] Processed {processed_count}/{total_requests}. Stopping profiler inside active loop...")
 
         # Stop the profiler while workers are still active
-        omni.stop_profile()
+        # Pass the same stages list used in start_profile()
+        omni_llm.stop_profile(stages=profiler_stages)
 
         # Wait for traces to flush to disk
         print("[Info] Waiting 30s for workers to write trace files to disk...")
         time.sleep(30)
         print("[Info] Trace export wait time finished.")
 
-omni.close()
+omni_llm.close()
 ```
 
 
+**CLI Usage** (using `end2end.py`):
+```bash
+# Profile only Stage 0 (Thinker)
+python end2end.py --output-wav output_audio \
+    --query-type text --enable-profiler --profiler-stages 0
+
+# Profile Stage 0 and Stage 2
+python end2end.py --output-wav output_audio \
+    --query-type text --enable-profiler --profiler-stages 0 2
+
+# Profile all stages (omit --profiler-stages)
+python end2end.py --output-wav output_audio \
+    --query-type text --enable-profiler
+```
+
 **Examples**:
 
 1. **Qwen2.5-Omni**:  [https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen2_5_omni/end2end.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen2_5_omni/end2end.py)
 
 2. **Qwen3-Omni**:   [https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen3_omni/end2end.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen3_omni/end2end.py)
 
-
 ### 3. Profiling diffusion models
 
-Diffusion profiling is End-to-End, capturing encoding, denoising loops, and decoding.
+Diffusion profiling is End-to-End, capturing encoding, denoising loops, and decoding. Standalone diffusion scripts use `--profiler-dir` to enable profiling.
 
 **CLI Usage:**
-```python
-
+```bash
 python image_to_video.py \
     --model Wan-AI/Wan2.2-I2V-A14B-Diffusers \
     --image qwen-bear.png \
     --prompt "A cat playing with yarn, smooth motion" \
+    --profiler-dir \
     \
     # Minimize Spatial Dimensions (Optional but helpful):
     #    Drastically reduces memory usage so the profiler doesn't
@@ -124,25 +145,72 @@ python image_to_video.py \
     --flow-shift 12.0 \
     --fps 16 \
     --output i2v_output.mp4
-
 ```
 
+> **Note:** For diffusion stages within a multi-stage omni pipeline, use `profiler_config` in the stage YAML instead (see Section 1).
+
 **Examples**:
 
 1. **Qwen image edit**:  [https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/image_to_image/image_edit.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/image_to_image/image_edit.py)
 
 2. **Wan-AI/Wan2.2-I2V-A14B-Diffusers**:   [https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/image_to_video](https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/image_to_video)
 
-### 4. Analyzing Omni Traces
+### 4. Profiling Online Serving
+
+When `profiler_config` is set in the stage YAML, the server automatically exposes `/start_profile` and `/stop_profile` HTTP endpoints.
+
+**1. Start the server** with a stage YAML that has `profiler_config` enabled:
+```bash
+vllm serve Qwen/Qwen2.5-Omni-7B \
+    --omni \
+    --stage-configs-path qwen2_5_omni.yaml \
+    --port 8091
+```
+
+Or for one stage diffusion models:
+
+```bash
+vllm serve Wan-AI/Wan2.2-I2V-A14B-Diffusers --omni --port 8091 --profiler-config '{"profiler": "torch", "torch_profiler_dir": "./vllm_profile"}'
+```
+
+**2. Start profiling** by sending a POST request:
+```bash
+# Profile all stages that have profiler_config set
+curl -X POST http://localhost:8091/start_profile
+
+# Profile specific stages only
+curl -X POST http://localhost:8091/start_profile \
+    -H "Content-Type: application/json" \
+    -d '{"stages": [0]}'
+```
+
+**3. Send your inference requests** as normal while the profiler is running.
+
+**4. Stop profiling** and collect traces:
+```bash
+# Stop all stages
+curl -X POST http://localhost:8091/stop_profile
+
+# Stop specific stages (must match the stages you started)
+curl -X POST http://localhost:8091/stop_profile \
+    -H "Content-Type: application/json" \
+    -d '{"stages": [0]}'
+```
+
+Trace files are written to the `torch_profiler_dir` specified in your stage YAML.
+
+> **Important:** Always stop the same stages you started. Stopping a stage that was never started will produce errors.
+
+### 5. Analyzing Traces
 
-Output files are saved to your configured ```VLLM_TORCH_PROFILER_DIR```.
+Output files are saved to the `torch_profiler_dir` specified in your stage YAML config.
 
 **Output**
-**Chrome Trace** (```.json.gz```): Visual timeline of kernels and stages. Open in Perfetto UI.
+**Chrome Trace** (`.json.gz`): Visual timeline of kernels and stages. Open in Perfetto UI.
 
 **Viewing Tools:**
 
-- [Perfetto](https://ui.perfetto.dev/)(recommended)
-- ```chrome://tracing```(Chrome only)
+- [Perfetto](https://ui.perfetto.dev/) (recommended)
+- `chrome://tracing` (Chrome only)
 
 **Note**: vLLM-Omni reuses the PyTorch Profiler infrastructure from vLLM. See the official vLLM profiler documentation:  [vLLM Profiling Guide](https://docs.vllm.ai/en/stable/contributing/profiling/)
@@ -353,9 +353,9 @@ def main(args):
         for i, prompt in enumerate(prompts):
             prompt["modalities"] = output_modalities
 
-    profiler_enabled = bool(os.getenv("VLLM_TORCH_PROFILER_DIR"))
+    profiler_enabled = args.enable_profiler
     if profiler_enabled:
-        omni.start_profile(stages=[0])
+        omni.start_profile(stages=args.profiler_stages)
     omni_generator = omni.generate(prompts, sampling_params_list, py_generator=args.py_generator)
     # Determine output directory: prefer --output-dir; fallback to --output-wav
     output_dir = args.output_dir if getattr(args, "output_dir", None) else args.output_wav
@@ -405,7 +405,7 @@ def main(args):
         if profiler_enabled and processed_count >= total_requests:
             print(f"[Info] Processed {processed_count}/{total_requests}. Stopping profiler inside active loop...")
             # Stop the profiler while workers are still alive
-            omni.stop_profile()
+            omni.stop_profile(stages=args.profiler_stages)
 
             print("[Info] Waiting 30s for workers to write trace files to disk...")
             time.sleep(30)
@@ -532,6 +532,19 @@ def parse_args():
         action="store_true",
         help="Enable diffusion pipeline profiler to display stage durations.",
     )
+    parser.add_argument(
+        "--enable-profiler",
+        action="store_true",
+        default=False,
+        help="Enables profiling when set.",
+    )
+    parser.add_argument(
+        "--profiler-stages",
+        type=int,
+        nargs="*",
+        default=None,
+        help="List of stage IDs to profile. If not set, profiles all stages.",
+    )
 
     return parser.parse_args()
 

@@ -1907,6 +1907,33 @@ def generate_multimodal(
         )
         return self.generate(omni_inputs, sampling_params_list)
 
+    def start_profile(
+        self,
+        profile_prefix: str | None = None,
+        stages: list[int] | None = None,
+    ) -> list[Any]:
+        """Start profiling specified stages.
+
+        Args:
+            profile_prefix: Optional prefix for the trace file names.
+            stages: List of stage IDs to profile. If None, profiles all stages.
+
+        Returns:
+            List of results from each stage.
+        """
+        return self.omni.start_profile(profile_prefix=profile_prefix, stages=stages)
+
+    def stop_profile(self, stages: list[int] | None = None) -> list[Any]:
+        """Stop profiling specified stages.
+
+        Args:
+            stages: List of stage IDs to profile. If None, stops all stages.
+
+        Returns:
+            List of results from each stage.
+        """
+        return self.omni.stop_profile(stages=stages)
+
     def _cleanup_process(self):
         try:
             keywords = ["enginecore"]
@@ -2020,6 +2047,18 @@ def send_request(self, request_config: dict[str, Any] | None = None) -> OmniResp
         assert_omni_response(response, request_config, run_level="L2")
         return response
 
+    def start_profile(
+        self,
+        profile_prefix: str | None = None,
+        stages: list[int] | None = None,
+    ) -> list[Any]:
+        """Start profiling specified stages."""
+        return self.runner.start_profile(profile_prefix=profile_prefix, stages=stages)
+
+    def stop_profile(self, stages: list[int] | None = None) -> list[Any]:
+        """Stop profiling specified stages."""
+        return self.runner.stop_profile(stages=stages)
+
 
 @pytest.fixture
 def omni_runner_handler(omni_runner):

@@ -274,20 +274,19 @@ def test_serial_collective_rpc_single_rank(self):
         assert result.error == "result_for_Y"
 
     def test_serial_collective_rpc_all_ranks(self):
-        """``collective_rpc`` without *unique_reply_rank* collects
-        ``num_gpus`` responses.
+        """``collective_rpc`` without *unique_reply_rank* returns a single
+        response from rank 0 (only rank 0 has a result_mq).
         """
         engine, _, _, res_q = _make_engine(num_gpus=2)
 
-        # Pre-populate two results (simulating two workers replying)
+        # Pre-populate one result (only rank 0 replies via result_mq)
         res_q.put(_tagged_output("rank0"))
-        res_q.put(_tagged_output("rank1"))
 
         results = engine.collective_rpc("ping", args=("multi",))
 
-        assert len(results) == 2
+        # Only 1 response expected since only rank 0 has result_mq
+        assert len(results) == 1
         assert results[0].error == "rank0"
-        assert results[1].error == "rank1"
 
     def test_serial_add_req_then_collective_rpc(self):
         engine, _, req_q, res_q = _make_engine()