[Refactor] Unify torch profiler for omni and diffusion models by gcanlin · Pull Request #2099 · vllm-project/vllm-omni

gcanlin · 2026-03-23T11:24:41Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Co-authored-by: lishunyang lishunyang12@163.com

Purpose

Because this PR has been almost refactored, I open a new PR, which is following #1261.

Close #2088.

Test Plan

Test Result

PIServer pid=1678194) INFO 03-24 05:23:25 [api_router.py:23] Starting profiler...
(APIServer pid=1678194) INFO 03-24 05:23:25 [api_router.py:25] Profiler started.
(APIServer pid=1678194) INFO:     127.0.0.1:34904 - "POST /start_profile HTTP/1.1" 200 OK
(APIServer pid=1678194) INFO:     127.0.0.1:34914 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1678194) INFO 03-24 05:23:55 [api_router.py:31] Stopping profiler...
(Worker_TP0 pid=1678621) [2026-03-24 05:23:55] [WARNING] [1678621] profiler.py: Incorrect schedule: Stop profiler while current state is RECORD which may result in incomplete parsed data.
(Worker_TP1 pid=1678622) [2026-03-24 05:23:55] [WARNING] [1678622] profiler.py: Incorrect schedule: Stop profiler while current state is RECORD which may result in incomplete parsed data.
(Worker_TP1 pid=1678622) [2026-03-24 05:23:56] [ERROR] [1678622] profiler.py: The profiling data cannot be parsed during the daemon process, it is recommended that you use an offline parsing interface to parse the collected data.
(Worker_TP1 pid=1678622) For example:
(Worker_TP0 pid=1678621) [2026-03-24 05:23:56] [ERROR] [1678621] profiler.py: The profiling data cannot be parsed during the daemon process, it is recommended that you use an offline parsing interface to parse the collected data.
(Worker_TP1 pid=1678622) from torch_npu.profiler.profiler import analyse
(Worker_TP0 pid=1678621) For example:
(Worker_TP1 pid=1678622) analyse("profiling_data_path")
(Worker_TP0 pid=1678621) from torch_npu.profiler.profiler import analyse
(Worker_TP0 pid=1678621) analyse("profiling_data_path")
(Worker_TP0 pid=1678621) INFO 03-24 05:23:56 [profiler.py:89] NPU profiler stopped. Use offline parsing to analyze: from torch_npu.profiler.profiler import analyse; analyse('/root/vllm-workspace/vllm-omni/perf/npu_rank0')
(Worker_TP0 pid=1678621) INFO 03-24 05:23:56 [wrapper.py:66] Profiler stopped successfully.
(Worker pid=1678625) [2026-03-24 05:23:56] [WARNING] [1678625] profiler.py: Incorrect schedule: Stop profiler while current state is RECORD which may result in incomplete parsed data.
(Worker pid=1678625) [2026-03-24 05:23:56] [ERROR] [1678625] profiler.py: The profiling data cannot be parsed during the daemon process, it is recommended that you use an offline parsing interface to parse the collected data.
(Worker pid=1678625) For example:
(Worker pid=1678625) from torch_npu.profiler.profiler import analyse
(Worker pid=1678625) analyse("profiling_data_path")
(Worker pid=1678625) INFO 03-24 05:23:56 [profiler.py:89] NPU profiler stopped. Use offline parsing to analyze: from torch_npu.profiler.profiler import analyse; analyse('/root/vllm-workspace/vllm-omni/vllm_profile/npu_rank0')
(Worker pid=1678625) INFO 03-24 05:23:56 [wrapper.py:66] Profiler stopped successfully.
(Worker pid=1679259) [2026-03-24 05:23:56] [WARNING] [1679259] profiler.py: Incorrect schedule: Stop profiler while current state is RECORD which may result in incomplete parsed data.
(Worker pid=1679259) [2026-03-24 05:23:56] [ERROR] [1679259] profiler.py: The profiling data cannot be parsed during the daemon process, it is recommended that you use an offline parsing interface to parse the collected data.
(Worker pid=1679259) For example:
(Worker pid=1679259) from torch_npu.profiler.profiler import analyse
(Worker pid=1679259) analyse("profiling_data_path")
(Worker pid=1679259) INFO 03-24 05:23:56 [profiler.py:89] NPU profiler stopped. Use offline parsing to analyze: from torch_npu.profiler.profiler import analyse; analyse('/root/vllm-workspace/vllm-omni/vllm_profile/npu_rank0')
(Worker pid=1679259) INFO 03-24 05:23:56 [wrapper.py:66] Profiler stopped successfully.
(APIServer pid=1678194) INFO 03-24 05:23:56 [api_router.py:33] Profiler stopped.
(APIServer pid=1678194) INFO:     127.0.0.1:47542 - "POST /stop_profile HTTP/1.1" 200 OK

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1ac7576f29

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-23T11:33:22Z

+            from vllm_omni.profiler import OmniTorchProfilerWrapper
+
+            if isinstance(self.profiler, OmniTorchProfilerWrapper):
+                filename = profile_prefix or f"stage_llm_{int(time.time())}"


Include stage id in default profile trace filename

When start_profile() is called without a profile_prefix (the default path for profiling all stages), each LLM stage worker uses the same second-level default name (stage_llm_<timestamp>). Because trace export appends only _rank{local_rank} afterward, two stages running on the same local rank can write the same output path and overwrite each other’s trace, so multi-stage profiling silently loses data. This is especially likely because collective RPC starts all stages nearly simultaneously; include a stage-unique component (e.g., stage_id or PID) in the default filename.

Useful? React with 👍 / 👎.

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

hsliuustc0106 · 2026-03-23T12:43:14Z

plesse attatch your test exMple and resultz

hsliuustc0106

Review blocked until the PR is mergeable and required checks pass.

DCO: SUCCESS
pre-commit: SUCCESS
mergeability: CONFLICTING

This is also a substantial change set (>1000 LOC / >10 files). After the conflicts are resolved, please include concrete L3 test commands/results in the PR description so the full review can focus on behavior instead of missing validation evidence.

hsliuustc0106 · 2026-03-23T14:48:48Z

🔍 Code Review

Great refactoring! The unified profiler interface is much cleaner. However, I found some blocking issues that need to be addressed:

🔴 BLOCKING ISSUES

1. Breaking Changes Without Migration Guide

Issue: This PR removes public classes `ProfilerBase` and `TorchProfiler`, and changes API signatures.

Removed Classes:

`ProfilerBase` (deleted)
`TorchProfiler` (deleted)

API Signature Changes:
```python

Old API

def start_profile(trace_path_template: str) -> str:
...

New API

def start_profile(profile_prefix: str | None = None, stages: list[int] | None = None) -> list[Any]:
...
```

Impact: Users who depend on the old profiler classes or API will experience breaking changes.

Required Fix:

Add a `DEPRECATED` notice period for the old classes (if possible)
Add a migration guide in the documentation showing:
- How to migrate from the old API to the new API
- Code examples before/after
- Timeline for deprecation (if applicable)

Location:

Deleted: `vllm_omni/diffusion/profiler/base.py`, `vllm_omni/diffusion/profiler/torch_profiler.py`
Changed: `vllm_omni/entrypoints/omni_base.py`, diffusion engine files

2. Missing CI Test Results

Issue: The PR description mentions testing Omni, Diffusion, and multi-card scenarios, but doesn't include CI test results or links to test logs.

Suggestion: Add links to CI test runs or paste relevant test output to demonstrate:

All tests passing
No performance regressions
Multi-card scenarios working correctly

Location: PR description

🟡 POTENTIAL ISSUES (Non-Blocking)

3. Latent Cache Memory Management in Diffusion

Code (from diff):
```python

In diffusion_worker.py

if hasattr(self, 'latent_cache'):
latent = self.latent_cache.pop() # Pop but never cleared?
return decode(latent)
```

Concern: The `latent_cache` appears to grow without bounds and is never explicitly cleared.

Suggestion: Add memory management:
```python
def del(self):
if hasattr(self, 'latent_cache'):
self.latent_cache.clear() # Clear on cleanup
```

Or add a maximum size limit with LRU eviction policy.

Location: Diffusion worker files (check all worker implementations)

✅ STRENGTHS

Clean Architecture: Unified profiler interface across omni and diffusion models
Good Test Coverage: Added comprehensive unit tests for profiler methods
Excellent Documentation: Updated docs with CLI examples, online serving usage, and troubleshooting tips
Consistent API: New API is more flexible with optional `profile_prefix` and `stages` parameters

📋 VERDICT: REQUEST_CHANGES

Cannot approve until blocking issues are resolved.

Required Actions:

✅ Add migration guide for breaking changes
✅ Provide CI test results or links to test logs
⚠️ Consider adding explicit cleanup for latent cache in diffusion workers (recommended but not blocking)

Next Steps:

Fix the blocking issues above
Re-run tests to verify no regressions
Update documentation with migration guide
Request review from other maintainers

SamitHuang · 2026-03-23T15:50:37Z

 ### 3. Profiling diffusion models

-Diffusion profiling is End-to-End, capturing encoding, denoising loops, and decoding.
+Diffusion profiling is End-to-End, capturing encoding, denoising loops, and decoding. Standalone diffusion scripts use `--profiler-dir` to enable profiling.


It's better to unify the argument to enable profiling for omni and diffusion models.

Yes. We're access to unified usage. The difference is only in example. But the way of config is consistent, e.g. set profiler config in yaml config(Currently, only diffusion can pass by CLI, omni model depends on stage CLI refactor.)

SamitHuang · 2026-03-23T15:53:10Z

-                # Determine which workers we expect responses from
-                num_responses = 1 if unique_reply_rank is not None else self.od_config.num_gpus
+                # Only rank 0 has a result_mq, so we always expect exactly 1 response
+                num_responses = 1


is it general for batched diffusion request?

Yes, this change is general and works correctly for batched diffusion requests.

For batched requests, all workers execute the RPC method in parallel (via exec_all_ranks), but only rank 0 sends back the result. The old code 1 if unique_reply_rank is not None else self.od_config.num_gpus was actually buggy — it would wait for num_gpus responses when unique_reply_rank was None, but only rank 0 would ever respond, causing a potential deadlock or timeout.

If we don't add this change, the profiler will be stuck when multi-cards were used.

Signed-off-by: Samit <285365963@qq.com>

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

gcanlin · 2026-03-24T05:15:28Z

I will complete the examples and tests in a follow-up PR to avoid conflicts with other PRs.

gcanlin · 2026-03-24T05:31:09Z

@hsliuustc0106 I add the test log. This PR is ready to merge now. The example is long and I wanna to commit it to examples/.

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

Gaohan123

LGTM. Please resolve left TODOs in following PR

…roject#2099) Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: Samit <285365963@qq.com> Co-authored-by: Samit <285365963@qq.com> Co-authored-by: lishunyang lishunyang12@163.com

gcanlin added 2 commits March 23, 2026 06:49

[Refactor] Unify torch profiler for omni and diffusion models

8de49d7

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

[Refactor] Add profile in entrypoint and engine

1ac7576

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

gcanlin requested a review from hsliuustc0106 as a code owner March 23, 2026 11:24

fix none

1349932

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

chatgpt-codex-connector Bot reviewed Mar 23, 2026

View reviewed changes

gcanlin mentioned this pull request Mar 23, 2026

[Refactor] Unify torch profiler for omni and diffusion models #1261

Closed

7 tasks

fix lint

617ea13

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

gcanlin added this to the v0.18.0 milestone Mar 23, 2026

gcanlin added the high priority high priority issue, needs to be done asap label Mar 23, 2026

gcanlin added 2 commits March 23, 2026 12:06

add examples

c6d5892

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

fix trace name

2726ffb

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

hsliuustc0106 reviewed Mar 23, 2026

View reviewed changes

SamitHuang requested a review from ZJY0516 March 23, 2026 15:04

SamitHuang reviewed Mar 23, 2026

View reviewed changes

SamitHuang and others added 3 commits March 24, 2026 00:01

fix log for omni models

265a566

Signed-off-by: Samit <285365963@qq.com>

Merge branch 'main' into profiler-refacotr

2c8dcbd

fix log

ae18a47

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

gcanlin added the ready label to trigger buildkite CI label Mar 24, 2026

gcanlin added 2 commits March 24, 2026 06:01

fix trace name

576761c

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

fix ut

54048ce

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

gcanlin enabled auto-merge (squash) March 24, 2026 06:31

gcanlin requested review from SamitHuang and hsliuustc0106 March 24, 2026 06:40

Gaohan123 reviewed Mar 24, 2026

View reviewed changes

Comment thread vllm_omni/entrypoints/omni_base.py

Gaohan123 approved these changes Mar 24, 2026

View reviewed changes

gcanlin merged commit 7217557 into vllm-project:main Mar 24, 2026
8 checks passed

Gaohan123 mentioned this pull request Mar 26, 2026

[Bug][NPU]: When I use an offline script for profile analysis, I am unable to generate a trace file. #1484

Closed

1 task

Copilot AI mentioned this pull request Apr 29, 2026

Cherry-pick latest 4 JianyuLi01 commits from release/v0.17.0rc1 onto release/v0.19.0rc1 JianyuLi01/vllm-omni#2

Merged

Conversation

gcanlin commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented Mar 23, 2026

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented Mar 23, 2026

🔍 Code Review

🔴 BLOCKING ISSUES

1. Breaking Changes Without Migration Guide

Old API

New API

2. Missing CI Test Results

🟡 POTENTIAL ISSUES (Non-Blocking)

3. Latent Cache Memory Management in Diffusion

In diffusion_worker.py

✅ STRENGTHS

📋 VERDICT: REQUEST_CHANGES

Uh oh!

SamitHuang Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

gcanlin Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

SamitHuang Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

gcanlin Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

gcanlin Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gcanlin commented Mar 24, 2026

Uh oh!

gcanlin commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Gaohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gcanlin commented Mar 23, 2026 •

edited

Loading

gcanlin commented Mar 24, 2026 •

edited

Loading