Skip to content

[Refactor] Unify torch profiler for omni and diffusion models#2099

Merged
gcanlin merged 11 commits into
vllm-project:mainfrom
gcanlin:profiler-refacotr
Mar 24, 2026
Merged

[Refactor] Unify torch profiler for omni and diffusion models#2099
gcanlin merged 11 commits into
vllm-project:mainfrom
gcanlin:profiler-refacotr

Conversation

@gcanlin
Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin commented Mar 23, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Co-authored-by: lishunyang lishunyang12@163.com

Purpose

Because this PR has been almost refactored, I open a new PR, which is following #1261.

Close #2088.

Test Plan

  • Omni
    • Offline
    • Online
  • Diffusion
    • Offline
    • Online
  • Multi-cards

Test Result

PIServer pid=1678194) INFO 03-24 05:23:25 [api_router.py:23] Starting profiler...
(APIServer pid=1678194) INFO 03-24 05:23:25 [api_router.py:25] Profiler started.
(APIServer pid=1678194) INFO:     127.0.0.1:34904 - "POST /start_profile HTTP/1.1" 200 OK
(APIServer pid=1678194) INFO:     127.0.0.1:34914 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1678194) INFO 03-24 05:23:55 [api_router.py:31] Stopping profiler...
(Worker_TP0 pid=1678621) [2026-03-24 05:23:55] [WARNING] [1678621] profiler.py: Incorrect schedule: Stop profiler while current state is RECORD which may result in incomplete parsed data.
(Worker_TP1 pid=1678622) [2026-03-24 05:23:55] [WARNING] [1678622] profiler.py: Incorrect schedule: Stop profiler while current state is RECORD which may result in incomplete parsed data.
(Worker_TP1 pid=1678622) [2026-03-24 05:23:56] [ERROR] [1678622] profiler.py: The profiling data cannot be parsed during the daemon process, it is recommended that you use an offline parsing interface to parse the collected data.
(Worker_TP1 pid=1678622) For example:
(Worker_TP0 pid=1678621) [2026-03-24 05:23:56] [ERROR] [1678621] profiler.py: The profiling data cannot be parsed during the daemon process, it is recommended that you use an offline parsing interface to parse the collected data.
(Worker_TP1 pid=1678622) from torch_npu.profiler.profiler import analyse
(Worker_TP0 pid=1678621) For example:
(Worker_TP1 pid=1678622) analyse("profiling_data_path")
(Worker_TP0 pid=1678621) from torch_npu.profiler.profiler import analyse
(Worker_TP0 pid=1678621) analyse("profiling_data_path")
(Worker_TP0 pid=1678621) INFO 03-24 05:23:56 [profiler.py:89] NPU profiler stopped. Use offline parsing to analyze: from torch_npu.profiler.profiler import analyse; analyse('/root/vllm-workspace/vllm-omni/perf/npu_rank0')
(Worker_TP0 pid=1678621) INFO 03-24 05:23:56 [wrapper.py:66] Profiler stopped successfully.
(Worker pid=1678625) [2026-03-24 05:23:56] [WARNING] [1678625] profiler.py: Incorrect schedule: Stop profiler while current state is RECORD which may result in incomplete parsed data.
(Worker pid=1678625) [2026-03-24 05:23:56] [ERROR] [1678625] profiler.py: The profiling data cannot be parsed during the daemon process, it is recommended that you use an offline parsing interface to parse the collected data.
(Worker pid=1678625) For example:
(Worker pid=1678625) from torch_npu.profiler.profiler import analyse
(Worker pid=1678625) analyse("profiling_data_path")
(Worker pid=1678625) INFO 03-24 05:23:56 [profiler.py:89] NPU profiler stopped. Use offline parsing to analyze: from torch_npu.profiler.profiler import analyse; analyse('/root/vllm-workspace/vllm-omni/vllm_profile/npu_rank0')
(Worker pid=1678625) INFO 03-24 05:23:56 [wrapper.py:66] Profiler stopped successfully.
(Worker pid=1679259) [2026-03-24 05:23:56] [WARNING] [1679259] profiler.py: Incorrect schedule: Stop profiler while current state is RECORD which may result in incomplete parsed data.
(Worker pid=1679259) [2026-03-24 05:23:56] [ERROR] [1679259] profiler.py: The profiling data cannot be parsed during the daemon process, it is recommended that you use an offline parsing interface to parse the collected data.
(Worker pid=1679259) For example:
(Worker pid=1679259) from torch_npu.profiler.profiler import analyse
(Worker pid=1679259) analyse("profiling_data_path")
(Worker pid=1679259) INFO 03-24 05:23:56 [profiler.py:89] NPU profiler stopped. Use offline parsing to analyze: from torch_npu.profiler.profiler import analyse; analyse('/root/vllm-workspace/vllm-omni/vllm_profile/npu_rank0')
(Worker pid=1679259) INFO 03-24 05:23:56 [wrapper.py:66] Profiler stopped successfully.
(APIServer pid=1678194) INFO 03-24 05:23:56 [api_router.py:33] Profiler stopped.
(APIServer pid=1678194) INFO:     127.0.0.1:47542 - "POST /stop_profile HTTP/1.1" 200 OK

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

gcanlin added 2 commits March 23, 2026 06:49
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@gcanlin gcanlin requested a review from hsliuustc0106 as a code owner March 23, 2026 11:24
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1ac7576f29

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm_omni/worker/base.py Outdated
from vllm_omni.profiler import OmniTorchProfilerWrapper

if isinstance(self.profiler, OmniTorchProfilerWrapper):
filename = profile_prefix or f"stage_llm_{int(time.time())}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Include stage id in default profile trace filename

When start_profile() is called without a profile_prefix (the default path for profiling all stages), each LLM stage worker uses the same second-level default name (stage_llm_<timestamp>). Because trace export appends only _rank{local_rank} afterward, two stages running on the same local rank can write the same output path and overwrite each other’s trace, so multi-stage profiling silently loses data. This is especially likely because collective RPC starts all stages nearly simultaneously; include a stage-unique component (e.g., stage_id or PID) in the default filename.

Useful? React with 👍 / 👎.

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@gcanlin gcanlin added this to the v0.18.0 milestone Mar 23, 2026
@gcanlin gcanlin added the high priority high priority issue, needs to be done asap label Mar 23, 2026
gcanlin added 2 commits March 23, 2026 12:06
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

plesse attatch your test exMple and resultz

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review blocked until the PR is mergeable and required checks pass.

  • DCO: SUCCESS
  • pre-commit: SUCCESS
  • mergeability: CONFLICTING

This is also a substantial change set (>1000 LOC / >10 files). After the conflicts are resolved, please include concrete L3 test commands/results in the PR description so the full review can focus on behavior instead of missing validation evidence.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

🔍 Code Review

Great refactoring! The unified profiler interface is much cleaner. However, I found some blocking issues that need to be addressed:

🔴 BLOCKING ISSUES

1. Breaking Changes Without Migration Guide

Issue: This PR removes public classes `ProfilerBase` and `TorchProfiler`, and changes API signatures.

Removed Classes:

  • `ProfilerBase` (deleted)
  • `TorchProfiler` (deleted)

API Signature Changes:
```python

Old API

def start_profile(trace_path_template: str) -> str:
...

New API

def start_profile(profile_prefix: str | None = None, stages: list[int] | None = None) -> list[Any]:
...
```

Impact: Users who depend on the old profiler classes or API will experience breaking changes.

Required Fix:

  1. Add a `DEPRECATED` notice period for the old classes (if possible)
  2. Add a migration guide in the documentation showing:
    • How to migrate from the old API to the new API
    • Code examples before/after
    • Timeline for deprecation (if applicable)

Location:

  • Deleted: `vllm_omni/diffusion/profiler/base.py`, `vllm_omni/diffusion/profiler/torch_profiler.py`
  • Changed: `vllm_omni/entrypoints/omni_base.py`, diffusion engine files

2. Missing CI Test Results

Issue: The PR description mentions testing Omni, Diffusion, and multi-card scenarios, but doesn't include CI test results or links to test logs.

Suggestion: Add links to CI test runs or paste relevant test output to demonstrate:

  • All tests passing
  • No performance regressions
  • Multi-card scenarios working correctly

Location: PR description


🟡 POTENTIAL ISSUES (Non-Blocking)

3. Latent Cache Memory Management in Diffusion

Code (from diff):
```python

In diffusion_worker.py

if hasattr(self, 'latent_cache'):
latent = self.latent_cache.pop() # Pop but never cleared?
return decode(latent)
```

Concern: The `latent_cache` appears to grow without bounds and is never explicitly cleared.

Suggestion: Add memory management:
```python
def del(self):
if hasattr(self, 'latent_cache'):
self.latent_cache.clear() # Clear on cleanup
```

Or add a maximum size limit with LRU eviction policy.

Location: Diffusion worker files (check all worker implementations)


STRENGTHS

  1. Clean Architecture: Unified profiler interface across omni and diffusion models
  2. Good Test Coverage: Added comprehensive unit tests for profiler methods
  3. Excellent Documentation: Updated docs with CLI examples, online serving usage, and troubleshooting tips
  4. Consistent API: New API is more flexible with optional `profile_prefix` and `stages` parameters

📋 VERDICT: REQUEST_CHANGES

Cannot approve until blocking issues are resolved.

Required Actions:

  1. ✅ Add migration guide for breaking changes
  2. ✅ Provide CI test results or links to test logs
  3. ⚠️ Consider adding explicit cleanup for latent cache in diffusion workers (recommended but not blocking)

Next Steps:

  • Fix the blocking issues above
  • Re-run tests to verify no regressions
  • Update documentation with migration guide
  • Request review from other maintainers

@SamitHuang SamitHuang requested a review from ZJY0516 March 23, 2026 15:04
### 3. Profiling diffusion models

Diffusion profiling is End-to-End, capturing encoding, denoising loops, and decoding.
Diffusion profiling is End-to-End, capturing encoding, denoising loops, and decoding. Standalone diffusion scripts use `--profiler-dir` to enable profiling.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to unify the argument to enable profiling for omni and diffusion models.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We're access to unified usage. The difference is only in example. But the way of config is consistent, e.g. set profiler config in yaml config(Currently, only diffusion can pass by CLI, omni model depends on stage CLI refactor.)

# Determine which workers we expect responses from
num_responses = 1 if unique_reply_rank is not None else self.od_config.num_gpus
# Only rank 0 has a result_mq, so we always expect exactly 1 response
num_responses = 1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it general for batched diffusion request?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this change is general and works correctly for batched diffusion requests.

For batched requests, all workers execute the RPC method in parallel (via exec_all_ranks), but only rank 0 sends back the result. The old code 1 if unique_reply_rank is not None else self.od_config.num_gpus was actually buggy — it would wait for num_gpus responses when unique_reply_rank was None, but only rank 0 would ever respond, causing a potential deadlock or timeout.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't add this change, the profiler will be stuck when multi-cards were used.

Comment thread vllm_omni/worker/base.py Outdated
SamitHuang and others added 3 commits March 24, 2026 00:01
Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@gcanlin gcanlin added the ready label to trigger buildkite CI label Mar 24, 2026
@gcanlin
Copy link
Copy Markdown
Collaborator Author

gcanlin commented Mar 24, 2026

I will complete the examples and tests in a follow-up PR to avoid conflicts with other PRs.

@gcanlin
Copy link
Copy Markdown
Collaborator Author

gcanlin commented Mar 24, 2026

@hsliuustc0106 I add the test log. This PR is ready to merge now. The example is long and I wanna to commit it to examples/.

gcanlin added 2 commits March 24, 2026 06:01
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@gcanlin gcanlin enabled auto-merge (squash) March 24, 2026 06:31
Comment thread vllm_omni/entrypoints/omni_base.py
Copy link
Copy Markdown
Collaborator

@Gaohan123 Gaohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please resolve left TODOs in following PR

@gcanlin gcanlin merged commit 7217557 into vllm-project:main Mar 24, 2026
8 checks passed
zhangj1an pushed a commit to zhangj1an/vllm-omni that referenced this pull request Mar 26, 2026
…roject#2099)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: Samit <285365963@qq.com>
Co-authored-by: Samit <285365963@qq.com>
Co-authored-by: lishunyang lishunyang12@163.com
zhangj1an pushed a commit to zhangj1an/vllm-omni that referenced this pull request Mar 26, 2026
…roject#2099)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: Samit <285365963@qq.com>
Co-authored-by: Samit <285365963@qq.com>
Co-authored-by: lishunyang lishunyang12@163.com
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
…roject#2099)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: Samit <285365963@qq.com>
Co-authored-by: Samit <285365963@qq.com>
Co-authored-by: lishunyang lishunyang12@163.com
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
…roject#2099)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: Samit <285365963@qq.com>
Co-authored-by: Samit <285365963@qq.com>
Co-authored-by: lishunyang lishunyang12@163.com
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

high priority high priority issue, needs to be done asap ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC]: Unified Torch Profiler Interface for vLLM-Omni

4 participants