[Refactor] Unify torch profiler for omni and diffusion models#1261
[Refactor] Unify torch profiler for omni and diffusion models#1261gcanlin wants to merge 39 commits into
Conversation
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
|
look forward to this! |
After discussing with @lishunyang12, #1123 would be the first choice, which has included this PR. I submitted it as the minimal change to refactor env to CLI. If #1123 was blocked by some concerns, we could consider merge this PR to avoid blocking upgrading to v0.16.0. @lishunyang12 Any plan to update #1123? |
I don't think so. #1123 is not a high priority feature. Let's do this first |
I will remove memory profiler and priorize on aligning with upstream first. @ZJY0516 @gcanlin Will do it by today. Edit: I close the PR because of too much conflict with the existing related opened ones. |
lishunyang12
left a comment
There was a problem hiding this comment.
Good direction moving from env vars to config. A few issues to sort out before this is mergeable.
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 89447840e6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if config.profiler == "torch": | ||
| TorchProfiler.set_config(config) |
There was a problem hiding this comment.
Validate non-torch profiler selections for diffusion
configure_profiler now silently ignores any profiler_config.profiler value other than torch, even though this module added get_profiler_class() that explicitly errors for unsupported backends (for example cuda). Because CurrentProfiler remains TorchProfiler, unsupported selections are neither rejected nor switched correctly, which can lead to confusing behavior when users request a non-torch profiler.
Useful? React with 👍 / 👎.
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
|
this is quite important to user experience |
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
|
Ready for integration ? @gcanlin |
Almost :) I will make it ready today. |
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
|
@lishunyang12 @hsliuustc0106 ready to review. |
lishunyang12
left a comment
There was a problem hiding this comment.
Left a couple comments — two bugs that will crash at runtime.
| prompt["modalities"] = output_modalities | ||
|
|
||
| profiler_enabled = bool(os.getenv("VLLM_TORCH_PROFILER_DIR")) | ||
| profiler_enabled = args.enable_profiler is not None |
There was a problem hiding this comment.
args.enable_profiler does not exist — the argparse argument is --profiler-dir, so the attribute is args.profiler_dir. This will crash with AttributeError.
Also --profiler-dir below uses action="store_true" (boolean), but the other examples (text_to_image.py, qwen3_omni/end2end.py) use type=str so it’s an actual directory path. Should be consistent.
| profiler_enabled = args.enable_profiler is not None | |
| profiler_enabled = args.profiler_dir is not None |
There was a problem hiding this comment.
For omni model, will unify to enable_profiler. Because profiler_dir will need to be defined in yaml config.
| def __init__(self, *args, **kwargs): | ||
| super().__init__(*args, **kwargs) | ||
|
|
||
| # Replace vLLM's profiler with platform-specific profiler |
There was a problem hiding this comment.
Missing None check — when profiling is not configured, profiler_config is None and this crashes every worker with AttributeError. The NPU version in platforms/npu/worker/base.py correctly does if profiler_config and profiler_config.profiler == "torch".
| # Replace vLLM's profiler with platform-specific profiler | |
| if profiler_config and profiler_config.profiler == "torch": |
There was a problem hiding this comment.
Good catch. Have been fixed in the new PR.
|
Rebase please. |
|
Important. Needs to be rebased @gcanlin |
|
Based on the latest architecture, I have to rewrite most of code for this feature. |
|
Because this PR has been almost refactored, I open a new PR #2099. Will close this. |
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Remove torch profiler env and refactor it to the way to use profiler config.
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)