[API] Add APIs for online profiling of diffusion models#1451
[API] Add APIs for online profiling of diffusion models#1451NickLucche wants to merge 4 commits into
Conversation
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
|
cc @ZJY0516 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6cf2902f75
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| which stops profiling on all workers and collects trace paths. | ||
|
|
||
| """ | ||
| self.engine.stop_profile() |
There was a problem hiding this comment.
Return profiler result dict from AsyncOmniDiffusion.stop_profile
AsyncOmniDiffusion.stop_profile() invokes DiffusionEngine.stop_profile() but does not return its result, so callers always receive None. In the new async stage path, handle_profiler_task_async does result_data = stage_engine.stop_profile() or {} and forwards that to the orchestrator, which means profiling artifacts are always dropped from the response even when traces were successfully written. This breaks the new result-collection flow added in this commit.
Useful? React with 👍 / 👎.
| await engine_client.start_profile() | ||
| logger.info("Profiler started.") | ||
| return Response(status_code=200) |
There was a problem hiding this comment.
Propagate profiling failures instead of always returning 200
The new /start_profile and /stop_profile APIs always return 200 OK once these calls complete, but worker-side profiling errors are caught and only logged in stage handlers, so failures (e.g., invalid/unwritable profiler output path) are silently reported as success to clients. This can invalidate benchmark/profiling runs because automation has no reliable signal that profiling did not actually start/stop correctly.
Useful? React with 👍 / 👎.
|
@vllm-omni-reviewer |
🤖 VLLM-Omni PR ReviewCode Review: Add APIs for Online Profiling of Diffusion Models1. OverviewThis PR adds API endpoints (
Overall Assessment: Positive. The implementation follows existing patterns and provides useful tooling for performance optimization. 2. Code QualityStrengths
IssuesBug: Missing return value propagation ( def stop_profile(self) -> None:
"""Stop profiling and return trace file paths.
...
"""
self.engine.stop_profile()The docstring says "return trace file paths" but the method returns result_data = stage_engine.stop_profile() or {}This will always be Inconsistent profiling state handling ( 3. Architecture & DesignStrengths
SuggestionsConsider adding profiling state tracking: The endpoints don't track whether profiling is already active. Multiple calls to Missing conditional endpoint registration: The PR description shows 404 responses when profiling isn't enabled, but the diff shows unconditional endpoint registration. There may be missing context or the 404 comes from a different mechanism. 4. Security & SafetyConcerns
5. Testing & DocumentationDocumentation
Testing Gaps
6. Specific Suggestions
|
|
closing as @gcanlin PR got merged |
Enable torch profiler for online use-cases, including diffusion-only models, aligning closer with vLLM.
This change will provide useful tooling as set ourselves up for optimizing online serving performance for diffusion models.
Test with
Server:
Client:
While attempting to profile a server without profiling enabled will result in:
cc @DarkLight1337