[Profiler] Add Nsight Systems support for serving#1098
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b23aa54006
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if task_type == OmniStageTaskType.PROFILER_START: | ||
| # Signal nsys to begin capturing (no-op if not under nsys) | ||
| try: | ||
| torch.cuda.profiler.start() | ||
| logger.info("[Stage-%s] CUDA profiler started (nsys capture region open)", stage_id) |
There was a problem hiding this comment.
Start CUDA profiler inside diffusion worker processes
This torch.cuda.profiler.start() call runs only in the stage worker process. For diffusion, actual GPU kernels execute in subprocesses spawned by the diffusion executor (e.g., MultiprocDiffusionExecutor → WorkerProc), and those workers never call cudaProfilerStart. With --capture-range=cudaProfilerApi, nsys opens capture ranges per process, so the child processes doing the CUDA work stay closed and the nsys report ends up empty for diffusion workloads. Consider invoking torch.cuda.profiler.start()/stop() in DiffusionWorker.start_profile/stop_profile (or via the RPC path) so the capture range opens in the GPU worker processes.
Useful? React with 👍 / 👎.
b23aa54 to
8c2baf3
Compare
84ebfcb to
6408db4
Compare
|
@lishunyang12 @ZJY0516 PTAL if free, thx |
|
provide an e2e example please |
|
I would recommend splitting this PR into two: one for online serving profiling, and another for the nsys integration. |
| launched with ``--capture-range=cudaProfilerApi``) records GPU | ||
| activity from within this worker process. | ||
| """ | ||
| if torch.cuda.is_available(): |
There was a problem hiding this comment.
Having to check whether it's CUDA every single time adds a lot of noise to the code.
@gcanlin Do you have any suggestion?
There was a problem hiding this comment.
any suggestion here? I am not sure. or I can also revert these keep them as previous version.
There was a problem hiding this comment.
For omni model, because we reuse the vLLM's profiler for now, I think we don't need to add anything for supporting cuda profiler? In gpu_worker.py, it has been supported. If we also need it for diffusion, we may refer to vLLM's wrapper implementation. Before this, maybe we should consider unify the profiler between omni models and diffusion models. cc @lishunyang12
class Worker(WorkerBase):
# Torch/CUDA profiler. Enabled and configured through profiler_config.
self.profiler: Any | None = None
profiler_config = vllm_config.profiler_config
if profiler_config.profiler == "torch":
worker_name = f"{vllm_config.instance_id}-rank-{self.rank}"
self.profiler = TorchProfilerWrapper(
profiler_config,
worker_name=worker_name,
local_rank=self.local_rank,
activities=["CPU", "CUDA"],
)
elif profiler_config.profiler == "cuda":
self.profiler = CudaProfilerWrapper(profiler_config)
else:
self.profiler = None|
BTW, #1136 is implementing online profiling:) |
I see, I'll remove online profiling part. |
Can you provide test results, maybe a trace graph attached to description would be good. |
1621416 to
b0c7853
Compare
I‘d love to, but all screensshots uploading are blocked by our network policy of company... |
| logger.info("Diffusion worker %s: profiler stopped", self.rank) | ||
| return None | ||
|
|
||
| def execute_model(self, req: OmniDiffusionRequest, od_config: OmniDiffusionConfig) -> DiffusionOutput: |
There was a problem hiding this comment.
stop_profile() always returns None, which means DiffusionEngine.stop_profile() never gets any trace paths from workers. The elaborate aggregation logic in the engine becomes dead code.
TorchProfilerWrapper.stop() returns a dict with trace file paths — please return that result instead of discarding it:
def stop_profile(self) -> dict | None:
if self.profiler is not None:
return self.profiler.stop()
return None| """ | ||
| if self.profiler is not None: | ||
| self.profiler.start() | ||
| logger.info("Diffusion worker %s: profiler started", self.rank) |
There was a problem hiding this comment.
The trace_path_template parameter is accepted but never used — vLLM's wrappers get their paths from profiler_config at init time. This is confusing for callers. Consider removing it entirely or at minimum documenting that it's ignored.
| worker_name = f"diffusion-rank-{self.rank}" | ||
| self.profiler = TorchProfilerWrapper( | ||
| profiler_config, | ||
| worker_name=worker_name, |
There was a problem hiding this comment.
Missing activities parameter. vLLM's gpu_worker.py explicitly passes activities=["CPU", "CUDA"]:
self.profiler = TorchProfilerWrapper(
profiler_config,
worker_name=worker_name,
local_rank=self.local_rank,
activities=["CPU", "CUDA"], # <-- add this
)Without it, the torch profiler may not capture CUDA kernels, which defeats the purpose of nsys integration.
| profiler_context = ( | ||
| self.profiler.annotate_context_manager("diffusion_forward") if self.profiler is not None else nullcontext() | ||
| ) | ||
| with profiler_context: |
There was a problem hiding this comment.
Good use of annotate_context_manager and step() — this follows vLLM's pattern and gives clean trace segmentation per forward pass.
| output_files["traces"].append(trace_path) | ||
| elif isinstance(trace_path, list): | ||
| output_files["traces"].extend(trace_path) | ||
| successful_traces = len(output_files["traces"]) |
There was a problem hiding this comment.
Since workers always return None right now, the entire for rank, res in enumerate(results) loop body is effectively dead code (the if res is None: continue skips everything). This will become useful after fixing the worker's stop_profile() to return the wrapper's result.
| trace_filename = f"stage_{stage_id}_diffusion_{int(time.time())}" | ||
| stage_engine.start_profile(trace_filename=trace_filename) | ||
| logger.info("[Stage-%s] Diffusion Torch profiler started", stage_id) | ||
| profile_dir = os.environ.get("VLLM_TORCH_PROFILER_DIR") |
There was a problem hiding this comment.
nit: The comment # Sync call is safe here was left behind, but now this function is named handle_profiler_task_async. The comment is stale/misleading.
|
@ahengljh Hey, aligning the diffusion profiler with vLLM's CudaProfilerWrapper and TorchProfilerWrapper is the right approach — makes nsys and torch profiling work consistently across LLM and diffusion workers. Any blockers on getting this merged? |
|
resolve conflicts |
d239992 to
8771777
Compare
Signed-off-by: Jinheng Li <ahengljh@gmail.com>
|
Learn from #2382 , add test files |
@Gaohan123 what's left for merging? just test results? |
Signed-off-by: Jinheng Li <ahengljh@gmail.com>
Signed-off-by: Jinheng Li <ahengljh@gmail.com>
|
@gcanlin PTAL. Can help me to test if it can also work on NPU? |
| return None | ||
|
|
||
| self.profiler.stop() | ||
| if isinstance(self.profiler, OmniTorchProfilerWrapper): |
There was a problem hiding this comment.
Why do we need to return the result?
There was a problem hiding this comment.
Seems that vLLM doesn't return any.
| if profiler_type == "torch": | ||
| return create_omni_profiler( | ||
| profiler_config=profiler_config, | ||
| worker_name=f"diffusion-rank-{self.rank}", |
There was a problem hiding this comment.
| worker_name=f"diffusion-rank-{self.rank}", | |
| worker_name=f"diffusion_rank_{self.rank}", |
| try: | ||
| return CudaProfilerWrapper(profiler_config) | ||
| except Exception as exc: | ||
| logger.warning( | ||
| "Failed to initialize CUDA profiler on diffusion worker %s: %s", | ||
| self.rank, | ||
| exc, | ||
| ) | ||
| return None | ||
| if profiler_type is not None: | ||
| logger.warning("Unknown profiler backend %r on diffusion worker %s", profiler_type, self.rank) | ||
| return None |
There was a problem hiding this comment.
| try: | |
| return CudaProfilerWrapper(profiler_config) | |
| except Exception as exc: | |
| logger.warning( | |
| "Failed to initialize CUDA profiler on diffusion worker %s: %s", | |
| self.rank, | |
| exc, | |
| ) | |
| return None | |
| if profiler_type is not None: | |
| logger.warning("Unknown profiler backend %r on diffusion worker %s", profiler_type, self.rank) | |
| return None | |
| return CudaProfilerWrapper(profiler_config) | |
| if profiler_type is not None: | |
| logger.warning("Unknown profiler backend %r on diffusion worker %s", profiler_type, self.rank) |
Signed-off-by: Canlin Guo <961750412@qq.com>
Signed-off-by: Jinheng Li <ahengljh@gmail.com>
|
@ahengljh Could you help fix UT:https://buildkite.com/vllm/vllm-omni/builds/6142/steps/canvas?sid=019d7002-91f2-48bf-8fad-7f68a8847c8c&tab=output? Seems that I changed the code breaking the original UT. |
I am taking a look |
Signed-off-by: Jinheng Li <ahengljh@gmail.com>
|
@ahengljh Could you run one model profiling and show the trace? |
|
Tested the nsys approach from this PR on TTS models (Qwen3-TTS + Fish Speech) and can confirm it works well. Here are the trace results @gcanlin asked for. Test Setup
Trace ResultsFish Speech (s2-pro) — 7.7 MB trace, 2 profiled iterations:
Qwen3-TTS (0.6B-Base) — 271 MB trace, 1 profiled iteration:
Extending to omni workersRe: @gcanlin's earlier suggestion about unifying the profiler — def _create_profiler(self) -> WorkerProfiler | None:
profiler_config = self.vllm_config.profiler_config
profiler_type = getattr(profiler_config, "profiler", None)
if profiler_type == "torch":
return create_omni_profiler(...)
if profiler_type == "cuda":
return CudaProfilerWrapper(profiler_config)
return NoneThis would give all omni workers (TTS, audio, future models) nsys support for free. Happy to open a follow-up PR for this once this one lands. Notes
|
Signed-off-by: Jinheng Li <ahengljh@gmail.com> Signed-off-by: Canlin Guo <961750412@qq.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Canlin Guo <961750412@qq.com>
Signed-off-by: Jinheng Li <ahengljh@gmail.com> Signed-off-by: Canlin Guo <961750412@qq.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Canlin Guo <961750412@qq.com>
Signed-off-by: Jinheng Li <ahengljh@gmail.com> Signed-off-by: Canlin Guo <961750412@qq.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Canlin Guo <961750412@qq.com>
Summary
Related to #677
Follow vLLM's profiler pattern for diffusion workers — use
CudaProfilerWrapperandTorchProfilerWrapperfrom vLLM instead of custom implementation.How It Works
Diffusion workers now use the same profiler infrastructure as vLLM's LLM workers:
VLLM_TORCH_CUDA_PROFILE=1→ usesCudaProfilerWrapperfor nsys integrationVLLM_TORCH_PROFILER_DIR=./profiles→ usesTorchProfilerWrapperfor detailed tracesNsys usage:
export VLLM_TORCH_CUDA_PROFILE=1 nsys profile \ --capture-range=cudaProfilerApi \ --capture-range-end=repeat \ --trace-fork-before-exec=true \ --cuda-graph-trace=node \ -o diffusion_trace \ python image_to_video.py --model Wan-AI/Wan2.2-I2V-A14B-Diffusers ...Files Changed
vllm_omni/diffusion/worker/diffusion_worker.pyprofiler_configdocs/contributing/profiling.mdVLLM_TORCH_CUDA_PROFILE=1Test Results