[Profiler] Add Nsight Systems support for serving by ahengljh · Pull Request #1098 · vllm-project/vllm-omni

ahengljh · 2026-01-30T03:42:42Z

Summary

Related to #677

Follow vLLM's profiler pattern for diffusion workers — use CudaProfilerWrapper and TorchProfilerWrapper from vLLM instead of custom implementation.

How It Works

Diffusion workers now use the same profiler infrastructure as vLLM's LLM workers:

VLLM_TORCH_CUDA_PROFILE=1 → uses CudaProfilerWrapper for nsys integration
VLLM_TORCH_PROFILER_DIR=./profiles → uses TorchProfilerWrapper for detailed traces

Nsys usage:

export VLLM_TORCH_CUDA_PROFILE=1

nsys profile \
  --capture-range=cudaProfilerApi \
  --capture-range-end=repeat \
  --trace-fork-before-exec=true \
  --cuda-graph-trace=node \
  -o diffusion_trace \
  python image_to_video.py --model Wan-AI/Wan2.2-I2V-A14B-Diffusers ...

Files Changed

File	Change
`vllm_omni/diffusion/worker/diffusion_worker.py`	Use vLLM's profiler wrappers based on `profiler_config`
`docs/contributing/profiling.md`	Updated nsys usage with `VLLM_TORCH_CUDA_PROFILE=1`

Test Results

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b23aa54006

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-01-30T03:46:44Z

        if task_type == OmniStageTaskType.PROFILER_START:
+            # Signal nsys to begin capturing (no-op if not under nsys)
+            try:
+                torch.cuda.profiler.start()
+                logger.info("[Stage-%s] CUDA profiler started (nsys capture region open)", stage_id)


Start CUDA profiler inside diffusion worker processes

This torch.cuda.profiler.start() call runs only in the stage worker process. For diffusion, actual GPU kernels execute in subprocesses spawned by the diffusion executor (e.g., MultiprocDiffusionExecutor → WorkerProc), and those workers never call cudaProfilerStart. With --capture-range=cudaProfilerApi, nsys opens capture ranges per process, so the child processes doing the CUDA work stay closed and the nsys report ends up empty for diffusion workloads. Consider invoking torch.cuda.profiler.start()/stop() in DiffusionWorker.start_profile/stop_profile (or via the RPC path) so the capture range opens in the GPU worker processes.

Useful? React with 👍 / 👎.

david6666666 · 2026-01-30T08:50:13Z

@lishunyang12 @ZJY0516 PTAL if free, thx

hsliuustc0106 · 2026-01-31T08:32:21Z

provide an e2e example please

ZJY0516 · 2026-01-31T08:34:19Z

I would recommend splitting this PR into two: one for online serving profiling, and another for the nsys integration.

ZJY0516 · 2026-02-03T02:46:18Z

+        launched with ``--capture-range=cudaProfilerApi``) records GPU
+        activity from within this worker process.
+        """
+        if torch.cuda.is_available():


Having to check whether it's CUDA every single time adds a lot of noise to the code.

@gcanlin Do you have any suggestion?

any suggestion here? I am not sure. or I can also revert these keep them as previous version.

For omni model, because we reuse the vLLM's profiler for now, I think we don't need to add anything for supporting cuda profiler? In gpu_worker.py, it has been supported. If we also need it for diffusion, we may refer to vLLM's wrapper implementation. Before this, maybe we should consider unify the profiler between omni models and diffusion models. cc @lishunyang12

class Worker(WorkerBase): # Torch/CUDA profiler. Enabled and configured through profiler_config. self.profiler: Any | None = None profiler_config = vllm_config.profiler_config if profiler_config.profiler == "torch": worker_name = f"{vllm_config.instance_id}-rank-{self.rank}" self.profiler = TorchProfilerWrapper( profiler_config, worker_name=worker_name, local_rank=self.local_rank, activities=["CPU", "CUDA"], ) elif profiler_config.profiler == "cuda": self.profiler = CudaProfilerWrapper(profiler_config) else: self.profiler = None

gcanlin · 2026-02-03T03:29:57Z

BTW, #1136 is implementing online profiling:)

ahengljh · 2026-02-03T03:33:54Z

BTW, #1136 is implementing online profiling:)

I see, I'll remove online profiling part.

lishunyang12 · 2026-02-03T05:01:20Z

BTW, #1136 is implementing online profiling:)

I see, I'll remove online profiling part.

Can you provide test results, maybe a trace graph attached to description would be good.

ahengljh · 2026-02-03T09:03:57Z

BTW, #1136 is implementing online profiling:)

I see, I'll remove online profiling part.

Can you provide test results, maybe a trace graph attached to description would be good.

I‘d love to, but all screensshots uploading are blocked by our network policy of company...

lishunyang12

I will review it now, sorry for late response.

lishunyang12

Thanks for aligning diffusion profiling with vLLM's infrastructure — this is the right direction.

lishunyang12 · 2026-02-12T09:28:09Z

+            logger.info("Diffusion worker %s: profiler stopped", self.rank)
+        return None

    def execute_model(self, req: OmniDiffusionRequest, od_config: OmniDiffusionConfig) -> DiffusionOutput:


stop_profile() always returns None, which means DiffusionEngine.stop_profile() never gets any trace paths from workers. The elaborate aggregation logic in the engine becomes dead code.

TorchProfilerWrapper.stop() returns a dict with trace file paths — please return that result instead of discarding it:

def stop_profile(self) -> dict | None: if self.profiler is not None: return self.profiler.stop() return None

lishunyang12 · 2026-02-12T09:28:09Z

+        """
+        if self.profiler is not None:
+            self.profiler.start()
+            logger.info("Diffusion worker %s: profiler started", self.rank)


The trace_path_template parameter is accepted but never used — vLLM's wrappers get their paths from profiler_config at init time. This is confusing for callers. Consider removing it entirely or at minimum documenting that it's ignored.

lishunyang12 · 2026-02-12T09:28:09Z

+            worker_name = f"diffusion-rank-{self.rank}"
+            self.profiler = TorchProfilerWrapper(
+                profiler_config,
+                worker_name=worker_name,


Missing activities parameter. vLLM's gpu_worker.py explicitly passes activities=["CPU", "CUDA"]:

self.profiler = TorchProfilerWrapper( profiler_config, worker_name=worker_name, local_rank=self.local_rank, activities=["CPU", "CUDA"], # <-- add this )

Without it, the torch profiler may not capture CUDA kernels, which defeats the purpose of nsys integration.

lishunyang12 · 2026-02-12T09:28:09Z

+        profiler_context = (
+            self.profiler.annotate_context_manager("diffusion_forward") if self.profiler is not None else nullcontext()
+        )
+        with profiler_context:


Good use of annotate_context_manager and step() — this follows vLLM's pattern and gives clean trace segmentation per forward pass.

lishunyang12 · 2026-02-12T09:28:09Z

+                    output_files["traces"].append(trace_path)
+                elif isinstance(trace_path, list):
+                    output_files["traces"].extend(trace_path)
+                successful_traces = len(output_files["traces"])


Since workers always return None right now, the entire for rank, res in enumerate(results) loop body is effectively dead code (the if res is None: continue skips everything). This will become useful after fixing the worker's stop_profile() to return the wrapper's result.

lishunyang12 · 2026-02-12T09:28:09Z

-                    trace_filename = f"stage_{stage_id}_diffusion_{int(time.time())}"
-                    stage_engine.start_profile(trace_filename=trace_filename)
-                    logger.info("[Stage-%s] Diffusion Torch profiler started", stage_id)
+                    profile_dir = os.environ.get("VLLM_TORCH_PROFILER_DIR")


nit: The comment # Sync call is safe here was left behind, but now this function is named handle_profiler_task_async. The comment is stale/misleading.

lishunyang12 · 2026-02-21T08:00:29Z

@ahengljh Hey, aligning the diffusion profiler with vLLM's CudaProfilerWrapper and TorchProfilerWrapper is the right approach — makes nsys and torch profiling work consistently across LLM and diffusion workers. Any blockers on getting this merged?

hsliuustc0106 · 2026-02-25T16:11:26Z

resolve conflicts

Signed-off-by: Jinheng Li <ahengljh@gmail.com>

ahengljh · 2026-04-07T06:49:56Z

Learn from #2382 , add test files

ahengljh · 2026-04-07T06:55:04Z

Learn from #2382 , add test files

@Gaohan123 what's left for merging? just test results?

Signed-off-by: Jinheng Li <ahengljh@gmail.com>

lishunyang12 · 2026-04-08T16:14:57Z

@gcanlin PTAL. Can help me to test if it can also work on NPU?

gcanlin · 2026-04-09T01:08:01Z

+            return None
+
+        self.profiler.stop()
+        if isinstance(self.profiler, OmniTorchProfilerWrapper):


Why do we need to return the result?

Seems that vLLM doesn't return any.

gcanlin · 2026-04-09T01:24:46Z

+        if profiler_type == "torch":
+            return create_omni_profiler(
+                profiler_config=profiler_config,
+                worker_name=f"diffusion-rank-{self.rank}",


Suggested change

worker_name=f"diffusion-rank-{self.rank}",

worker_name=f"diffusion_rank_{self.rank}",

gcanlin · 2026-04-09T01:28:45Z

+            try:
+                return CudaProfilerWrapper(profiler_config)
+            except Exception as exc:
+                logger.warning(
+                    "Failed to initialize CUDA profiler on diffusion worker %s: %s",
+                    self.rank,
+                    exc,
+                )
+                return None
+        if profiler_type is not None:
+            logger.warning("Unknown profiler backend %r on diffusion worker %s", profiler_type, self.rank)
+        return None


Suggested change

try:

return CudaProfilerWrapper(profiler_config)

except Exception as exc:

logger.warning(

"Failed to initialize CUDA profiler on diffusion worker %s: %s",

self.rank,

exc,

)

return None

if profiler_type is not None:

logger.warning("Unknown profiler backend %r on diffusion worker %s", profiler_type, self.rank)

return None

return CudaProfilerWrapper(profiler_config)

if profiler_type is not None:

logger.warning("Unknown profiler backend %r on diffusion worker %s", profiler_type, self.rank)

Signed-off-by: Canlin Guo <961750412@qq.com>

Signed-off-by: Jinheng Li <ahengljh@gmail.com>

gcanlin · 2026-04-09T03:14:40Z

@ahengljh Could you help fix UT:https://buildkite.com/vllm/vllm-omni/builds/6142/steps/canvas?sid=019d7002-91f2-48bf-8fad-7f68a8847c8c&tab=output? Seems that I changed the code breaking the original UT.

ahengljh · 2026-04-09T03:15:30Z

@ahengljh Could you help fix UT:https://buildkite.com/vllm/vllm-omni/builds/6142/steps/canvas?sid=019d7002-91f2-48bf-8fad-7f68a8847c8c&tab=output? Seems that I changed the code breaking the original UT.

I am taking a look

Signed-off-by: Jinheng Li <ahengljh@gmail.com>

gcanlin · 2026-04-09T10:55:10Z

@ahengljh Could you run one model profiling and show the trace?

ahengljh · 2026-04-09T10:56:52Z

@ahengljh Could you run one model profiling and show the trace?

@bjf-frz may help on this? Have you tried already？

bjf-frz · 2026-04-09T11:03:05Z

#2382

No i

@ahengljh Could you run one model profiling and show the trace?

@bjf-frz may help on this? Have you tried already？

No I did't. I was testing PyTorch profiler.

linyueqian · 2026-04-09T18:15:01Z

Tested the nsys approach from this PR on TTS models (Qwen3-TTS + Fish Speech) and can confirm it works well. Here are the trace results @gcanlin asked for.

Test Setup

Hardware: 8× NVIDIA H20 141GB (Alibaba Cloud)
Models: fishaudio/s2-pro (Fish Speech), Qwen/Qwen3-TTS-12Hz-0.6B-Base
Method: Applied the _create_profiler() pattern from this PR to OmniGPUWorkerBase to enable CudaProfilerWrapper on TTS workers (AR + generation stages)

Trace Results

Fish Speech (s2-pro) — 7.7 MB trace, 2 profiled iterations:

Report	Key Numbers
NVTX ranges	Stage 1 (DAC decoder): 90 calls @ 18.7ms avg (87.7%). Stage 0 (AR): 1,664 calls @ 11.7μs
CUDA API	52K `cudaLaunchKernel`, 90 `cudaGraphLaunch` @ 614ms avg
Top kernels	cuDNN convolutions (DAC decoder), CUTLASS GEMMs, fused elementwise

Qwen3-TTS (0.6B-Base) — 271 MB trace, 1 profiled iteration:

Report	Key Numbers
NVTX ranges	Stage 1 (Code2Wav): 3,895 calls @ 8.5ms avg (92.9%). Stage 0 (Talker): 41,210 calls @ 50.6μs
CUDA API	1.45M `cudaLaunchKernel`, 62.5K `cudaGraphLaunch` @ 128ms avg
Potential bottleneck	`cudaMemcpyAsync` takes 63.3% of CUDA API time (395K calls)

Extending to omni workers

Re: @gcanlin's earlier suggestion about unifying the profiler — OmniGPUWorkerBase currently only creates OmniTorchProfilerWrapper for profiler: torch. Adding cuda backend support is a ~10 line change in _create_profiler():

def _create_profiler(self) -> WorkerProfiler | None:
    profiler_config = self.vllm_config.profiler_config
    profiler_type = getattr(profiler_config, "profiler", None)
    if profiler_type == "torch":
        return create_omni_profiler(...)
    if profiler_type == "cuda":
        return CudaProfilerWrapper(profiler_config)
    return None

This would give all omni workers (TTS, audio, future models) nsys support for free. Happy to open a follow-up PR for this once this one lands.

Notes

annotate_context_manager + step() pattern works well — clean NVTX segmentation per forward pass in the trace
Agree with @lishunyang12's point about the missing activities=["CPU", "CUDA"] param — worth fixing

fixed

hsliuustc0106

lgtm, thx

Signed-off-by: Jinheng Li <ahengljh@gmail.com> Signed-off-by: Canlin Guo <961750412@qq.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Canlin Guo <961750412@qq.com>

ahengljh requested a review from hsliuustc0106 as a code owner January 30, 2026 03:42

ahengljh changed the title ~~[Profiler] Add Nsight Systems support for online serving~~ [Profiler] Add Nsight Systems support for serving Jan 30, 2026

chatgpt-codex-connector Bot reviewed Jan 30, 2026

View reviewed changes

ahengljh force-pushed the enable_nsight_profiling branch from b23aa54 to 8c2baf3 Compare January 30, 2026 06:00

lishunyang12 mentioned this pull request Jan 30, 2026

[RFC] Q1 Quantization Support #1057

Closed

ahengljh force-pushed the enable_nsight_profiling branch 2 times, most recently from 84ebfcb to 6408db4 Compare January 30, 2026 07:58

david6666666 linked an issue Jan 30, 2026 that may be closed by this pull request

[Feature]: Nsight Systems Profiler Support #1064

Closed

david6666666 mentioned this pull request Jan 30, 2026

[Feature]: Nsight Systems Profiler Support #1064

Closed

hsliuustc0106 reviewed Jan 31, 2026

View reviewed changes

Comment thread docs/contributing/profiling.md Outdated

Comment thread vllm_omni/diffusion/profiler/cuda_profiler.py Outdated

ZJY0516 reviewed Feb 3, 2026

View reviewed changes

ahengljh force-pushed the enable_nsight_profiling branch from 1621416 to b0c7853 Compare February 3, 2026 08:11

hsliuustc0106 mentioned this pull request Feb 3, 2026

[RFC]: vLLM-Omni 2026 Q1 Roadmap #677

Open

38 tasks

Gaohan123 added this to the v0.16.0 milestone Feb 10, 2026

lishunyang12 mentioned this pull request Feb 12, 2026

[Feature] Unified Profiler with Online Serving and Stage-Aware Endpoints #1123

Closed

13 tasks

lishunyang12 requested changes Feb 12, 2026

View reviewed changes

lishunyang12 mentioned this pull request Feb 12, 2026

[Profiler] Support online profiling #1136

Merged

5 tasks

lishunyang12 previously requested changes Feb 12, 2026

View reviewed changes

hsliuustc0106 added the ready label to trigger buildkite CI label Feb 26, 2026

Gaohan123 modified the milestones: v0.16.0, v0.18.0 Mar 3, 2026

ahengljh force-pushed the enable_nsight_profiling branch from d239992 to 8771777 Compare April 7, 2026 06:36

Add CUDA profiler coverage for diffusion worker

d54e6e3

Signed-off-by: Jinheng Li <ahengljh@gmail.com>

ahengljh added 2 commits April 8, 2026 15:26

chore: trigger CI rerun

46180c1

Signed-off-by: Jinheng Li <ahengljh@gmail.com>

fix: make diffusion worker profiler helpers defensive

9871942

Signed-off-by: Jinheng Li <ahengljh@gmail.com>

gcanlin reviewed Apr 9, 2026

View reviewed changes

gcanlin and others added 5 commits April 9, 2026 10:00

clean code

baf2cf7

Signed-off-by: Canlin Guo <961750412@qq.com>

lint

224e25e

Signed-off-by: Canlin Guo <961750412@qq.com>

fix lint

0dfdc77

Signed-off-by: Canlin Guo <961750412@qq.com>

chore: trigger pipeline rerun

9046697

Signed-off-by: Jinheng Li <ahengljh@gmail.com>

Merge branch 'vllm-project:main' into enable_nsight_profiling

0634248

fix: restore defensive diffusion profiler access

3fdadd7

Signed-off-by: Jinheng Li <ahengljh@gmail.com>

Merge branch 'main' into enable_nsight_profiling

0d5bea5

hsliuustc0106 enabled auto-merge (squash) April 10, 2026 10:12

hsliuustc0106 approved these changes Apr 10, 2026

View reviewed changes

hsliuustc0106 merged commit 78bef62 into vllm-project:main Apr 10, 2026
8 checks passed

	worker_name=f"diffusion-rank-{self.rank}",
	worker_name=f"diffusion_rank_{self.rank}",

Conversation

ahengljh commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How It Works

Files Changed

Test Results

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

david6666666 commented Jan 30, 2026

Uh oh!

Uh oh!

Uh oh!

hsliuustc0106 commented Jan 31, 2026

Uh oh!

ZJY0516 commented Jan 31, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gcanlin Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gcanlin commented Feb 3, 2026

Uh oh!

ahengljh commented Feb 3, 2026

Uh oh!

lishunyang12 commented Feb 3, 2026

Uh oh!

ahengljh commented Feb 3, 2026

Uh oh!

lishunyang12 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lishunyang12 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lishunyang12 commented Feb 21, 2026

Uh oh!

hsliuustc0106 commented Feb 25, 2026

Uh oh!

ahengljh commented Apr 7, 2026

Uh oh!

ahengljh commented Apr 7, 2026

Uh oh!

lishunyang12 commented Apr 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gcanlin commented Apr 9, 2026

Uh oh!

ahengljh commented Apr 9, 2026

ahengljh commented Jan 30, 2026 •

edited

Loading

gcanlin Feb 3, 2026 •

edited

Loading

lishunyang12 left a comment •

edited

Loading

lishunyang12 left a comment •

edited

Loading