Skip to content

[Profiler] Add Nsight Systems support for serving#1098

Merged
hsliuustc0106 merged 15 commits into
vllm-project:mainfrom
ahengljh:enable_nsight_profiling
Apr 10, 2026
Merged

[Profiler] Add Nsight Systems support for serving#1098
hsliuustc0106 merged 15 commits into
vllm-project:mainfrom
ahengljh:enable_nsight_profiling

Conversation

@ahengljh
Copy link
Copy Markdown
Contributor

@ahengljh ahengljh commented Jan 30, 2026

Summary

Related to #677

Follow vLLM's profiler pattern for diffusion workers — use CudaProfilerWrapper and TorchProfilerWrapper from vLLM instead of custom implementation.

How It Works

Diffusion workers now use the same profiler infrastructure as vLLM's LLM workers:

  • VLLM_TORCH_CUDA_PROFILE=1 → uses CudaProfilerWrapper for nsys integration
  • VLLM_TORCH_PROFILER_DIR=./profiles → uses TorchProfilerWrapper for detailed traces

Nsys usage:

export VLLM_TORCH_CUDA_PROFILE=1

nsys profile \
  --capture-range=cudaProfilerApi \
  --capture-range-end=repeat \
  --trace-fork-before-exec=true \
  --cuda-graph-trace=node \
  -o diffusion_trace \
  python image_to_video.py --model Wan-AI/Wan2.2-I2V-A14B-Diffusers ...

Files Changed

File Change
vllm_omni/diffusion/worker/diffusion_worker.py Use vLLM's profiler wrappers based on profiler_config
docs/contributing/profiling.md Updated nsys usage with VLLM_TORCH_CUDA_PROFILE=1

Test Results

image image

@ahengljh ahengljh changed the title [Profiler] Add Nsight Systems support for online serving [Profiler] Add Nsight Systems support for serving Jan 30, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b23aa54006

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm_omni/entrypoints/omni_stage.py Outdated
Comment on lines +737 to +741
if task_type == OmniStageTaskType.PROFILER_START:
# Signal nsys to begin capturing (no-op if not under nsys)
try:
torch.cuda.profiler.start()
logger.info("[Stage-%s] CUDA profiler started (nsys capture region open)", stage_id)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Start CUDA profiler inside diffusion worker processes

This torch.cuda.profiler.start() call runs only in the stage worker process. For diffusion, actual GPU kernels execute in subprocesses spawned by the diffusion executor (e.g., MultiprocDiffusionExecutorWorkerProc), and those workers never call cudaProfilerStart. With --capture-range=cudaProfilerApi, nsys opens capture ranges per process, so the child processes doing the CUDA work stay closed and the nsys report ends up empty for diffusion workloads. Consider invoking torch.cuda.profiler.start()/stop() in DiffusionWorker.start_profile/stop_profile (or via the RPC path) so the capture range opens in the GPU worker processes.

Useful? React with 👍 / 👎.

@ahengljh ahengljh force-pushed the enable_nsight_profiling branch from b23aa54 to 8c2baf3 Compare January 30, 2026 06:00
@ahengljh ahengljh force-pushed the enable_nsight_profiling branch 2 times, most recently from 84ebfcb to 6408db4 Compare January 30, 2026 07:58
@david6666666 david6666666 linked an issue Jan 30, 2026 that may be closed by this pull request
@david6666666
Copy link
Copy Markdown
Collaborator

@lishunyang12 @ZJY0516 PTAL if free, thx

Comment thread docs/contributing/profiling.md Outdated
Comment thread vllm_omni/diffusion/profiler/cuda_profiler.py Outdated
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

provide an e2e example please

@ZJY0516
Copy link
Copy Markdown
Member

ZJY0516 commented Jan 31, 2026

I would recommend splitting this PR into two: one for online serving profiling, and another for the nsys integration.

launched with ``--capture-range=cudaProfilerApi``) records GPU
activity from within this worker process.
"""
if torch.cuda.is_available():
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having to check whether it's CUDA every single time adds a lot of noise to the code.

@gcanlin Do you have any suggestion?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any suggestion here? I am not sure. or I can also revert these keep them as previous version.

Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For omni model, because we reuse the vLLM's profiler for now, I think we don't need to add anything for supporting cuda profiler? In gpu_worker.py, it has been supported. If we also need it for diffusion, we may refer to vLLM's wrapper implementation. Before this, maybe we should consider unify the profiler between omni models and diffusion models. cc @lishunyang12

class Worker(WorkerBase):
        # Torch/CUDA profiler. Enabled and configured through profiler_config.
        self.profiler: Any | None = None
        profiler_config = vllm_config.profiler_config
        if profiler_config.profiler == "torch":
            worker_name = f"{vllm_config.instance_id}-rank-{self.rank}"
            self.profiler = TorchProfilerWrapper(
                profiler_config,
                worker_name=worker_name,
                local_rank=self.local_rank,
                activities=["CPU", "CUDA"],
            )
        elif profiler_config.profiler == "cuda":
            self.profiler = CudaProfilerWrapper(profiler_config)
        else:
            self.profiler = None

@gcanlin
Copy link
Copy Markdown
Collaborator

gcanlin commented Feb 3, 2026

BTW, #1136 is implementing online profiling:)

@ahengljh
Copy link
Copy Markdown
Contributor Author

ahengljh commented Feb 3, 2026

BTW, #1136 is implementing online profiling:)

I see, I'll remove online profiling part.

@lishunyang12
Copy link
Copy Markdown
Collaborator

BTW, #1136 is implementing online profiling:)

I see, I'll remove online profiling part.

Can you provide test results, maybe a trace graph attached to description would be good.

@ahengljh ahengljh force-pushed the enable_nsight_profiling branch from 1621416 to b0c7853 Compare February 3, 2026 08:11
@ahengljh
Copy link
Copy Markdown
Contributor Author

ahengljh commented Feb 3, 2026

BTW, #1136 is implementing online profiling:)

I see, I'll remove online profiling part.

Can you provide test results, maybe a trace graph attached to description would be good.

I‘d love to, but all screensshots uploading are blocked by our network policy of company...

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will review it now, sorry for late response.

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for aligning diffusion profiling with vLLM's infrastructure — this is the right direction.

logger.info("Diffusion worker %s: profiler stopped", self.rank)
return None

def execute_model(self, req: OmniDiffusionRequest, od_config: OmniDiffusionConfig) -> DiffusionOutput:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stop_profile() always returns None, which means DiffusionEngine.stop_profile() never gets any trace paths from workers. The elaborate aggregation logic in the engine becomes dead code.

TorchProfilerWrapper.stop() returns a dict with trace file paths — please return that result instead of discarding it:

def stop_profile(self) -> dict | None:
    if self.profiler is not None:
        return self.profiler.stop()
    return None

"""
if self.profiler is not None:
self.profiler.start()
logger.info("Diffusion worker %s: profiler started", self.rank)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trace_path_template parameter is accepted but never used — vLLM's wrappers get their paths from profiler_config at init time. This is confusing for callers. Consider removing it entirely or at minimum documenting that it's ignored.

worker_name = f"diffusion-rank-{self.rank}"
self.profiler = TorchProfilerWrapper(
profiler_config,
worker_name=worker_name,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing activities parameter. vLLM's gpu_worker.py explicitly passes activities=["CPU", "CUDA"]:

self.profiler = TorchProfilerWrapper(
    profiler_config,
    worker_name=worker_name,
    local_rank=self.local_rank,
    activities=["CPU", "CUDA"],  # <-- add this
)

Without it, the torch profiler may not capture CUDA kernels, which defeats the purpose of nsys integration.

profiler_context = (
self.profiler.annotate_context_manager("diffusion_forward") if self.profiler is not None else nullcontext()
)
with profiler_context:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good use of annotate_context_manager and step() — this follows vLLM's pattern and gives clean trace segmentation per forward pass.

Comment thread vllm_omni/diffusion/diffusion_engine.py Outdated
output_files["traces"].append(trace_path)
elif isinstance(trace_path, list):
output_files["traces"].extend(trace_path)
successful_traces = len(output_files["traces"])
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since workers always return None right now, the entire for rank, res in enumerate(results) loop body is effectively dead code (the if res is None: continue skips everything). This will become useful after fixing the worker's stop_profile() to return the wrapper's result.

Comment thread vllm_omni/entrypoints/omni_stage.py Outdated
trace_filename = f"stage_{stage_id}_diffusion_{int(time.time())}"
stage_engine.start_profile(trace_filename=trace_filename)
logger.info("[Stage-%s] Diffusion Torch profiler started", stage_id)
profile_dir = os.environ.get("VLLM_TORCH_PROFILER_DIR")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The comment # Sync call is safe here was left behind, but now this function is named handle_profiler_task_async. The comment is stale/misleading.

@lishunyang12
Copy link
Copy Markdown
Collaborator

@ahengljh Hey, aligning the diffusion profiler with vLLM's CudaProfilerWrapper and TorchProfilerWrapper is the right approach — makes nsys and torch profiling work consistently across LLM and diffusion workers. Any blockers on getting this merged?

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

resolve conflicts

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Feb 26, 2026
@Gaohan123 Gaohan123 modified the milestones: v0.16.0, v0.18.0 Mar 3, 2026
@ahengljh ahengljh force-pushed the enable_nsight_profiling branch from d239992 to 8771777 Compare April 7, 2026 06:36
Signed-off-by: Jinheng Li <ahengljh@gmail.com>
@ahengljh
Copy link
Copy Markdown
Contributor Author

ahengljh commented Apr 7, 2026

Learn from #2382 , add test files

@ahengljh
Copy link
Copy Markdown
Contributor Author

ahengljh commented Apr 7, 2026

Learn from #2382 , add test files

@Gaohan123 what's left for merging? just test results?

ahengljh added 2 commits April 8, 2026 15:26
Signed-off-by: Jinheng Li <ahengljh@gmail.com>
Signed-off-by: Jinheng Li <ahengljh@gmail.com>
@lishunyang12
Copy link
Copy Markdown
Collaborator

@gcanlin PTAL. Can help me to test if it can also work on NPU?

return None

self.profiler.stop()
if isinstance(self.profiler, OmniTorchProfilerWrapper):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to return the result?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that vLLM doesn't return any.

if profiler_type == "torch":
return create_omni_profiler(
profiler_config=profiler_config,
worker_name=f"diffusion-rank-{self.rank}",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
worker_name=f"diffusion-rank-{self.rank}",
worker_name=f"diffusion_rank_{self.rank}",

Comment on lines +157 to +168
try:
return CudaProfilerWrapper(profiler_config)
except Exception as exc:
logger.warning(
"Failed to initialize CUDA profiler on diffusion worker %s: %s",
self.rank,
exc,
)
return None
if profiler_type is not None:
logger.warning("Unknown profiler backend %r on diffusion worker %s", profiler_type, self.rank)
return None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
try:
return CudaProfilerWrapper(profiler_config)
except Exception as exc:
logger.warning(
"Failed to initialize CUDA profiler on diffusion worker %s: %s",
self.rank,
exc,
)
return None
if profiler_type is not None:
logger.warning("Unknown profiler backend %r on diffusion worker %s", profiler_type, self.rank)
return None
return CudaProfilerWrapper(profiler_config)
if profiler_type is not None:
logger.warning("Unknown profiler backend %r on diffusion worker %s", profiler_type, self.rank)

gcanlin and others added 5 commits April 9, 2026 10:00
Signed-off-by: Canlin Guo <961750412@qq.com>
Signed-off-by: Canlin Guo <961750412@qq.com>
Signed-off-by: Canlin Guo <961750412@qq.com>
Signed-off-by: Jinheng Li <ahengljh@gmail.com>
@gcanlin
Copy link
Copy Markdown
Collaborator

gcanlin commented Apr 9, 2026

@ahengljh Could you help fix UT:https://buildkite.com/vllm/vllm-omni/builds/6142/steps/canvas?sid=019d7002-91f2-48bf-8fad-7f68a8847c8c&tab=output? Seems that I changed the code breaking the original UT.

@ahengljh
Copy link
Copy Markdown
Contributor Author

ahengljh commented Apr 9, 2026

@ahengljh Could you help fix UT:https://buildkite.com/vllm/vllm-omni/builds/6142/steps/canvas?sid=019d7002-91f2-48bf-8fad-7f68a8847c8c&tab=output? Seems that I changed the code breaking the original UT.

I am taking a look

Signed-off-by: Jinheng Li <ahengljh@gmail.com>
@gcanlin
Copy link
Copy Markdown
Collaborator

gcanlin commented Apr 9, 2026

@ahengljh Could you run one model profiling and show the trace?

@ahengljh
Copy link
Copy Markdown
Contributor Author

ahengljh commented Apr 9, 2026

@ahengljh Could you run one model profiling and show the trace?

@bjf-frz may help on this? Have you tried already?

@bjf-frz
Copy link
Copy Markdown
Contributor

bjf-frz commented Apr 9, 2026

#2382

No i

@ahengljh Could you run one model profiling and show the trace?

@bjf-frz may help on this? Have you tried already?

No I did't. I was testing PyTorch profiler.

@linyueqian
Copy link
Copy Markdown
Collaborator

Tested the nsys approach from this PR on TTS models (Qwen3-TTS + Fish Speech) and can confirm it works well. Here are the trace results @gcanlin asked for.

Test Setup

  • Hardware: 8× NVIDIA H20 141GB (Alibaba Cloud)
  • Models: fishaudio/s2-pro (Fish Speech), Qwen/Qwen3-TTS-12Hz-0.6B-Base
  • Method: Applied the _create_profiler() pattern from this PR to OmniGPUWorkerBase to enable CudaProfilerWrapper on TTS workers (AR + generation stages)

Trace Results

Fish Speech (s2-pro) — 7.7 MB trace, 2 profiled iterations:

Report Key Numbers
NVTX ranges Stage 1 (DAC decoder): 90 calls @ 18.7ms avg (87.7%). Stage 0 (AR): 1,664 calls @ 11.7μs
CUDA API 52K cudaLaunchKernel, 90 cudaGraphLaunch @ 614ms avg
Top kernels cuDNN convolutions (DAC decoder), CUTLASS GEMMs, fused elementwise

Qwen3-TTS (0.6B-Base) — 271 MB trace, 1 profiled iteration:

Report Key Numbers
NVTX ranges Stage 1 (Code2Wav): 3,895 calls @ 8.5ms avg (92.9%). Stage 0 (Talker): 41,210 calls @ 50.6μs
CUDA API 1.45M cudaLaunchKernel, 62.5K cudaGraphLaunch @ 128ms avg
Potential bottleneck cudaMemcpyAsync takes 63.3% of CUDA API time (395K calls)

Extending to omni workers

Re: @gcanlin's earlier suggestion about unifying the profiler — OmniGPUWorkerBase currently only creates OmniTorchProfilerWrapper for profiler: torch. Adding cuda backend support is a ~10 line change in _create_profiler():

def _create_profiler(self) -> WorkerProfiler | None:
    profiler_config = self.vllm_config.profiler_config
    profiler_type = getattr(profiler_config, "profiler", None)
    if profiler_type == "torch":
        return create_omni_profiler(...)
    if profiler_type == "cuda":
        return CudaProfilerWrapper(profiler_config)
    return None

This would give all omni workers (TTS, audio, future models) nsys support for free. Happy to open a follow-up PR for this once this one lands.

Notes

  • annotate_context_manager + step() pattern works well — clean NVTX segmentation per forward pass in the trace
  • Agree with @lishunyang12's point about the missing activities=["CPU", "CUDA"] param — worth fixing

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thx

@hsliuustc0106 hsliuustc0106 merged commit 78bef62 into vllm-project:main Apr 10, 2026
8 checks passed
daixinning pushed a commit to daixinning/vllm-omni that referenced this pull request Apr 13, 2026
Signed-off-by: Jinheng Li <ahengljh@gmail.com>
Signed-off-by: Canlin Guo <961750412@qq.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Canlin Guo <961750412@qq.com>
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
Signed-off-by: Jinheng Li <ahengljh@gmail.com>
Signed-off-by: Canlin Guo <961750412@qq.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Canlin Guo <961750412@qq.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
Signed-off-by: Jinheng Li <ahengljh@gmail.com>
Signed-off-by: Canlin Guo <961750412@qq.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Canlin Guo <961750412@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Nsight Systems Profiler Support

9 participants