[CPU] Enable torch profiling by aditew01 · Pull Request #28130 · vllm-project/vllm

aditew01 · 2025-11-05T13:45:10Z

Purpose

The PR enables profiling for vllm models using torch.profile on CPU

Usage

export VLLM_TORCH_PROFILER_DIR=example_directory

Example

VLLM_TORCH_PROFILER_DIR=vllm_profile vllm bench throughput --num-prompts 1 --seed 0   --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --input_len 128 --load-format  dummy   --profile

Example output for reference:

(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210] -----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                                                  Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210] -----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                                         _C::onednn_mm        48.73%        1.022s        48.73%        1.022s      89.669us         11392  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                       _C_cache_ops::reshape_and_cache        16.46%     345.077ms        16.46%     345.077ms     122.542us          2816  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                               vllm::unified_attention        14.39%     301.668ms        38.74%     812.062ms     288.374us          2816  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                            Torch-Compiled Region: 1/0         4.47%      93.724ms         4.77%      99.913ms     780.568us           128  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                                _C::paged_attention_v1         2.88%      60.405ms         2.88%      60.405ms      21.620us          2794  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                                           aten::slice         1.77%      37.093ms         2.25%      47.086ms       3.124us         15074  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                                          aten::select         1.16%      24.366ms         1.41%      29.556ms       5.247us          5633  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                                            aten::view         0.95%      19.870ms         0.96%      20.107ms       1.161us         17326  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                                   aten::empty_strided         0.88%      18.352ms         0.88%      18.352ms       5.446us          3370  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                       bytecode_tracing (dynamo_timed)         0.84%      17.627ms         2.97%      62.195ms      62.195ms             1  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]         OutputGraph.call_user_compiler (dynamo_timed)         0.80%      16.811ms         1.18%      24.666ms      24.666ms             1  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                                      aten::empty_like         0.79%      16.526ms         1.35%      28.325ms       9.818us          2885  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                                      aten::as_strided         0.76%      15.902ms         0.76%      15.902ms       0.740us         21492  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                           build_guards (dynamo_timed)         0.68%      14.249ms         0.68%      14.249ms      14.249ms             1  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                                     Pregraph bytecode         0.47%       9.781ms         0.47%       9.781ms      38.209us           256  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                                           aten::copy_         0.39%       8.224ms         0.45%       9.451ms       6.399us          1477  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                      compile_attempt_0 (dynamo_timed)         0.38%       8.031ms         4.53%      94.892ms      94.892ms             1  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]           PyCodeCache.load_by_key_path (dynamo_timed)         0.27%       5.634ms         0.27%       5.634ms       5.634ms             1

github-actions · 2025-11-05T13:45:19Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request enables torch.profiler for CPU workers, which is a great addition for performance analysis on CPU. The implementation is clean and follows the existing pattern from the GPU worker. I've suggested a minor improvement to align the CPU worker's profiler initialization more closely with the GPU worker's for consistency, specifically regarding debug logging and trace file compression. Overall, this is a valuable feature.

gemini-code-assist · 2025-11-05T13:46:37Z

vllm/v1/worker/cpu_worker.py

+            logger.info(
+                "Profiling enabled. Traces will be saved to: %s",
+                torch_profiler_trace_dir,
+            )
+            self.profiler = torch.profiler.profile(
+                activities=[
+                    torch.profiler.ProfilerActivity.CPU,
+                ],
+                record_shapes=envs.VLLM_TORCH_PROFILER_RECORD_SHAPES,
+                profile_memory=envs.VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY,
+                with_stack=envs.VLLM_TORCH_PROFILER_WITH_STACK,
+                with_flops=envs.VLLM_TORCH_PROFILER_WITH_FLOPS,
+                on_trace_ready=torch.profiler.tensorboard_trace_handler(
+                    torch_profiler_trace_dir, worker_name=worker_name, use_gzip=False
+                ),
+            )


For consistency with the GPUWorker and to provide better debugging information, it would be beneficial to add a debug log for the profiler configuration. Additionally, enabling gzip compression for the trace files can help save disk space, especially for longer profiling sessions.

logger.info( "Profiling enabled. Traces will be saved to: %s", torch_profiler_trace_dir, ) logger.debug( "Profiler config: record_shapes=%s," "profile_memory=%s,with_stack=%s,with_flops=%s", envs.VLLM_TORCH_PROFILER_RECORD_SHAPES, envs.VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY, envs.VLLM_TORCH_PROFILER_WITH_STACK, envs.VLLM_TORCH_PROFILER_WITH_FLOPS, ) self.profiler = torch.profiler.profile( activities=[ torch.profiler.ProfilerActivity.CPU, ], record_shapes=envs.VLLM_TORCH_PROFILER_RECORD_SHAPES, profile_memory=envs.VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY, with_stack=envs.VLLM_TORCH_PROFILER_WITH_STACK, with_flops=envs.VLLM_TORCH_PROFILER_WITH_FLOPS, on_trace_ready=torch.profiler.tensorboard_trace_handler( torch_profiler_trace_dir, worker_name=worker_name, use_gzip=True ), )

Signed-off-by: Aditya Tewari <aditya.tewari@arm.com>

bigPYJ1151

Thanks, LGTM :)

Signed-off-by: Aditya Tewari <aditya.tewari@arm.com>

aditew01 requested a review from bigPYJ1151 as a code owner November 5, 2025 13:45

mergify bot added the v1 label Nov 5, 2025

gemini-code-assist bot reviewed Nov 5, 2025

View reviewed changes

aditew01 force-pushed the cpu_profile branch 3 times, most recently from dcc09fd to 6f3383a Compare November 5, 2025 14:59

add torch.profiler for CPU

add9f2f

Signed-off-by: Aditya Tewari <aditya.tewari@arm.com>

aditew01 force-pushed the cpu_profile branch from 6f3383a to add9f2f Compare November 5, 2025 15:14

bigPYJ1151 approved these changes Nov 6, 2025

View reviewed changes

bigPYJ1151 enabled auto-merge (squash) November 6, 2025 05:18

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 6, 2025

bigPYJ1151 merged commit 3755c14 into vllm-project:main Nov 6, 2025
47 checks passed

ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025

[CPU] Enable torch profiling (vllm-project#28130)

a12e628

Signed-off-by: Aditya Tewari <aditya.tewari@arm.com>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[CPU] Enable torch profiling (vllm-project#28130)

df3273b

Signed-off-by: Aditya Tewari <aditya.tewari@arm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CPU] Enable torch profiling#28130

[CPU] Enable torch profiling#28130
bigPYJ1151 merged 1 commit intovllm-project:mainfrom
aditew01:cpu_profile

aditew01 commented Nov 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 5, 2025

Uh oh!

bigPYJ1151 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

aditew01 commented Nov 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Usage

Example

Uh oh!

github-actions bot commented Nov 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

bigPYJ1151 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aditew01 commented Nov 5, 2025 •

edited by github-actions bot

Loading