Skip to content

[CPU] Enable torch profiling#28130

Merged
bigPYJ1151 merged 1 commit intovllm-project:mainfrom
aditew01:cpu_profile
Nov 6, 2025
Merged

[CPU] Enable torch profiling#28130
bigPYJ1151 merged 1 commit intovllm-project:mainfrom
aditew01:cpu_profile

Conversation

@aditew01
Copy link
Contributor

@aditew01 aditew01 commented Nov 5, 2025

Purpose

The PR enables profiling for vllm models using torch.profile on CPU

Usage

export VLLM_TORCH_PROFILER_DIR=example_directory

Example

VLLM_TORCH_PROFILER_DIR=vllm_profile vllm bench throughput --num-prompts 1 --seed 0   --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --input_len 128 --load-format  dummy   --profile

Example output for reference:

(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210] -----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                                                  Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210] -----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                                         _C::onednn_mm        48.73%        1.022s        48.73%        1.022s      89.669us         11392  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                       _C_cache_ops::reshape_and_cache        16.46%     345.077ms        16.46%     345.077ms     122.542us          2816  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                               vllm::unified_attention        14.39%     301.668ms        38.74%     812.062ms     288.374us          2816  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                            Torch-Compiled Region: 1/0         4.47%      93.724ms         4.77%      99.913ms     780.568us           128  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                                _C::paged_attention_v1         2.88%      60.405ms         2.88%      60.405ms      21.620us          2794  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                                           aten::slice         1.77%      37.093ms         2.25%      47.086ms       3.124us         15074  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                                          aten::select         1.16%      24.366ms         1.41%      29.556ms       5.247us          5633  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                                            aten::view         0.95%      19.870ms         0.96%      20.107ms       1.161us         17326  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                                   aten::empty_strided         0.88%      18.352ms         0.88%      18.352ms       5.446us          3370  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                       bytecode_tracing (dynamo_timed)         0.84%      17.627ms         2.97%      62.195ms      62.195ms             1  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]         OutputGraph.call_user_compiler (dynamo_timed)         0.80%      16.811ms         1.18%      24.666ms      24.666ms             1  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                                      aten::empty_like         0.79%      16.526ms         1.35%      28.325ms       9.818us          2885  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                                      aten::as_strided         0.76%      15.902ms         0.76%      15.902ms       0.740us         21492  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                           build_guards (dynamo_timed)         0.68%      14.249ms         0.68%      14.249ms      14.249ms             1  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                                     Pregraph bytecode         0.47%       9.781ms         0.47%       9.781ms      38.209us           256  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                                           aten::copy_         0.39%       8.224ms         0.45%       9.451ms       6.399us          1477  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]                      compile_attempt_0 (dynamo_timed)         0.38%       8.031ms         4.53%      94.892ms      94.892ms             1  
(EngineCore_DP0 pid=44565) INFO 11-05 13:22:49 [cpu_worker.py:210]           PyCodeCache.load_by_key_path (dynamo_timed)         0.27%       5.634ms         0.27%       5.634ms       5.634ms             1  

@aditew01 aditew01 requested a review from bigPYJ1151 as a code owner November 5, 2025 13:45
@github-actions
Copy link

github-actions bot commented Nov 5, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added the v1 label Nov 5, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables torch.profiler for CPU workers, which is a great addition for performance analysis on CPU. The implementation is clean and follows the existing pattern from the GPU worker. I've suggested a minor improvement to align the CPU worker's profiler initialization more closely with the GPU worker's for consistency, specifically regarding debug logging and trace file compression. Overall, this is a valuable feature.

Comment on lines +43 to +58
logger.info(
"Profiling enabled. Traces will be saved to: %s",
torch_profiler_trace_dir,
)
self.profiler = torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
],
record_shapes=envs.VLLM_TORCH_PROFILER_RECORD_SHAPES,
profile_memory=envs.VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY,
with_stack=envs.VLLM_TORCH_PROFILER_WITH_STACK,
with_flops=envs.VLLM_TORCH_PROFILER_WITH_FLOPS,
on_trace_ready=torch.profiler.tensorboard_trace_handler(
torch_profiler_trace_dir, worker_name=worker_name, use_gzip=False
),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For consistency with the GPUWorker and to provide better debugging information, it would be beneficial to add a debug log for the profiler configuration. Additionally, enabling gzip compression for the trace files can help save disk space, especially for longer profiling sessions.

            logger.info(
                "Profiling enabled. Traces will be saved to: %s",
                torch_profiler_trace_dir,
            )
            logger.debug(
                "Profiler config: record_shapes=%s,"
                "profile_memory=%s,with_stack=%s,with_flops=%s",
                envs.VLLM_TORCH_PROFILER_RECORD_SHAPES,
                envs.VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY,
                envs.VLLM_TORCH_PROFILER_WITH_STACK,
                envs.VLLM_TORCH_PROFILER_WITH_FLOPS,
            )
            self.profiler = torch.profiler.profile(
                activities=[
                    torch.profiler.ProfilerActivity.CPU,
                ],
                record_shapes=envs.VLLM_TORCH_PROFILER_RECORD_SHAPES,
                profile_memory=envs.VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY,
                with_stack=envs.VLLM_TORCH_PROFILER_WITH_STACK,
                with_flops=envs.VLLM_TORCH_PROFILER_WITH_FLOPS,
                on_trace_ready=torch.profiler.tensorboard_trace_handler(
                    torch_profiler_trace_dir, worker_name=worker_name, use_gzip=True
                ),
            )

@aditew01 aditew01 force-pushed the cpu_profile branch 3 times, most recently from dcc09fd to 6f3383a Compare November 5, 2025 14:59
Signed-off-by: Aditya Tewari <aditya.tewari@arm.com>
Copy link
Member

@bigPYJ1151 bigPYJ1151 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM :)

@bigPYJ1151 bigPYJ1151 enabled auto-merge (squash) November 6, 2025 05:18
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 6, 2025
@bigPYJ1151 bigPYJ1151 merged commit 3755c14 into vllm-project:main Nov 6, 2025
47 checks passed
ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025
Signed-off-by: Aditya Tewari <aditya.tewari@arm.com>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Signed-off-by: Aditya Tewari <aditya.tewari@arm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants