[Refactor] profiler config optimze by mengchengTang · Pull Request #6141 · vllm-project/vllm-ascend

mengchengTang · 2026-01-22T09:42:51Z

What this PR does / why we need it?

This PR optimizes the torch_npu profiler configuration to significantly reduce overhead and trace file size. The key changes include:
Enable Data Simplification: Explicitly sets data_simplification=True in _ExperimentalConfig. This filters out unnecessary intermediate data during profiling, drastically reducing the memory footprint and I/O overhead.
Use Lightweight Stack Tracing: Replaces with_stack with with_modules when torch_profiler_with_stack is enabled. In torch_npu, with_stack introduces heavy latency. with_modules provides equivalent semantic information with much lower overhead.
Code Simplification: Removes redundant parameter configurations in _ExperimentalConfig by utilizing default values, making the codebase cleaner and easier to maintain.

Test setup:
max length = 50, profiler + stack enabled

Before optimization:
Profiler data size: 651 MB
Generate time: 3 seconds

After optimization:
Profiler data size: 156 MB (≈76% reduction)
Generate time: <1 second

Does this PR introduce any user-facing change?

No API changes. Users profiling on Ascend will experience faster profiling execution and smaller trace files when stack tracing is enabled.

How was this patch tested?

Manually verified on Ascend NPU by running vLLM with the profiler enabled. Confirmed that trace files are generated correctly containing necessary stack/module info, while showing the reported reduction in size and time.

vLLM version: v0.13.0
vLLM main: vllm-project/vllm@d682094

github-actions · 2026-01-22T09:43:06Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request optimizes the profiler configuration for torch_npu. The changes include simplifying the _ExperimentalConfig, enabling data_simplification, and using the with_modules parameter instead of with_stack for better performance when capturing stack traces.

My review focuses on the robustness of the changes. While simplifying the code is good, relying on default values of an internal, experimental API can be risky. I've suggested explicitly setting the parameters to avoid potential future performance regressions if the defaults change. The rest of the changes look like reasonable optimizations.

gemini-code-assist · 2026-01-22T09:44:03Z

            experimental_config = torch_npu.profiler._ExperimentalConfig(
                export_type=torch_npu.profiler.ExportType.Text,
                profiler_level=torch_npu.profiler.ProfilerLevel.Level1,
-                msprof_tx=False,
                aic_metrics=torch_npu.profiler.AiCMetrics.AiCoreNone,
-                l2_cache=False,
-                op_attr=False,
-                data_simplification=False,
-                record_op_args=False,
-                gc_detect_threshold=None,
+                data_simplification=True,
            )


While simplifying the call to _ExperimentalConfig makes the code shorter, it now relies on the default values for several parameters (msprof_tx, l2_cache, op_attr, record_op_args, gc_detect_threshold). Since _ExperimentalConfig is an internal, experimental API, its defaults could change in future torch_npu versions. This could unexpectedly enable a resource-intensive profiling feature, leading to performance degradation. To make the code more robust and future-proof, it's safer to explicitly specify these parameters to ensure the intended behavior.

experimental_config = torch_npu.profiler._ExperimentalConfig( export_type=torch_npu.profiler.ExportType.Text, profiler_level=torch_npu.profiler.ProfilerLevel.Level1, msprof_tx=False, aic_metrics=torch_npu.profiler.AiCMetrics.AiCoreNone, l2_cache=False, op_attr=False, data_simplification=True, record_op_args=False, gc_detect_threshold=None, )

The default values for these parameters are stable and will not change. We prefer to keep the code concise by relying on these defaults."

whx-sjtu · 2026-01-26T08:20:21Z

                profiler_level=torch_npu.profiler.ProfilerLevel.Level1,
-                msprof_tx=False,
                aic_metrics=torch_npu.profiler.AiCMetrics.AiCoreNone,
-                l2_cache=False,


I think it's better to keep these default values in case user wants to enable one of these parameters but don't known concrete param name.

Thanks for the suggestion. I agree that explicitly listing these parameters makes it easier for users to discover and enable them if needed. I have restored the parameters with their default values.

MengqingCao

LGTM

Signed-off-by: mengchengTang <745274877@qq.com>

…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (86 commits) [refactor] refactor excute_model and _dymmy_run method (vllm-project#6043) [Refactor] profiler config optimze (vllm-project#6141) [Graph][Fusion] Add MatmulAllReduceAddRMSNorm graph fusion for npugraph_ex. (vllm-project#6006) [UT]: refactoring 310p ops ut (vllm-project#6296) [Refact.]: refactoring 310p-kv cache allocator, align with main branch (vllm-project#6270) [Misc] Removes unnecessary graph size re-initialization (vllm-project#6280) [Main2Main] Upgrade vllm commit to 0123 (vllm-project#6169) [BugFix] Fix wheel package build workflow (vllm-project#6276) [CI][BugFix] Qwen3-Next nightly test fix. (vllm-project#6247) [Doc] quick fix for vllm-ascend version (vllm-project#6278) [Community] Nominate whx-sjtu as maintainer (vllm-project#6268) [Lint] Fix mypy issue to make CI happy (vllm-project#6272) BugFix: Fix moe_load accumulation error in ACL graph mode (vllm-project#6182) [Patch] Remove the patch of ECExampleConnector (vllm-project#5976) [Bugfix] Fix PP+PCP and PP+flashcomm1 bugs (vllm-project#5416) [Feat] proxy delay to remove instances (vllm-project#5934) [CI] Add workfolw_dispatch for nightly image build (vllm-project#6269) [bugfix][npugraph_ex]fix static kernel uninstall issue (vllm-project#6128) [Doc] 310P Documents update (vllm-project#6246) [Feature] Mooncake connector get remote ptp size (vllm-project#5822) ...

Pick from #6141 to make profiler work as expect. This PR also make sure `VLLM_TORCH_PROFILER_WITH_STACK` and `VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY` work as the same behaviour with v0.12.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

### What this PR does / why we need it? This PR optimizes the torch_npu profiler configuration to significantly reduce overhead and trace file size. The key changes include: **Enable Data Simplification**: Explicitly sets data_simplification=True in _ExperimentalConfig. This filters out unnecessary intermediate data during profiling, drastically reducing the memory footprint and I/O overhead. **Use Lightweight Stack Tracing**: Replaces with_stack with with_modules when torch_profiler_with_stack is enabled. In torch_npu, with_stack introduces heavy latency. with_modules provides equivalent semantic information with much lower overhead. **Code Simplification:** Removes redundant parameter configurations in _ExperimentalConfig by utilizing default values, making the codebase cleaner and easier to maintain. **Test setup:** max length = 50, profiler + stack enabled **Before optimization:** Profiler data size: 651 MB Generate time: 3 seconds **After optimization:** Profiler data size: 156 MB (≈76% reduction) Generate time: <1 second ### Does this PR introduce _any_ user-facing change? No API changes. Users profiling on Ascend will experience faster profiling execution and smaller trace files when stack tracing is enabled. ### How was this patch tested? Manually verified on Ascend NPU by running vLLM with the profiler enabled. Confirmed that trace files are generated correctly containing necessary stack/module info, while showing the reported reduction in size and time. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 Signed-off-by: mengchengTang <745274877@qq.com>

Pick from vllm-project#6141 to make profiler work as expect. This PR also make sure `VLLM_TORCH_PROFILER_WITH_STACK` and `VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY` work as the same behaviour with v0.12.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

### What this PR does / why we need it? This PR optimizes the torch_npu profiler configuration to significantly reduce overhead and trace file size. The key changes include: **Enable Data Simplification**: Explicitly sets data_simplification=True in _ExperimentalConfig. This filters out unnecessary intermediate data during profiling, drastically reducing the memory footprint and I/O overhead. **Use Lightweight Stack Tracing**: Replaces with_stack with with_modules when torch_profiler_with_stack is enabled. In torch_npu, with_stack introduces heavy latency. with_modules provides equivalent semantic information with much lower overhead. **Code Simplification:** Removes redundant parameter configurations in _ExperimentalConfig by utilizing default values, making the codebase cleaner and easier to maintain. **Test setup:** max length = 50, profiler + stack enabled **Before optimization:** Profiler data size: 651 MB Generate time: 3 seconds **After optimization:** Profiler data size: 156 MB (≈76% reduction) Generate time: <1 second ### Does this PR introduce _any_ user-facing change? No API changes. Users profiling on Ascend will experience faster profiling execution and smaller trace files when stack tracing is enabled. ### How was this patch tested? Manually verified on Ascend NPU by running vLLM with the profiler enabled. Confirmed that trace files are generated correctly containing necessary stack/module info, while showing the reported reduction in size and time. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 Signed-off-by: mengchengTang <745274877@qq.com>

### What this PR does / why we need it? This PR optimizes the torch_npu profiler configuration to significantly reduce overhead and trace file size. The key changes include: **Enable Data Simplification**: Explicitly sets data_simplification=True in _ExperimentalConfig. This filters out unnecessary intermediate data during profiling, drastically reducing the memory footprint and I/O overhead. **Use Lightweight Stack Tracing**: Replaces with_stack with with_modules when torch_profiler_with_stack is enabled. In torch_npu, with_stack introduces heavy latency. with_modules provides equivalent semantic information with much lower overhead. **Code Simplification:** Removes redundant parameter configurations in _ExperimentalConfig by utilizing default values, making the codebase cleaner and easier to maintain. **Test setup:** max length = 50, profiler + stack enabled **Before optimization:** Profiler data size: 651 MB Generate time: 3 seconds **After optimization:** Profiler data size: 156 MB (≈76% reduction) Generate time: <1 second ### Does this PR introduce _any_ user-facing change? No API changes. Users profiling on Ascend will experience faster profiling execution and smaller trace files when stack tracing is enabled. ### How was this patch tested? Manually verified on Ascend NPU by running vLLM with the profiler enabled. Confirmed that trace files are generated correctly containing necessary stack/module info, while showing the reported reduction in size and time. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 Signed-off-by: mengchengTang <745274877@qq.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>

### What this PR does / why we need it? This PR optimizes the torch_npu profiler configuration to significantly reduce overhead and trace file size. The key changes include: **Enable Data Simplification**: Explicitly sets data_simplification=True in _ExperimentalConfig. This filters out unnecessary intermediate data during profiling, drastically reducing the memory footprint and I/O overhead. **Use Lightweight Stack Tracing**: Replaces with_stack with with_modules when torch_profiler_with_stack is enabled. In torch_npu, with_stack introduces heavy latency. with_modules provides equivalent semantic information with much lower overhead. **Code Simplification:** Removes redundant parameter configurations in _ExperimentalConfig by utilizing default values, making the codebase cleaner and easier to maintain. **Test setup:** max length = 50, profiler + stack enabled **Before optimization:** Profiler data size: 651 MB Generate time: 3 seconds **After optimization:** Profiler data size: 156 MB (≈76% reduction) Generate time: <1 second ### Does this PR introduce _any_ user-facing change? No API changes. Users profiling on Ascend will experience faster profiling execution and smaller trace files when stack tracing is enabled. ### How was this patch tested? Manually verified on Ascend NPU by running vLLM with the profiler enabled. Confirmed that trace files are generated correctly containing necessary stack/module info, while showing the reported reduction in size and time. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 Signed-off-by: mengchengTang <745274877@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? This PR optimizes the torch_npu profiler configuration to significantly reduce overhead and trace file size. The key changes include: **Enable Data Simplification**: Explicitly sets data_simplification=True in _ExperimentalConfig. This filters out unnecessary intermediate data during profiling, drastically reducing the memory footprint and I/O overhead. **Use Lightweight Stack Tracing**: Replaces with_stack with with_modules when torch_profiler_with_stack is enabled. In torch_npu, with_stack introduces heavy latency. with_modules provides equivalent semantic information with much lower overhead. **Code Simplification:** Removes redundant parameter configurations in _ExperimentalConfig by utilizing default values, making the codebase cleaner and easier to maintain. **Test setup:** max length = 50, profiler + stack enabled **Before optimization:** Profiler data size: 651 MB Generate time: 3 seconds **After optimization:** Profiler data size: 156 MB (≈76% reduction) Generate time: <1 second ### Does this PR introduce _any_ user-facing change? No API changes. Users profiling on Ascend will experience faster profiling execution and smaller trace files when stack tracing is enabled. ### How was this patch tested? Manually verified on Ascend NPU by running vLLM with the profiler enabled. Confirmed that trace files are generated correctly containing necessary stack/module info, while showing the reported reduction in size and time. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 Signed-off-by: mengchengTang <745274877@qq.com>

### What this PR does / why we need it? This PR optimizes the torch_npu profiler configuration to significantly reduce overhead and trace file size. The key changes include: **Enable Data Simplification**: Explicitly sets data_simplification=True in _ExperimentalConfig. This filters out unnecessary intermediate data during profiling, drastically reducing the memory footprint and I/O overhead. **Use Lightweight Stack Tracing**: Replaces with_stack with with_modules when torch_profiler_with_stack is enabled. In torch_npu, with_stack introduces heavy latency. with_modules provides equivalent semantic information with much lower overhead. **Code Simplification:** Removes redundant parameter configurations in _ExperimentalConfig by utilizing default values, making the codebase cleaner and easier to maintain. **Test setup:** max length = 50, profiler + stack enabled **Before optimization:** Profiler data size: 651 MB Generate time: 3 seconds **After optimization:** Profiler data size: 156 MB (≈76% reduction) Generate time: <1 second ### Does this PR introduce _any_ user-facing change? No API changes. Users profiling on Ascend will experience faster profiling execution and smaller trace files when stack tracing is enabled. ### How was this patch tested? Manually verified on Ascend NPU by running vLLM with the profiler enabled. Confirmed that trace files are generated correctly containing necessary stack/module info, while showing the reported reduction in size and time. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 Signed-off-by: mengchengTang <745274877@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? This PR optimizes the torch_npu profiler configuration to significantly reduce overhead and trace file size. The key changes include: **Enable Data Simplification**: Explicitly sets data_simplification=True in _ExperimentalConfig. This filters out unnecessary intermediate data during profiling, drastically reducing the memory footprint and I/O overhead. **Use Lightweight Stack Tracing**: Replaces with_stack with with_modules when torch_profiler_with_stack is enabled. In torch_npu, with_stack introduces heavy latency. with_modules provides equivalent semantic information with much lower overhead. **Code Simplification:** Removes redundant parameter configurations in _ExperimentalConfig by utilizing default values, making the codebase cleaner and easier to maintain. **Test setup:** max length = 50, profiler + stack enabled **Before optimization:** Profiler data size: 651 MB Generate time: 3 seconds **After optimization:** Profiler data size: 156 MB (≈76% reduction) Generate time: <1 second ### Does this PR introduce _any_ user-facing change? No API changes. Users profiling on Ascend will experience faster profiling execution and smaller trace files when stack tracing is enabled. ### How was this patch tested? Manually verified on Ascend NPU by running vLLM with the profiler enabled. Confirmed that trace files are generated correctly containing necessary stack/module info, while showing the reported reduction in size and time. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 Signed-off-by: mengchengTang <745274877@qq.com>

mengchengTang requested a review from MengqingCao as a code owner January 22, 2026 09:42

gemini-code-assist Bot reviewed Jan 22, 2026

View reviewed changes

mengchengTang changed the title ~~profiler config optimze~~ [Refactor] profiler config optimze Jan 26, 2026

wangxiyuan approved these changes Jan 26, 2026

View reviewed changes

whx-sjtu reviewed Jan 26, 2026

View reviewed changes

mengchengTang force-pushed the profiler_config_optim branch from 6fb85a8 to ec5e648 Compare January 26, 2026 08:43

MengqingCao approved these changes Jan 26, 2026

View reviewed changes

profiler config optimze

b186d7e

Signed-off-by: mengchengTang <745274877@qq.com>

mengchengTang force-pushed the profiler_config_optim branch from ec5e648 to b186d7e Compare January 26, 2026 11:16

MengqingCao merged commit 41eb71d into vllm-project:main Jan 27, 2026
20 checks passed

wangxiyuan mentioned this pull request Jan 29, 2026

[0.13.0][Profiler] Fix profiler bug #6383

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Refactor] profiler config optimze#6141

[Refactor] profiler config optimze#6141
MengqingCao merged 1 commit intovllm-project:mainfrom
mengchengTang:profiler_config_optim

mengchengTang commented Jan 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jan 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jan 22, 2026

Uh oh!

mengchengTang Jan 26, 2026

Uh oh!

whx-sjtu Jan 26, 2026

Uh oh!

mengchengTang Jan 26, 2026

Uh oh!

MengqingCao left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mengchengTang commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions Bot commented Jan 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

mengchengTang Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

whx-sjtu Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

mengchengTang Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

MengqingCao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mengchengTang commented Jan 22, 2026 •

edited

Loading