[Refactor] profiler config optimze#6141
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request optimizes the profiler configuration for torch_npu. The changes include simplifying the _ExperimentalConfig, enabling data_simplification, and using the with_modules parameter instead of with_stack for better performance when capturing stack traces.
My review focuses on the robustness of the changes. While simplifying the code is good, relying on default values of an internal, experimental API can be risky. I've suggested explicitly setting the parameters to avoid potential future performance regressions if the defaults change. The rest of the changes look like reasonable optimizations.
| experimental_config = torch_npu.profiler._ExperimentalConfig( | ||
| export_type=torch_npu.profiler.ExportType.Text, | ||
| profiler_level=torch_npu.profiler.ProfilerLevel.Level1, | ||
| msprof_tx=False, | ||
| aic_metrics=torch_npu.profiler.AiCMetrics.AiCoreNone, | ||
| l2_cache=False, | ||
| op_attr=False, | ||
| data_simplification=False, | ||
| record_op_args=False, | ||
| gc_detect_threshold=None, | ||
| data_simplification=True, | ||
| ) |
There was a problem hiding this comment.
While simplifying the call to _ExperimentalConfig makes the code shorter, it now relies on the default values for several parameters (msprof_tx, l2_cache, op_attr, record_op_args, gc_detect_threshold). Since _ExperimentalConfig is an internal, experimental API, its defaults could change in future torch_npu versions. This could unexpectedly enable a resource-intensive profiling feature, leading to performance degradation. To make the code more robust and future-proof, it's safer to explicitly specify these parameters to ensure the intended behavior.
experimental_config = torch_npu.profiler._ExperimentalConfig(
export_type=torch_npu.profiler.ExportType.Text,
profiler_level=torch_npu.profiler.ProfilerLevel.Level1,
msprof_tx=False,
aic_metrics=torch_npu.profiler.AiCMetrics.AiCoreNone,
l2_cache=False,
op_attr=False,
data_simplification=True,
record_op_args=False,
gc_detect_threshold=None,
)There was a problem hiding this comment.
The default values for these parameters are stable and will not change. We prefer to keep the code concise by relying on these defaults."
| profiler_level=torch_npu.profiler.ProfilerLevel.Level1, | ||
| msprof_tx=False, | ||
| aic_metrics=torch_npu.profiler.AiCMetrics.AiCoreNone, | ||
| l2_cache=False, |
There was a problem hiding this comment.
I think it's better to keep these default values in case user wants to enable one of these parameters but don't known concrete param name.
There was a problem hiding this comment.
Thanks for the suggestion. I agree that explicitly listing these parameters makes it easier for users to discover and enable them if needed. I have restored the parameters with their default values.
6fb85a8 to
ec5e648
Compare
Signed-off-by: mengchengTang <745274877@qq.com>
ec5e648 to
b186d7e
Compare
…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (86 commits) [refactor] refactor excute_model and _dymmy_run method (vllm-project#6043) [Refactor] profiler config optimze (vllm-project#6141) [Graph][Fusion] Add MatmulAllReduceAddRMSNorm graph fusion for npugraph_ex. (vllm-project#6006) [UT]: refactoring 310p ops ut (vllm-project#6296) [Refact.]: refactoring 310p-kv cache allocator, align with main branch (vllm-project#6270) [Misc] Removes unnecessary graph size re-initialization (vllm-project#6280) [Main2Main] Upgrade vllm commit to 0123 (vllm-project#6169) [BugFix] Fix wheel package build workflow (vllm-project#6276) [CI][BugFix] Qwen3-Next nightly test fix. (vllm-project#6247) [Doc] quick fix for vllm-ascend version (vllm-project#6278) [Community] Nominate whx-sjtu as maintainer (vllm-project#6268) [Lint] Fix mypy issue to make CI happy (vllm-project#6272) BugFix: Fix moe_load accumulation error in ACL graph mode (vllm-project#6182) [Patch] Remove the patch of ECExampleConnector (vllm-project#5976) [Bugfix] Fix PP+PCP and PP+flashcomm1 bugs (vllm-project#5416) [Feat] proxy delay to remove instances (vllm-project#5934) [CI] Add workfolw_dispatch for nightly image build (vllm-project#6269) [bugfix][npugraph_ex]fix static kernel uninstall issue (vllm-project#6128) [Doc] 310P Documents update (vllm-project#6246) [Feature] Mooncake connector get remote ptp size (vllm-project#5822) ...
Pick from #6141 to make profiler work as expect. This PR also make sure `VLLM_TORCH_PROFILER_WITH_STACK` and `VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY` work as the same behaviour with v0.12.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it? This PR optimizes the torch_npu profiler configuration to significantly reduce overhead and trace file size. The key changes include: **Enable Data Simplification**: Explicitly sets data_simplification=True in _ExperimentalConfig. This filters out unnecessary intermediate data during profiling, drastically reducing the memory footprint and I/O overhead. **Use Lightweight Stack Tracing**: Replaces with_stack with with_modules when torch_profiler_with_stack is enabled. In torch_npu, with_stack introduces heavy latency. with_modules provides equivalent semantic information with much lower overhead. **Code Simplification:** Removes redundant parameter configurations in _ExperimentalConfig by utilizing default values, making the codebase cleaner and easier to maintain. **Test setup:** max length = 50, profiler + stack enabled **Before optimization:** Profiler data size: 651 MB Generate time: 3 seconds **After optimization:** Profiler data size: 156 MB (≈76% reduction) Generate time: <1 second ### Does this PR introduce _any_ user-facing change? No API changes. Users profiling on Ascend will experience faster profiling execution and smaller trace files when stack tracing is enabled. ### How was this patch tested? Manually verified on Ascend NPU by running vLLM with the profiler enabled. Confirmed that trace files are generated correctly containing necessary stack/module info, while showing the reported reduction in size and time. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 Signed-off-by: mengchengTang <745274877@qq.com>
Pick from vllm-project#6141 to make profiler work as expect. This PR also make sure `VLLM_TORCH_PROFILER_WITH_STACK` and `VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY` work as the same behaviour with v0.12.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it? This PR optimizes the torch_npu profiler configuration to significantly reduce overhead and trace file size. The key changes include: **Enable Data Simplification**: Explicitly sets data_simplification=True in _ExperimentalConfig. This filters out unnecessary intermediate data during profiling, drastically reducing the memory footprint and I/O overhead. **Use Lightweight Stack Tracing**: Replaces with_stack with with_modules when torch_profiler_with_stack is enabled. In torch_npu, with_stack introduces heavy latency. with_modules provides equivalent semantic information with much lower overhead. **Code Simplification:** Removes redundant parameter configurations in _ExperimentalConfig by utilizing default values, making the codebase cleaner and easier to maintain. **Test setup:** max length = 50, profiler + stack enabled **Before optimization:** Profiler data size: 651 MB Generate time: 3 seconds **After optimization:** Profiler data size: 156 MB (≈76% reduction) Generate time: <1 second ### Does this PR introduce _any_ user-facing change? No API changes. Users profiling on Ascend will experience faster profiling execution and smaller trace files when stack tracing is enabled. ### How was this patch tested? Manually verified on Ascend NPU by running vLLM with the profiler enabled. Confirmed that trace files are generated correctly containing necessary stack/module info, while showing the reported reduction in size and time. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 Signed-off-by: mengchengTang <745274877@qq.com>
### What this PR does / why we need it? This PR optimizes the torch_npu profiler configuration to significantly reduce overhead and trace file size. The key changes include: **Enable Data Simplification**: Explicitly sets data_simplification=True in _ExperimentalConfig. This filters out unnecessary intermediate data during profiling, drastically reducing the memory footprint and I/O overhead. **Use Lightweight Stack Tracing**: Replaces with_stack with with_modules when torch_profiler_with_stack is enabled. In torch_npu, with_stack introduces heavy latency. with_modules provides equivalent semantic information with much lower overhead. **Code Simplification:** Removes redundant parameter configurations in _ExperimentalConfig by utilizing default values, making the codebase cleaner and easier to maintain. **Test setup:** max length = 50, profiler + stack enabled **Before optimization:** Profiler data size: 651 MB Generate time: 3 seconds **After optimization:** Profiler data size: 156 MB (≈76% reduction) Generate time: <1 second ### Does this PR introduce _any_ user-facing change? No API changes. Users profiling on Ascend will experience faster profiling execution and smaller trace files when stack tracing is enabled. ### How was this patch tested? Manually verified on Ascend NPU by running vLLM with the profiler enabled. Confirmed that trace files are generated correctly containing necessary stack/module info, while showing the reported reduction in size and time. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 Signed-off-by: mengchengTang <745274877@qq.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>
### What this PR does / why we need it? This PR optimizes the torch_npu profiler configuration to significantly reduce overhead and trace file size. The key changes include: **Enable Data Simplification**: Explicitly sets data_simplification=True in _ExperimentalConfig. This filters out unnecessary intermediate data during profiling, drastically reducing the memory footprint and I/O overhead. **Use Lightweight Stack Tracing**: Replaces with_stack with with_modules when torch_profiler_with_stack is enabled. In torch_npu, with_stack introduces heavy latency. with_modules provides equivalent semantic information with much lower overhead. **Code Simplification:** Removes redundant parameter configurations in _ExperimentalConfig by utilizing default values, making the codebase cleaner and easier to maintain. **Test setup:** max length = 50, profiler + stack enabled **Before optimization:** Profiler data size: 651 MB Generate time: 3 seconds **After optimization:** Profiler data size: 156 MB (≈76% reduction) Generate time: <1 second ### Does this PR introduce _any_ user-facing change? No API changes. Users profiling on Ascend will experience faster profiling execution and smaller trace files when stack tracing is enabled. ### How was this patch tested? Manually verified on Ascend NPU by running vLLM with the profiler enabled. Confirmed that trace files are generated correctly containing necessary stack/module info, while showing the reported reduction in size and time. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 Signed-off-by: mengchengTang <745274877@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? This PR optimizes the torch_npu profiler configuration to significantly reduce overhead and trace file size. The key changes include: **Enable Data Simplification**: Explicitly sets data_simplification=True in _ExperimentalConfig. This filters out unnecessary intermediate data during profiling, drastically reducing the memory footprint and I/O overhead. **Use Lightweight Stack Tracing**: Replaces with_stack with with_modules when torch_profiler_with_stack is enabled. In torch_npu, with_stack introduces heavy latency. with_modules provides equivalent semantic information with much lower overhead. **Code Simplification:** Removes redundant parameter configurations in _ExperimentalConfig by utilizing default values, making the codebase cleaner and easier to maintain. **Test setup:** max length = 50, profiler + stack enabled **Before optimization:** Profiler data size: 651 MB Generate time: 3 seconds **After optimization:** Profiler data size: 156 MB (≈76% reduction) Generate time: <1 second ### Does this PR introduce _any_ user-facing change? No API changes. Users profiling on Ascend will experience faster profiling execution and smaller trace files when stack tracing is enabled. ### How was this patch tested? Manually verified on Ascend NPU by running vLLM with the profiler enabled. Confirmed that trace files are generated correctly containing necessary stack/module info, while showing the reported reduction in size and time. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 Signed-off-by: mengchengTang <745274877@qq.com>
### What this PR does / why we need it? This PR optimizes the torch_npu profiler configuration to significantly reduce overhead and trace file size. The key changes include: **Enable Data Simplification**: Explicitly sets data_simplification=True in _ExperimentalConfig. This filters out unnecessary intermediate data during profiling, drastically reducing the memory footprint and I/O overhead. **Use Lightweight Stack Tracing**: Replaces with_stack with with_modules when torch_profiler_with_stack is enabled. In torch_npu, with_stack introduces heavy latency. with_modules provides equivalent semantic information with much lower overhead. **Code Simplification:** Removes redundant parameter configurations in _ExperimentalConfig by utilizing default values, making the codebase cleaner and easier to maintain. **Test setup:** max length = 50, profiler + stack enabled **Before optimization:** Profiler data size: 651 MB Generate time: 3 seconds **After optimization:** Profiler data size: 156 MB (≈76% reduction) Generate time: <1 second ### Does this PR introduce _any_ user-facing change? No API changes. Users profiling on Ascend will experience faster profiling execution and smaller trace files when stack tracing is enabled. ### How was this patch tested? Manually verified on Ascend NPU by running vLLM with the profiler enabled. Confirmed that trace files are generated correctly containing necessary stack/module info, while showing the reported reduction in size and time. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 Signed-off-by: mengchengTang <745274877@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? This PR optimizes the torch_npu profiler configuration to significantly reduce overhead and trace file size. The key changes include: **Enable Data Simplification**: Explicitly sets data_simplification=True in _ExperimentalConfig. This filters out unnecessary intermediate data during profiling, drastically reducing the memory footprint and I/O overhead. **Use Lightweight Stack Tracing**: Replaces with_stack with with_modules when torch_profiler_with_stack is enabled. In torch_npu, with_stack introduces heavy latency. with_modules provides equivalent semantic information with much lower overhead. **Code Simplification:** Removes redundant parameter configurations in _ExperimentalConfig by utilizing default values, making the codebase cleaner and easier to maintain. **Test setup:** max length = 50, profiler + stack enabled **Before optimization:** Profiler data size: 651 MB Generate time: 3 seconds **After optimization:** Profiler data size: 156 MB (≈76% reduction) Generate time: <1 second ### Does this PR introduce _any_ user-facing change? No API changes. Users profiling on Ascend will experience faster profiling execution and smaller trace files when stack tracing is enabled. ### How was this patch tested? Manually verified on Ascend NPU by running vLLM with the profiler enabled. Confirmed that trace files are generated correctly containing necessary stack/module info, while showing the reported reduction in size and time. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 Signed-off-by: mengchengTang <745274877@qq.com>
### What this PR does / why we need it? This PR optimizes the torch_npu profiler configuration to significantly reduce overhead and trace file size. The key changes include: **Enable Data Simplification**: Explicitly sets data_simplification=True in _ExperimentalConfig. This filters out unnecessary intermediate data during profiling, drastically reducing the memory footprint and I/O overhead. **Use Lightweight Stack Tracing**: Replaces with_stack with with_modules when torch_profiler_with_stack is enabled. In torch_npu, with_stack introduces heavy latency. with_modules provides equivalent semantic information with much lower overhead. **Code Simplification:** Removes redundant parameter configurations in _ExperimentalConfig by utilizing default values, making the codebase cleaner and easier to maintain. **Test setup:** max length = 50, profiler + stack enabled **Before optimization:** Profiler data size: 651 MB Generate time: 3 seconds **After optimization:** Profiler data size: 156 MB (≈76% reduction) Generate time: <1 second ### Does this PR introduce _any_ user-facing change? No API changes. Users profiling on Ascend will experience faster profiling execution and smaller trace files when stack tracing is enabled. ### How was this patch tested? Manually verified on Ascend NPU by running vLLM with the profiler enabled. Confirmed that trace files are generated correctly containing necessary stack/module info, while showing the reported reduction in size and time. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 Signed-off-by: mengchengTang <745274877@qq.com>
### What this PR does / why we need it? This PR optimizes the torch_npu profiler configuration to significantly reduce overhead and trace file size. The key changes include: **Enable Data Simplification**: Explicitly sets data_simplification=True in _ExperimentalConfig. This filters out unnecessary intermediate data during profiling, drastically reducing the memory footprint and I/O overhead. **Use Lightweight Stack Tracing**: Replaces with_stack with with_modules when torch_profiler_with_stack is enabled. In torch_npu, with_stack introduces heavy latency. with_modules provides equivalent semantic information with much lower overhead. **Code Simplification:** Removes redundant parameter configurations in _ExperimentalConfig by utilizing default values, making the codebase cleaner and easier to maintain. **Test setup:** max length = 50, profiler + stack enabled **Before optimization:** Profiler data size: 651 MB Generate time: 3 seconds **After optimization:** Profiler data size: 156 MB (≈76% reduction) Generate time: <1 second ### Does this PR introduce _any_ user-facing change? No API changes. Users profiling on Ascend will experience faster profiling execution and smaller trace files when stack tracing is enabled. ### How was this patch tested? Manually verified on Ascend NPU by running vLLM with the profiler enabled. Confirmed that trace files are generated correctly containing necessary stack/module info, while showing the reported reduction in size and time. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 Signed-off-by: mengchengTang <745274877@qq.com>
What this PR does / why we need it?
This PR optimizes the torch_npu profiler configuration to significantly reduce overhead and trace file size. The key changes include:
Enable Data Simplification: Explicitly sets data_simplification=True in _ExperimentalConfig. This filters out unnecessary intermediate data during profiling, drastically reducing the memory footprint and I/O overhead.
Use Lightweight Stack Tracing: Replaces with_stack with with_modules when torch_profiler_with_stack is enabled. In torch_npu, with_stack introduces heavy latency. with_modules provides equivalent semantic information with much lower overhead.
Code Simplification: Removes redundant parameter configurations in _ExperimentalConfig by utilizing default values, making the codebase cleaner and easier to maintain.
Test setup:
max length = 50, profiler + stack enabled
Before optimization:
Profiler data size: 651 MB
Generate time: 3 seconds
After optimization:
Profiler data size: 156 MB (≈76% reduction)
Generate time: <1 second
Does this PR introduce any user-facing change?
No API changes. Users profiling on Ascend will experience faster profiling execution and smaller trace files when stack tracing is enabled.
How was this patch tested?
Manually verified on Ascend NPU by running vLLM with the profiler enabled. Confirmed that trace files are generated correctly containing necessary stack/module info, while showing the reported reduction in size and time.