[Perf] Refactor tensor disposal logic to reduce memory usage #966

ApsarasX · 2025-05-27T02:58:29Z

What this PR does / why we need it?

In previous PRs [quantization] Support w8a8 quantization #580 [Perf] Optimize fused_experts quantization code to save npu memory #784, I saved GPU memory by promptly deleting unnecessary tensors. For tensors passed from upper-layer functions, I used a list container to transfer the parameter and then popped the tensor from the list within the inner function to achieve deletion. Recently, I discovered a better implementation in sglang—the dispose_tensor function and I recommend adopting this approach.
Dispose hidden_states and residual from the previous layer once they're no longer used.
Avoid to generate self.inputs_embeds in ModelRunnerV1 in non-multimodal scenarios.

With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model under the conditions of TP=16 and max-model-len=32768, we can save 1.3GB of npu memory.

Before

After

Reference: sgl-project/sglang#6147

Does this PR introduce any user-facing change?

No

How was this patch tested?

Signed-off-by: ApsarasX <[email protected]>

jianzs

LGTM

MengqingCao · 2025-05-27T06:52:29Z

LGTM, thanks for your efforts!

… main * 'main' of https://github.com/raindaywhu/vllm-ascend: [aclgraph] implentment NPUPiecewiseBackend to enable aclgraph (vllm-project#836) [Bugfix][V1] Fix deepseek with v1 (vllm-project#958) [Perf] Refactor tensor disposal logic to reduce memory usage (vllm-project#966)

…oject#966) ### What this PR does / why we need it? 1. In previous PRs vllm-project#580 vllm-project#784, I saved GPU memory by promptly deleting unnecessary tensors. For tensors passed from upper-layer functions, I used a list container to transfer the parameter and then popped the tensor from the list within the inner function to achieve deletion. Recently, I discovered a better implementation in sglang—the `dispose_tensor` function and I recommend adopting this approach. 2. Dispose `hidden_states` and `residual` from the previous layer once they're no longer used. 3. Avoid to generate `self.inputs_embeds` in `ModelRunnerV1` in non-multimodal scenarios. With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model under the conditions of `TP=16` and `max-model-len=32768`, we can save 1.3GB of npu memory. **Reference**: sgl-project/sglang#6147 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? --------- Signed-off-by: ApsarasX <[email protected]>

@ApsarasX

I would like to nominate Wengang Chen (@ApsarasX https://github.com/ApsarasX) as a maintainer, starting with my +1. ## Reason Review Quality‌: He focuses on the vLLM Ascend Core module review with 100+ high quality review, such as [#2326 (comment)](#2326 (comment)), [#768 (comment)](#768 (comment)), [#2312 (comment)](#2312 (comment)), [#2268 (comment)](#2268 (comment)), [#2192 (comment)](#2192 (comment)), [#2156 (comment)](#2156 (comment)). This helped vLLM Ascend v0.9.x and v0.10.x to be released with high quality. Sustained and Quality Contributions: He has a very good habit of sharing his design ideas, development process, performance test results, such as [#966](#966), he contributed [many PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3AApsarasX+is%3Amerged+), valuable bugfixes and also perf improvements. Community Involvement: Active involved in community discussion, he is collaborative and helps the users solve problems, involved in [120+ PR and issues](https://github.com/vllm-project/vllm-ascend/issues?q=commenter%3AApsarasX). He is also the speaker of [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/7n8OYNrCC_I9SJaybHA_-Q). So I think he's a great addition to the vLLM Ascend Maintainer team. - ✅Review Quality‌: 108+ PR with valuable review https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3AApsarasX with many valuable review, like #2326 (comment) #768 (comment) #2312 (comment) #2268 (comment) #2192 (comment) #2156 (comment) - ✅ Sustained and Major Contributions https://github.com/vllm-project/vllm-ascend/pulls/ApsarasX - ✅ Quality Contribution‌: https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3AApsarasX+is%3Aclosed Good quality with well documents [Perf] Refactor tensor disposal logic to reduce memory usage #966 - ✅Community Involvement‌: 7 issue: https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aclosed%20author%3AApsarasX - 120+ PR and issue: https://github.com/vllm-project/vllm-ascend/issues?q=commenter%3AApsarasX Signed-off-by: wangxiyuan <[email protected]>

@ApsarasX

I would like to nominate Wengang Chen (@ApsarasX https://github.com/ApsarasX) as a maintainer, starting with my +1. ## Reason Review Quality‌: He focuses on the vLLM Ascend Core module review with 100+ high quality review, such as [vllm-project#2326 (comment)](vllm-project#2326 (comment)), [vllm-project#768 (comment)](vllm-project#768 (comment)), [vllm-project#2312 (comment)](vllm-project#2312 (comment)), [vllm-project#2268 (comment)](vllm-project#2268 (comment)), [vllm-project#2192 (comment)](vllm-project#2192 (comment)), [vllm-project#2156 (comment)](vllm-project#2156 (comment)). This helped vLLM Ascend v0.9.x and v0.10.x to be released with high quality. Sustained and Quality Contributions: He has a very good habit of sharing his design ideas, development process, performance test results, such as [vllm-project#966](vllm-project#966), he contributed [many PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3AApsarasX+is%3Amerged+), valuable bugfixes and also perf improvements. Community Involvement: Active involved in community discussion, he is collaborative and helps the users solve problems, involved in [120+ PR and issues](https://github.com/vllm-project/vllm-ascend/issues?q=commenter%3AApsarasX). He is also the speaker of [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/7n8OYNrCC_I9SJaybHA_-Q). So I think he's a great addition to the vLLM Ascend Maintainer team. - ✅Review Quality‌: 108+ PR with valuable review https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3AApsarasX with many valuable review, like vllm-project#2326 (comment) vllm-project#768 (comment) vllm-project#2312 (comment) vllm-project#2268 (comment) vllm-project#2192 (comment) vllm-project#2156 (comment) - ✅ Sustained and Major Contributions https://github.com/vllm-project/vllm-ascend/pulls/ApsarasX - ✅ Quality Contribution‌: https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3AApsarasX+is%3Aclosed Good quality with well documents [Perf] Refactor tensor disposal logic to reduce memory usage vllm-project#966 - ✅Community Involvement‌: 7 issue: https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aclosed%20author%3AApsarasX - 120+ PR and issue: https://github.com/vllm-project/vllm-ascend/issues?q=commenter%3AApsarasX Signed-off-by: wangxiyuan <[email protected]>

@ApsarasX

I would like to nominate Wengang Chen (@ApsarasX https://github.com/ApsarasX) as a maintainer, starting with my +1. ## Reason Review Quality‌: He focuses on the vLLM Ascend Core module review with 100+ high quality review, such as [vllm-project#2326 (comment)](vllm-project#2326 (comment)), [vllm-project#768 (comment)](vllm-project#768 (comment)), [vllm-project#2312 (comment)](vllm-project#2312 (comment)), [vllm-project#2268 (comment)](vllm-project#2268 (comment)), [vllm-project#2192 (comment)](vllm-project#2192 (comment)), [vllm-project#2156 (comment)](vllm-project#2156 (comment)). This helped vLLM Ascend v0.9.x and v0.10.x to be released with high quality. Sustained and Quality Contributions: He has a very good habit of sharing his design ideas, development process, performance test results, such as [vllm-project#966](vllm-project#966), he contributed [many PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3AApsarasX+is%3Amerged+), valuable bugfixes and also perf improvements. Community Involvement: Active involved in community discussion, he is collaborative and helps the users solve problems, involved in [120+ PR and issues](https://github.com/vllm-project/vllm-ascend/issues?q=commenter%3AApsarasX). He is also the speaker of [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/7n8OYNrCC_I9SJaybHA_-Q). So I think he's a great addition to the vLLM Ascend Maintainer team. - ✅Review Quality‌: 108+ PR with valuable review https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3AApsarasX with many valuable review, like vllm-project#2326 (comment) vllm-project#768 (comment) vllm-project#2312 (comment) vllm-project#2268 (comment) vllm-project#2192 (comment) vllm-project#2156 (comment) - ✅ Sustained and Major Contributions https://github.com/vllm-project/vllm-ascend/pulls/ApsarasX - ✅ Quality Contribution‌: https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3AApsarasX+is%3Aclosed Good quality with well documents [Perf] Refactor tensor disposal logic to reduce memory usage vllm-project#966 - ✅Community Involvement‌: 7 issue: https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aclosed%20author%3AApsarasX - 120+ PR and issue: https://github.com/vllm-project/vllm-ascend/issues?q=commenter%3AApsarasX Signed-off-by: wangxiyuan <[email protected]>

…oject#966) ### What this PR does / why we need it? 1. In previous PRs vllm-project#580 vllm-project#784, I saved GPU memory by promptly deleting unnecessary tensors. For tensors passed from upper-layer functions, I used a list container to transfer the parameter and then popped the tensor from the list within the inner function to achieve deletion. Recently, I discovered a better implementation in sglang—the `dispose_tensor` function and I recommend adopting this approach. 2. Dispose `hidden_states` and `residual` from the previous layer once they're no longer used. 3. Avoid to generate `self.inputs_embeds` in `ModelRunnerV1` in non-multimodal scenarios. With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model under the conditions of `TP=16` and `max-model-len=32768`, we can save 1.3GB of npu memory. **Reference**: sgl-project/sglang#6147 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? --------- Signed-off-by: ApsarasX <[email protected]>

@ApsarasX

I would like to nominate Wengang Chen (@ApsarasX https://github.com/ApsarasX) as a maintainer, starting with my +1. ## Reason Review Quality‌: He focuses on the vLLM Ascend Core module review with 100+ high quality review, such as [vllm-project#2326 (comment)](vllm-project#2326 (comment)), [vllm-project#768 (comment)](vllm-project#768 (comment)), [vllm-project#2312 (comment)](vllm-project#2312 (comment)), [vllm-project#2268 (comment)](vllm-project#2268 (comment)), [vllm-project#2192 (comment)](vllm-project#2192 (comment)), [vllm-project#2156 (comment)](vllm-project#2156 (comment)). This helped vLLM Ascend v0.9.x and v0.10.x to be released with high quality. Sustained and Quality Contributions: He has a very good habit of sharing his design ideas, development process, performance test results, such as [vllm-project#966](vllm-project#966), he contributed [many PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3AApsarasX+is%3Amerged+), valuable bugfixes and also perf improvements. Community Involvement: Active involved in community discussion, he is collaborative and helps the users solve problems, involved in [120+ PR and issues](https://github.com/vllm-project/vllm-ascend/issues?q=commenter%3AApsarasX). He is also the speaker of [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/7n8OYNrCC_I9SJaybHA_-Q). So I think he's a great addition to the vLLM Ascend Maintainer team. - ✅Review Quality‌: 108+ PR with valuable review https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3AApsarasX with many valuable review, like vllm-project#2326 (comment) vllm-project#768 (comment) vllm-project#2312 (comment) vllm-project#2268 (comment) vllm-project#2192 (comment) vllm-project#2156 (comment) - ✅ Sustained and Major Contributions https://github.com/vllm-project/vllm-ascend/pulls/ApsarasX - ✅ Quality Contribution‌: https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3AApsarasX+is%3Aclosed Good quality with well documents [Perf] Refactor tensor disposal logic to reduce memory usage vllm-project#966 - ✅Community Involvement‌: 7 issue: https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aclosed%20author%3AApsarasX - 120+ PR and issue: https://github.com/vllm-project/vllm-ascend/issues?q=commenter%3AApsarasX Signed-off-by: wangxiyuan <[email protected]>

ApsarasX added 2 commits May 27, 2025 02:26

[Perf] Refactor tensor disposal logic to reduce memory usage

8e23cec

Signed-off-by: ApsarasX <[email protected]>

[Perf] Reduce memory usage by avoiding unused tensor

05c9fa2

Signed-off-by: ApsarasX <[email protected]>

github-actions bot added module:core module:quantization labels May 27, 2025

[Chore] Remove unused imported typing class

e26444b

Signed-off-by: ApsarasX <[email protected]>

ApsarasX requested review from MengqingCao, ganyi1996ppo, jianzs and wangxiyuan May 27, 2025 06:03

ApsarasX added the ready read for review label May 27, 2025

jianzs approved these changes May 27, 2025

View reviewed changes

wangxiyuan approved these changes May 28, 2025

View reviewed changes

ganyi1996ppo merged commit e3c7f71 into vllm-project:main May 29, 2025
23 checks passed

wangxiyuan mentioned this pull request Aug 18, 2025

Nominate ApsarasX as vllm-ascend maintainer #2419

Merged

ApsarasX mentioned this pull request Aug 25, 2025

[Perf] Remove unnecessary dispose_tensor #2519

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf] Refactor tensor disposal logic to reduce memory usage #966

[Perf] Refactor tensor disposal logic to reduce memory usage #966

Uh oh!

ApsarasX commented May 27, 2025 •

edited

Loading

Uh oh!

jianzs left a comment

Uh oh!

MengqingCao commented May 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[Perf] Refactor tensor disposal logic to reduce memory usage #966

[Perf] Refactor tensor disposal logic to reduce memory usage #966

Uh oh!

Conversation

ApsarasX commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

jianzs left a comment

Choose a reason for hiding this comment

Uh oh!

MengqingCao commented May 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ApsarasX commented May 27, 2025 •

edited

Loading