Skip to content

Conversation

@ApsarasX
Copy link
Collaborator

@ApsarasX ApsarasX commented May 27, 2025

What this PR does / why we need it?

  1. In previous PRs [quantization] Support w8a8 quantization #580 [Perf] Optimize fused_experts quantization code to save npu memory #784, I saved GPU memory by promptly deleting unnecessary tensors. For tensors passed from upper-layer functions, I used a list container to transfer the parameter and then popped the tensor from the list within the inner function to achieve deletion. Recently, I discovered a better implementation in sglang—the dispose_tensor function and I recommend adopting this approach.
  2. Dispose hidden_states and residual from the previous layer once they're no longer used.
  3. Avoid to generate self.inputs_embeds in ModelRunnerV1 in non-multimodal scenarios.

With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model under the conditions of TP=16 and max-model-len=32768, we can save 1.3GB of npu memory.

Before
截屏2025-05-27 10 30 23
截屏2025-05-27 10 35 45

After
截屏2025-05-27 10 34 41
截屏2025-05-27 10 36 11

Reference: sgl-project/sglang#6147

Does this PR introduce any user-facing change?

No

How was this patch tested?

@ApsarasX ApsarasX added the ready read for review label May 27, 2025
Copy link
Collaborator

@jianzs jianzs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MengqingCao
Copy link
Collaborator

LGTM, thanks for your efforts!

@ganyi1996ppo ganyi1996ppo merged commit e3c7f71 into vllm-project:main May 29, 2025
23 checks passed
raindaywhu added a commit to raindaywhu/vllm-ascend that referenced this pull request May 30, 2025
… main

* 'main' of https://github.com/raindaywhu/vllm-ascend:
  [aclgraph] implentment NPUPiecewiseBackend to enable aclgraph (vllm-project#836)
  [Bugfix][V1] Fix deepseek with v1 (vllm-project#958)
  [Perf] Refactor tensor disposal logic to reduce memory usage (vllm-project#966)
zxdukki pushed a commit to zxdukki/vllm-ascend that referenced this pull request Jun 3, 2025
…oject#966)

### What this PR does / why we need it?
1. In previous PRs vllm-project#580
vllm-project#784, I saved GPU memory
by promptly deleting unnecessary tensors. For tensors passed from
upper-layer functions, I used a list container to transfer the parameter
and then popped the tensor from the list within the inner function to
achieve deletion. Recently, I discovered a better implementation in
sglang—the `dispose_tensor` function and I recommend adopting this
approach.
2. Dispose `hidden_states` and `residual` from the previous layer once
they're no longer used.
3. Avoid to generate `self.inputs_embeds` in `ModelRunnerV1` in
non-multimodal scenarios.

With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model
under the conditions of `TP=16` and `max-model-len=32768`, we can save
1.3GB of npu memory.

**Reference**: sgl-project/sglang#6147

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

---------

Signed-off-by: ApsarasX <[email protected]>
David9857 pushed a commit to David9857/vllm-ascend that referenced this pull request Jun 3, 2025
…oject#966)

### What this PR does / why we need it?
1. In previous PRs vllm-project#580
vllm-project#784, I saved GPU memory
by promptly deleting unnecessary tensors. For tensors passed from
upper-layer functions, I used a list container to transfer the parameter
and then popped the tensor from the list within the inner function to
achieve deletion. Recently, I discovered a better implementation in
sglang—the `dispose_tensor` function and I recommend adopting this
approach.
2. Dispose `hidden_states` and `residual` from the previous layer once
they're no longer used.
3. Avoid to generate `self.inputs_embeds` in `ModelRunnerV1` in
non-multimodal scenarios.

With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model
under the conditions of `TP=16` and `max-model-len=32768`, we can save
1.3GB of npu memory.

**Reference**: sgl-project/sglang#6147

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

---------

Signed-off-by: ApsarasX <[email protected]>
ApsarasX pushed a commit that referenced this pull request Aug 19, 2025
I would like to nominate Wengang Chen (@ApsarasX
https://github.com/ApsarasX) as a maintainer, starting with my +1.

## Reason
Review Quality‌: He focuses on the vLLM Ascend Core module review with
100+ high quality review, such as [#2326
(comment)](#2326 (comment)),
[#768
(comment)](#768 (comment)),
[#2312
(comment)](#2312 (comment)),
[#2268
(comment)](#2268 (comment)),
[#2192
(comment)](#2192 (comment)),
[#2156
(comment)](#2156 (comment)).
This helped vLLM Ascend v0.9.x and v0.10.x to be released with high
quality.

Sustained and Quality Contributions: He has a very good habit of sharing
his design ideas, development process, performance test results, such as
[#966](#966), he
contributed [many
PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3AApsarasX+is%3Amerged+),
valuable bugfixes and also perf improvements.

Community Involvement: Active involved in community discussion, he is
collaborative and helps the users solve problems, involved in [120+ PR
and
issues](https://github.com/vllm-project/vllm-ascend/issues?q=commenter%3AApsarasX).
He is also the speaker of [vLLM Beijing
Meetup](https://mp.weixin.qq.com/s/7n8OYNrCC_I9SJaybHA_-Q).

So I think he's a great addition to the vLLM Ascend Maintainer team.

- ✅Review Quality‌:
108+ PR with valuable review
https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3AApsarasX
with many valuable review, like 

#2326 (comment)

#768 (comment)

#2312 (comment)

#2268 (comment)

#2192 (comment)

#2156 (comment)

- ✅ Sustained and Major Contributions
https://github.com/vllm-project/vllm-ascend/pulls/ApsarasX

- ✅ Quality Contribution‌:

https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3AApsarasX+is%3Aclosed
Good quality with well documents
[Perf] Refactor tensor disposal logic to reduce memory usage
#966

- ✅Community Involvement‌: 
7 issue:

https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aclosed%20author%3AApsarasX

- 120+ PR and issue:

https://github.com/vllm-project/vllm-ascend/issues?q=commenter%3AApsarasX

Signed-off-by: wangxiyuan <[email protected]>
wangxiaoteng888 pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Sep 25, 2025
I would like to nominate Wengang Chen (@ApsarasX
https://github.com/ApsarasX) as a maintainer, starting with my +1.

## Reason
Review Quality‌: He focuses on the vLLM Ascend Core module review with
100+ high quality review, such as [vllm-project#2326
(comment)](vllm-project#2326 (comment)),
[vllm-project#768
(comment)](vllm-project#768 (comment)),
[vllm-project#2312
(comment)](vllm-project#2312 (comment)),
[vllm-project#2268
(comment)](vllm-project#2268 (comment)),
[vllm-project#2192
(comment)](vllm-project#2192 (comment)),
[vllm-project#2156
(comment)](vllm-project#2156 (comment)).
This helped vLLM Ascend v0.9.x and v0.10.x to be released with high
quality.

Sustained and Quality Contributions: He has a very good habit of sharing
his design ideas, development process, performance test results, such as
[vllm-project#966](vllm-project#966), he
contributed [many
PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3AApsarasX+is%3Amerged+),
valuable bugfixes and also perf improvements.

Community Involvement: Active involved in community discussion, he is
collaborative and helps the users solve problems, involved in [120+ PR
and
issues](https://github.com/vllm-project/vllm-ascend/issues?q=commenter%3AApsarasX).
He is also the speaker of [vLLM Beijing
Meetup](https://mp.weixin.qq.com/s/7n8OYNrCC_I9SJaybHA_-Q).

So I think he's a great addition to the vLLM Ascend Maintainer team.

- ✅Review Quality‌:
108+ PR with valuable review
https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3AApsarasX
with many valuable review, like 

vllm-project#2326 (comment)

vllm-project#768 (comment)

vllm-project#2312 (comment)

vllm-project#2268 (comment)

vllm-project#2192 (comment)

vllm-project#2156 (comment)

- ✅ Sustained and Major Contributions
https://github.com/vllm-project/vllm-ascend/pulls/ApsarasX

- ✅ Quality Contribution‌:

https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3AApsarasX+is%3Aclosed
Good quality with well documents
[Perf] Refactor tensor disposal logic to reduce memory usage
vllm-project#966

- ✅Community Involvement‌: 
7 issue:

https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aclosed%20author%3AApsarasX

- 120+ PR and issue:

https://github.com/vllm-project/vllm-ascend/issues?q=commenter%3AApsarasX

Signed-off-by: wangxiyuan <[email protected]>
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Sep 26, 2025
I would like to nominate Wengang Chen (@ApsarasX
https://github.com/ApsarasX) as a maintainer, starting with my +1.

## Reason
Review Quality‌: He focuses on the vLLM Ascend Core module review with
100+ high quality review, such as [vllm-project#2326
(comment)](vllm-project#2326 (comment)),
[vllm-project#768
(comment)](vllm-project#768 (comment)),
[vllm-project#2312
(comment)](vllm-project#2312 (comment)),
[vllm-project#2268
(comment)](vllm-project#2268 (comment)),
[vllm-project#2192
(comment)](vllm-project#2192 (comment)),
[vllm-project#2156
(comment)](vllm-project#2156 (comment)).
This helped vLLM Ascend v0.9.x and v0.10.x to be released with high
quality.

Sustained and Quality Contributions: He has a very good habit of sharing
his design ideas, development process, performance test results, such as
[vllm-project#966](vllm-project#966), he
contributed [many
PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3AApsarasX+is%3Amerged+),
valuable bugfixes and also perf improvements.

Community Involvement: Active involved in community discussion, he is
collaborative and helps the users solve problems, involved in [120+ PR
and
issues](https://github.com/vllm-project/vllm-ascend/issues?q=commenter%3AApsarasX).
He is also the speaker of [vLLM Beijing
Meetup](https://mp.weixin.qq.com/s/7n8OYNrCC_I9SJaybHA_-Q).

So I think he's a great addition to the vLLM Ascend Maintainer team.

- ✅Review Quality‌:
108+ PR with valuable review
https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3AApsarasX
with many valuable review, like 

vllm-project#2326 (comment)

vllm-project#768 (comment)

vllm-project#2312 (comment)

vllm-project#2268 (comment)

vllm-project#2192 (comment)

vllm-project#2156 (comment)

- ✅ Sustained and Major Contributions
https://github.com/vllm-project/vllm-ascend/pulls/ApsarasX

- ✅ Quality Contribution‌:

https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3AApsarasX+is%3Aclosed
Good quality with well documents
[Perf] Refactor tensor disposal logic to reduce memory usage
vllm-project#966

- ✅Community Involvement‌: 
7 issue:

https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aclosed%20author%3AApsarasX

- 120+ PR and issue:

https://github.com/vllm-project/vllm-ascend/issues?q=commenter%3AApsarasX

Signed-off-by: wangxiyuan <[email protected]>
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Oct 16, 2025
…oject#966)

### What this PR does / why we need it?
1. In previous PRs vllm-project#580
vllm-project#784, I saved GPU memory
by promptly deleting unnecessary tensors. For tensors passed from
upper-layer functions, I used a list container to transfer the parameter
and then popped the tensor from the list within the inner function to
achieve deletion. Recently, I discovered a better implementation in
sglang—the `dispose_tensor` function and I recommend adopting this
approach.
2. Dispose `hidden_states` and `residual` from the previous layer once
they're no longer used.
3. Avoid to generate `self.inputs_embeds` in `ModelRunnerV1` in
non-multimodal scenarios.

With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model
under the conditions of `TP=16` and `max-model-len=32768`, we can save
1.3GB of npu memory.

**Reference**: sgl-project/sglang#6147

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

---------

Signed-off-by: ApsarasX <[email protected]>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
…oject#966)

### What this PR does / why we need it?
1. In previous PRs vllm-project#580
vllm-project#784, I saved GPU memory
by promptly deleting unnecessary tensors. For tensors passed from
upper-layer functions, I used a list container to transfer the parameter
and then popped the tensor from the list within the inner function to
achieve deletion. Recently, I discovered a better implementation in
sglang—the `dispose_tensor` function and I recommend adopting this
approach.
2. Dispose `hidden_states` and `residual` from the previous layer once
they're no longer used.
3. Avoid to generate `self.inputs_embeds` in `ModelRunnerV1` in
non-multimodal scenarios.

With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model
under the conditions of `TP=16` and `max-model-len=32768`, we can save
1.3GB of npu memory.

**Reference**: sgl-project/sglang#6147

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

---------

Signed-off-by: ApsarasX <[email protected]>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
I would like to nominate Wengang Chen (@ApsarasX
https://github.com/ApsarasX) as a maintainer, starting with my +1.

## Reason
Review Quality‌: He focuses on the vLLM Ascend Core module review with
100+ high quality review, such as [vllm-project#2326
(comment)](vllm-project#2326 (comment)),
[vllm-project#768
(comment)](vllm-project#768 (comment)),
[vllm-project#2312
(comment)](vllm-project#2312 (comment)),
[vllm-project#2268
(comment)](vllm-project#2268 (comment)),
[vllm-project#2192
(comment)](vllm-project#2192 (comment)),
[vllm-project#2156
(comment)](vllm-project#2156 (comment)).
This helped vLLM Ascend v0.9.x and v0.10.x to be released with high
quality.

Sustained and Quality Contributions: He has a very good habit of sharing
his design ideas, development process, performance test results, such as
[vllm-project#966](vllm-project#966), he
contributed [many
PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3AApsarasX+is%3Amerged+),
valuable bugfixes and also perf improvements.

Community Involvement: Active involved in community discussion, he is
collaborative and helps the users solve problems, involved in [120+ PR
and
issues](https://github.com/vllm-project/vllm-ascend/issues?q=commenter%3AApsarasX).
He is also the speaker of [vLLM Beijing
Meetup](https://mp.weixin.qq.com/s/7n8OYNrCC_I9SJaybHA_-Q).

So I think he's a great addition to the vLLM Ascend Maintainer team.

- ✅Review Quality‌:
108+ PR with valuable review
https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3AApsarasX
with many valuable review, like 

vllm-project#2326 (comment)

vllm-project#768 (comment)

vllm-project#2312 (comment)

vllm-project#2268 (comment)

vllm-project#2192 (comment)

vllm-project#2156 (comment)

- ✅ Sustained and Major Contributions
https://github.com/vllm-project/vllm-ascend/pulls/ApsarasX

- ✅ Quality Contribution‌:

https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3AApsarasX+is%3Aclosed
Good quality with well documents
[Perf] Refactor tensor disposal logic to reduce memory usage
vllm-project#966

- ✅Community Involvement‌: 
7 issue:

https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aclosed%20author%3AApsarasX

- 120+ PR and issue:

https://github.com/vllm-project/vllm-ascend/issues?q=commenter%3AApsarasX

Signed-off-by: wangxiyuan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants