Skip to content

Conversation

@ApsarasX
Copy link
Collaborator

@ApsarasX ApsarasX commented May 7, 2025

What this PR does / why we need it?

In the w8a8 quantization code of fused_experts, the output of almost every operator is assigned a new variable name. If we want to save NPU memory, we manually del these variables to end their lifecycle, which fills the code with del statements and looks inelegant.
Therefore, I plan to names the output of most operators as hidden_states, thereby ending the lifecycle of the previous hidden_states.

Does this PR introduce any user-facing change?

No

How was this patch tested?

@ApsarasX ApsarasX force-pushed the optimize-w8a8-memory branch from 70b8c9f to b95ca84 Compare May 8, 2025 06:49
@ApsarasX
Copy link
Collaborator Author

ApsarasX commented May 8, 2025

@linfeng-yuan @ganyi1996ppo @zzzzwwjj @Yikun

Please review this PR.

Copy link
Collaborator

@wangxiyuan wangxiyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please attach the improvement result later to make sure this PR works as expect. Thanks.

dynamic_scale = None

down_out_list = apply_mlp(expand_x,
# place hidden_states in a list to transfer its ownership into the `apply_mlp` function
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this quick fix. you have describe the case more clear later. For example, this copy to list action aims to deal with the lifecycle for the tensor so that the memory usage can be improved.

down_out_list = apply_mlp(expand_x,
# place hidden_states in a list to transfer its ownership into the `apply_mlp` function
hidden_states_wrapper = [expand_x]
del expand_x
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There still a lot of del action everywhere. We can create a new func like release_tenstor to make the del aciton more clear for developer

@wangxiyuan wangxiyuan merged commit 324f819 into vllm-project:main May 9, 2025
14 checks passed
venus-taibai pushed a commit to venus-taibai/vllm-ascend that referenced this pull request May 15, 2025
…e npu memory (vllm-project#784)

Merge branch dev-v0.8.5.508-memory-optimization of [email protected]:Theta/vllm-ascend.git into dev-v0.8.5.508
https://code.alipay.com/Theta/vllm-ascend/pull_requests/9

Signed-off-by: 康安 <[email protected]>


* [Perf] Optimize fused_experts quantization code to save npu memory (vllm-project#784)
ganyi1996ppo pushed a commit that referenced this pull request May 29, 2025
### What this PR does / why we need it?
1. In previous PRs #580
#784, I saved GPU memory
by promptly deleting unnecessary tensors. For tensors passed from
upper-layer functions, I used a list container to transfer the parameter
and then popped the tensor from the list within the inner function to
achieve deletion. Recently, I discovered a better implementation in
sglang—the `dispose_tensor` function and I recommend adopting this
approach.
2. Dispose `hidden_states` and `residual` from the previous layer once
they're no longer used.
3. Avoid to generate `self.inputs_embeds` in `ModelRunnerV1` in
non-multimodal scenarios.

With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model
under the conditions of `TP=16` and `max-model-len=32768`, we can save
1.3GB of npu memory.

**Reference**: sgl-project/sglang#6147

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

---------

Signed-off-by: ApsarasX <[email protected]>
zxdukki pushed a commit to zxdukki/vllm-ascend that referenced this pull request Jun 3, 2025
…oject#966)

### What this PR does / why we need it?
1. In previous PRs vllm-project#580
vllm-project#784, I saved GPU memory
by promptly deleting unnecessary tensors. For tensors passed from
upper-layer functions, I used a list container to transfer the parameter
and then popped the tensor from the list within the inner function to
achieve deletion. Recently, I discovered a better implementation in
sglang—the `dispose_tensor` function and I recommend adopting this
approach.
2. Dispose `hidden_states` and `residual` from the previous layer once
they're no longer used.
3. Avoid to generate `self.inputs_embeds` in `ModelRunnerV1` in
non-multimodal scenarios.

With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model
under the conditions of `TP=16` and `max-model-len=32768`, we can save
1.3GB of npu memory.

**Reference**: sgl-project/sglang#6147

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

---------

Signed-off-by: ApsarasX <[email protected]>
David9857 pushed a commit to David9857/vllm-ascend that referenced this pull request Jun 3, 2025
…oject#966)

### What this PR does / why we need it?
1. In previous PRs vllm-project#580
vllm-project#784, I saved GPU memory
by promptly deleting unnecessary tensors. For tensors passed from
upper-layer functions, I used a list container to transfer the parameter
and then popped the tensor from the list within the inner function to
achieve deletion. Recently, I discovered a better implementation in
sglang—the `dispose_tensor` function and I recommend adopting this
approach.
2. Dispose `hidden_states` and `residual` from the previous layer once
they're no longer used.
3. Avoid to generate `self.inputs_embeds` in `ModelRunnerV1` in
non-multimodal scenarios.

With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model
under the conditions of `TP=16` and `max-model-len=32768`, we can save
1.3GB of npu memory.

**Reference**: sgl-project/sglang#6147

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

---------

Signed-off-by: ApsarasX <[email protected]>
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Oct 16, 2025
…llm-project#784)

### What this PR does / why we need it?
In the w8a8 quantization code of `fused_experts`, the output of almost
every operator is assigned a new variable name. If we want to save NPU
memory, we manually `del` these variables to end their lifecycle, which
fills the code with `del` statements and looks inelegant.
Therefore, I plan to names the output of most operators as
`hidden_states`, thereby ending the lifecycle of the previous
`hidden_states`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Signed-off-by: ApsarasX <[email protected]>
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Oct 16, 2025
…oject#966)

### What this PR does / why we need it?
1. In previous PRs vllm-project#580
vllm-project#784, I saved GPU memory
by promptly deleting unnecessary tensors. For tensors passed from
upper-layer functions, I used a list container to transfer the parameter
and then popped the tensor from the list within the inner function to
achieve deletion. Recently, I discovered a better implementation in
sglang—the `dispose_tensor` function and I recommend adopting this
approach.
2. Dispose `hidden_states` and `residual` from the previous layer once
they're no longer used.
3. Avoid to generate `self.inputs_embeds` in `ModelRunnerV1` in
non-multimodal scenarios.

With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model
under the conditions of `TP=16` and `max-model-len=32768`, we can save
1.3GB of npu memory.

**Reference**: sgl-project/sglang#6147

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

---------

Signed-off-by: ApsarasX <[email protected]>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
…llm-project#784)

### What this PR does / why we need it?
In the w8a8 quantization code of `fused_experts`, the output of almost
every operator is assigned a new variable name. If we want to save NPU
memory, we manually `del` these variables to end their lifecycle, which
fills the code with `del` statements and looks inelegant.
Therefore, I plan to names the output of most operators as
`hidden_states`, thereby ending the lifecycle of the previous
`hidden_states`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Signed-off-by: ApsarasX <[email protected]>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
…oject#966)

### What this PR does / why we need it?
1. In previous PRs vllm-project#580
vllm-project#784, I saved GPU memory
by promptly deleting unnecessary tensors. For tensors passed from
upper-layer functions, I used a list container to transfer the parameter
and then popped the tensor from the list within the inner function to
achieve deletion. Recently, I discovered a better implementation in
sglang—the `dispose_tensor` function and I recommend adopting this
approach.
2. Dispose `hidden_states` and `residual` from the previous layer once
they're no longer used.
3. Avoid to generate `self.inputs_embeds` in `ModelRunnerV1` in
non-multimodal scenarios.

With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model
under the conditions of `TP=16` and `max-model-len=32768`, we can save
1.3GB of npu memory.

**Reference**: sgl-project/sglang#6147

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

---------

Signed-off-by: ApsarasX <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants