[Perf] Optimize fused_experts quantization code to save npu memory #784

ApsarasX · 2025-05-07T12:56:46Z

What this PR does / why we need it?

In the w8a8 quantization code of fused_experts, the output of almost every operator is assigned a new variable name. If we want to save NPU memory, we manually del these variables to end their lifecycle, which fills the code with del statements and looks inelegant.
Therefore, I plan to names the output of most operators as hidden_states, thereby ending the lifecycle of the previous hidden_states.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Signed-off-by: ApsarasX <[email protected]>

ApsarasX · 2025-05-08T09:04:43Z

@linfeng-yuan @ganyi1996ppo @zzzzwwjj @Yikun

Please review this PR.

wangxiyuan

Please attach the improvement result later to make sure this PR works as expect. Thanks.

wangxiyuan · 2025-05-09T07:07:29Z

vllm_ascend/quantization/w8a8_dynamic.py

        dynamic_scale = None

-    down_out_list = apply_mlp(expand_x,
+    # place hidden_states in a list to transfer its ownership into the `apply_mlp` function


I'm fine with this quick fix. you have describe the case more clear later. For example, this copy to list action aims to deal with the lifecycle for the tensor so that the memory usage can be improved.

wangxiyuan · 2025-05-09T07:08:46Z

vllm_ascend/quantization/w8a8_dynamic.py

-    down_out_list = apply_mlp(expand_x,
+    # place hidden_states in a list to transfer its ownership into the `apply_mlp` function
+    hidden_states_wrapper = [expand_x]
+    del expand_x


There still a lot of del action everywhere. We can create a new func like release_tenstor to make the del aciton more clear for developer

…e npu memory (vllm-project#784) Merge branch dev-v0.8.5.508-memory-optimization of [email protected]:Theta/vllm-ascend.git into dev-v0.8.5.508 https://code.alipay.com/Theta/vllm-ascend/pull_requests/9 Signed-off-by: 康安 <[email protected]> * [Perf] Optimize fused_experts quantization code to save npu memory (vllm-project#784)

### What this PR does / why we need it? 1. In previous PRs #580 #784, I saved GPU memory by promptly deleting unnecessary tensors. For tensors passed from upper-layer functions, I used a list container to transfer the parameter and then popped the tensor from the list within the inner function to achieve deletion. Recently, I discovered a better implementation in sglang—the `dispose_tensor` function and I recommend adopting this approach. 2. Dispose `hidden_states` and `residual` from the previous layer once they're no longer used. 3. Avoid to generate `self.inputs_embeds` in `ModelRunnerV1` in non-multimodal scenarios. With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model under the conditions of `TP=16` and `max-model-len=32768`, we can save 1.3GB of npu memory. **Reference**: sgl-project/sglang#6147 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? --------- Signed-off-by: ApsarasX <[email protected]>

…oject#966) ### What this PR does / why we need it? 1. In previous PRs vllm-project#580 vllm-project#784, I saved GPU memory by promptly deleting unnecessary tensors. For tensors passed from upper-layer functions, I used a list container to transfer the parameter and then popped the tensor from the list within the inner function to achieve deletion. Recently, I discovered a better implementation in sglang—the `dispose_tensor` function and I recommend adopting this approach. 2. Dispose `hidden_states` and `residual` from the previous layer once they're no longer used. 3. Avoid to generate `self.inputs_embeds` in `ModelRunnerV1` in non-multimodal scenarios. With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model under the conditions of `TP=16` and `max-model-len=32768`, we can save 1.3GB of npu memory. **Reference**: sgl-project/sglang#6147 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? --------- Signed-off-by: ApsarasX <[email protected]>

…llm-project#784) ### What this PR does / why we need it? In the w8a8 quantization code of `fused_experts`, the output of almost every operator is assigned a new variable name. If we want to save NPU memory, we manually `del` these variables to end their lifecycle, which fills the code with `del` statements and looks inelegant. Therefore, I plan to names the output of most operators as `hidden_states`, thereby ending the lifecycle of the previous `hidden_states`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Signed-off-by: ApsarasX <[email protected]>

…oject#966) ### What this PR does / why we need it? 1. In previous PRs vllm-project#580 vllm-project#784, I saved GPU memory by promptly deleting unnecessary tensors. For tensors passed from upper-layer functions, I used a list container to transfer the parameter and then popped the tensor from the list within the inner function to achieve deletion. Recently, I discovered a better implementation in sglang—the `dispose_tensor` function and I recommend adopting this approach. 2. Dispose `hidden_states` and `residual` from the previous layer once they're no longer used. 3. Avoid to generate `self.inputs_embeds` in `ModelRunnerV1` in non-multimodal scenarios. With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model under the conditions of `TP=16` and `max-model-len=32768`, we can save 1.3GB of npu memory. **Reference**: sgl-project/sglang#6147 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? --------- Signed-off-by: ApsarasX <[email protected]>

…llm-project#784) ### What this PR does / why we need it? In the w8a8 quantization code of `fused_experts`, the output of almost every operator is assigned a new variable name. If we want to save NPU memory, we manually `del` these variables to end their lifecycle, which fills the code with `del` statements and looks inelegant. Therefore, I plan to names the output of most operators as `hidden_states`, thereby ending the lifecycle of the previous `hidden_states`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Signed-off-by: ApsarasX <[email protected]>

…oject#966) ### What this PR does / why we need it? 1. In previous PRs vllm-project#580 vllm-project#784, I saved GPU memory by promptly deleting unnecessary tensors. For tensors passed from upper-layer functions, I used a list container to transfer the parameter and then popped the tensor from the list within the inner function to achieve deletion. Recently, I discovered a better implementation in sglang—the `dispose_tensor` function and I recommend adopting this approach. 2. Dispose `hidden_states` and `residual` from the previous layer once they're no longer used. 3. Avoid to generate `self.inputs_embeds` in `ModelRunnerV1` in non-multimodal scenarios. With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model under the conditions of `TP=16` and `max-model-len=32768`, we can save 1.3GB of npu memory. **Reference**: sgl-project/sglang#6147 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? --------- Signed-off-by: ApsarasX <[email protected]>

github-actions bot added the module:quantization label May 7, 2025

ApsarasX force-pushed the optimize-w8a8-memory branch from 4f4df17 to 70b8c9f Compare May 7, 2025 12:58

[Perf] Optimize fused_experts quantization code to save npu memory

b95ca84

Signed-off-by: ApsarasX <[email protected]>

ApsarasX force-pushed the optimize-w8a8-memory branch from 70b8c9f to b95ca84 Compare May 8, 2025 06:49

wangxiyuan approved these changes May 9, 2025

View reviewed changes

wangxiyuan merged commit 324f819 into vllm-project:main May 9, 2025
14 checks passed

Yikun mentioned this pull request May 14, 2025

[Performance]: Moe Memory usage much larger #744

Open

ApsarasX mentioned this pull request May 27, 2025

[Perf] Refactor tensor disposal logic to reduce memory usage #966

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf] Optimize fused_experts quantization code to save npu memory #784

[Perf] Optimize fused_experts quantization code to save npu memory #784

Uh oh!

ApsarasX commented May 7, 2025 •

edited

Loading

Uh oh!

ApsarasX commented May 8, 2025 •

edited

Loading

Uh oh!

wangxiyuan left a comment

Uh oh!

wangxiyuan May 9, 2025

Uh oh!

wangxiyuan May 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Perf] Optimize fused_experts quantization code to save npu memory #784

[Perf] Optimize fused_experts quantization code to save npu memory #784

Uh oh!

Conversation

ApsarasX commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

ApsarasX commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangxiyuan left a comment

Choose a reason for hiding this comment

Uh oh!

wangxiyuan May 9, 2025

Choose a reason for hiding this comment

Uh oh!

wangxiyuan May 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ApsarasX commented May 7, 2025 •

edited

Loading

ApsarasX commented May 8, 2025 •

edited

Loading