-
Notifications
You must be signed in to change notification settings - Fork 617
[Perf] Optimize fused_experts quantization code to save npu memory #784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4f4df17 to
70b8c9f
Compare
Signed-off-by: ApsarasX <[email protected]>
70b8c9f to
b95ca84
Compare
|
@linfeng-yuan @ganyi1996ppo @zzzzwwjj @Yikun Please review this PR. |
wangxiyuan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please attach the improvement result later to make sure this PR works as expect. Thanks.
| dynamic_scale = None | ||
|
|
||
| down_out_list = apply_mlp(expand_x, | ||
| # place hidden_states in a list to transfer its ownership into the `apply_mlp` function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this quick fix. you have describe the case more clear later. For example, this copy to list action aims to deal with the lifecycle for the tensor so that the memory usage can be improved.
| down_out_list = apply_mlp(expand_x, | ||
| # place hidden_states in a list to transfer its ownership into the `apply_mlp` function | ||
| hidden_states_wrapper = [expand_x] | ||
| del expand_x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There still a lot of del action everywhere. We can create a new func like release_tenstor to make the del aciton more clear for developer
…e npu memory (vllm-project#784) Merge branch dev-v0.8.5.508-memory-optimization of [email protected]:Theta/vllm-ascend.git into dev-v0.8.5.508 https://code.alipay.com/Theta/vllm-ascend/pull_requests/9 Signed-off-by: 康安 <[email protected]> * [Perf] Optimize fused_experts quantization code to save npu memory (vllm-project#784)
### What this PR does / why we need it? 1. In previous PRs #580 #784, I saved GPU memory by promptly deleting unnecessary tensors. For tensors passed from upper-layer functions, I used a list container to transfer the parameter and then popped the tensor from the list within the inner function to achieve deletion. Recently, I discovered a better implementation in sglang—the `dispose_tensor` function and I recommend adopting this approach. 2. Dispose `hidden_states` and `residual` from the previous layer once they're no longer used. 3. Avoid to generate `self.inputs_embeds` in `ModelRunnerV1` in non-multimodal scenarios. With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model under the conditions of `TP=16` and `max-model-len=32768`, we can save 1.3GB of npu memory. **Reference**: sgl-project/sglang#6147 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? --------- Signed-off-by: ApsarasX <[email protected]>
…oject#966) ### What this PR does / why we need it? 1. In previous PRs vllm-project#580 vllm-project#784, I saved GPU memory by promptly deleting unnecessary tensors. For tensors passed from upper-layer functions, I used a list container to transfer the parameter and then popped the tensor from the list within the inner function to achieve deletion. Recently, I discovered a better implementation in sglang—the `dispose_tensor` function and I recommend adopting this approach. 2. Dispose `hidden_states` and `residual` from the previous layer once they're no longer used. 3. Avoid to generate `self.inputs_embeds` in `ModelRunnerV1` in non-multimodal scenarios. With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model under the conditions of `TP=16` and `max-model-len=32768`, we can save 1.3GB of npu memory. **Reference**: sgl-project/sglang#6147 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? --------- Signed-off-by: ApsarasX <[email protected]>
…oject#966) ### What this PR does / why we need it? 1. In previous PRs vllm-project#580 vllm-project#784, I saved GPU memory by promptly deleting unnecessary tensors. For tensors passed from upper-layer functions, I used a list container to transfer the parameter and then popped the tensor from the list within the inner function to achieve deletion. Recently, I discovered a better implementation in sglang—the `dispose_tensor` function and I recommend adopting this approach. 2. Dispose `hidden_states` and `residual` from the previous layer once they're no longer used. 3. Avoid to generate `self.inputs_embeds` in `ModelRunnerV1` in non-multimodal scenarios. With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model under the conditions of `TP=16` and `max-model-len=32768`, we can save 1.3GB of npu memory. **Reference**: sgl-project/sglang#6147 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? --------- Signed-off-by: ApsarasX <[email protected]>
…llm-project#784) ### What this PR does / why we need it? In the w8a8 quantization code of `fused_experts`, the output of almost every operator is assigned a new variable name. If we want to save NPU memory, we manually `del` these variables to end their lifecycle, which fills the code with `del` statements and looks inelegant. Therefore, I plan to names the output of most operators as `hidden_states`, thereby ending the lifecycle of the previous `hidden_states`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Signed-off-by: ApsarasX <[email protected]>
…oject#966) ### What this PR does / why we need it? 1. In previous PRs vllm-project#580 vllm-project#784, I saved GPU memory by promptly deleting unnecessary tensors. For tensors passed from upper-layer functions, I used a list container to transfer the parameter and then popped the tensor from the list within the inner function to achieve deletion. Recently, I discovered a better implementation in sglang—the `dispose_tensor` function and I recommend adopting this approach. 2. Dispose `hidden_states` and `residual` from the previous layer once they're no longer used. 3. Avoid to generate `self.inputs_embeds` in `ModelRunnerV1` in non-multimodal scenarios. With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model under the conditions of `TP=16` and `max-model-len=32768`, we can save 1.3GB of npu memory. **Reference**: sgl-project/sglang#6147 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? --------- Signed-off-by: ApsarasX <[email protected]>
…llm-project#784) ### What this PR does / why we need it? In the w8a8 quantization code of `fused_experts`, the output of almost every operator is assigned a new variable name. If we want to save NPU memory, we manually `del` these variables to end their lifecycle, which fills the code with `del` statements and looks inelegant. Therefore, I plan to names the output of most operators as `hidden_states`, thereby ending the lifecycle of the previous `hidden_states`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Signed-off-by: ApsarasX <[email protected]>
…oject#966) ### What this PR does / why we need it? 1. In previous PRs vllm-project#580 vllm-project#784, I saved GPU memory by promptly deleting unnecessary tensors. For tensors passed from upper-layer functions, I used a list container to transfer the parameter and then popped the tensor from the list within the inner function to achieve deletion. Recently, I discovered a better implementation in sglang—the `dispose_tensor` function and I recommend adopting this approach. 2. Dispose `hidden_states` and `residual` from the previous layer once they're no longer used. 3. Avoid to generate `self.inputs_embeds` in `ModelRunnerV1` in non-multimodal scenarios. With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model under the conditions of `TP=16` and `max-model-len=32768`, we can save 1.3GB of npu memory. **Reference**: sgl-project/sglang#6147 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? --------- Signed-off-by: ApsarasX <[email protected]>
What this PR does / why we need it?
In the w8a8 quantization code of
fused_experts, the output of almost every operator is assigned a new variable name. If we want to save NPU memory, we manuallydelthese variables to end their lifecycle, which fills the code withdelstatements and looks inelegant.Therefore, I plan to names the output of most operators as
hidden_states, thereby ending the lifecycle of the previoushidden_states.Does this PR introduce any user-facing change?
No
How was this patch tested?