Fix incorrect MLAPO weight release in PD mixex scenarios.#4774
Fix incorrect MLAPO weight release in PD mixex scenarios.#4774wangxiyuan merged 3 commits intovllm-project:mainfrom
Conversation
Signed-off-by: ZYang6263 <zy626375@gmail.com>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request correctly addresses a bug where MLAPO weights were being released prematurely in mixed prefill-decode scenarios. The approach of moving the weight release to a conditional block that checks for a KV consumer role is sound. However, the current implementation introduces a memory leak in non-PD scenarios by failing to release weights that were previously freed. I have added a critical comment with a suggested fix to ensure memory is managed correctly in all configurations.
| if self.vllm_config.kv_transfer_config is not None and \ | ||
| self.vllm_config.kv_transfer_config.is_kv_consumer: | ||
| self.fused_qkv_a_proj.weight = None | ||
| self.fused_qkv_a_proj.deq_scale = None | ||
| self.fused_qkv_a_proj.quant_bias = None | ||
| self.q_proj.weight = None | ||
| self.q_proj.deq_scale = None | ||
| self.q_proj.quant_bias = None | ||
| torch.npu.empty_cache() |
There was a problem hiding this comment.
The current logic for releasing MLAPO weights only covers the case where the node is a KV consumer in a Prefill-Decode (PD) mixed scenario. This correctly fixes the original bug but introduces a memory leak in non-PD (standalone) scenarios.
Previously, self.fused_qkv_a_proj.weight was released unconditionally after its data was processed. With this change, it is no longer released in non-PD scenarios, as self.vllm_config.kv_transfer_config would be None.
The logic should be updated to release the processed weights in both non-PD scenarios and on the consumer side of PD scenarios. This ensures memory is freed correctly in all configurations.
| if self.vllm_config.kv_transfer_config is not None and \ | |
| self.vllm_config.kv_transfer_config.is_kv_consumer: | |
| self.fused_qkv_a_proj.weight = None | |
| self.fused_qkv_a_proj.deq_scale = None | |
| self.fused_qkv_a_proj.quant_bias = None | |
| self.q_proj.weight = None | |
| self.q_proj.deq_scale = None | |
| self.q_proj.quant_bias = None | |
| torch.npu.empty_cache() | |
| if self.vllm_config.kv_transfer_config is None or \ | |
| self.vllm_config.kv_transfer_config.is_kv_consumer: | |
| self.fused_qkv_a_proj.weight = None | |
| self.fused_qkv_a_proj.deq_scale = None | |
| self.fused_qkv_a_proj.quant_bias = None | |
| self.q_proj.weight = None | |
| self.q_proj.deq_scale = None | |
| self.q_proj.quant_bias = None | |
| torch.npu.empty_cache() |
…ct#4774) ### What this PR does / why we need it? Fix incorrect MLAPO weight release in PD mixex scenarios. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: ZYang6263 <zy626375@gmail.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
…ct#4774) ### What this PR does / why we need it? Fix incorrect MLAPO weight release in PD mixex scenarios. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: ZYang6263 <zy626375@gmail.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
…ct#4774) ### What this PR does / why we need it? Fix incorrect MLAPO weight release in PD mixex scenarios. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: ZYang6263 <zy626375@gmail.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
…... (#5192) ### What this PR does / why we need it? - Problem: In MLA+MLAPO, KV-consumer deployments keep fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses the prepacked buffers, increasing memory footprint on decode nodes. - Fix: Conditionally drop those tensors only when `kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with the SFA behavior #4774 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: Chen Chen <0109chenchen@gmail.com>
…... (vllm-project#5192) ### What this PR does / why we need it? - Problem: In MLA+MLAPO, KV-consumer deployments keep fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses the prepacked buffers, increasing memory footprint on decode nodes. - Fix: Conditionally drop those tensors only when `kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with the SFA behavior vllm-project#4774 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: Chen Chen <0109chenchen@gmail.com>
…... (vllm-project#5192) ### What this PR does / why we need it? - Problem: In MLA+MLAPO, KV-consumer deployments keep fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses the prepacked buffers, increasing memory footprint on decode nodes. - Fix: Conditionally drop those tensors only when `kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with the SFA behavior vllm-project#4774 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: Chen Chen <0109chenchen@gmail.com>
…... (vllm-project#5192) ### What this PR does / why we need it? - Problem: In MLA+MLAPO, KV-consumer deployments keep fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses the prepacked buffers, increasing memory footprint on decode nodes. - Fix: Conditionally drop those tensors only when `kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with the SFA behavior vllm-project#4774 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: Chen Chen <0109chenchen@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…... (vllm-project#5192) ### What this PR does / why we need it? - Problem: In MLA+MLAPO, KV-consumer deployments keep fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses the prepacked buffers, increasing memory footprint on decode nodes. - Fix: Conditionally drop those tensors only when `kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with the SFA behavior vllm-project#4774 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: Chen Chen <0109chenchen@gmail.com>
…... (vllm-project#5192) ### What this PR does / why we need it? - Problem: In MLA+MLAPO, KV-consumer deployments keep fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses the prepacked buffers, increasing memory footprint on decode nodes. - Fix: Conditionally drop those tensors only when `kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with the SFA behavior vllm-project#4774 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: Chen Chen <0109chenchen@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…... (vllm-project#5192) ### What this PR does / why we need it? - Problem: In MLA+MLAPO, KV-consumer deployments keep fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses the prepacked buffers, increasing memory footprint on decode nodes. - Fix: Conditionally drop those tensors only when `kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with the SFA behavior vllm-project#4774 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: Chen Chen <0109chenchen@gmail.com>
What this PR does / why we need it?
Fix incorrect MLAPO weight release in PD mixex scenarios.
Does this PR introduce any user-facing change?
How was this patch tested?