Fix incorrect MLAPO weight release in PD mixex scenarios. by ZYang6263 · Pull Request #4774 · vllm-project/vllm-ascend

ZYang6263 · 2025-12-08T05:33:18Z

What this PR does / why we need it?

Fix incorrect MLAPO weight release in PD mixex scenarios.

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e

Signed-off-by: ZYang6263 <zy626375@gmail.com>

github-actions · 2025-12-08T05:33:27Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request correctly addresses a bug where MLAPO weights were being released prematurely in mixed prefill-decode scenarios. The approach of moving the weight release to a conditional block that checks for a KV consumer role is sound. However, the current implementation introduces a memory leak in non-PD scenarios by failing to release weights that were previously freed. I have added a critical comment with a suggested fix to ensure memory is managed correctly in all configurations.

gemini-code-assist · 2025-12-08T05:34:30Z

vllm_ascend/attention/sfa_v1.py

+        if self.vllm_config.kv_transfer_config is not None and \
+            self.vllm_config.kv_transfer_config.is_kv_consumer:
+            self.fused_qkv_a_proj.weight = None
            self.fused_qkv_a_proj.deq_scale = None
            self.fused_qkv_a_proj.quant_bias = None
+            self.q_proj.weight = None
            self.q_proj.deq_scale = None
            self.q_proj.quant_bias = None
            torch.npu.empty_cache()


The current logic for releasing MLAPO weights only covers the case where the node is a KV consumer in a Prefill-Decode (PD) mixed scenario. This correctly fixes the original bug but introduces a memory leak in non-PD (standalone) scenarios.

Previously, self.fused_qkv_a_proj.weight was released unconditionally after its data was processed. With this change, it is no longer released in non-PD scenarios, as self.vllm_config.kv_transfer_config would be None.

The logic should be updated to release the processed weights in both non-PD scenarios and on the consumer side of PD scenarios. This ensures memory is freed correctly in all configurations.

Suggested change

if self.vllm_config.kv_transfer_config is not None and \

self.vllm_config.kv_transfer_config.is_kv_consumer:

self.fused_qkv_a_proj.weight = None

self.fused_qkv_a_proj.deq_scale = None

self.fused_qkv_a_proj.quant_bias = None

self.q_proj.weight = None

self.q_proj.deq_scale = None

self.q_proj.quant_bias = None

torch.npu.empty_cache()

if self.vllm_config.kv_transfer_config is None or \

self.vllm_config.kv_transfer_config.is_kv_consumer:

self.fused_qkv_a_proj.weight = None

self.fused_qkv_a_proj.deq_scale = None

self.fused_qkv_a_proj.quant_bias = None

self.q_proj.weight = None

self.q_proj.deq_scale = None

self.q_proj.quant_bias = None

torch.npu.empty_cache()

…ct#4774) ### What this PR does / why we need it? Fix incorrect MLAPO weight release in PD mixex scenarios. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: ZYang6263 <zy626375@gmail.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>

…ct#4774) ### What this PR does / why we need it? Fix incorrect MLAPO weight release in PD mixex scenarios. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: ZYang6263 <zy626375@gmail.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>

…... (#5192) ### What this PR does / why we need it? - Problem: In MLA+MLAPO, KV-consumer deployments keep fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses the prepacked buffers, increasing memory footprint on decode nodes. - Fix: Conditionally drop those tensors only when `kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with the SFA behavior #4774 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: Chen Chen <0109chenchen@gmail.com>

…... (vllm-project#5192) ### What this PR does / why we need it? - Problem: In MLA+MLAPO, KV-consumer deployments keep fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses the prepacked buffers, increasing memory footprint on decode nodes. - Fix: Conditionally drop those tensors only when `kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with the SFA behavior vllm-project#4774 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: Chen Chen <0109chenchen@gmail.com>

…... (vllm-project#5192) ### What this PR does / why we need it? - Problem: In MLA+MLAPO, KV-consumer deployments keep fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses the prepacked buffers, increasing memory footprint on decode nodes. - Fix: Conditionally drop those tensors only when `kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with the SFA behavior vllm-project#4774 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: Chen Chen <0109chenchen@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…... (vllm-project#5192) ### What this PR does / why we need it? - Problem: In MLA+MLAPO, KV-consumer deployments keep fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses the prepacked buffers, increasing memory footprint on decode nodes. - Fix: Conditionally drop those tensors only when `kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with the SFA behavior vllm-project#4774 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: Chen Chen <0109chenchen@gmail.com>

…... (vllm-project#5192) ### What this PR does / why we need it? - Problem: In MLA+MLAPO, KV-consumer deployments keep fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses the prepacked buffers, increasing memory footprint on decode nodes. - Fix: Conditionally drop those tensors only when `kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with the SFA behavior vllm-project#4774 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: Chen Chen <0109chenchen@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…... (vllm-project#5192) ### What this PR does / why we need it? - Problem: In MLA+MLAPO, KV-consumer deployments keep fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses the prepacked buffers, increasing memory footprint on decode nodes. - Fix: Conditionally drop those tensors only when `kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with the SFA behavior vllm-project#4774 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: Chen Chen <0109chenchen@gmail.com>

Fix incorrect MLAPO weight release in PD mixex scenarios.

1db0135

Signed-off-by: ZYang6263 <zy626375@gmail.com>

ZYang6263 marked this pull request as ready for review December 8, 2025 05:33

Merge branch 'main' into pr-mlapo-fix

4d7bac1

gemini-code-assist bot reviewed Dec 8, 2025

View reviewed changes

wangxiyuan approved these changes Dec 8, 2025

View reviewed changes

Merge branch 'main' into pr-mlapo-fix

e1a3e70

wangxiyuan merged commit 432b861 into vllm-project:main Dec 8, 2025
17 of 19 checks passed

kiscad mentioned this pull request Dec 19, 2025

[perf] Fix MLAPO weight disposal for KV-consumer MLA in PD-mix deploy... #5192

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix incorrect MLAPO weight release in PD mixex scenarios.#4774

Fix incorrect MLAPO weight release in PD mixex scenarios.#4774
wangxiyuan merged 3 commits intovllm-project:mainfrom
ZYang6263:pr-mlapo-fix

ZYang6263 commented Dec 8, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Dec 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ZYang6263 commented Dec 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Dec 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ZYang6263 commented Dec 8, 2025 •

edited by github-actions bot

Loading