Skip to content

Fix incorrect MLAPO weight release in PD mixex scenarios.#4774

Merged
wangxiyuan merged 3 commits intovllm-project:mainfrom
ZYang6263:pr-mlapo-fix
Dec 8, 2025
Merged

Fix incorrect MLAPO weight release in PD mixex scenarios.#4774
wangxiyuan merged 3 commits intovllm-project:mainfrom
ZYang6263:pr-mlapo-fix

Conversation

@ZYang6263
Copy link
Collaborator

@ZYang6263 ZYang6263 commented Dec 8, 2025

What this PR does / why we need it?

Fix incorrect MLAPO weight release in PD mixex scenarios.

Does this PR introduce any user-facing change?

How was this patch tested?

Signed-off-by: ZYang6263 <zy626375@gmail.com>
@ZYang6263 ZYang6263 marked this pull request as ready for review December 8, 2025 05:33
@github-actions
Copy link
Contributor

github-actions bot commented Dec 8, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a bug where MLAPO weights were being released prematurely in mixed prefill-decode scenarios. The approach of moving the weight release to a conditional block that checks for a KV consumer role is sound. However, the current implementation introduces a memory leak in non-PD scenarios by failing to release weights that were previously freed. I have added a critical comment with a suggested fix to ensure memory is managed correctly in all configurations.

Comment on lines +673 to 681
if self.vllm_config.kv_transfer_config is not None and \
self.vllm_config.kv_transfer_config.is_kv_consumer:
self.fused_qkv_a_proj.weight = None
self.fused_qkv_a_proj.deq_scale = None
self.fused_qkv_a_proj.quant_bias = None
self.q_proj.weight = None
self.q_proj.deq_scale = None
self.q_proj.quant_bias = None
torch.npu.empty_cache()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The current logic for releasing MLAPO weights only covers the case where the node is a KV consumer in a Prefill-Decode (PD) mixed scenario. This correctly fixes the original bug but introduces a memory leak in non-PD (standalone) scenarios.

Previously, self.fused_qkv_a_proj.weight was released unconditionally after its data was processed. With this change, it is no longer released in non-PD scenarios, as self.vllm_config.kv_transfer_config would be None.

The logic should be updated to release the processed weights in both non-PD scenarios and on the consumer side of PD scenarios. This ensures memory is freed correctly in all configurations.

Suggested change
if self.vllm_config.kv_transfer_config is not None and \
self.vllm_config.kv_transfer_config.is_kv_consumer:
self.fused_qkv_a_proj.weight = None
self.fused_qkv_a_proj.deq_scale = None
self.fused_qkv_a_proj.quant_bias = None
self.q_proj.weight = None
self.q_proj.deq_scale = None
self.q_proj.quant_bias = None
torch.npu.empty_cache()
if self.vllm_config.kv_transfer_config is None or \
self.vllm_config.kv_transfer_config.is_kv_consumer:
self.fused_qkv_a_proj.weight = None
self.fused_qkv_a_proj.deq_scale = None
self.fused_qkv_a_proj.quant_bias = None
self.q_proj.weight = None
self.q_proj.deq_scale = None
self.q_proj.quant_bias = None
torch.npu.empty_cache()

@wangxiyuan wangxiyuan merged commit 432b861 into vllm-project:main Dec 8, 2025
17 of 19 checks passed
weijinqian0 pushed a commit to weijinqian0/vllm-ascend that referenced this pull request Dec 9, 2025
…ct#4774)

### What this PR does / why we need it?
Fix incorrect MLAPO weight release in PD mixex scenarios.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

Signed-off-by: ZYang6263 <zy626375@gmail.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 10, 2025
…ct#4774)

### What this PR does / why we need it?
Fix incorrect MLAPO weight release in PD mixex scenarios.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

Signed-off-by: ZYang6263 <zy626375@gmail.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 10, 2025
…ct#4774)

### What this PR does / why we need it?
Fix incorrect MLAPO weight release in PD mixex scenarios.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

Signed-off-by: ZYang6263 <zy626375@gmail.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
zzzzwwjj pushed a commit that referenced this pull request Jan 5, 2026
…... (#5192)

### What this PR does / why we need it?

- Problem: In MLA+MLAPO, KV-consumer deployments keep
fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses
the prepacked buffers, increasing memory footprint on decode nodes.
- Fix: Conditionally drop those tensors only when
`kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with
the SFA behavior #4774 ).

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

Signed-off-by: Chen Chen <0109chenchen@gmail.com>
Rozwel-dx pushed a commit to Rozwel-dx/vllm-ascend that referenced this pull request Jan 8, 2026
…... (vllm-project#5192)

### What this PR does / why we need it?

- Problem: In MLA+MLAPO, KV-consumer deployments keep
fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses
the prepacked buffers, increasing memory footprint on decode nodes.
- Fix: Conditionally drop those tensors only when
`kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with
the SFA behavior vllm-project#4774 ).

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

Signed-off-by: Chen Chen <0109chenchen@gmail.com>
aipaes pushed a commit to aipaes/vllm-ascend that referenced this pull request Jan 15, 2026
…... (vllm-project#5192)

### What this PR does / why we need it?

- Problem: In MLA+MLAPO, KV-consumer deployments keep
fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses
the prepacked buffers, increasing memory footprint on decode nodes.
- Fix: Conditionally drop those tensors only when
`kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with
the SFA behavior vllm-project#4774 ).

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

Signed-off-by: Chen Chen <0109chenchen@gmail.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
…... (vllm-project#5192)

### What this PR does / why we need it?

- Problem: In MLA+MLAPO, KV-consumer deployments keep
fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses
the prepacked buffers, increasing memory footprint on decode nodes.
- Fix: Conditionally drop those tensors only when
`kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with
the SFA behavior vllm-project#4774 ).

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

Signed-off-by: Chen Chen <0109chenchen@gmail.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
…... (vllm-project#5192)

### What this PR does / why we need it?

- Problem: In MLA+MLAPO, KV-consumer deployments keep
fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses
the prepacked buffers, increasing memory footprint on decode nodes.
- Fix: Conditionally drop those tensors only when
`kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with
the SFA behavior vllm-project#4774 ).

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

Signed-off-by: Chen Chen <0109chenchen@gmail.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
…... (vllm-project#5192)

### What this PR does / why we need it?

- Problem: In MLA+MLAPO, KV-consumer deployments keep
fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses
the prepacked buffers, increasing memory footprint on decode nodes.
- Fix: Conditionally drop those tensors only when
`kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with
the SFA behavior vllm-project#4774 ).

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

Signed-off-by: Chen Chen <0109chenchen@gmail.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
…... (vllm-project#5192)

### What this PR does / why we need it?

- Problem: In MLA+MLAPO, KV-consumer deployments keep
fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses
the prepacked buffers, increasing memory footprint on decode nodes.
- Fix: Conditionally drop those tensors only when
`kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with
the SFA behavior vllm-project#4774 ).

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

Signed-off-by: Chen Chen <0109chenchen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants