[perf] Fix MLAPO weight disposal for KV-consumer MLA in PD-mix deploy... by kiscad · Pull Request #5192 · vllm-project/vllm-ascend

kiscad · 2025-12-19T07:56:11Z

What this PR does / why we need it?

Problem: In MLA+MLAPO, KV-consumer deployments keep fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses the prepacked buffers, increasing memory footprint on decode nodes.
Fix: Conditionally drop those tensors only when kv_transfer_config.is_kv_consumer to reclaim memory (consistent with the SFA behavior Fix incorrect MLAPO weight release in PD mixex scenarios. #4774 ).

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e

gemini-code-assist

Code Review

This pull request aims to fix an issue with MLAPO weight disposal in PD-mix deployments where MLA acts as a KV consumer. The change introduces logic to conditionally clear certain weights and quantization parameters to free up memory on KV consumer nodes after they have been processed for the MLAPO kernel. While the logic for freeing memory seems correct for the intended scenario, it introduces a potential critical issue by making the _process_weights_for_fused_mlapo method non-idempotent. A second invocation on a KV consumer node would lead to a crash. I've added a comment with a suggestion to add a guard to prevent this.

vllm_ascend/attention/mla_v1.py

github-actions · 2025-12-19T10:26:07Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

jianzs · 2025-12-22T03:03:55Z

If this step is taken, it means the decoding node must skip the prefill stage. How is this currently guaranteed? What happens if the request is preempted and needs to be computed?

kiscad · 2025-12-23T02:13:31Z

If this step is taken, it means the decoding node must skip the prefill stage. How is this currently guaranteed? What happens if the request is preempted and needs to be computed?

Decoding node may never enter the prefilling stage.

ZYang6263 · 2025-12-23T02:32:34Z

Similar to SFA, this PR releases unused memory.

jianzs · 2025-12-23T03:28:52Z

If this step is taken, it means the decoding node must skip the prefill stage. How is this currently guaranteed? What happens if the request is preempted and needs to be computed?

Decoding node may never enter the prefilling stage.

How can we ensure this assumption holds? What if some requests get preempted due to insufficient KV cache?

jianzs · 2025-12-23T03:30:24Z

vllm_ascend/utils.py

-        "ep": {
-            "hccl_buffer_size": calculate_ep_buffer_size()
-        },


What does this code have to do with the current PR?

Actually not. I will put it in another PR.

vllm_ascend/attention/mla_v1.py

kiscad · 2025-12-23T08:00:14Z

If this step is taken, it means the decoding node must skip the prefill stage. How is this currently guaranteed? What happens if the request is preempted and needs to be computed?

Decoding node may never enter the prefilling stage.

How can we ensure this assumption holds? What if some requests get preempted due to insufficient KV cache?

For KV-cache insufficiency/preemption: on consumer nodes we rely on recompute_scheduler_enable. RecomputeScheduler handles allocation failures by dropping the preempted request back to the PD proxy (stop_reason="recomputed") instead of recomputing locally, so the prompt/prefill gets recomputed on a producer and KV is reloaded before decode resumes.

jianzs · 2025-12-23T08:07:42Z

If this step is taken, it means the decoding node must skip the prefill stage. How is this currently guaranteed? What happens if the request is preempted and needs to be computed?

Decoding node may never enter the prefilling stage.

How can we ensure this assumption holds? What if some requests get preempted due to insufficient KV cache?

For KV-cache insufficiency/preemption: on consumer nodes we rely on recompute_scheduler_enable. RecomputeScheduler handles allocation failures by dropping the preempted request back to the PD proxy (stop_reason="recomputed") instead of recomputing locally, so the prompt/prefill gets recomputed on a producer and KV is reloaded before decode resumes.

The recomputed feature is currently only available in the vllm-ascend proxy's examples folder and is not yet a stable solution, so it's not ready for production use.

jianzs · 2025-12-24T02:41:51Z

It would be better to adjust the condition for #4774 as well.

kiscad · 2025-12-24T06:14:57Z

It would be better to adjust the condition for #4774 as well.

Yea, good suggestion.

github-actions · 2025-12-28T02:43:30Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

…ments Signed-off-by: Chen Chen <0109chenchen@gmail.com>

…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (58 commits) [Main2Main] Upgrade vllm commit to 0106 (vllm-project#5617) [CI]update bisheng version (vllm-project#5621) [UT][PCP&DCP] UT for block_table.py (vllm-project#5032) [Main2Main] Upgrade vllm commit to 0105 (vllm-project#5595) [CI] mv ops to correct path (vllm-project#5615) [BugFix] Fix Smoke Testing Bug for DSR1 longseq (vllm-project#5613) Revert "[Feat] enable hierarchical mc2 ops on A2 by default (vllm-project#5545)" (vllm-project#5611) [TRITON][TEST]Add nightly test for triton split_qkv_rmsnorm_rope (vllm-project#5267) [perf] Fix MLAPO weight disposal for KV-consumer MLA in PD-mix deploy... (vllm-project#5192) [docs] Correct image about prefill phase of PCP (vllm-project#5598) [CI] update triton-ascend version (vllm-project#5584) [P/D]Remove mooncake kvpool unused parameter `local_hostname` (vllm-project#5574) [Bugfix] record cos and sin cache in AscendRotaryEmbedding (vllm-project#5516) [bugfix] fix test_camem failed with triton-ascend (vllm-project#5492) [UT]add triton ops ut : test_fused_qkvzba_split_reshape_cat (vllm-project#5474) [CI] Download models from ms (vllm-project#5405) Docs: Add A3 Docker image guidance for Atlas A3 machines (vllm-project#5256) [Doc] Add NNAL installation guide and requirements (vllm-project#5235) Add the requirement of arctic-inference which speculative decoding with suffix_decode (vllm-project#5045) [BugFix][Fusion] Fix graph fusion failure problem (vllm-project#5253) ...

…... (vllm-project#5192) ### What this PR does / why we need it? - Problem: In MLA+MLAPO, KV-consumer deployments keep fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses the prepacked buffers, increasing memory footprint on decode nodes. - Fix: Conditionally drop those tensors only when `kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with the SFA behavior vllm-project#4774 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: Chen Chen <0109chenchen@gmail.com>

…... (vllm-project#5192) ### What this PR does / why we need it? - Problem: In MLA+MLAPO, KV-consumer deployments keep fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses the prepacked buffers, increasing memory footprint on decode nodes. - Fix: Conditionally drop those tensors only when `kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with the SFA behavior vllm-project#4774 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: Chen Chen <0109chenchen@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…... (vllm-project#5192) ### What this PR does / why we need it? - Problem: In MLA+MLAPO, KV-consumer deployments keep fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses the prepacked buffers, increasing memory footprint on decode nodes. - Fix: Conditionally drop those tensors only when `kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with the SFA behavior vllm-project#4774 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: Chen Chen <0109chenchen@gmail.com>

…... (vllm-project#5192) ### What this PR does / why we need it? - Problem: In MLA+MLAPO, KV-consumer deployments keep fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses the prepacked buffers, increasing memory footprint on decode nodes. - Fix: Conditionally drop those tensors only when `kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with the SFA behavior vllm-project#4774 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: Chen Chen <0109chenchen@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…... (vllm-project#5192) ### What this PR does / why we need it? - Problem: In MLA+MLAPO, KV-consumer deployments keep fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses the prepacked buffers, increasing memory footprint on decode nodes. - Fix: Conditionally drop those tensors only when `kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with the SFA behavior vllm-project#4774 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: Chen Chen <0109chenchen@gmail.com>

gemini-code-assist bot reviewed Dec 19, 2025

View reviewed changes

vllm_ascend/attention/mla_v1.py Show resolved Hide resolved

kiscad changed the title ~~[perf] Fix MLAPO weight disposal for KV-consumer MLA in PD-mix deploy…~~ [perf] Fix MLAPO weight disposal for KV-consumer MLA in PD-mix deploy... Dec 19, 2025

kiscad force-pushed the fix-mlapo-mla branch from aee08d2 to 590e4ad Compare December 23, 2025 01:19

jianzs reviewed Dec 23, 2025

View reviewed changes

zzzzwwjj requested changes Dec 23, 2025

View reviewed changes

vllm_ascend/attention/mla_v1.py Show resolved Hide resolved

kiscad force-pushed the fix-mlapo-mla branch 2 times, most recently from 784af55 to 2c6c2e0 Compare December 23, 2025 07:18

kiscad force-pushed the fix-mlapo-mla branch from 2c6c2e0 to ed4ade8 Compare December 23, 2025 08:34

jianzs approved these changes Dec 23, 2025

View reviewed changes

weijinqian0 approved these changes Dec 23, 2025

View reviewed changes

weijinqian0 added ready read for review ready-for-test start test by label for PR labels Dec 23, 2025

zzzzwwjj approved these changes Dec 24, 2025

View reviewed changes

zzzzwwjj force-pushed the fix-mlapo-mla branch from 4c12ca8 to 9b22ba8 Compare December 24, 2025 01:36

kiscad force-pushed the fix-mlapo-mla branch from 9b22ba8 to 85c49ef Compare December 24, 2025 06:20

github-actions bot added the merge-conflicts label Dec 28, 2025

kiscad force-pushed the fix-mlapo-mla branch from 85c49ef to 5087904 Compare December 30, 2025 08:49

github-actions bot removed the merge-conflicts label Dec 30, 2025

[perf] Fix MLAPO weight disposal for KV-consumer MLA in PD-mix deploy…

2d8588b

…ments Signed-off-by: Chen Chen <0109chenchen@gmail.com>

kiscad force-pushed the fix-mlapo-mla branch from 1041eab to 2d8588b Compare January 5, 2026 07:38

zzzzwwjj merged commit a2daacb into vllm-project:main Jan 5, 2026
19 checks passed

Conversation

kiscad commented Dec 19, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

github-actions bot commented Dec 19, 2025

Uh oh!

jianzs commented Dec 22, 2025

Uh oh!

kiscad commented Dec 23, 2025

Uh oh!

ZYang6263 commented Dec 23, 2025

Uh oh!

jianzs commented Dec 23, 2025

Uh oh!

jianzs Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

kiscad Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kiscad commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jianzs commented Dec 23, 2025

Uh oh!

jianzs commented Dec 24, 2025

Uh oh!

kiscad commented Dec 24, 2025

Uh oh!

github-actions bot commented Dec 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kiscad commented Dec 19, 2025 •

edited by github-actions bot

Loading

kiscad commented Dec 23, 2025 •

edited

Loading