[perf] Fix MLAPO weight disposal for KV-consumer MLA in PD-mix deploy...#5192
[perf] Fix MLAPO weight disposal for KV-consumer MLA in PD-mix deploy...#5192zzzzwwjj merged 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request aims to fix an issue with MLAPO weight disposal in PD-mix deployments where MLA acts as a KV consumer. The change introduces logic to conditionally clear certain weights and quantization parameters to free up memory on KV consumer nodes after they have been processed for the MLAPO kernel. While the logic for freeing memory seems correct for the intended scenario, it introduces a potential critical issue by making the _process_weights_for_fused_mlapo method non-idempotent. A second invocation on a KV consumer node would lead to a crash. I've added a comment with a suggestion to add a guard to prevent this.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
|
If this step is taken, it means the decoding node must skip the prefill stage. How is this currently guaranteed? What happens if the request is preempted and needs to be computed? |
aee08d2 to
590e4ad
Compare
Decoding node may never enter the prefilling stage. |
|
Similar to SFA, this PR releases unused memory. |
How can we ensure this assumption holds? What if some requests get preempted due to insufficient KV cache? |
vllm_ascend/utils.py
Outdated
| "ep": { | ||
| "hccl_buffer_size": calculate_ep_buffer_size() | ||
| }, |
There was a problem hiding this comment.
What does this code have to do with the current PR?
There was a problem hiding this comment.
Actually not. I will put it in another PR.
784af55 to
2c6c2e0
Compare
For KV-cache insufficiency/preemption: on consumer nodes we rely on |
The |
2c6c2e0 to
ed4ade8
Compare
4c12ca8 to
9b22ba8
Compare
|
It would be better to adjust the condition for #4774 as well. |
Yea, good suggestion. |
9b22ba8 to
85c49ef
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
85c49ef to
5087904
Compare
…ments Signed-off-by: Chen Chen <0109chenchen@gmail.com>
…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (58 commits) [Main2Main] Upgrade vllm commit to 0106 (vllm-project#5617) [CI]update bisheng version (vllm-project#5621) [UT][PCP&DCP] UT for block_table.py (vllm-project#5032) [Main2Main] Upgrade vllm commit to 0105 (vllm-project#5595) [CI] mv ops to correct path (vllm-project#5615) [BugFix] Fix Smoke Testing Bug for DSR1 longseq (vllm-project#5613) Revert "[Feat] enable hierarchical mc2 ops on A2 by default (vllm-project#5545)" (vllm-project#5611) [TRITON][TEST]Add nightly test for triton split_qkv_rmsnorm_rope (vllm-project#5267) [perf] Fix MLAPO weight disposal for KV-consumer MLA in PD-mix deploy... (vllm-project#5192) [docs] Correct image about prefill phase of PCP (vllm-project#5598) [CI] update triton-ascend version (vllm-project#5584) [P/D]Remove mooncake kvpool unused parameter `local_hostname` (vllm-project#5574) [Bugfix] record cos and sin cache in AscendRotaryEmbedding (vllm-project#5516) [bugfix] fix test_camem failed with triton-ascend (vllm-project#5492) [UT]add triton ops ut : test_fused_qkvzba_split_reshape_cat (vllm-project#5474) [CI] Download models from ms (vllm-project#5405) Docs: Add A3 Docker image guidance for Atlas A3 machines (vllm-project#5256) [Doc] Add NNAL installation guide and requirements (vllm-project#5235) Add the requirement of arctic-inference which speculative decoding with suffix_decode (vllm-project#5045) [BugFix][Fusion] Fix graph fusion failure problem (vllm-project#5253) ...
…... (vllm-project#5192) ### What this PR does / why we need it? - Problem: In MLA+MLAPO, KV-consumer deployments keep fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses the prepacked buffers, increasing memory footprint on decode nodes. - Fix: Conditionally drop those tensors only when `kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with the SFA behavior vllm-project#4774 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: Chen Chen <0109chenchen@gmail.com>
…... (vllm-project#5192) ### What this PR does / why we need it? - Problem: In MLA+MLAPO, KV-consumer deployments keep fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses the prepacked buffers, increasing memory footprint on decode nodes. - Fix: Conditionally drop those tensors only when `kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with the SFA behavior vllm-project#4774 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: Chen Chen <0109chenchen@gmail.com>
…... (vllm-project#5192) ### What this PR does / why we need it? - Problem: In MLA+MLAPO, KV-consumer deployments keep fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses the prepacked buffers, increasing memory footprint on decode nodes. - Fix: Conditionally drop those tensors only when `kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with the SFA behavior vllm-project#4774 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: Chen Chen <0109chenchen@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…... (vllm-project#5192) ### What this PR does / why we need it? - Problem: In MLA+MLAPO, KV-consumer deployments keep fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses the prepacked buffers, increasing memory footprint on decode nodes. - Fix: Conditionally drop those tensors only when `kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with the SFA behavior vllm-project#4774 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: Chen Chen <0109chenchen@gmail.com>
…... (vllm-project#5192) ### What this PR does / why we need it? - Problem: In MLA+MLAPO, KV-consumer deployments keep fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses the prepacked buffers, increasing memory footprint on decode nodes. - Fix: Conditionally drop those tensors only when `kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with the SFA behavior vllm-project#4774 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: Chen Chen <0109chenchen@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…... (vllm-project#5192) ### What this PR does / why we need it? - Problem: In MLA+MLAPO, KV-consumer deployments keep fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses the prepacked buffers, increasing memory footprint on decode nodes. - Fix: Conditionally drop those tensors only when `kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with the SFA behavior vllm-project#4774 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: Chen Chen <0109chenchen@gmail.com>
What this PR does / why we need it?
kv_transfer_config.is_kv_consumerto reclaim memory (consistent with the SFA behavior Fix incorrect MLAPO weight release in PD mixex scenarios. #4774 ).Does this PR introduce any user-facing change?
How was this patch tested?