[0.13.0][Feat] Integrate FIA operator in mla_cp._forward_decode#6046
[0.13.0][Feat] Integrate FIA operator in mla_cp._forward_decode#6046wangxiyuan merged 28 commits intovllm-project:releases/v0.13.0from
Conversation
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Bai Yongbin <845473182@qq.com>
Signed-off-by: tongyuzhou <t00886357@china.huawei.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
There was a problem hiding this comment.
Code Review
This pull request integrates the npu_fused_infer_attention_score operator, which appears to be a new or optimized NPU kernel, into the attention mechanism. The changes are primarily focused on adapting the attention logic, especially for context parallel (CP) and multi-head latent attention (MLA) implementations, to utilize this new operator. This includes modifications to parameter preparation, operator calls, and output handling. The _update_out_and_lse method has been removed, with its functionality consolidated into _npu_attn_out_lse_update. Test files have been updated to reflect these changes by mocking the new NPU operators and adjusting expected input/output shapes and return values. The acl_graph.py file has also been updated to correctly handle the new parameters for graph capture and replay. The changes are consistent across the codebase and appear to be a necessary adaptation to the new NPU operator API.
…lm-ascend into FIA_v0.13.0 * 'releases/v0.13.0' of https://github.com/vllm-project/vllm-ascend: [0.13.0][Bugfix] Add `synced_cudagraph_mode` to limit mixed graph modes in dp ranks (vllm-project#6011)
This reverts commit 5c1f197. Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
…lm-ascend into FIA_v0.13.0 * 'releases/v0.13.0' of https://github.com/vllm-project/vllm-ascend: [0.13.0][Bugfix] Fix setting of `speculative_config.enforce_eager` for dsv32 (vllm-project#5958) [v0.13.0][Bugfix] Fix XliteModelRunner init failed when aclgraph is enabled (vllm-project#5887) [0.13.0][Bugfix] Fixed an problem related to embeddings sharing (vllm-project#5972) [Bugfix]Fixed precision issues caused by pooled request pooling (vllm-project#6057) [0.13.0][Bugfix] fix pcp aclgraph qwen FIA bug (vllm-project#6038) [0.13.0][cherry-pick][bugfix] fix bug of triton mrope (vllm-project#6009) 【0.13.0】【bugfix】Resolved memory deallocation failure in the pooling layer under re-computation workloads. (vllm-project#6056)
…lm-ascend into FIA_v0.13.0 * 'releases/v0.13.0' of https://github.com/vllm-project/vllm-ascend: Revert "[0.13.0][cherry-pick][bugfix] fix bug of triton mrope" (vllm-project#6075)
…lm-ascend into FIA_v0.13.0 * 'releases/v0.13.0' of https://github.com/vllm-project/vllm-ascend: [0.13.0][Doc] Supplement PD separation parameters of DeepSeek V3.1 (vllm-project#6054) [EPLB][Bugfix][v0.13.0] Incorporate the warm up of the EPLB into the profile run. (vllm-project#6099) [EPLB][Bugfix] Dispatch Allgather use log2phy if enable eplb (vllm-project#5933) (vllm-project#6016) [0.13.0][CI]fix for CI lint (vllm-project#6093) [0.13.0][cherry-pick][bugfix] fix the complex and potentially problematic generate_kv_idx. (vllm-project#5955)
…lm-ascend into FIA_v0.13.0 * 'releases/v0.13.0' of https://github.com/vllm-project/vllm-ascend: [Feature][Cherry Pick]Enable DispatchGmmCombineDecode when eagle is moe with w8a8, or not moe (vllm-project#6081) [v0.13.0][BugFix][Cherry Pick] Fix input parameter bug of dispatch_gmm_combine_decode (vllm-project#5931) [0.13.0][Bugfix] Fix Triton operator usage for multimodal models based on `the mrope_interleaved` parameter (vllm-project#6074) [v0.13.0][CI] Upgrade to CANN 8.5.0 (vllm-project#6101)
…lm-ascend into FIA_v0.13.0 * 'releases/v0.13.0' of https://github.com/vllm-project/vllm-ascend: [EPLB] Config Rename wrapper (vllm-project#6111) [v0.13.0][Bugfix] Fix the input constraints checks for the mlapo and bmm_transpose operators (vllm-project#5764) (vllm-project#6088)
…d_decode (vllm-project#6046) ### What this PR does / why we need it? Replace the npu_multi_head_latent_attention with FIA operator in mla_cp _forward_decode. Adjust mla_attn_dpc_pcp in acl_graph.py. pick-from: vllm-project#5641 ### Does this PR introduce _any_ user-facing change? no --------- Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Signed-off-by: Bai Yongbin <845473182@qq.com> Signed-off-by: tongyuzhou <t00886357@china.huawei.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: tongyuzhou <t00886357@china.huawei.com>
…d_decode (vllm-project#6046) ### What this PR does / why we need it? Replace the npu_multi_head_latent_attention with FIA operator in mla_cp _forward_decode. Adjust mla_attn_dpc_pcp in acl_graph.py. pick-from: vllm-project#5641 ### Does this PR introduce _any_ user-facing change? no --------- Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Signed-off-by: Bai Yongbin <845473182@qq.com> Signed-off-by: tongyuzhou <t00886357@china.huawei.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: tongyuzhou <t00886357@china.huawei.com>
…d_decode (vllm-project#6046) ### What this PR does / why we need it? Replace the npu_multi_head_latent_attention with FIA operator in mla_cp _forward_decode. Adjust mla_attn_dpc_pcp in acl_graph.py. pick-from: vllm-project#5641 ### Does this PR introduce _any_ user-facing change? no --------- Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Signed-off-by: Bai Yongbin <845473182@qq.com> Signed-off-by: tongyuzhou <t00886357@china.huawei.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: tongyuzhou <t00886357@china.huawei.com>
What this PR does / why we need it?
Replace the npu_multi_head_latent_attention with FIA operator in mla_cp _forward_decode.
Adjust mla_attn_dpc_pcp in acl_graph.py.
pick-from: #5641
Does this PR introduce any user-facing change?
no
How was this patch tested?