Skip to content

[0.13.0][Feat] Integrate FIA operator in mla_cp._forward_decode#6046

Merged
wangxiyuan merged 28 commits intovllm-project:releases/v0.13.0from
845473182:FIA_v0.13.0
Jan 22, 2026
Merged

[0.13.0][Feat] Integrate FIA operator in mla_cp._forward_decode#6046
wangxiyuan merged 28 commits intovllm-project:releases/v0.13.0from
845473182:FIA_v0.13.0

Conversation

@845473182
Copy link
Copy Markdown
Contributor

@845473182 845473182 commented Jan 20, 2026

What this PR does / why we need it?

Replace the npu_multi_head_latent_attention with FIA operator in mla_cp _forward_decode.
Adjust mla_attn_dpc_pcp in acl_graph.py.
pick-from: #5641

Does this PR introduce any user-facing change?

no

How was this patch tested?

白永斌 and others added 17 commits January 20, 2026 16:45
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Bai Yongbin <845473182@qq.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: tongyuzhou <t00886357@china.huawei.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates the npu_fused_infer_attention_score operator, which appears to be a new or optimized NPU kernel, into the attention mechanism. The changes are primarily focused on adapting the attention logic, especially for context parallel (CP) and multi-head latent attention (MLA) implementations, to utilize this new operator. This includes modifications to parameter preparation, operator calls, and output handling. The _update_out_and_lse method has been removed, with its functionality consolidated into _npu_attn_out_lse_update. Test files have been updated to reflect these changes by mocking the new NPU operators and adjusting expected input/output shapes and return values. The acl_graph.py file has also been updated to correctly handle the new parameters for graph capture and replay. The changes are consistent across the codebase and appear to be a necessary adaptation to the new NPU operator API.

白永斌 added 9 commits January 20, 2026 19:58
…lm-ascend into FIA_v0.13.0

* 'releases/v0.13.0' of https://github.com/vllm-project/vllm-ascend:
  [0.13.0][Bugfix] Add `synced_cudagraph_mode` to limit mixed graph modes in dp ranks (vllm-project#6011)
This reverts commit 5c1f197.

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
…lm-ascend into FIA_v0.13.0

* 'releases/v0.13.0' of https://github.com/vllm-project/vllm-ascend:
  [0.13.0][Bugfix] Fix setting of `speculative_config.enforce_eager` for dsv32 (vllm-project#5958)
  [v0.13.0][Bugfix] Fix XliteModelRunner init failed when aclgraph is enabled (vllm-project#5887)
  [0.13.0][Bugfix] Fixed an problem related to embeddings sharing (vllm-project#5972)
  [Bugfix]Fixed precision issues caused by pooled request pooling (vllm-project#6057)
  [0.13.0][Bugfix] fix pcp aclgraph qwen FIA bug (vllm-project#6038)
  [0.13.0][cherry-pick][bugfix] fix bug of triton mrope (vllm-project#6009)
  【0.13.0】【bugfix】Resolved memory deallocation failure in the pooling layer under re-computation workloads. (vllm-project#6056)
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
…lm-ascend into FIA_v0.13.0

* 'releases/v0.13.0' of https://github.com/vllm-project/vllm-ascend:
  [0.13.0][Doc] Supplement PD separation parameters of DeepSeek V3.1 (vllm-project#6054)
  [EPLB][Bugfix][v0.13.0] Incorporate the warm up of the EPLB into the profile run. (vllm-project#6099)
  [EPLB][Bugfix] Dispatch Allgather use log2phy if enable eplb (vllm-project#5933) (vllm-project#6016)
  [0.13.0][CI]fix for CI lint (vllm-project#6093)
  [0.13.0][cherry-pick][bugfix] fix the complex and potentially problematic generate_kv_idx. (vllm-project#5955)
白永斌 added 2 commits January 22, 2026 14:13
…lm-ascend into FIA_v0.13.0

* 'releases/v0.13.0' of https://github.com/vllm-project/vllm-ascend:
  [Feature][Cherry Pick]Enable DispatchGmmCombineDecode when eagle is moe with w8a8, or not moe (vllm-project#6081)
  [v0.13.0][BugFix][Cherry Pick] Fix input parameter bug of dispatch_gmm_combine_decode (vllm-project#5931)
  [0.13.0][Bugfix] Fix Triton operator usage for multimodal models based on `the mrope_interleaved` parameter (vllm-project#6074)
  [v0.13.0][CI] Upgrade to CANN 8.5.0 (vllm-project#6101)
…lm-ascend into FIA_v0.13.0

* 'releases/v0.13.0' of https://github.com/vllm-project/vllm-ascend:
  [EPLB] Config Rename wrapper (vllm-project#6111)
  [v0.13.0][Bugfix] Fix the input constraints checks for the mlapo and bmm_transpose operators (vllm-project#5764) (vllm-project#6088)
@wangxiyuan wangxiyuan merged commit d2bf9ea into vllm-project:releases/v0.13.0 Jan 22, 2026
12 checks passed
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
…d_decode (vllm-project#6046)

### What this PR does / why we need it?
Replace the npu_multi_head_latent_attention with FIA operator in mla_cp
_forward_decode.
Adjust mla_attn_dpc_pcp in acl_graph.py.

pick-from: vllm-project#5641
### Does this PR introduce _any_ user-facing change?
no

---------

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: Bai Yongbin <845473182@qq.com>
Signed-off-by: tongyuzhou <t00886357@china.huawei.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: tongyuzhou <t00886357@china.huawei.com>
tangtiangu pushed a commit to tangtiangu/jiusi-vllm-ascend that referenced this pull request Feb 24, 2026
…d_decode (vllm-project#6046)

### What this PR does / why we need it?
Replace the npu_multi_head_latent_attention with FIA operator in mla_cp
_forward_decode.
Adjust mla_attn_dpc_pcp in acl_graph.py.

pick-from: vllm-project#5641
### Does this PR introduce _any_ user-facing change?
no

---------

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: Bai Yongbin <845473182@qq.com>
Signed-off-by: tongyuzhou <t00886357@china.huawei.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: tongyuzhou <t00886357@china.huawei.com>
tangtiangu pushed a commit to tangtiangu/jiusi-vllm-ascend that referenced this pull request Feb 24, 2026
…d_decode (vllm-project#6046)

### What this PR does / why we need it?
Replace the npu_multi_head_latent_attention with FIA operator in mla_cp
_forward_decode.
Adjust mla_attn_dpc_pcp in acl_graph.py.

pick-from: vllm-project#5641
### Does this PR introduce _any_ user-facing change?
no

---------

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: Bai Yongbin <845473182@qq.com>
Signed-off-by: tongyuzhou <t00886357@china.huawei.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: tongyuzhou <t00886357@china.huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants