Skip to content

[Feat] Integrate FIA operator in mla_cp._forward_decode#5641

Merged
wangxiyuan merged 30 commits intovllm-project:mainfrom
845473182:FIA_rebase
Jan 22, 2026
Merged

[Feat] Integrate FIA operator in mla_cp._forward_decode#5641
wangxiyuan merged 30 commits intovllm-project:mainfrom
845473182:FIA_rebase

Conversation

@845473182
Copy link
Copy Markdown
Contributor

@845473182 845473182 commented Jan 6, 2026

What this PR does / why we need it?

Replace the npu_multi_head_latent_attention with FIA operator in mla_cp.py _forward_decode.
Adjust mla_attn_dpc_pcp in acl_graph.py

Does this PR introduce any user-facing change?

no

How was this patch tested?

白永斌 added 9 commits December 24, 2025 10:44
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
…to FIA_rebase

* 'main' of https://github.com/vllm-project/vllm-ascend: (88 commits)
  [1/N] Refactor nightly test structure (vllm-project#5479)
  Docs: Remove deprecated --task parameter for embedding models (vllm-project#5257)
  Revert "moe_gating_top_k" (vllm-project#5512)
  [Doc] Fix issue link for 0.12.0 (vllm-project#5500)
  [CI]update triton ascend version (vllm-project#5392)
  moe_gating_top_k (vllm-project#5271)
  [refactor] refactor model runner capture model (vllm-project#5230)
  Update corresponding vllm commit ID to 12 29 (vllm-project#5475)
  [Kernel]update csrc cmakelist for open-source cann (vllm-project#5458)
  [OP] add custom op aclnnMoeInitRoutingCustom (vllm-project#5251)
  [Refactor][EAGLE] 1/N delete __init__ in mtp_proposer (vllm-project#5176)
  [Refactor][Triton] Move reject sample triton kernels into ops/triton (vllm-project#5324)
  [Feature] support eager mode in model runner v2 (vllm-project#5210)
  [feature] fia support sliding windows (vllm-project#5239)
  Optimize some rejectsampler functions to make npu op launch non-blocking (vllm-project#4587)
  [Feature] Support to use fullgraph with eagle (vllm-project#5118)
  [EPLB][refactor] Modification of the initialization logic for expert_map and log2phy(depend on pr5285) (vllm-project#5311)
  [Refactor]6/N Extract common code of class AscendMLAImpl (vllm-project#5314)
  [Refactor] cache cos/sin in mla & remove parameter model in builder. (vllm-project#5277)
  update vllm pin to 12.27 (vllm-project#5412)
  ...
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
…to FIA_rebase

* 'main' of https://github.com/vllm-project/vllm-ascend:
  [feature] mooncake support pcp/dcp in common conditions (vllm-project#5224)
  [Bugfix] Fix mm_merge (vllm-project#5249)
  [Main2Main] Upgrade vllm commit to 1230 (vllm-project#5495)
  [Feature] Refactor PCP &DCP related code (vllm-project#5214)
  [main][test] Refactor the mtp and eagle test case (vllm-project#5326)
  [smoke][bugfix] moe_init_routing_v2 active_expert_range use int type (vllm-project#5521)
  [2/N] Upgrade nightly doc (vllm-project#5534)
  [Doc] Add new contributors. (vllm-project#5537)
  [3/N][Nightly] Move ops tests to nightly (vllm-project#5538)
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
…to FIA_rebase

* 'main' of https://github.com/vllm-project/vllm-ascend: (58 commits)
  [Main2Main] Upgrade vllm commit to 0106 (vllm-project#5617)
  [CI]update bisheng version (vllm-project#5621)
  [UT][PCP&DCP] UT for block_table.py (vllm-project#5032)
  [Main2Main] Upgrade vllm commit to 0105 (vllm-project#5595)
  [CI] mv ops to correct path (vllm-project#5615)
  [BugFix] Fix Smoke Testing Bug for DSR1 longseq (vllm-project#5613)
  Revert "[Feat] enable hierarchical mc2 ops on A2 by default (vllm-project#5545)" (vllm-project#5611)
  [TRITON][TEST]Add nightly test for triton split_qkv_rmsnorm_rope (vllm-project#5267)
  [perf] Fix MLAPO weight disposal for KV-consumer MLA in PD-mix deploy... (vllm-project#5192)
  [docs] Correct image about prefill phase of PCP (vllm-project#5598)
  [CI] update triton-ascend version (vllm-project#5584)
  [P/D]Remove mooncake kvpool unused parameter `local_hostname` (vllm-project#5574)
  [Bugfix] record cos and sin cache in AscendRotaryEmbedding (vllm-project#5516)
  [bugfix] fix test_camem failed with triton-ascend (vllm-project#5492)
  [UT]add triton ops ut :  test_fused_qkvzba_split_reshape_cat (vllm-project#5474)
  [CI] Download models from ms (vllm-project#5405)
  Docs: Add A3 Docker image guidance for Atlas A3 machines (vllm-project#5256)
  [Doc] Add NNAL installation guide and requirements (vllm-project#5235)
  Add the requirement of arctic-inference which  speculative decoding with suffix_decode  (vllm-project#5045)
  [BugFix][Fusion] Fix graph fusion failure problem (vllm-project#5253)
  ...
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates the npu_fused_infer_attention_score (FIA) operator into the Multi-Head Latent Attention with Context Parallelism (mla_cp) implementation, replacing the previous npu_multi_head_latent_attention operator. The changes span across attention metadata, the attention implementation itself, and the ACL graph update logic to support the new operator.

Overall, the changes look reasonable and follow the required refactoring pattern to switch to the new attention kernel. However, I've identified a critical issue in the graph replay logic that could lead to incorrect results, and a high-severity memory leak due to missing weak_ref_tensors wrappers. Please address these issues.

Comment on lines 482 to +485
seq_len = decode_meta.cp_seq_len
if isinstance(seq_len, torch.Tensor):
seq_len = seq_len.tolist()
actual_seq_lengths_kv = seq_len
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Dynamic parameters such as block_table, spec_attn_mask, and actual_seq_lengths are not updated during graph replay. They are read from the param tuple which contains values from the time of graph capture. This will cause the replayed graph to execute with stale data, leading to incorrect attention outputs. These parameters must be updated from the current forward_context at every step, similar to how actual_seq_lengths_kv is being updated.

            block_table = decode_meta.block_table
            spec_attn_mask = decode_meta.attn_mask
            actual_seq_lengths = decode_meta.actual_seq_lengths_q
            seq_len = decode_meta.cp_seq_len
            if isinstance(seq_len, torch.Tensor):
                seq_len = seq_len.tolist()
            actual_seq_lengths_kv = seq_len

Comment thread vllm_ascend/attention/context_parallel/mla_cp.py Outdated
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 6, 2026

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

845473182 and others added 4 commits January 7, 2026 14:38
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Bai Yongbin <845473182@qq.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
@yiz-liu yiz-liu marked this pull request as ready for review January 7, 2026 08:11
@yiz-liu yiz-liu added ready read for review ready-for-test start test by label for PR labels Jan 7, 2026
@yiz-liu
Copy link
Copy Markdown
Collaborator

yiz-liu commented Jan 7, 2026

We can replace ring_mla with FIA now that lse output is supported @zzzzwwjj @weijinqian0

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 7, 2026

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: Bai Yongbin <845473182@qq.com>
白永斌 and others added 4 commits January 8, 2026 14:18
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
…into FIA_rebase

* 'FIA_rebase' of https://github.com/845473182/vllm-ascend: (39 commits)
  [CI] Drop outdated cases (vllm-project#5709)
  [EPLB][CI] EPLB add aclgraph and redundant expert ci (vllm-project#5625)
  [CI] fix image build tag (vllm-project#5703)
  Optimize the print info format when deprecated code is used in vllm-ascend (vllm-project#5696)
  [Feature] add the magicmtp speculative decoding acceleration algorithm (vllm-project#5542)
  [bugfix] adapt to new implemented get_kv_cache_spec in cpuoffload connector (vllm-project#4311)
  [refactor] Refactor the interface for shard weight and remove the flashcomm2 o_shared interface. (vllm-project#5181)
  [BugFix][P/D] Fix pre-create link parameter error (vllm-project#5694)
  [Kernel] Add moe_gating_top_k operator support for Ascend NPU (vllm-project#5579)
  [1/N][CI] Refactor accuracy test (vllm-project#5400)
  [BugFix][Fusion] Fix graph fusion failure problem (vllm-project#5676)
  [Tests] Add qwen3-8b nightly test (vllm-project#5597)
  [Refactor] Import global var form vllm instead of overwirte it (vllm-project#5469)
  [Refactor] Fix AttentionMaskBuilder singleton and remove redundant pcp_prefill_mask (vllm-project#4870)
  [CI] move image and wheel job to schedule way (vllm-project#5685)
  [Bugfix] Fix the graph capture failure issue in the eagle3+full scenario. (vllm-project#5553)
  [Bugfix] fix resource are insufficient when pcp and piecewise (vllm-project#5377)
  [CI] Add workflow to cancel running workflows on PR close (vllm-project#5646)
  [CI] Bump lm-eval version to v0.4.9.2 (vllm-project#5655)
  [CI] cleanup single/multi-card test (vllm-project#5623)
  ...
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: tongyuzhou <t00886357@china.huawei.com>
…FIA_rebase

* 'ops' of https://github.com/YzTongNiar/vllm-ascend:
  [Ops] replace _update_out_and_lse with _npu_attn_out_lse_update
  [OP] Enable custom op aclnnMoeInitRoutingCustom (vllm-project#5332)
  [Nightly] Move ops to the correct path (vllm-project#5642)
  [CI] Remove workflow_dispatch way for image build (vllm-project#5742)
  [feature]dcp&pcp support mlapo (vllm-project#5672)
  [CI] Add triton ascend in nightly CI (vllm-project#5716)
  [Fix] Fixes speculative decode indexing and unpad condition for attention metadata (vllm-project#5626)
  [Doc] Add Qwen3-Omni-30B-A3B-Thinking Tutorials  (vllm-project#3991)
  [bugfix] Support dsv3.2 enable both mtp and full_decode_only (vllm-project#5679)
  [Feat][Bugfix][main] Adapted SP to eagle3 (vllm-project#5562)
  [CI] Fix image build workflow_dispatch error (vllm-project#5717)
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

…to FIA_rebase

* 'main' of https://github.com/vllm-project/vllm-ascend: (110 commits)
  [Performance] Remove index opetation when VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1 (vllm-project#5936)
  [main][bugfix] fix mooncake kv cache transfer when one P has multi nodes (vllm-project#5960)
  [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC: issue 5476] (vllm-project#5755)
  [Refactor] Move AttentionSpec initialization to Attention module (vllm-project#5834)
  [EPLB][Bugfix] policy_swift_balancer bugfix and renaming (vllm-project#5897)
  [CI]fix for lint CI (vllm-project#5982)
  [Fusion] [Graph]Add Matmul Allreduce Rmsnorm fusion Pass (vllm-project#5034)
  [Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (vllm-project#5928)
  [EPLB][Bugfix] Dispatch Allgather use log2phy if enable eplb (vllm-project#5933)
  [EPLB][Nightly][Bugfix] Get expert from moe layer only (vllm-project#5908)
  [Bugfix][MM] Fix multi-modal inference OOM issues by setting `expandable_segments:True` (vllm-project#5855)
  [doc]Table split  (vllm-project#5929)
  [Doc] Upgrade outdated ut doc (vllm-project#5937)
  [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#2) (vllm-project#5977)
  Eagle3 mm support, enablement on qwen3vl (vllm-project#4848)
  [Doc] Remove Chinese characters from the icons in the doc. (vllm-project#5959)
  [P/D]The issue of solving the force-free secondary release request, which causes the node to crash. (vllm-project#5968)
  [Feature] Support fine-grained shared expert overlap (vllm-project#5482)
  [Bugfix] fix cpu offload hang with tp=1 (vllm-project#5963)
  [Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (vllm-project#5776)
  ...
白永斌 added 8 commits January 20, 2026 16:26
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
This reverts commit a2a6f72.

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
…to FIA_rebase

* 'main' of https://github.com/vllm-project/vllm-ascend: (24 commits)
  add dispath_ffn_combine_bf16 (vllm-project#5866)
  [BugFix] Fix input parameter bug of dispatch_gmm_combine_decode[RFC: issue 5476] (vllm-project#5932)
  [1/N][Feat] Xlite Qwen3 MoE Support (vllm-project#5951)
  [Bugfix] Fix setting of `speculative_config.enforce_eager` for dsv32 (vllm-project#5945)
  [bugfix][mm] change get_num_encoder_tokens to get_num_encoder_embeds in recompute_schedule.py (vllm-project#5132)
  [Bugfix] fix pcp qwen full graph FIA bug (vllm-project#6037)
  [Bugfix]Fixed precision issues caused by pooled request pooling (vllm-project#6049)
  【main】【bugfix】Resolved memory deallocation failure in the pooling layer under re-computation workloads. (vllm-project#6045)
  [main][Bugfix] Fixed an problem related to embeddings sharing (vllm-project#5967)
  [Feature]refactor the npugraph_ex config, support online-infer with static kernel (vllm-project#5775)
  [CI][Lint] Show lint diff on failure (vllm-project#5956)
  [CI] Add wait logic for each individual case (vllm-project#6036)
  [CI] Add DeepSeek-V3.2-W8A8 nightly ci test (vllm-project#4633)
  model runner v2 support triton of penalty (vllm-project#5854)
  [Docs][Model] Support Qwen3-VL-Embedding & Qwen3-VL-Reranker (vllm-project#6034)
  [Tests] move qwen3 performance test from nightly to e2e (vllm-project#5980)
  [Bugfix] fix bug of pcp+mtp+async scheduler (vllm-project#5994)
  [Main2Main] Upgrade vllm commit to releases/v0.14.0 (vllm-project#5988)
  [Ops] Add layernorm for qwen3Next (vllm-project#5765)
  [Doc] Add layer_sharding additional config for DeepSeek-V3.2-W8A8 (vllm-project#5921)
  ...
…to FIA_rebase

* 'main' of https://github.com/vllm-project/vllm-ascend:
  [Refactor] AttentionBuilder inherit from base class in vllm (vllm-project#5916)
  [Nightly] Use Qwen repo for qwen3-next (vllm-project#6064)
…to FIA_rebase

* 'main' of https://github.com/vllm-project/vllm-ascend:
  [CI] Upgrade CANN to 8.5.0 (vllm-project#6070)
  Default enable MLAPO (vllm-project#5952)
  [Doc] Supplement PD separation parameters of DeepSeek V3.1 (vllm-project#6053)
  [Ascend] perf: optimize rope embedding with triton kernel for huge performance gain (vllm-project#5918)
  [Ops] update causal_conv1d_update (vllm-project#5984)
  [CI]Update triton ascend version in 3.2.0 (vllm-project#6067)
  [bugfix] fix the complex and potentially problematic generate_kv_idx. (vllm-project#5957)
@wangxiyuan wangxiyuan merged commit 7f91ac2 into vllm-project:main Jan 22, 2026
20 checks passed
wangxiyuan pushed a commit that referenced this pull request Jan 22, 2026
…d_decode (#6046)

### What this PR does / why we need it?
Replace the npu_multi_head_latent_attention with FIA operator in mla_cp
_forward_decode.
Adjust mla_attn_dpc_pcp in acl_graph.py.

pick-from: #5641
### Does this PR introduce _any_ user-facing change?
no

---------

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: Bai Yongbin <845473182@qq.com>
Signed-off-by: tongyuzhou <t00886357@china.huawei.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: tongyuzhou <t00886357@china.huawei.com>
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Jan 22, 2026
…to qwen3next_rebase

* 'main' of https://github.com/vllm-project/vllm-ascend: (51 commits)
  [Bugfix] Remove `use_aclgraph` in mtp_proposer and use `use_cuda_graph` (vllm-project#6032)
  [BugFix] fix 3vl dense model load quant weight (vllm-project#6100)
  [CP&SP] Integrate FIA operator in mla_cp._forward_decode (vllm-project#5641)
  [CI][Doc] Upgrade wheel building's CANN to 8.5.0 and update the Docs (vllm-project#6145)
  [CI]Install clang in dokerfile for triton ascend (vllm-project#4409)
  [Main] Upgrade PTA to 2.9.0 (vllm-project#6112)
  [Graph][Fusion] Add QKVNormRope and QKVNormRopeWithBias (vllm-project#5721)
  [P/D][PCP]bugfix pcp force free twice caused logger error (vllm-project#6124)
  [BugFix]converting pa get_workspace back to capturing (vllm-project#5833)
  [CI] optimize lint term (vllm-project#5986)
  [Bugfix] Fix Triton operator usage for multimodal models based on `the mrope_interleaved` parameter (vllm-project#6042)
  [bugfix][npugraph_ex]fix the model output type issue caused by manually modify FX graph (vllm-project#6015)
  [BugFix] Support setting tp=1 for the Eagle draft model to take effect (vllm-project#6097)
  [Misc] Bump mooncake version to v0.3.8.post1 (vllm-project#6110)
  [Feature]Enable DispatchGmmCombineDecode when eagle is moe with w8a8 or not moe [RFC: issue 5476] (vllm-project#5758)
  [bugfix] adapt_remote_request_id (vllm-project#6051)
  [Feature] Add support of new W4A4_LAOS_DYNAMIC quantization method (vllm-project#5143)
  [Feature] Support DSA-CP for Hybrid scenario (vllm-project#5702)
  [CI] Upgrade CANN to 8.5.0 (vllm-project#6070)
  Default enable MLAPO (vllm-project#5952)
  ...
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
…t#5641)

### What this PR does / why we need it?
Replace the npu_multi_head_latent_attention with FIA operator in
mla_cp.py _forward_decode.
Adjust mla_attn_dpc_pcp in acl_graph.py

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

---------

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: Bai Yongbin <845473182@qq.com>
Signed-off-by: tongyuzhou <t00886357@china.huawei.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: tongyuzhou <t00886357@china.huawei.com>
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
…d_decode (vllm-project#6046)

### What this PR does / why we need it?
Replace the npu_multi_head_latent_attention with FIA operator in mla_cp
_forward_decode.
Adjust mla_attn_dpc_pcp in acl_graph.py.

pick-from: vllm-project#5641
### Does this PR introduce _any_ user-facing change?
no

---------

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: Bai Yongbin <845473182@qq.com>
Signed-off-by: tongyuzhou <t00886357@china.huawei.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: tongyuzhou <t00886357@china.huawei.com>
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
…t#5641)

### What this PR does / why we need it?
Replace the npu_multi_head_latent_attention with FIA operator in
mla_cp.py _forward_decode.
Adjust mla_attn_dpc_pcp in acl_graph.py

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

---------

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: Bai Yongbin <845473182@qq.com>
Signed-off-by: tongyuzhou <t00886357@china.huawei.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: tongyuzhou <t00886357@china.huawei.com>
tangtiangu pushed a commit to tangtiangu/jiusi-vllm-ascend that referenced this pull request Feb 24, 2026
…d_decode (vllm-project#6046)

### What this PR does / why we need it?
Replace the npu_multi_head_latent_attention with FIA operator in mla_cp
_forward_decode.
Adjust mla_attn_dpc_pcp in acl_graph.py.

pick-from: vllm-project#5641
### Does this PR introduce _any_ user-facing change?
no

---------

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: Bai Yongbin <845473182@qq.com>
Signed-off-by: tongyuzhou <t00886357@china.huawei.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: tongyuzhou <t00886357@china.huawei.com>
tangtiangu pushed a commit to tangtiangu/jiusi-vllm-ascend that referenced this pull request Feb 24, 2026
…d_decode (vllm-project#6046)

### What this PR does / why we need it?
Replace the npu_multi_head_latent_attention with FIA operator in mla_cp
_forward_decode.
Adjust mla_attn_dpc_pcp in acl_graph.py.

pick-from: vllm-project#5641
### Does this PR introduce _any_ user-facing change?
no

---------

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: Bai Yongbin <845473182@qq.com>
Signed-off-by: tongyuzhou <t00886357@china.huawei.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: tongyuzhou <t00886357@china.huawei.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
…t#5641)

### What this PR does / why we need it?
Replace the npu_multi_head_latent_attention with FIA operator in
mla_cp.py _forward_decode.
Adjust mla_attn_dpc_pcp in acl_graph.py

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

---------

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: Bai Yongbin <845473182@qq.com>
Signed-off-by: tongyuzhou <t00886357@china.huawei.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: tongyuzhou <t00886357@china.huawei.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
…t#5641)

### What this PR does / why we need it?
Replace the npu_multi_head_latent_attention with FIA operator in
mla_cp.py _forward_decode.
Adjust mla_attn_dpc_pcp in acl_graph.py

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

---------

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: Bai Yongbin <845473182@qq.com>
Signed-off-by: tongyuzhou <t00886357@china.huawei.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: tongyuzhou <t00886357@china.huawei.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
…t#5641)

### What this PR does / why we need it?
Replace the npu_multi_head_latent_attention with FIA operator in
mla_cp.py _forward_decode.
Adjust mla_attn_dpc_pcp in acl_graph.py

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

---------

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: Bai Yongbin <845473182@qq.com>
Signed-off-by: tongyuzhou <t00886357@china.huawei.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: tongyuzhou <t00886357@china.huawei.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
…t#5641)

### What this PR does / why we need it?
Replace the npu_multi_head_latent_attention with FIA operator in
mla_cp.py _forward_decode.
Adjust mla_attn_dpc_pcp in acl_graph.py

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

---------

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: Bai Yongbin <845473182@qq.com>
Signed-off-by: tongyuzhou <t00886357@china.huawei.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: tongyuzhou <t00886357@china.huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants