[Feat] Integrate FIA operator in mla_cp._forward_decode by 845473182 · Pull Request #5641 · vllm-project/vllm-ascend

845473182 · 2026-01-06T08:05:37Z

What this PR does / why we need it?

Replace the npu_multi_head_latent_attention with FIA operator in mla_cp.py _forward_decode.
Adjust mla_attn_dpc_pcp in acl_graph.py

Does this PR introduce any user-facing change?

no

How was this patch tested?

vLLM version: v0.13.0
vLLM main: vllm-project/vllm@2f4e654

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (88 commits) [1/N] Refactor nightly test structure (vllm-project#5479) Docs: Remove deprecated --task parameter for embedding models (vllm-project#5257) Revert "moe_gating_top_k" (vllm-project#5512) [Doc] Fix issue link for 0.12.0 (vllm-project#5500) [CI]update triton ascend version (vllm-project#5392) moe_gating_top_k (vllm-project#5271) [refactor] refactor model runner capture model (vllm-project#5230) Update corresponding vllm commit ID to 12 29 (vllm-project#5475) [Kernel]update csrc cmakelist for open-source cann (vllm-project#5458) [OP] add custom op aclnnMoeInitRoutingCustom (vllm-project#5251) [Refactor][EAGLE] 1/N delete __init__ in mtp_proposer (vllm-project#5176) [Refactor][Triton] Move reject sample triton kernels into ops/triton (vllm-project#5324) [Feature] support eager mode in model runner v2 (vllm-project#5210) [feature] fia support sliding windows (vllm-project#5239) Optimize some rejectsampler functions to make npu op launch non-blocking (vllm-project#4587) [Feature] Support to use fullgraph with eagle (vllm-project#5118) [EPLB][refactor] Modification of the initialization logic for expert_map and log2phy（depend on pr5285） (vllm-project#5311) [Refactor]6/N Extract common code of class AscendMLAImpl (vllm-project#5314) [Refactor] cache cos/sin in mla & remove parameter model in builder. (vllm-project#5277) update vllm pin to 12.27 (vllm-project#5412) ...

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: [feature] mooncake support pcp/dcp in common conditions (vllm-project#5224) [Bugfix] Fix mm_merge (vllm-project#5249) [Main2Main] Upgrade vllm commit to 1230 (vllm-project#5495) [Feature] Refactor PCP &DCP related code (vllm-project#5214) [main][test] Refactor the mtp and eagle test case (vllm-project#5326) [smoke][bugfix] moe_init_routing_v2 active_expert_range use int type (vllm-project#5521) [2/N] Upgrade nightly doc (vllm-project#5534) [Doc] Add new contributors. (vllm-project#5537) [3/N][Nightly] Move ops tests to nightly (vllm-project#5538)

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (58 commits) [Main2Main] Upgrade vllm commit to 0106 (vllm-project#5617) [CI]update bisheng version (vllm-project#5621) [UT][PCP&DCP] UT for block_table.py (vllm-project#5032) [Main2Main] Upgrade vllm commit to 0105 (vllm-project#5595) [CI] mv ops to correct path (vllm-project#5615) [BugFix] Fix Smoke Testing Bug for DSR1 longseq (vllm-project#5613) Revert "[Feat] enable hierarchical mc2 ops on A2 by default (vllm-project#5545)" (vllm-project#5611) [TRITON][TEST]Add nightly test for triton split_qkv_rmsnorm_rope (vllm-project#5267) [perf] Fix MLAPO weight disposal for KV-consumer MLA in PD-mix deploy... (vllm-project#5192) [docs] Correct image about prefill phase of PCP (vllm-project#5598) [CI] update triton-ascend version (vllm-project#5584) [P/D]Remove mooncake kvpool unused parameter `local_hostname` (vllm-project#5574) [Bugfix] record cos and sin cache in AscendRotaryEmbedding (vllm-project#5516) [bugfix] fix test_camem failed with triton-ascend (vllm-project#5492) [UT]add triton ops ut : test_fused_qkvzba_split_reshape_cat (vllm-project#5474) [CI] Download models from ms (vllm-project#5405) Docs: Add A3 Docker image guidance for Atlas A3 machines (vllm-project#5256) [Doc] Add NNAL installation guide and requirements (vllm-project#5235) Add the requirement of arctic-inference which speculative decoding with suffix_decode (vllm-project#5045) [BugFix][Fusion] Fix graph fusion failure problem (vllm-project#5253) ...

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

gemini-code-assist

Code Review

This pull request integrates the npu_fused_infer_attention_score (FIA) operator into the Multi-Head Latent Attention with Context Parallelism (mla_cp) implementation, replacing the previous npu_multi_head_latent_attention operator. The changes span across attention metadata, the attention implementation itself, and the ACL graph update logic to support the new operator.

Overall, the changes look reasonable and follow the required refactoring pattern to switch to the new attention kernel. However, I've identified a critical issue in the graph replay logic that could lead to incorrect results, and a high-severity memory leak due to missing weak_ref_tensors wrappers. Please address these issues.

gemini-code-assist · 2026-01-06T08:09:56Z

            seq_len = decode_meta.cp_seq_len
+            if isinstance(seq_len, torch.Tensor):
+                seq_len = seq_len.tolist()
+            actual_seq_lengths_kv = seq_len


Dynamic parameters such as block_table, spec_attn_mask, and actual_seq_lengths are not updated during graph replay. They are read from the param tuple which contains values from the time of graph capture. This will cause the replayed graph to execute with stale data, leading to incorrect attention outputs. These parameters must be updated from the current forward_context at every step, similar to how actual_seq_lengths_kv is being updated.

block_table = decode_meta.block_table spec_attn_mask = decode_meta.attn_mask actual_seq_lengths = decode_meta.actual_seq_lengths_q seq_len = decode_meta.cp_seq_len if isinstance(seq_len, torch.Tensor): seq_len = seq_len.tolist() actual_seq_lengths_kv = seq_len

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

github-actions · 2026-01-06T09:05:15Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Bai Yongbin <845473182@qq.com>

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

…into FIA_rebase * 'FIA_rebase' of https://github.com/845473182/vllm-ascend: Update vllm_ascend/attention/context_parallel/mla_cp.py

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

yiz-liu · 2026-01-07T08:14:03Z

We can replace ring_mla with FIA now that lse output is supported @zzzzwwjj @weijinqian0

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

github-actions · 2026-01-07T09:47:47Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: Bai Yongbin <845473182@qq.com>

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

…into FIA_rebase * 'FIA_rebase' of https://github.com/845473182/vllm-ascend: (39 commits) [CI] Drop outdated cases (vllm-project#5709) [EPLB][CI] EPLB add aclgraph and redundant expert ci (vllm-project#5625) [CI] fix image build tag (vllm-project#5703) Optimize the print info format when deprecated code is used in vllm-ascend (vllm-project#5696) [Feature] add the magicmtp speculative decoding acceleration algorithm (vllm-project#5542) [bugfix] adapt to new implemented get_kv_cache_spec in cpuoffload connector (vllm-project#4311) [refactor] Refactor the interface for shard weight and remove the flashcomm2 o_shared interface. (vllm-project#5181) [BugFix][P/D] Fix pre-create link parameter error (vllm-project#5694) [Kernel] Add moe_gating_top_k operator support for Ascend NPU (vllm-project#5579) [1/N][CI] Refactor accuracy test (vllm-project#5400) [BugFix][Fusion] Fix graph fusion failure problem (vllm-project#5676) [Tests] Add qwen3-8b nightly test (vllm-project#5597) [Refactor] Import global var form vllm instead of overwirte it (vllm-project#5469) [Refactor] Fix AttentionMaskBuilder singleton and remove redundant pcp_prefill_mask (vllm-project#4870) [CI] move image and wheel job to schedule way (vllm-project#5685) [Bugfix] Fix the graph capture failure issue in the eagle3+full scenario. (vllm-project#5553) [Bugfix] fix resource are insufficient when pcp and piecewise (vllm-project#5377) [CI] Add workflow to cancel running workflows on PR close (vllm-project#5646) [CI] Bump lm-eval version to v0.4.9.2 (vllm-project#5655) [CI] cleanup single/multi-card test (vllm-project#5623) ...

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

Signed-off-by: tongyuzhou <t00886357@china.huawei.com>

…FIA_rebase * 'ops' of https://github.com/YzTongNiar/vllm-ascend: [Ops] replace _update_out_and_lse with _npu_attn_out_lse_update [OP] Enable custom op aclnnMoeInitRoutingCustom (vllm-project#5332) [Nightly] Move ops to the correct path (vllm-project#5642) [CI] Remove workflow_dispatch way for image build (vllm-project#5742) [feature]dcp&pcp support mlapo (vllm-project#5672) [CI] Add triton ascend in nightly CI (vllm-project#5716) [Fix] Fixes speculative decode indexing and unpad condition for attention metadata (vllm-project#5626) [Doc] Add Qwen3-Omni-30B-A3B-Thinking Tutorials (vllm-project#3991) [bugfix] Support dsv3.2 enable both mtp and full_decode_only (vllm-project#5679) [Feat][Bugfix][main] Adapted SP to eagle3 (vllm-project#5562) [CI] Fix image build workflow_dispatch error (vllm-project#5717)

github-actions · 2026-01-16T13:02:11Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (110 commits) [Performance] Remove index opetation when VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1 (vllm-project#5936) [main][bugfix] fix mooncake kv cache transfer when one P has multi nodes (vllm-project#5960) [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC: issue 5476] (vllm-project#5755) [Refactor] Move AttentionSpec initialization to Attention module (vllm-project#5834) [EPLB][Bugfix] policy_swift_balancer bugfix and renaming (vllm-project#5897) [CI]fix for lint CI (vllm-project#5982) [Fusion] [Graph]Add Matmul Allreduce Rmsnorm fusion Pass (vllm-project#5034) [Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (vllm-project#5928) [EPLB][Bugfix] Dispatch Allgather use log2phy if enable eplb (vllm-project#5933) [EPLB][Nightly][Bugfix] Get expert from moe layer only (vllm-project#5908) [Bugfix][MM] Fix multi-modal inference OOM issues by setting `expandable_segments:True` (vllm-project#5855) [doc]Table split (vllm-project#5929) [Doc] Upgrade outdated ut doc (vllm-project#5937) [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#2) (vllm-project#5977) Eagle3 mm support, enablement on qwen3vl (vllm-project#4848) [Doc] Remove Chinese characters from the icons in the doc. (vllm-project#5959) [P/D]The issue of solving the force-free secondary release request, which causes the node to crash. (vllm-project#5968) [Feature] Support fine-grained shared expert overlap (vllm-project#5482) [Bugfix] fix cpu offload hang with tp=1 (vllm-project#5963) [Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (vllm-project#5776) ...

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

This reverts commit a2a6f72. Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (24 commits) add dispath_ffn_combine_bf16 (vllm-project#5866) [BugFix] Fix input parameter bug of dispatch_gmm_combine_decode[RFC: issue 5476] (vllm-project#5932) [1/N][Feat] Xlite Qwen3 MoE Support (vllm-project#5951) [Bugfix] Fix setting of `speculative_config.enforce_eager` for dsv32 (vllm-project#5945) [bugfix][mm] change get_num_encoder_tokens to get_num_encoder_embeds in recompute_schedule.py (vllm-project#5132) [Bugfix] fix pcp qwen full graph FIA bug (vllm-project#6037) [Bugfix]Fixed precision issues caused by pooled request pooling (vllm-project#6049) 【main】【bugfix】Resolved memory deallocation failure in the pooling layer under re-computation workloads. (vllm-project#6045) [main][Bugfix] Fixed an problem related to embeddings sharing (vllm-project#5967) [Feature]refactor the npugraph_ex config, support online-infer with static kernel (vllm-project#5775) [CI][Lint] Show lint diff on failure (vllm-project#5956) [CI] Add wait logic for each individual case (vllm-project#6036) [CI] Add DeepSeek-V3.2-W8A8 nightly ci test (vllm-project#4633) model runner v2 support triton of penalty (vllm-project#5854) [Docs][Model] Support Qwen3-VL-Embedding & Qwen3-VL-Reranker (vllm-project#6034) [Tests] move qwen3 performance test from nightly to e2e (vllm-project#5980) [Bugfix] fix bug of pcp+mtp+async scheduler (vllm-project#5994) [Main2Main] Upgrade vllm commit to releases/v0.14.0 (vllm-project#5988) [Ops] Add layernorm for qwen3Next (vllm-project#5765) [Doc] Add layer_sharding additional config for DeepSeek-V3.2-W8A8 (vllm-project#5921) ...

…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: [Refactor] AttentionBuilder inherit from base class in vllm (vllm-project#5916) [Nightly] Use Qwen repo for qwen3-next (vllm-project#6064)

…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: [CI] Upgrade CANN to 8.5.0 (vllm-project#6070) Default enable MLAPO (vllm-project#5952) [Doc] Supplement PD separation parameters of DeepSeek V3.1 (vllm-project#6053) [Ascend] perf: optimize rope embedding with triton kernel for huge performance gain (vllm-project#5918) [Ops] update causal_conv1d_update (vllm-project#5984) [CI]Update triton ascend version in 3.2.0 (vllm-project#6067) [bugfix] fix the complex and potentially problematic generate_kv_idx. (vllm-project#5957)

…d_decode (#6046) ### What this PR does / why we need it? Replace the npu_multi_head_latent_attention with FIA operator in mla_cp _forward_decode. Adjust mla_attn_dpc_pcp in acl_graph.py. pick-from: #5641 ### Does this PR introduce _any_ user-facing change? no --------- Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Signed-off-by: Bai Yongbin <845473182@qq.com> Signed-off-by: tongyuzhou <t00886357@china.huawei.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: tongyuzhou <t00886357@china.huawei.com>

…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (51 commits) [Bugfix] Remove `use_aclgraph` in mtp_proposer and use `use_cuda_graph` (vllm-project#6032) [BugFix] fix 3vl dense model load quant weight (vllm-project#6100) [CP&SP] Integrate FIA operator in mla_cp._forward_decode (vllm-project#5641) [CI][Doc] Upgrade wheel building's CANN to 8.5.0 and update the Docs (vllm-project#6145) [CI]Install clang in dokerfile for triton ascend (vllm-project#4409) [Main] Upgrade PTA to 2.9.0 (vllm-project#6112) [Graph][Fusion] Add QKVNormRope and QKVNormRopeWithBias (vllm-project#5721) [P/D][PCP]bugfix pcp force free twice caused logger error (vllm-project#6124) [BugFix]converting pa get_workspace back to capturing (vllm-project#5833) [CI] optimize lint term (vllm-project#5986) [Bugfix] Fix Triton operator usage for multimodal models based on `the mrope_interleaved` parameter (vllm-project#6042) [bugfix][npugraph_ex]fix the model output type issue caused by manually modify FX graph (vllm-project#6015) [BugFix] Support setting tp=1 for the Eagle draft model to take effect (vllm-project#6097) [Misc] Bump mooncake version to v0.3.8.post1 (vllm-project#6110) [Feature]Enable DispatchGmmCombineDecode when eagle is moe with w8a8 or not moe [RFC: issue 5476] (vllm-project#5758) [bugfix] adapt_remote_request_id (vllm-project#6051) [Feature] Add support of new W4A4_LAOS_DYNAMIC quantization method (vllm-project#5143) [Feature] Support DSA-CP for Hybrid scenario (vllm-project#5702) [CI] Upgrade CANN to 8.5.0 (vllm-project#6070) Default enable MLAPO (vllm-project#5952) ...

…t#5641) ### What this PR does / why we need it? Replace the npu_multi_head_latent_attention with FIA operator in mla_cp.py _forward_decode. Adjust mla_attn_dpc_pcp in acl_graph.py ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 --------- Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Signed-off-by: Bai Yongbin <845473182@qq.com> Signed-off-by: tongyuzhou <t00886357@china.huawei.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: tongyuzhou <t00886357@china.huawei.com>

…d_decode (vllm-project#6046) ### What this PR does / why we need it? Replace the npu_multi_head_latent_attention with FIA operator in mla_cp _forward_decode. Adjust mla_attn_dpc_pcp in acl_graph.py. pick-from: vllm-project#5641 ### Does this PR introduce _any_ user-facing change? no --------- Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Signed-off-by: Bai Yongbin <845473182@qq.com> Signed-off-by: tongyuzhou <t00886357@china.huawei.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: tongyuzhou <t00886357@china.huawei.com>

…t#5641) ### What this PR does / why we need it? Replace the npu_multi_head_latent_attention with FIA operator in mla_cp.py _forward_decode. Adjust mla_attn_dpc_pcp in acl_graph.py ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 --------- Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Signed-off-by: Bai Yongbin <845473182@qq.com> Signed-off-by: tongyuzhou <t00886357@china.huawei.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: tongyuzhou <t00886357@china.huawei.com>

…d_decode (vllm-project#6046) ### What this PR does / why we need it? Replace the npu_multi_head_latent_attention with FIA operator in mla_cp _forward_decode. Adjust mla_attn_dpc_pcp in acl_graph.py. pick-from: vllm-project#5641 ### Does this PR introduce _any_ user-facing change? no --------- Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Signed-off-by: Bai Yongbin <845473182@qq.com> Signed-off-by: tongyuzhou <t00886357@china.huawei.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: tongyuzhou <t00886357@china.huawei.com>

…t#5641) ### What this PR does / why we need it? Replace the npu_multi_head_latent_attention with FIA operator in mla_cp.py _forward_decode. Adjust mla_attn_dpc_pcp in acl_graph.py ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 --------- Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Signed-off-by: Bai Yongbin <845473182@qq.com> Signed-off-by: tongyuzhou <t00886357@china.huawei.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: tongyuzhou <t00886357@china.huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…t#5641) ### What this PR does / why we need it? Replace the npu_multi_head_latent_attention with FIA operator in mla_cp.py _forward_decode. Adjust mla_attn_dpc_pcp in acl_graph.py ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 --------- Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Signed-off-by: Bai Yongbin <845473182@qq.com> Signed-off-by: tongyuzhou <t00886357@china.huawei.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: tongyuzhou <t00886357@china.huawei.com>

…t#5641) ### What this PR does / why we need it? Replace the npu_multi_head_latent_attention with FIA operator in mla_cp.py _forward_decode. Adjust mla_attn_dpc_pcp in acl_graph.py ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 --------- Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Signed-off-by: Bai Yongbin <845473182@qq.com> Signed-off-by: tongyuzhou <t00886357@china.huawei.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: tongyuzhou <t00886357@china.huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…t#5641) ### What this PR does / why we need it? Replace the npu_multi_head_latent_attention with FIA operator in mla_cp.py _forward_decode. Adjust mla_attn_dpc_pcp in acl_graph.py ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 --------- Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Signed-off-by: Bai Yongbin <845473182@qq.com> Signed-off-by: tongyuzhou <t00886357@china.huawei.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: tongyuzhou <t00886357@china.huawei.com>

白永斌 added 9 commits December 24, 2025 10:44

integrate FIA operator into mla_cp

6ae53b5

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

make it more readable

08de021

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

adapt acl_graph in mla_cp FIA

daafaff

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

adapt graph mode

452c663

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

support mtp

6733ce3

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

remove redundant attributes

410be4d

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

gemini-code-assist bot reviewed Jan 6, 2026

View reviewed changes

remove data cleaning

8d06f81

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

845473182 and others added 4 commits January 7, 2026 14:38

Update vllm_ascend/attention/context_parallel/mla_cp.py

1352315

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Bai Yongbin <845473182@qq.com>

fix lint

47072e3

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

Merge branch 'FIA_rebase' of https://github.com/845473182/vllm-ascend …

120ac20

…into FIA_rebase * 'FIA_rebase' of https://github.com/845473182/vllm-ascend: Update vllm_ascend/attention/context_parallel/mla_cp.py

fix lint

7e899c6

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

dsxsteven mentioned this pull request Jan 7, 2026

[Feat] Remove Redundant Variables after Integrate FIA operator in mla_cp._forward_decode #5659

Closed

yiz-liu marked this pull request as ready for review January 7, 2026 08:11

yiz-liu added ready read for review ready-for-test start test by label for PR labels Jan 7, 2026

fix lint

40afa15

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

github-actions bot added the merge-conflicts label Jan 7, 2026

Merge branch 'main' into FIA_rebase

4134757

Signed-off-by: Bai Yongbin <845473182@qq.com>

github-actions bot removed the merge-conflicts label Jan 8, 2026

白永斌 and others added 4 commits January 8, 2026 14:18

fix ut

c3f5465

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

fix lint

92436a2

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

[Ops] replace _update_out_and_lse with _npu_attn_out_lse_update

a2a6f72

Signed-off-by: tongyuzhou <t00886357@china.huawei.com>

github-actions bot added the merge-conflicts label Jan 16, 2026

845473182 requested review from wangxiyuan, weijinqian0 and yiz-liu as code owners January 19, 2026 10:37

github-actions bot removed the merge-conflicts label Jan 19, 2026

dsxsteven mentioned this pull request Jan 20, 2026

[Feat] Remove CP Redundant Variables after FIA operator enables for CANN 8.5 #6013

Merged

白永斌 added 8 commits January 20, 2026 16:26

fix pre-commit

7b1dd4a

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

restore _process_attn_out_lse

bba3ddf

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

restore _process_attn_out_lse

92b50c3

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

fix ut

c51a43b

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

Revert "[Ops] replace _update_out_and_lse with _npu_attn_out_lse_update"

0d80040

This reverts commit a2a6f72. Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>

845473182 mentioned this pull request Jan 22, 2026

[0.13.0][Feat] Integrate FIA operator in mla_cp._forward_decode #6046

Merged

wangxiyuan merged commit 7f91ac2 into vllm-project:main Jan 22, 2026
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Integrate FIA operator in mla_cp._forward_decode#5641

[Feat] Integrate FIA operator in mla_cp._forward_decode#5641
wangxiyuan merged 30 commits intovllm-project:mainfrom
845473182:FIA_rebase

845473182 commented Jan 6, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 6, 2026

Uh oh!

Uh oh!

github-actions bot commented Jan 6, 2026

Uh oh!

yiz-liu commented Jan 7, 2026

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

github-actions bot commented Jan 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

845473182 commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Jan 6, 2026

Uh oh!

yiz-liu commented Jan 7, 2026

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

github-actions bot commented Jan 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

845473182 commented Jan 6, 2026 •

edited

Loading