[Feat] Remove Redundant Variables after Integrate FIA operator in mla_cp._forward_decode#5659
Conversation
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (88 commits) [1/N] Refactor nightly test structure (vllm-project#5479) Docs: Remove deprecated --task parameter for embedding models (vllm-project#5257) Revert "moe_gating_top_k" (vllm-project#5512) [Doc] Fix issue link for 0.12.0 (vllm-project#5500) [CI]update triton ascend version (vllm-project#5392) moe_gating_top_k (vllm-project#5271) [refactor] refactor model runner capture model (vllm-project#5230) Update corresponding vllm commit ID to 12 29 (vllm-project#5475) [Kernel]update csrc cmakelist for open-source cann (vllm-project#5458) [OP] add custom op aclnnMoeInitRoutingCustom (vllm-project#5251) [Refactor][EAGLE] 1/N delete __init__ in mtp_proposer (vllm-project#5176) [Refactor][Triton] Move reject sample triton kernels into ops/triton (vllm-project#5324) [Feature] support eager mode in model runner v2 (vllm-project#5210) [feature] fia support sliding windows (vllm-project#5239) Optimize some rejectsampler functions to make npu op launch non-blocking (vllm-project#4587) [Feature] Support to use fullgraph with eagle (vllm-project#5118) [EPLB][refactor] Modification of the initialization logic for expert_map and log2phy(depend on pr5285) (vllm-project#5311) [Refactor]6/N Extract common code of class AscendMLAImpl (vllm-project#5314) [Refactor] cache cos/sin in mla & remove parameter model in builder. (vllm-project#5277) update vllm pin to 12.27 (vllm-project#5412) ...
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: [feature] mooncake support pcp/dcp in common conditions (vllm-project#5224) [Bugfix] Fix mm_merge (vllm-project#5249) [Main2Main] Upgrade vllm commit to 1230 (vllm-project#5495) [Feature] Refactor PCP &DCP related code (vllm-project#5214) [main][test] Refactor the mtp and eagle test case (vllm-project#5326) [smoke][bugfix] moe_init_routing_v2 active_expert_range use int type (vllm-project#5521) [2/N] Upgrade nightly doc (vllm-project#5534) [Doc] Add new contributors. (vllm-project#5537) [3/N][Nightly] Move ops tests to nightly (vllm-project#5538)
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (58 commits) [Main2Main] Upgrade vllm commit to 0106 (vllm-project#5617) [CI]update bisheng version (vllm-project#5621) [UT][PCP&DCP] UT for block_table.py (vllm-project#5032) [Main2Main] Upgrade vllm commit to 0105 (vllm-project#5595) [CI] mv ops to correct path (vllm-project#5615) [BugFix] Fix Smoke Testing Bug for DSR1 longseq (vllm-project#5613) Revert "[Feat] enable hierarchical mc2 ops on A2 by default (vllm-project#5545)" (vllm-project#5611) [TRITON][TEST]Add nightly test for triton split_qkv_rmsnorm_rope (vllm-project#5267) [perf] Fix MLAPO weight disposal for KV-consumer MLA in PD-mix deploy... (vllm-project#5192) [docs] Correct image about prefill phase of PCP (vllm-project#5598) [CI] update triton-ascend version (vllm-project#5584) [P/D]Remove mooncake kvpool unused parameter `local_hostname` (vllm-project#5574) [Bugfix] record cos and sin cache in AscendRotaryEmbedding (vllm-project#5516) [bugfix] fix test_camem failed with triton-ascend (vllm-project#5492) [UT]add triton ops ut : test_fused_qkvzba_split_reshape_cat (vllm-project#5474) [CI] Download models from ms (vllm-project#5405) Docs: Add A3 Docker image guidance for Atlas A3 machines (vllm-project#5256) [Doc] Add NNAL installation guide and requirements (vllm-project#5235) Add the requirement of arctic-inference which speculative decoding with suffix_decode (vllm-project#5045) [BugFix][Fusion] Fix graph fusion failure problem (vllm-project#5253) ...
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: daishixun <dsxsteven@sina.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the attention mechanism code by removing the batch_seq_mask variable and its associated logic. This cleanup is done following the integration of an updated FIA (Fused Infer Attention) operator, which now presumably handles masking for zero-length sequences internally. The changes are consistent across the implementation and test files, leading to cleaner and more maintainable code. The removal of explicit masking logic in functions like _process_attn_out_lse and _compute_prefill_context correctly relies on the improved capabilities of the underlying NPU operators. The updates to tests to reflect these changes are also correctly implemented. Overall, this is a good refactoring that simplifies the codebase.
Signed-off-by: daishixun <dsxsteven@sina.com>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Signed-off-by: daishixun <dsxsteven@sina.com>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Bai Yongbin <845473182@qq.com>
Signed-off-by: daishixun <dsxsteven@sina.com>
Signed-off-by: daishixun <dsxsteven@sina.com>
Signed-off-by: daishixun <dsxsteven@sina.com>
Signed-off-by: daishixun <dsxsteven@sina.com>
…into FIA_rebase * 'FIA_rebase' of https://github.com/845473182/vllm-ascend: Update vllm_ascend/attention/context_parallel/mla_cp.py
|
This PR should be merged after #5641 |
Signed-off-by: Bai Yongbin <845473182@qq.com>
…into FIA_rebase * 'FIA_rebase' of https://github.com/845473182/vllm-ascend: (39 commits) [CI] Drop outdated cases (vllm-project#5709) [EPLB][CI] EPLB add aclgraph and redundant expert ci (vllm-project#5625) [CI] fix image build tag (vllm-project#5703) Optimize the print info format when deprecated code is used in vllm-ascend (vllm-project#5696) [Feature] add the magicmtp speculative decoding acceleration algorithm (vllm-project#5542) [bugfix] adapt to new implemented get_kv_cache_spec in cpuoffload connector (vllm-project#4311) [refactor] Refactor the interface for shard weight and remove the flashcomm2 o_shared interface. (vllm-project#5181) [BugFix][P/D] Fix pre-create link parameter error (vllm-project#5694) [Kernel] Add moe_gating_top_k operator support for Ascend NPU (vllm-project#5579) [1/N][CI] Refactor accuracy test (vllm-project#5400) [BugFix][Fusion] Fix graph fusion failure problem (vllm-project#5676) [Tests] Add qwen3-8b nightly test (vllm-project#5597) [Refactor] Import global var form vllm instead of overwirte it (vllm-project#5469) [Refactor] Fix AttentionMaskBuilder singleton and remove redundant pcp_prefill_mask (vllm-project#4870) [CI] move image and wheel job to schedule way (vllm-project#5685) [Bugfix] Fix the graph capture failure issue in the eagle3+full scenario. (vllm-project#5553) [Bugfix] fix resource are insufficient when pcp and piecewise (vllm-project#5377) [CI] Add workflow to cancel running workflows on PR close (vllm-project#5646) [CI] Bump lm-eval version to v0.4.9.2 (vllm-project#5655) [CI] cleanup single/multi-card test (vllm-project#5623) ...
Signed-off-by: tongyuzhou <t00886357@china.huawei.com>
Signed-off-by: dsxsteven <36877507+dsxsteven@users.noreply.github.com>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
What this PR does / why we need it?
PCP/DCP splits the kv-cache onto different cards. After introducing the parameter
cp-kv-cache-interleave-size, the firstsizetokens will be cached at Card 0, and so on.However, if there are too few tokens, some cards will not store the key-value pairs, resulting in values of 0, corrupted values, and precision issues. Currently, additional operations are introduced to avoid this precision problem.
After we integrate FIA operator in mla_cp._forward_decode, we now can remove these additional operations.
Does this PR introduce any user-facing change?
No
How was this patch tested?