Skip to content

[Bugfix] Fix the issue when the FI Attention operator is used in graph mode with eagle enabled#6932

Closed
chenxi-hh wants to merge 4 commits intovllm-project:mainfrom
chenxi-hh:main
Closed

[Bugfix] Fix the issue when the FI Attention operator is used in graph mode with eagle enabled#6932
chenxi-hh wants to merge 4 commits intovllm-project:mainfrom
chenxi-hh:main

Conversation

@chenxi-hh
Copy link
Collaborator

@chenxi-hh chenxi-hh commented Mar 2, 2026

Fix the issue when the FI Attention operator is used in graph mode with eagle enabled

  • vLLM version: v0.16.0
  • vLLM main: vllm-project/vllm@15d76f7
  • Error Message: RuntimeError: npu_fused_infer_attention_score_out_symint:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:482 NPU function error: call aclnnFusedInferAttentionScoreV3 failed, error code is 561002
    E29999[PID: 65928] 2026-03-03-05:08:21.205.139 (E29999): [InitOpInfoLib][InitOpKernel] Inconsistent format size [1] and data type size [2]![FUNC:InitUnknownFormatAndDtype][FILE:op_kernel_info_constructor.cc][LINE:959]
    When layout is TND, queryT(8) must be equal to the last element of actualSequenceLengthQ(7)[FUNC:CheckFAISeqlenDataInTND][FILE:fused_infer_attention_score_tiling.cpp][LINE:942]

…h mode with eagle enabled

Signed-off-by: chenxi-hh <chen464822955@163.com>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2026

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses and resolves a critical bug affecting the FI Attention operator when it is utilized in graph mode with the eagle feature enabled. The fix ensures that sequence lengths are correctly padded and aligned during graph parameter updates, preventing issues that previously arose from mismatched sequence dimensions in this specific operational configuration.

Highlights

  • New Sequence Padding Logic: Introduced a new private helper function, _pad_attention_seq_params, to correctly pad actual_seq_lengths_q and seq_lens based on a runtime_shape and an optional num_speculative_tokens.
  • Graph Mode Bug Fix: Integrated the new padding logic within the update_graph_params method. This ensures proper sequence length handling for the FI Attention operator when used in graph mode with the eagle feature enabled, resolving a specific bug.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • vllm_ascend/attention/attention_v1.py
    • Added a new private helper function _pad_attention_seq_params to manage the padding of sequence lengths for attention operations.
    • Modified the update_graph_params method to conditionally invoke _pad_attention_seq_params when processing attention metadata for draft_step and when eagle is active, ensuring proper sequence length alignment.
Activity
  • No specific activity (comments, reviews, or progress updates) was found for this pull request.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix for an issue with the FI Attention operator in graph mode when Eagle speculative decoding is enabled. The main change is the addition of a _pad_attention_seq_params function to pad attention parameters, ensuring static shapes for graph execution. My review identifies a logical simplification in this new function where a branch of code is unreachable, and I've provided a suggestion to improve its clarity and maintainability.

Comment on lines +355 to +358
if interpolated and interpolated[-1] < runtime_shape:
interpolated.append(runtime_shape)
elif not interpolated and last_val < runtime_shape:
interpolated = [runtime_shape]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for ensuring runtime_shape is in the interpolated list can be simplified. The elif not interpolated branch is unreachable because the outer else block (line 350) is only entered if last_val < runtime_shape, which guarantees that range(...) will produce a non-empty list. Also, since interpolated will never be empty, the check interpolated and ... is redundant.

Suggested change
if interpolated and interpolated[-1] < runtime_shape:
interpolated.append(runtime_shape)
elif not interpolated and last_val < runtime_shape:
interpolated = [runtime_shape]
if interpolated[-1] < runtime_shape:
interpolated.append(runtime_shape)

@chenxi-hh chenxi-hh added ready-for-test start test by label for PR ready read for review and removed ready-for-test start test by label for PR labels Mar 3, 2026
Copy link
Collaborator

@yiz-liu yiz-liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't add this

return attn_metadata


def _pad_attention_seq_params(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't add another padding function while we already have pad_query_start_loc_for_fia, better check why it's not working now, I assume it has something to do with vllm-project/vllm#34043 .

chenxi-hh and others added 2 commits March 3, 2026 22:17
@chenxi-hh chenxi-hh requested a review from wangxiyuan as a code owner March 3, 2026 14:28
…h mode with eagle enabled

Signed-off-by: chenxi-hh <chen464822955@163.com>
@chenxi-hh chenxi-hh closed this Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants