[Bugfix] Fix the issue when the FI Attention operator is used in graph mode with eagle enabled by chenxi-hh · Pull Request #6932 · vllm-project/vllm-ascend

chenxi-hh · 2026-03-02T12:55:28Z

Fix the issue when the FI Attention operator is used in graph mode with eagle enabled

vLLM version: v0.16.0
vLLM main: vllm-project/vllm@15d76f7
Error Message: RuntimeError: npu_fused_infer_attention_score_out_symint:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:482 NPU function error: call aclnnFusedInferAttentionScoreV3 failed, error code is 561002
E29999[PID: 65928] 2026-03-03-05:08:21.205.139 (E29999): [InitOpInfoLib][InitOpKernel] Inconsistent format size [1] and data type size [2]![FUNC:InitUnknownFormatAndDtype][FILE:op_kernel_info_constructor.cc][LINE:959]
When layout is TND, queryT(8) must be equal to the last element of actualSequenceLengthQ(7)[FUNC:CheckFAISeqlenDataInTND][FILE:fused_infer_attention_score_tiling.cpp][LINE:942]

…h mode with eagle enabled Signed-off-by: chenxi-hh <chen464822955@163.com>

github-actions · 2026-03-02T12:55:43Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist · 2026-03-02T12:55:44Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses and resolves a critical bug affecting the FI Attention operator when it is utilized in graph mode with the eagle feature enabled. The fix ensures that sequence lengths are correctly padded and aligned during graph parameter updates, preventing issues that previously arose from mismatched sequence dimensions in this specific operational configuration.

Highlights

New Sequence Padding Logic: Introduced a new private helper function, _pad_attention_seq_params, to correctly pad actual_seq_lengths_q and seq_lens based on a runtime_shape and an optional num_speculative_tokens.
Graph Mode Bug Fix: Integrated the new padding logic within the update_graph_params method. This ensures proper sequence length handling for the FI Attention operator when used in graph mode with the eagle feature enabled, resolving a specific bug.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

vllm_ascend/attention/attention_v1.py
- Added a new private helper function _pad_attention_seq_params to manage the padding of sequence lengths for attention operations.
- Modified the update_graph_params method to conditionally invoke _pad_attention_seq_params when processing attention metadata for draft_step and when eagle is active, ensuring proper sequence length alignment.

Activity

No specific activity (comments, reviews, or progress updates) was found for this pull request.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a fix for an issue with the FI Attention operator in graph mode when Eagle speculative decoding is enabled. The main change is the addition of a _pad_attention_seq_params function to pad attention parameters, ensuring static shapes for graph execution. My review identifies a logical simplification in this new function where a branch of code is unreachable, and I've provided a suggestion to improve its clarity and maintainability.

gemini-code-assist · 2026-03-02T12:57:34Z

vllm_ascend/attention/attention_v1.py

+            if interpolated and interpolated[-1] < runtime_shape:
+                interpolated.append(runtime_shape)
+            elif not interpolated and last_val < runtime_shape:
+                interpolated = [runtime_shape]


The logic for ensuring runtime_shape is in the interpolated list can be simplified. The elif not interpolated branch is unreachable because the outer else block (line 350) is only entered if last_val < runtime_shape, which guarantees that range(...) will produce a non-empty list. Also, since interpolated will never be empty, the check interpolated and ... is redundant.

Suggested change

if interpolated and interpolated[-1] < runtime_shape:

interpolated.append(runtime_shape)

elif not interpolated and last_val < runtime_shape:

interpolated = [runtime_shape]

if interpolated[-1] < runtime_shape:

interpolated.append(runtime_shape)

yiz-liu

Please don't add this

yiz-liu · 2026-03-03T09:54:06Z

vllm_ascend/attention/attention_v1.py

        return attn_metadata


+def _pad_attention_seq_params(


Please don't add another padding function while we already have pad_query_start_loc_for_fia, better check why it's not working now, I assume it has something to do with vllm-project/vllm#34043 .

…h mode with eagle enabled Signed-off-by: chenxi-hh <chen464822955@163.com>

[Bugfix] Fix the issue when the FI Attention operator is used in grap…

f58d040

…h mode with eagle enabled Signed-off-by: chenxi-hh <chen464822955@163.com>

chenxi-hh requested review from weijinqian0 and whx-sjtu as code owners March 2, 2026 12:55

gemini-code-assist bot reviewed Mar 2, 2026

View reviewed changes

chenxi-hh added ready-for-test start test by label for PR ready read for review and removed ready-for-test start test by label for PR labels Mar 3, 2026

yiz-liu requested changes Mar 3, 2026

View reviewed changes

chenxi-hh and others added 2 commits March 3, 2026 22:17

Merge branch 'vllm-project:main' into main

63305fd

[Bugfix] Fix the issue when the FI Attention operator is used in grap…

f42f8db

…h mode with eagle enabled Signed-off-by: chenxi-hh <chen464822955@163.com>

chenxi-hh requested a review from wangxiyuan as a code owner March 3, 2026 14:28

[Bugfix] Fix the issue when the FI Attention operator is used in grap…

5be8c8f

…h mode with eagle enabled Signed-off-by: chenxi-hh <chen464822955@163.com>

chenxi-hh closed this Mar 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Fix the issue when the FI Attention operator is used in graph mode with eagle enabled#6932

[Bugfix] Fix the issue when the FI Attention operator is used in graph mode with eagle enabled#6932
chenxi-hh wants to merge 4 commits intovllm-project:mainfrom
chenxi-hh:main

chenxi-hh commented Mar 2, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 2, 2026

Uh oh!

gemini-code-assist bot commented Mar 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

yiz-liu left a comment

Uh oh!

yiz-liu Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chenxi-hh commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 2, 2026

Uh oh!

gemini-code-assist bot commented Mar 2, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

yiz-liu left a comment

Choose a reason for hiding this comment

Uh oh!

yiz-liu Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chenxi-hh commented Mar 2, 2026 •

edited

Loading