Skip to content

[Bugfix] Fix model run _npu_flash_attention hang issue#4410

Merged
MengqingCao merged 1 commit intovllm-project:mainfrom
Semmer2:FixFAHangIssue
Nov 29, 2025
Merged

[Bugfix] Fix model run _npu_flash_attention hang issue#4410
MengqingCao merged 1 commit intovllm-project:mainfrom
Semmer2:FixFAHangIssue

Conversation

@Semmer2
Copy link
Copy Markdown
Contributor

@Semmer2 Semmer2 commented Nov 24, 2025

Fix model run _npu_flash_attention in _forward_prefill_no_cache hang issue, it was caused by wrong attention mask dtype.

What this PR does / why we need it?

Does this PR introduce any user-facing change?

No

How was this patch tested?

Yes, tesed on Qwen2.5-VL and Qwen2.5-Omni

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request effectively addresses a reported hang issue in _npu_flash_attention by refining the handling of attention mask dtypes. The key change involves refactoring the logic for retrieving the chunked_prefill_attn_mask into a dedicated method, get_chunked_prefill_attn_mask. This new method explicitly ensures the mask is converted to torch.bool, which is crucial for the correct operation of the attention mechanism. This refactoring not only fixes the bug but also enhances code clarity and maintainability by centralizing the dtype conversion for this specific mask.

Comment thread vllm_ascend/attention/attention_mask.py Outdated
Comment on lines 68 to 71
def get_chunked_prefill_attn_mask(self):
return self.chunked_prefill_attn_mask.to(torch.bool)

def get_attn_mask(self, max_seq_len: int, dtype: torch.dtype,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The introduction of get_chunked_prefill_attn_mask and the removal of the conditional logic from get_attn_mask is a significant improvement. This refactoring clearly separates the responsibility of providing the chunked prefill attention mask and explicitly ensures its torch.bool dtype. This directly addresses the reported "wrong attention mask dtype" issue, which was causing hangs in _npu_flash_attention, by enforcing the correct data type for this specific mask. It also makes the get_attn_mask method more focused on its general purpose.

Comment thread vllm_ascend/worker/model_runner_v1.py Outdated
elif attn_state == AscendAttentionState.PrefillCacheHit:
return self.attn_mask_builder.get_attn_mask(
2048, self.dtype, self.device)
return self.attn_mask_builder.get_chunked_prefill_attn_mask()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Updating the call to use the new get_chunked_prefill_attn_mask() method is a correct and consistent application of the refactored logic. This change ensures that the attention mask used in the PrefillCacheHit state consistently has the torch.bool dtype, which is essential for preventing the _npu_flash_attention hang issue as described in the PR.

@Semmer2 Semmer2 force-pushed the FixFAHangIssue branch 3 times, most recently from f5f2eac to 36d1b5e Compare November 25, 2025 07:30
Fix model run _npu_flash_attention in _forward_prefill_no_cache hang
issue, it was caused by wrong attention mask dtype.

Signed-off-by: Ting FU <futing10@huawei.com>
@MengqingCao MengqingCao added ready read for review ready-for-test start test by label for PR labels Nov 28, 2025
Copy link
Copy Markdown
Collaborator

@MengqingCao MengqingCao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix!

@MengqingCao MengqingCao merged commit 9af3475 into vllm-project:main Nov 29, 2025
51 of 52 checks passed
ChenCangtao pushed a commit to ChenCangtao/vllm-ascend that referenced this pull request Dec 3, 2025
…4410)

Fix model run _npu_flash_attention in _forward_prefill_no_cache hang
issue, it was caused by wrong attention mask dtype.
### How was this patch tested?
Yes, tesed on Qwen2.5-VL and Qwen2.5-Omni

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@2918c1b

Signed-off-by: Ting FU <futing10@huawei.com>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 4, 2025
…4410)

Fix model run _npu_flash_attention in _forward_prefill_no_cache hang
issue, it was caused by wrong attention mask dtype.
### How was this patch tested?
Yes, tesed on Qwen2.5-VL and Qwen2.5-Omni

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@2918c1b

Signed-off-by: Ting FU <futing10@huawei.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 4, 2025
…4410)

Fix model run _npu_flash_attention in _forward_prefill_no_cache hang
issue, it was caused by wrong attention mask dtype.
### How was this patch tested?
Yes, tesed on Qwen2.5-VL and Qwen2.5-Omni

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@2918c1b

Signed-off-by: Ting FU <futing10@huawei.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Meihan-chen pushed a commit to Meihan-chen/vllm-ascend that referenced this pull request Dec 5, 2025
…4410)

Fix model run _npu_flash_attention in _forward_prefill_no_cache hang
issue, it was caused by wrong attention mask dtype.
### How was this patch tested?
Yes, tesed on Qwen2.5-VL and Qwen2.5-Omni

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@2918c1b

Signed-off-by: Ting FU <futing10@huawei.com>
Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 9, 2025
…4410)

Fix model run _npu_flash_attention in _forward_prefill_no_cache hang
issue, it was caused by wrong attention mask dtype.
### How was this patch tested?
Yes, tesed on Qwen2.5-VL and Qwen2.5-Omni

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@2918c1b

Signed-off-by: Ting FU <futing10@huawei.com>
Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 10, 2025
…4410)

Fix model run _npu_flash_attention in _forward_prefill_no_cache hang
issue, it was caused by wrong attention mask dtype.
### How was this patch tested?
Yes, tesed on Qwen2.5-VL and Qwen2.5-Omni

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@2918c1b

Signed-off-by: Ting FU <futing10@huawei.com>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 10, 2025
…4410)

Fix model run _npu_flash_attention in _forward_prefill_no_cache hang
issue, it was caused by wrong attention mask dtype.
### How was this patch tested?
Yes, tesed on Qwen2.5-VL and Qwen2.5-Omni

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@2918c1b

Signed-off-by: Ting FU <futing10@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants