Skip to content

Use paged_attention_v1 for sliding window decode in rocm_aiter_fa#34378

Merged
zhuohan123 merged 1 commit intovllm-project:mainfrom
iseeyuan:export-D93009177
Feb 13, 2026
Merged

Use paged_attention_v1 for sliding window decode in rocm_aiter_fa#34378
zhuohan123 merged 1 commit intovllm-project:mainfrom
iseeyuan:export-D93009177

Conversation

@iseeyuan
Copy link
Contributor

Summary: Replace unified_attention (Triton) with paged_attention_v1 for the sliding window decode path in AiterFlashAttentionImpl. paged_attention_v1 already supports sliding window natively via its sliding_window parameter, so this unifies the NHD decode path for both sliding window and non-sliding window cases. The sliding window value is recovered from the flash-attn convention (self.sliding_window[0] + 1), which yields 0 (disabled) when no sliding window is configured.

Test Plan: Requires ROCm GPU with a sliding window model (e.g., Mistral) to validate end-to-end. Verified that unified_attention is no longer referenced in the file.

Differential Revision: D93009177

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the sliding window decode path within the AiterFlashAttentionImpl for ROCm. By removing the separate unified_attention (Triton) kernel and instead leveraging the native sliding window support in paged_attention_v1, the changes successfully unify the decode logic for both sliding window and non-sliding window scenarios. This simplification improves code clarity and maintainability. The sliding window parameter is passed correctly, ensuring that the behavior is preserved while the implementation is streamlined. The changes appear solid and are a good improvement.

iseeyuan pushed a commit to iseeyuan/vllm that referenced this pull request Feb 12, 2026
…lm-project#34378)

Summary:

Replace unified_attention (Triton) with paged_attention_v1 for the sliding window decode path in AiterFlashAttentionImpl. paged_attention_v1 already supports sliding window natively via its sliding_window parameter, so this unifies the NHD decode path for both sliding window and non-sliding window cases. The sliding window value is recovered from the flash-attn convention (self.sliding_window[0] + 1), which yields 0 (disabled) when no sliding window is configured.

Test Plan: Requires ROCm GPU with a sliding window model (e.g., Mistral) to validate end-to-end. Verified that unified_attention is no longer referenced in the file.

Differential Revision: D93009177
iseeyuan pushed a commit to iseeyuan/vllm that referenced this pull request Feb 12, 2026
…lm-project#34378)

Summary:

Replace unified_attention (Triton) with paged_attention_v1 for the sliding window decode path in AiterFlashAttentionImpl. paged_attention_v1 already supports sliding window natively via its sliding_window parameter, so this unifies the NHD decode path for both sliding window and non-sliding window cases. The sliding window value is recovered from the flash-attn convention (self.sliding_window[0] + 1), which yields 0 (disabled) when no sliding window is configured.

Test Plan: Requires ROCm GPU with a sliding window model (e.g., Mistral) to validate end-to-end. Verified that unified_attention is no longer referenced in the file.

Differential Revision: D93009177

Signed-off-by: Martin Yuan <myuan@meta.com>
iseeyuan pushed a commit to iseeyuan/vllm that referenced this pull request Feb 12, 2026
…lm-project#34378)

Summary:

Replace unified_attention (Triton) with paged_attention_v1 for the sliding window decode path in AiterFlashAttentionImpl. paged_attention_v1 already supports sliding window natively via its sliding_window parameter, so this unifies the NHD decode path for both sliding window and non-sliding window cases. The sliding window value is recovered from the flash-attn convention (self.sliding_window[0] + 1), which yields 0 (disabled) when no sliding window is configured.

Test Plan: Requires ROCm GPU with a sliding window model (e.g., Mistral) to validate end-to-end. Verified that unified_attention is no longer referenced in the file.

Differential Revision: D93009177
…lm-project#34378)

Summary:

Replace unified_attention (Triton) with paged_attention_v1 for the sliding window decode path in AiterFlashAttentionImpl. paged_attention_v1 already supports sliding window natively via its sliding_window parameter, so this unifies the NHD decode path for both sliding window and non-sliding window cases. The sliding window value is recovered from the flash-attn convention (self.sliding_window[0] + 1), which yields 0 (disabled) when no sliding window is configured.

Test Plan: Requires ROCm GPU with a sliding window model (e.g., Mistral) to validate end-to-end. Verified that unified_attention is no longer referenced in the file.

Differential Revision: D93009177

Signed-off-by: Martin Yuan <myuan@meta.com>
Copy link
Collaborator

@houseroad houseroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@houseroad houseroad added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 12, 2026
@zhuohan123 zhuohan123 merged commit 9ea1f59 into vllm-project:main Feb 13, 2026
53 of 56 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in AMD Feb 13, 2026
@AndreasKaratzas
Copy link
Collaborator

AndreasKaratzas commented Feb 15, 2026

This PR is behind a recent regression: https://buildkite.com/vllm/amd-ci/builds/4773/steps/canvas?sid=019c5af3-274c-4971-937e-636c0be82f12&tab=output

I'm working on a fix right now. But in the future let us run tests on AMD hardware before such changes :)
Btw, I'm seeing that Language Models Tests (Standard) failed in amd-ci of this very PR as well: https://buildkite.com/vllm/amd-ci/builds/4717/steps/canvas?sid=019c53b0-1e30-43af-91d7-8ede16e93052&tab=output

cc @iseeyuan @houseroad

eldarkurtic pushed a commit to eldarkurtic/vllm that referenced this pull request Feb 19, 2026
…lm-project#34378)

Signed-off-by: Martin Yuan <myuan@meta.com>
Co-authored-by: Martin Yuan <myuan@meta.com>
Signed-off-by: Eldar Kurtic <research@neuralmagic.com>
llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026
…lm-project#34378)

Signed-off-by: Martin Yuan <myuan@meta.com>
Co-authored-by: Martin Yuan <myuan@meta.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
…lm-project#34378)

Signed-off-by: Martin Yuan <myuan@meta.com>
Co-authored-by: Martin Yuan <myuan@meta.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fb-exported meta-exported ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants