Use paged_attention_v1 for sliding window decode in rocm_aiter_fa#34378
Use paged_attention_v1 for sliding window decode in rocm_aiter_fa#34378zhuohan123 merged 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the sliding window decode path within the AiterFlashAttentionImpl for ROCm. By removing the separate unified_attention (Triton) kernel and instead leveraging the native sliding window support in paged_attention_v1, the changes successfully unify the decode logic for both sliding window and non-sliding window scenarios. This simplification improves code clarity and maintainability. The sliding window parameter is passed correctly, ensuring that the behavior is preserved while the implementation is streamlined. The changes appear solid and are a good improvement.
e294cea to
c8eda38
Compare
…lm-project#34378) Summary: Replace unified_attention (Triton) with paged_attention_v1 for the sliding window decode path in AiterFlashAttentionImpl. paged_attention_v1 already supports sliding window natively via its sliding_window parameter, so this unifies the NHD decode path for both sliding window and non-sliding window cases. The sliding window value is recovered from the flash-attn convention (self.sliding_window[0] + 1), which yields 0 (disabled) when no sliding window is configured. Test Plan: Requires ROCm GPU with a sliding window model (e.g., Mistral) to validate end-to-end. Verified that unified_attention is no longer referenced in the file. Differential Revision: D93009177
…lm-project#34378) Summary: Replace unified_attention (Triton) with paged_attention_v1 for the sliding window decode path in AiterFlashAttentionImpl. paged_attention_v1 already supports sliding window natively via its sliding_window parameter, so this unifies the NHD decode path for both sliding window and non-sliding window cases. The sliding window value is recovered from the flash-attn convention (self.sliding_window[0] + 1), which yields 0 (disabled) when no sliding window is configured. Test Plan: Requires ROCm GPU with a sliding window model (e.g., Mistral) to validate end-to-end. Verified that unified_attention is no longer referenced in the file. Differential Revision: D93009177 Signed-off-by: Martin Yuan <myuan@meta.com>
c8eda38 to
b7f390f
Compare
…lm-project#34378) Summary: Replace unified_attention (Triton) with paged_attention_v1 for the sliding window decode path in AiterFlashAttentionImpl. paged_attention_v1 already supports sliding window natively via its sliding_window parameter, so this unifies the NHD decode path for both sliding window and non-sliding window cases. The sliding window value is recovered from the flash-attn convention (self.sliding_window[0] + 1), which yields 0 (disabled) when no sliding window is configured. Test Plan: Requires ROCm GPU with a sliding window model (e.g., Mistral) to validate end-to-end. Verified that unified_attention is no longer referenced in the file. Differential Revision: D93009177
b7f390f to
772a458
Compare
…lm-project#34378) Summary: Replace unified_attention (Triton) with paged_attention_v1 for the sliding window decode path in AiterFlashAttentionImpl. paged_attention_v1 already supports sliding window natively via its sliding_window parameter, so this unifies the NHD decode path for both sliding window and non-sliding window cases. The sliding window value is recovered from the flash-attn convention (self.sliding_window[0] + 1), which yields 0 (disabled) when no sliding window is configured. Test Plan: Requires ROCm GPU with a sliding window model (e.g., Mistral) to validate end-to-end. Verified that unified_attention is no longer referenced in the file. Differential Revision: D93009177 Signed-off-by: Martin Yuan <myuan@meta.com>
772a458 to
285ec8c
Compare
|
This PR is behind a recent regression: https://buildkite.com/vllm/amd-ci/builds/4773/steps/canvas?sid=019c5af3-274c-4971-937e-636c0be82f12&tab=output I'm working on a fix right now. But in the future let us run tests on AMD hardware before such changes :) |
…lm-project#34378) Signed-off-by: Martin Yuan <myuan@meta.com> Co-authored-by: Martin Yuan <myuan@meta.com> Signed-off-by: Eldar Kurtic <research@neuralmagic.com>
…lm-project#34378) Signed-off-by: Martin Yuan <myuan@meta.com> Co-authored-by: Martin Yuan <myuan@meta.com>
…lm-project#34378) Signed-off-by: Martin Yuan <myuan@meta.com> Co-authored-by: Martin Yuan <myuan@meta.com>
Summary: Replace unified_attention (Triton) with paged_attention_v1 for the sliding window decode path in AiterFlashAttentionImpl. paged_attention_v1 already supports sliding window natively via its sliding_window parameter, so this unifies the NHD decode path for both sliding window and non-sliding window cases. The sliding window value is recovered from the flash-attn convention (self.sliding_window[0] + 1), which yields 0 (disabled) when no sliding window is configured.
Test Plan: Requires ROCm GPU with a sliding window model (e.g., Mistral) to validate end-to-end. Verified that unified_attention is no longer referenced in the file.
Differential Revision: D93009177